Archive for the ‘C’ Category

Two postfix operations redux: sequence points

Sunday, January 10th, 2010

Get ready for some serious language lawyering.

I was going back and converting my old entries to reStructuredText when I found an entry in which I was wrong! (Shocking, I know.)

C

Stupid old me didn’t know about sequence points back in 2007: the effects of the ++ operator in the C expression i++ * i++ are in an indeterminate state of side-effect completion until one of the language-defined sequence points is encountered (i.e. a semicolon or function invocation).

From the C99 standard 6.5.4.2 item 2 regarding the postfix increment and decrement operators:

The result of the postfix ++ operator is the value of the operand. After the result is obtained, the value of the operand is incremented. The side effect of updating the stored value of the operand shall occur between the previous and the next sequence point.

Therefore, the compiler is totally at liberty to interpret that expression as:

mov lhs_result, i     ; Copy the values of the postincrement evaluation.
mov rhs_result, i     ; (Which is the original value of i.)
mul result, lhs_result, rhs_result
add i, lhs_result, 1
add i, rhs_result, 1  ; Second increment clobbers with the same value!

This results in the same result as the GCC compilation in the referenced entry: i is 12 and the result is 121.

As I mentioned before, the reason this can occur is that nothing in the syntax forces the first postincrement to be evaluated before the second one. To give an analogy to concurrency constructs: you have a kind of compile-time "race condition" in your syntax between the two postincrements that could be solved with a sequence point "barrier". [*]

In this assembly, those adds can float anywhere they like after their corresponding mov instruction and can operate directly on i instead of the temporary if they’d prefer. Here’s an possible sequence that results in a value of 132 and i as 13.

mov lhs_result, i ; Gets the original 11.
inc i             ; Increment in-place after the start value is copied.
mov rhs_result, i ; Gets the new value 12.
inc i             ; Increment occurs in-place again, making 13.
mul result, lhs_result, rhs_result

Even if you know what you’re doing, mixing two postfix operations, or any side effect, using the less obvious sequence points (like function invocation) is dangerous and easy to get wrong. Clearly it is not a best practice. [†]

Java

The postincrement operation appears to have sequence-point-like semantics in the Java language through experimentation, and it does! From the Java language specification (page 416):

The Java programming language also guarantees that every operand of an operator (except the conditional operators &&, ||, and ? :) appears to be fully evaluated before any part of the operation itself is performed.

Which combines with the definition of the postfix increment expression (page 485):

A postfix expression followed by a ++ operator is a postfix increment expression.

As well as left-to-right expression evaluation (page 415):

The left-hand operand of a binary operator appears to be fully evaluated before any part of the right-hand operand is evaluated.

To a definitive conclusion that i++ * i++ will always result in 132 == 11 * 12 and i == 13 when i == 11 to start.

Python

Python has no increment operators specifically so you don’t have to deal with this kind of nonsense.

>>> count = 0
>>> count++
  File "<stdin>", line 1
    count++
          ^
SyntaxError: invalid syntax

Annoyingly for newbies, though, it looks like ++count is a valid expression that happens to look like preincrement.

>>> count = 0
>>> ++count
0
>>> --count
0

They’re actually two unary positive and negative operators, respectively. Just one of the hazards of a context free grammar, I suppose.

Footnotes

[*] I threw this in because the ordeal reminds me of the classic bank account concurrency problem. If it’s more confusing than descriptive, please ignore it. :-)
[†]

Since function invocation defines sequence points, I thought this code sequence guaranteed those results:

#include <stdio.h>
 
int identity(int value) { return value; }
 
int main() {
        int i = 11;
        printf("%d\n", identity(i++) * identity(i++));
        printf("%d\n", i);
        return 0;
}

As Dan points out, the order of evaluation is totally unspecified — the left hand and right hand subexpression can potentially be evaluated concurrently.

Bit twiddling: Simple O(1) membership test

Tuesday, April 7th, 2009

Disclaimer

Bit twiddling is fun. Plus, it has several advantages:

  • It gives you a greater understanding of the nuances of the (C) language. As we all know, learning C is important. If you’re a computer engineer and have weak C-fu, then it is time, grasshopper.
  • Bit twiddling is important to know if you’re ever going to write software in the field of computer engineering. Without a good understanding of how to slice-and-dice primitive data types, your simulations will be significantly more prone to error. Trust me, you don’t want to use gdb for hours only to find that your shift value was off by one. Not that I’ve ever done that.
  • It makes you feel smart, like when you’re the only person who knows the secret to your party trick. I don’t know what that’s like personally, of course, because I spend all my party time bit bashing, but I hear it’s similar.

You have to understand, though, that clever tricks without appropriate documentation will make people want to break your face. [#] Always bit bash responsibly: appoint a designated code-reader to make sure you’re clear enough, and leave your keys at the door.

The Problem

Let’s say you wanted to know whether a number was a valid PCI Express link width in terms of number of lanes. We know that valid widths are x1, x2, x4, x8, x12, x16, or x32, and want to construct a function of the following form:

#include <stdbool.h>
#include <stdint.h>
#include <assert.h>
 
/**
 * :return: Whether the lane count is valid.
 */
bool is_valid_link_width(uint8_t lane_count);
 
/**
 * Unit test for ``is_valid_link_width``.
 */
int main(int argc, char **argv) {
    assert(!is_valid_link_width(0));
    assert(is_valid_link_width(1));
    assert(is_valid_link_width(2));
    assert(!is_valid_link_width(3));
    assert(is_valid_link_width(32));
    assert(!is_valid_link_width(33));
    assert(!is_valid_link_width(33));
    assert(!is_valid_link_width(0xff));
    return 0;
}

Note that the uint8_t has a width of exactly 8 bits. [%]

How would you write it?

Less Interesting Solution

If you were thinking switch statement, that will work. You could use a switch statement with intentional fall-throughs and hope that the compiler optimizes a branch table for you. (For values this small and dense it probably will, as mentioned in the referenced article.) If the compiler doesn’t write the branch table for you, but instead generates the equivalent of a big if/else if ladder, your solution doesn’t satisfy the O(1) constraint: in that case, the worst case control flow hits every rung of the ladder (the else if guards), making it O(n).

bool is_valid_link_width(uint8_t lane_count) {
    switch (lane_count) {
    case 1:
    case 2:
    case 4:
    case 8:
    case 12:
    case 16:
    case 32:
        return true;
    }
    return false;
}

An implementation that I like better, which doesn’t put as much faith in the compiler, is as follows:

bool is_valid_link_width(uint8_t lane_count) {
    return 0x100011116ULL & (1ULL << lane_count);
}

How cool is that?

The Neat Trick

The clever insight here is that we can encode all of our target “true” values in binary form, like so:

       32                      16    12    8     4    1
0b__0001__0000__0000__0000__0001__0001__0001__0001__0110

Now, if we were to take a 1 value and move it over a number of binary slots equal to the lane count, it will line up with a 1 value in this long binary number we’ve constructed. Take the bitwise-AND of those two values, and we wind up with:

  • A non-zero (true) value if they line up, OR
  • A zero value if they don’t

This is exactly what we were looking for.

This long binary number we’ve created must be converted from binary into a hexadecimal value, so that we can represent it as an integer literal in our C program. Encoding each binary 4-tuple into hex from right to left, we get the value 0x100011116.

There’s an issue with this value, however. Unless we specify a suffix for our integer literal, the compiler is allowed to truncate the value to its native word size, [*] which would cause serious problems. For x86 systems with 16 bit words, our value could be truncated to 0×1116, which would only allow lane sizes of 1, 2, 4, 8, and 12 — the allowed values of 16 and 32 would be cut off!

To solve this, as you can see in the function definition, we add the ULL integer suffix, which explicitly marks the integer literal as an unsigned long long. (The long long integer data type was added to the C language in the C99 standard.) This data type is required to be at least 64 bits wide, so it can definitely hold our 33 relevant bits (32 plus the zero bit which is there for the 1ULL << 0 case). The long data type is too small to hold this value, as long can potentially be only 32 bits wide per the C standard (and it is 32 bits wide on most modern machines).

Readability Counts

Note that there’s a more readable version of the same trick in the following:

bool is_valid_link_width(uint8_t lane_count) {
    const uint64_t set =
        1ULL << 1
        | 1ULL << 2
        | 1ULL << 4
        | 1ULL << 8
        | 1ULL << 12
        | 1ULL << 16
        | 1ULL << 32;
    return set & (1ULL << lane_count);
}

Here we make the construction of the big integer more explicit and make the code less prone to our errors in encoding the literal binary value into hex. Any compiler worth its salt will fold out the const calculation at compile time, so no overhead will be incurred for writing it this way.

I demonstrated the other way of doing it first to: a) blow your mind a little bit, and b) demonstrate an idiom you might see in other people’s (overly) clever code. Now there’s a chance you can recognize and decipher it without adequate documentation. Huzzah!

Footnotes

[#] A great rift in the universe known as “Other People’s Perl” has been the cause of 80% of the computer-engineering face breakage since 1987. Don’t let this tragedy happen to you or your beloved programmers.
[%] I like using fixed-width integer types, especially in describing problems, because they helpfully constrain the possibilities on the input domain. This is even more important for newcomers who are just wrapping their heads around bit twiddling.
[*] I can’t find standards documents/discussions to support this claim, but it’s definitely what I was taught. Can anybody provide evidence to confirm/deny?

procfs and preload

Thursday, November 15th, 2007

Two of the cool utilities that I’ve checked out lately have centered around /proc. /proc is a virtual filesystem mountpoint — the filesystem entities are generated on the fly by the kernel. The filesystem entities provide information about the kernel state and, consequently, the currently running processes. [*]

The utilities are preload and powertop. Both are written in C, though I think that either of them could be written more clearly in Python.

preload

Preload’s premise is fascinating. Each shared library that a running process is using via MMIO can be queried via /proc/[pid]/maps, which contains entries of the form:

[vm_start_addr]-[vm_end_addr] [perms] [file_offset] [device_major_id]:[device_minor_id] [inode_num] [file_path]

Preload uses a Markov chain to decide which shared library pages to “pre-load” into the page cache by reading and analyzing these maps over time. Preload’s primary goal was to reduce login times by pre-emptively warming up a cold page cache, which it was successful in doing. The catch is that running preload was shown to decrease performance once the cache was warmed up, indicating that it may have just gotten in the way of the native Linux page cache prefetch algorithm. [$]

There are a few other things in /proc that preload uses, like /proc/meminfo, but querying the maps is the meat and potatoes. I was thinking of porting it to Python so that I could understand the structure of the program better, but the fact that the daemon caused a performance decrease over a warm cache turned me off the idea.

Footnotes:

[*] A cool side note — all files in /proc have a file size of 0 except kcore and self.
[$] The page_cache_readahead() function in the Linux kernel.

References: