May 1, 2010

Notes from the JS pit: lofty goals, humble beginnings

I've been working at Mozilla for about two months on the JavaScript (JS) engine team, the members of which sit in an area affectionately known as the "JS pit".

Mozillians appear to try to blog on a regular basis, so I'll be starting a series of entries prefixed "notes from the JS pit" to explain what I've been working on and/or thinking about.

Notably, I feel fortunate to work for a company that encourages this kind of openness.

Goals

I always feel stupid writing down goals — they seem so self-evident; however, it helps to put the work I'm doing into perspective and gives me something that I can to refer back to.

I'm also a big believer in the effectiveness of public accountability, so publishing those goals seems prudent — my notes from the JS pit are poised to help me stay motivated more than anything else.

My goals are to:

Working with compiler engineers from diverse language backgrounds, it's prime time for sucking knowledge out of people's heads, comparing and contrasting it, and listening to them argue with each other. Heck, just look at the concepts behind JS: an imperative, Scheme-inspired, prototypal, C-and-Java-syntax conforming language that's irrevocably tied to a practical platform, the web. It's bound to be a fun ride.

From start to present

I started off implementing the simpler opcodes for JaegerMonkey (JM) and getting an understanding the JM code base. Not too long into it, I was told that looking into quick-and-easy parser optimizations was a priority — somebody had reported that a significant fraction of the GMail load bar time could be attributed to JS parsing. [*]

Now, JavaScript isn't the easiest language in the world to parse; for example, automatic semicolon insertion creates some non-traditional obstacles for generated shift/reduce parsers [†] — it effectively makes an error correction algorithm part of the normal parse procedure. The details are for another entry, but suffice it to say that our recursive descent parser code gets complex, especially due to our E4X support and some of the static analyses we perform for optimizations before bytecode emission.

In pursuing JS parser optimization I assembled a suite of parsing benchmarks from sites on the web with "large" JS payloads — I call this suite parsemark. After getting some speedups from simple inlining, I attempted a somewhat fundamental change to the parser to reduce the number of branch mispredictions, in converting it to always have a token "pre-lexed" as opposed to the prior "lex-on-demand" model. Roughly, this required adding a "there's always a lexed token" invariant to the lexer and hoisting lexer calls/modesets from substitution rules into their referring nonterminals in the parser. The details for this are also entry fodder. Sadly, it demonstrated negligible performance gains for the increase in complexity. Sure taught me a lot about our parser, though.

The biggest performance win was obtained through a basic fix to our parser arena-allocation chunk sizing. sayrer noticed that a surprising amount of time was being spent in kernel space, so we tracked the issue down. It was frustrating to work for a few weeks on a fundamental change and then realize that multiplying a constant by four can get you a 20% parsing speedup, but I've certainly learned to look a lot more closely at the vmem subsystem when things are slow. I have some speedup statistics and a comparison to V8 (with all its lazy parsing and parse-caching bits ripped out), but I don't have much faith that my environment hasn't changed in the course of all the historical data measurements — writing a script to verify speedup over changesets seems like a viable option for future notes.

In the miscellany department, I've been trying to do a good amount of work fixing broken windows via cleanup patches. I'm finding it difficult to strike a balance here, since there's a lot of modularity-breaking interdependencies in the code base — what appear to be simple cleanups tend to unravel into large patches that get stale easily. However, cleanup does force you to read through the code you're modifying, which is always good when you're learning a new code base.

Looking back on it, it doesn't seem like a lot of work; of course, my hope is that the time I spend up-front getting accustomed to the codebase will let me make progress on my goals more rapidly.

Stay tuned for more JS pittage — unpredictable time, but predictable channel.

Footnotes

[*]

To date, I haven't looked into this myself. Ideally, I should have verified it before starting on the parser work, but I was eager to start working on things rather than investigate the reasons behind them.

[†]

Though I've seen that Webkit's JavaScriptCore uses Bison — I'm going to have to check out that implementation at some point.

Bit twiddling: Simple O(1) membership test

Disclaimer

Bit twiddling is fun. Plus, it has several advantages:

You have to understand, though, that clever tricks without appropriate documentation will make people want to break your face. [*] Always bit bash responsibly: appoint a designated code-reader to make sure you're clear enough, and leave your keys at the door.

The Problem

Let's say you wanted to know whether a number was a valid PCI Express link width in terms of number of lanes. We know that valid widths are x1, x2, x4, x8, x12, x16, or x32, and want to construct a function of the following form:

#include <stdbool.h>
#include <stdint.h>
#include <assert.h>

/**
 * :return: Whether the lane count is valid.
 */
bool is_valid_link_width(uint8_t lane_count);

/**
 * Unit test for ``is_valid_link_width``.
 */
int main(int argc, char **argv) {
    assert(!is_valid_link_width(0));
    assert(is_valid_link_width(1));
    assert(is_valid_link_width(2));
    assert(!is_valid_link_width(3));
    assert(is_valid_link_width(32));
    assert(!is_valid_link_width(33));
    assert(!is_valid_link_width(33));
    assert(!is_valid_link_width(0xff));
    return 0;
}

Note that the uint8_t has a width of exactly 8 bits. [†]

How would you write it?

Less Interesting Solution

If you were thinking switch statement, that will work. You could use a switch statement with intentional fall-throughs and hope that the compiler optimizes a branch table for you. (For values this small and dense it probably will, as mentioned in the referenced article.) If the compiler doesn't write the branch table for you, but instead generates the equivalent of a big if/else if ladder, your solution doesn't satisfy the O(1) constraint: in that case, the worst case control flow hits every rung of the ladder (the else if guards), making it O(n).

bool is_valid_link_width(uint8_t lane_count) {
    switch (lane_count) {
    case 1:
    case 2:
    case 4:
    case 8:
    case 12:
    case 16:
    case 32:
        return true;
    }
    return false;
}

An implementation that I like better, which doesn't put as much faith in the compiler, is as follows:

bool is_valid_link_width(uint8_t lane_count) {
    return 0x100011116ULL & (1ULL << lane_count);
}

How cool is that?

The Neat Trick

The clever insight here is that we can encode all of our target "true" values in binary form, like so:

       32                      16    12    8     4    1
0b__0001__0000__0000__0000__0001__0001__0001__0001__0110

Now, if we were to take a 1 value and move it over a number of binary slots equal to the lane count, it will line up with a 1 value in this long binary number we've constructed. Take the bitwise-AND of those two values, and we wind up with:

This is exactly what we were looking for.

This long binary number we've created must be converted from binary into a hexadecimal value, so that we can represent it as an integer literal in our C program. Encoding each binary 4-tuple into hex from right to left, we get the value 0x100011116.

There's an issue with this value, however. Unless we specify a suffix for our integer literal, the compiler is allowed to truncate the value to its native word size, [‡] which would cause serious problems. For x86 systems with 16 bit words, our value could be truncated to 0x1116, which would only allow lane sizes of 1, 2, 4, 8, and 12 — the allowed values of 16 and 32 would be cut off!

To solve this, as you can see in the function definition, we add the ULL integer suffix, which explicitly marks the integer literal as an unsigned long long. (The long long integer data type was added to the C language in the C99 standard.) This data type is required to be at least 64 bits wide, so it can definitely hold our 33 relevant bits (32 plus the zero bit which is there for the 1ULL << 0 case). The long data type is too small to hold this value, as long can potentially be only 32 bits wide per the C standard (and it is 32 bits wide on most modern machines).

Readability Counts

Note that there's a more readable version of the same trick in the following:

bool is_valid_link_width(uint8_t lane_count) {
    const uint64_t set =
        1ULL << 1
        | 1ULL << 2
        | 1ULL << 4
        | 1ULL << 8
        | 1ULL << 12
        | 1ULL << 16
        | 1ULL << 32;
    return set & (1ULL << lane_count);
}

Here we make the construction of the big integer more explicit and make the code less prone to our errors in encoding the literal binary value into hex. Any compiler worth its salt will fold out the const calculation at compile time, so no overhead will be incurred for writing it this way.

I demonstrated the other way of doing it first to: a) blow your mind a little bit, and b) demonstrate an idiom you might see in other people's (overly) clever code. Now there's a chance you can recognize and decipher it without adequate documentation. Huzzah!

Footnotes

[*]

A great rift in the universe known as "Other People's Perl" has been the cause of 80% of the computer-engineering face breakage since 1987. Don't let this tragedy happen to you or your beloved programmers.

[†]

I like using fixed-width integer types, especially in describing problems, because they helpfully constrain the possibilities on the input domain. This is even more important for newcomers who are just wrapping their heads around bit twiddling.

[‡]

I can't find standards documents/discussions to support this claim, but it's definitely what I was taught. Can anybody provide evidence to confirm/deny?

Thoughts on self-modifying code and Futurist Programmers

Around 8th grade I read an article about a faction of programmers — the Futurist Programmers — whose rallying cry is paraphrased in the following quotation:

Why does computer science reject self modifying programs? Why have some departments stopped teaching assembly language programming? On what scientific basis has this been done? Where is the experimental evidence to support these actions?

As far as I remember, this movement attempted to emphasize the purity of computer programming, which they believed was a form of artistry. This was posed as a throwback to the tenets Italian Futurism, which were opposed to tradition and commoditization, in the context of computer programming. A Wikipedia excerpt will probably be helpful:

The Futurists admired speed, technology, youth and violence, the car, the plane and the industrial city, all that represented the technological triumph of humanity over nature, and they were passionate nationalists.

Thinking about JavaScript Just In Time compilers (JITs) today — like TraceMonkey — reminded me of this philosophy. I believe that their line of questioning was insightful, but the formulation was misdirected. Technological triumph stems primarily from computers doing what humans want them to do. It's additionally awesome if the computers can do these things extra quickly; however, if they do things incorrectly very quickly, humanity comes out much less triumphant. Perhaps we even come out worse for the experience.

Secondly, we note that humanity strives for the ability to make further progress based on the success of past experiences. This is the concept of extensibility and reusability. Standing on the shoulders of giants, if you will. Self modifying code that I have encountered is often very clever; however, programming cleverness tends to be at odds with readability. [*] This is not to say that all self-modifying code is unreadable: in languages with dynamic method dispatch, swapping a object's methods out (with some kind of locking mechanism) is a recognized idiom that can lead to beneficial efficiency/complexity trade-offs. [†]

Ultimately, you'd have trouble finding computer enthusiasts who find speed unimportant. Everybody loves it when their computers are more efficient! The caveat is that most computer enthusiasts will, in many situations, put speed down here: after correctness and extensibility. As a testament to this, there is continuing emergence and acceptance of Very High Level Languages (VHLLs) over low level programming languages in non-academic contexts.

So how did the futurists have the right idea? "Introspective" programs are important. There's lots of information at runtime that we can use to more efficiently execute programs. [‡] Hotspot JITs, such as the aforementioned TraceMonkey, know this well: the basic premise is that they dynamically rewrite the code they're executing or, in recent developments with Google's V8, rewrite it before executing. The key here is that we can now:

  1. Write correct, extensible programs.

  2. Write correct, extensible programs to optimize the programs from 1.

  3. Run the more efficient result of combining 2 and 1.

Self-hosting platforms such as PyPy and intermediary representation JITs such as LLVM also show astonishing insight into introspective techniques. These platforms can be used to a number of ends, including, but not limited to, the increases in speed that the Futurist Programmers seem to be longing for.

In the end, I only have one rebuttal question for the Futurist Programmers: What kind of science disregards the accuracy and reproducibility of results for the sake of fast "experiments"? [§] We don't reject self-modifying programs without consideration — there are very important maintainability and extensibility concerns that have to be taken into account before making a decision. It's not always a choice between making something artistically beautiful or performing a feat of engineering: if most computer enthusiasts are like me, they're searching for a way to produce an appropriate mix of the two.

Footnotes

[*]

This is generally recognized within the Python community.

[†]

As an example of this, think of the singleton access pattern in a multithreaded application. After Singleton.get_instance() has instantiated the class on the first call, you could swap get_instance() with a method that simply returns the created reference. This avoids subsequent locking and singleton-instantiation checking that you would incur from the old get_instance() method.

[‡]

I recommend the Steve Yegge talk on dynamic languages for some more background on this topic.

[§]

What is an application if not a software engineer's big, scary experiment?