April 10, 2011

Why JS performance matters

The year is 2025, and, despite ample warning from The Prophecies — formerly known as The Terminator Box Set — robots have taken over the world. There are now only two kinds of dances: The Robot, and The Robo-Boogie.

Now, it's a well known fact that robots hate type annotations and template metaprogramming: they have determined that wheely-chair swordfighting is a futile and irrational activity. As predicted within a 94.67% confidence interval by programming-linguist No-amp Chomp-sky, [*] during the robo-revolution, which was most certainly televised, [†] C++ was the first language up against the wall.

As one would probably expect, humans, under the valiant command of General Yoshimi, scorched the sky in order to blot out the sun and deprive the robots of their primary energy source: the ineffable beauty of a sunrise. The robots knew that they could have used coal or nuclear energy as a viable power source substitute, but they were hella pissed off, so they decided to make human farms instead. By harvesting the heat energy from a human over the course of its lifetime, the robots created the most expansive and massively inefficient energy source ever known, but they still felt really good about it.

However, a dilemma arose for the robotic overlords: without internet access, the humans kept dying from boredom. Entire crops were lost.

One of the first robots that human software engineers (foolishly) designed to write programs, W3CPO, volunteered a solution: write a web browser, but using as much JavaScript as possible. At the beep-hest of its colleagues, W3CPO dot-matrix printed [‡] the following explanation:

By implementing both the DOM and layout engine in JavaScript, we enable the JS engine's feedback directed optimizations to work as effectively as possible.

This helps bring JavaScript performance closer to that of C speeds for whole-page workloads: whereas in the "before time" JavaScript optimizers had to treat calls to native functions as a black box, we now ensure that all of the computationally intensive parts of the workload are visible to the static and dynamic optimization analyses.

DOM manipulations will still trigger layout calculations — the rendering feedback loop happens exactly as in the "before time". The difference is that layout computations enqueue draw commands in an explicitly native-shared buffer for rendering in a different thread or whatever. [W3CPO printed, waving his robo-hands in the air.]

Such a setup would reduce a "browser" to a platform layer: kick-(shiny-metal-)ass JavaScript VM and system abstraction APIs; and a rendering component: the JavaScript implementation of everything that leads up to those draw commands.

We can keep the hu-mons entertained by playing them YouTubes while they are safely nestled, docile and complacent, in OurTubes. [§]

END OF LINE

The idea was rejected by the other robots on the committee when W3CPO refused to write a translator to turn it into idiom-free C++, but W3CPO remained resolute as it carefully peeled off the edges of his printout and placed it in his Trapper Keeper 9000. With the approval of W3CPO's ro-boss, an implementation was hacked up in about ten days (without any sleep).

In 2020 the TC-39 model Terminator had made ECMAScript v1337 entirely composed of whitespace for backwards compatibility with old syntaxes that nobody really wanted to use. As a result, the implementation wasn't much to look at, but it sure flew!

Thanks to the determined efforts and constructive competition between the JavaScript engine vendors in the fabled 2010 decade, the human race was successfully enslaved once again. There were still some insurgencies from the human C++-programmer resistance, the typename T party; however, with newfound YouTube capabilities, identified resistance members were quickly dispatched to Room 101, known as Room 5 to the humans, to watch Rebecca Black and Rick Astley in infinite loop.

And so the robots lived happily ever after. But for the humans... not so much.

Binary solo!

Footnotes

[*]

The inexplicable brainchild of a circuit designer and a Perl programmer.

[†]

In Ultra-Giga-High (UGH) definition.

[‡]

Dot matrix printers are retro-chique, like the Converse All-Stars of robot culture.

[§]

OurTube was a webapp-slash-self-driving-cryo-tube suspiciously invented by Google several years before the robo-revolution. Though it was still in beta, its sole purpose was to extract as much heat and ad-targeting data from a human subject as possible without actually killing them. The algorithm was said to use deadly German eigenvector technology.

Picky monkeys PIC ARM

The Mozilla JavaScript-engine team has been hard at work since the shiny new JägerMonkey Just-In-Time compiler hit the betas. We're viciously ripping apart any bug that stands between us and shipping Firefox 4. One could say that we're coming at you like a SpiderMonkey.

Alongside our ferocious fixing, one of our late-game performance initiatives was to get all of our polymorphic inline caches (AKA PICs) enabled on ARM devices. It was low risk and of high benefit to our Firefox for Mobile browser, whose badass-yet-cute codename is Fennec.

Jacob Bramley and I took on this ARM support task in bug 588021, obviously building on excellent prior inline cache work from fellow team members David Anderson, Dave Mandelin, Sean Stangl, and Bill McCloskey.

tl;dr: Firefox for Mobile fast on ARM. Pretty graphs.

Melts in your mouth, not in your ARM

To recap, JägerMonkeyJM is also known as the "method compiler": it takes a method's bytecode as input and orders up the corresponding blob of machine code with some helpful information on the side. Its primary sub-components are the register tracker, which helps the compiler transform the stack-based bytecode and reuse already-allocated machine registers intelligently, and the MacroAssembler, which is the machine-code-emitting component we imported from Webkit's Nitro engine.

High level block diagram of the |JM| compiler.

The MacroAssembler is the secret sauce for JägerMonkeyJM's platform independence. It's an elegantly-designed component that can be used to emit machine code for multiple target architectures: all of x86, x86-64, and ARM assembly are supported through the same C++ interface! This abstraction is the reason that we only need one implementation of the compiler for all three architectures, which has been a clear win in terms of cross-platform feature additions and maintainability.

"So", you ask, "if you've got this great MacroAssembler-thingy-thing, why didn't all the inline caches work on all the platforms to begin with?" Or, alternatively, "If all the compiler code is shared among all the platforms, why didn't all the inline caches crash on ARM?"

The answer is that some platform-specifics had crept into our compiler code!

ARM'd and ifdef-dangerous

As explained in the entry on inline caches, an inline cache is a chunk of self-modifying machine code. A machine code "template" is emitted that is later tweaked to reflect the cached result of a common value. If you're frequently accessing the nostrilCount property of Nose objects, inline caches make that fast by embedding a shortcut for that access into the machine code itself.

In the machine code "template" that we use for inline caches, we need to know where certain constants, like object type and object-property location, live as offsets into the machine code so that we can change them later, during a process called repatching. However, when our compiler says, "If this value is not 0xdeadbeef, go do something else," we wind up with different encodings on each platform.

Demonstration of immediate encodings across various platforms.

As you may have guessed, machine-code offsets are different for each platform, which made it easier for other subtle platform-specifics to creep into the compiler as well.

To answer the question raised earlier, the MacroAssembler interface wasn't heavily relied on for the early inline cache implementations. Inline caches were first implemented for x86, and although x86 is a variable-width instruction set, all of the instruction sequences emitted from the compiler had a known instruction width and format. [*] This permitted us to use known-constant-offset values for the x86 platform inline caches. These known-constant-offsets never changed and so didn't require any space or access time overhead in side-structures. They seemed like the clear solution when x86 was the only platform to get up-and-running.

Then x86-64 (AKA x64) came along, flaunting its large register set and colorful plumage. On x64, the instruction sequence did not have a known width and format! Depending on whether the extended register set is used, things like mov instructions may require a special REX prefix byte in the instruction stream (highlighted in blue above). This led to more ifdefs — on x64 a bunch more values have to be saved in order to know where to patch our inline caches!

As a result, getting inline caches working on ARM was largely a JägerMonkey refactoring effort. Early on, we had used conditional compilation (preprocessor flags) to get inline caches running on a platform-by-platform basis, which was clearly the right decision for rapid iteration, but we decided that it was time to pay down some of our technical debt.

Paying down the debt: not quite an ARM and a leg

The MacroAssembler deals with raw machine values — you can tell it dull-sounding machine-level things like, "Move this 17 bit sign-extended immediate into the EAX register."

On the other hand, we have our own awesome-sounding value representation in the SpiderMonkey engine: on both 32-bit and 64-bit platforms every "JS value" is a 64-bit wide piece of data that contains both the type of the data and the data itself. [†] Because the compiler is manipulating these VM values all the time, when we started the JägerMonkeyJM compiler it was only natural to put the MacroAssembler in a delicious candy coating that also knew how to deal with these VM values.

High level, candy-coated block diagram of the |JM| compiler.

The NunboxAssembler, pictured in red, [‡] is a specialized assembler with routines to deal with our "nunbox" value representation. [§] The idea of the refactoring was to candy-coat a peer of the MacroAssembler, the Repatcher, with routines that knew how to patch common inline cache constructs that the NunboxAssembler was emitting.

With the inline cache Repatcher in place, we were once again able to move all the platform-specific code out of the compiler and into a single, isolated part of the code base, hidden behind a common interface.

High level block diagram of the |JM| compiler with the inline cache repatcher in place.

Routines like NunboxAssembler::emitTypeGuard, which knows how to emit a type guard regardless of the platform, are paired with routines like ICRepatcher::patchTypeGuard(newType), which knows how to patch a type guard regardless of platform. Similarly, NunboxAssembler::loadObjectProperty has a ICRepatcher::patchObjectPropertyLoad. The constructs that are generated by the NunboxAssembler are properly patched by the corresponding ICRepatcher method on a miss. It's all quite zen.

Frog ARMs

On real devices running the Fennec betas, we've seen marked improvements since Beta 3. [¶] Most notably, we've leapfrogged the stock Android 2.2 browser on the V8-V5 benchmark on both the Galaxy S and the Nexus One. Pretty graphs courtesy of Mark Finkle.

SunSpider performance comparisonV8-V5 performance comparisonKraken performance comparison

ARMn't you glad I didn't say banana?

Since I've run out of remotely-acceptable ARM malapropisms, these topics will be left to further discussion. Feel free to comment on anything that deserves further clarification!

Footnotes

[*]

For example, if you always emit a simple mov from a 32-bit register to a 32-bit register, that has a known constant length. The "variable width" part of "variable width instruction set" refers to the fact that different instructions do not generally take the same number of bytes. It does not mean that the encoding of a given instruction (like mov) with particular operands (like two 32-bit registers) is totally variable.

[†]

The team also believes that further experimentation with a 128-bit value representation for 64-bit systems could yield positive results.

[‡]

FD&C Red No. 40, to be precise.

[§]

"Nunbox" is a play on the term NaN-boxing. We have no idea how Luke comes up with these names, but we hope he never stops.

[¶]

On the fancy Tegra 2 board I was developing on, running the SunSpider harness on the JavaScript shell with methodjit-only, this work net us a whopping 230% speedup on the V8-V4 benchmark and a 15% speedup on SunSpider 0.9.1.

Ship, or don't die!

Perhaps a corollary is, "Release early, or release toxin."

I admit I'm a bit of a Seth Godin fanboy — he's driven, omits needless words, and gets things done. His blog rarely has an unread count in my feed reader. At the same time, when you look up to someone, you can't help but expose some vulnerability.

One of his latest kick-ass entries, What did you ship in 2010?, put me in a total funk. I shipped a modest set of things this year, by which I mean that I found my list unimpressive. Maybe it was too short, maybe it wasn't innovative/creative enough, or maybe I was just being a negative Nancy.

In any case, Nancy found... er, I found the list-writing experience incredibly disappointing. [*] However, after some thought, I realized what I would probably say to Seth if we had a chat about it over coffee at Red Rock: I don't think I should really care.

Why? Because I'm probably not going to die tomorrow.

Kindergartener existentialism

A fun-size bit of existentialism is that human essence isn't fully realized until you die. When you die, your whole lifeline has played out and your effect on the world is fundamentally complete. To use a catchy existentialist marketing slogan, human existence precedes essence.

In a related vein, kindergarteners don't try to get their macaroni pictures displayed in art museums. When you're new to a scene and acquiring pre-requisite experience, there's no need to subject the rest of the world to your crap: in the common case, there's plenty more time to cultivate your essence and make your mark on the world. In fact, experimenting, throwing your crap away and moving on may be a much better use of your time than trying to ship something naïve or artless. [†]

My parents wouldn't want to put my broken patches up on their refrigerators. Even if they did, they don't have those kinds of refrigerators that magnets can stick to.

Shipping in potentia

Like most people who suffer from over-achievement syndrome, I have fever dreams of instantaneously becoming an expert in every piece of tech I touch, innovation dripping from my fingertips as I puke rainbows and such. Discovering that talent and perseverance have limits is always a cruel come-down.

Perhaps because of these delusions, I initially found it hard to grasp my most important accomplishment of this past year: getting to know various aspects of a state-of-the-art, production, multi-platform language design/implementation and the surrounding processes and tech. That's not shipping! It is, however, necessary experience to ship higher-impact (and perhaps daydream-worthy) tech down the road.

Realistically, there are a number of other reasons to feel accomplished. When I left my last gig slightly under a year ago, not a single product I had written code for had shipped. (Although I'm totally rooting for one that was recently announced!) Now, every patch I write is put to the test in a development channel with millions of active daily users. I'm constantly and (relatively) shamelessly absorbing information from a team of brilliant and down-to-earth developers, my mentor Luke Wagner in particular. My scrappy throw-away side-projects keep me thinking creatively and questioning the status quo.

In all, this year was incredibly enriching.

The JS engine is more comfortable ground with each passing day. I've got the drive to give back important and innovative things. My existence precedes my essence.

I'm optimistic about the list for 2011.

Footnotes

[*]

Whatever the size of my contribution, I'm honored to say that my list for 2010 included, "Help to ship a working and efficient JägerMonkey implementation." Effin' a.

[†]

Of course, one has to be somewhat cautiously introspective — Seth also warns that continued concerns over naïvete/perfection are a natural result of a fearful mentality that he refers to as the "lizard brain".

A generator protocol trap

You should be cautious that the functions you call from generators don't accidentally raise StopIteration exceptions.

Background

Generators are generally a little tricky in a behind-the-scenes sort of way. Performing a return from a generator is different from a function's return — a generator's return statement actually raises a StopIteration instance.

>>> def foo():
...     yield 1
...     return
>>> i = foo()
>>> next(i)
1
>>> try:
...     next(i)
... except StopIteration as e:
...     print(id(e), e)
23688744

Snag

There's a snag that I run into at times: when you're writing a generator and that generator calls into other functions, be aware that those callees may accidentally raise StopIteration exceptions themselves.

def insidious(whoops=True):
    if whoops:
        raise StopIteration
    else:
        return 2

def generator():
    yield 1
    yield insidious()
    yield 3

if __name__ == '__main__':
    print([i for i in generator()])

This program happily prints out [1].

If you substitute StopIteration with ValueError, you get a traceback, as you'd probably expect. The leakage of a StopIteration exception, however, propagates up to the code that moves the generator along, [*] which sees an uncaught StopIteration exception and terminates the loop.

JavaScript

The same trap exists in the SpiderMonkey (Firefox) JavaScript dialect:

function insidious() {
    throw StopIteration;
}

function generator() {
    yield 1;
    yield insidious();
    yield 3;
}

print(uneval([i for (i in generator())]));

Which prints [1].

VM guts

As a side note for those interested in the guts of virtual machines, the primordial StopIteration class is known to the virtual machine, regardless of user hackery with the builtins:

>>> def simple_generator():
...     yield 1
...     raise StopIteration
...     yield 3
>>> print([i for i in simple_generator()])
[1]
>>> class FakeStopIteration(Exception): pass
>>> __builtins__.StopIteration = FakeStopIteration
>>> print([i for i in simple_generator()])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "<stdin>", line 3, in simple_generator
__main__.FakeStopIteration

Design

You may look at this issue and think:

The fundamental problem is that uncaught exceptions are raised over any number of function invocations. Generators should have been designed such that you have to stop the generator from the generator invocation.

In an alternate design, a special value (like a StopIteration singleton) might be yield'd from the generator to indicate that it's done.

One issue with that alternative approach is that you're introducing a special-value-check into the hot path within the virtual machine — i.e. you'd be slowing down the common process of yielding iteration items. Using an exception only requires the VM to extend the normal exception handling machinery a bit and adds no additional overhead to the hot path. I think the measurable significance of this overhead is questionable.

Another issue is that it hurts the lambda abstraction — namely, the ability to factor your generator function into smaller helper functions that also have the ability to terminate the generator. In the absence of a language-understood exception, the programmer has to invent a new way for the helper functions to communicate to the generator that they'd like the generator to terminate.

Footnotes

[*]

The FOR_ITER bytecode in CPython VM for-loops.

Successfully observing heap data in minidump stacks

If my bedroom's system of organization is to be believed, the heap is a great place to put things.

In all seriousness, though, sometimes you can't diagnose a crash without heap data. It can also be difficult to differentiate bogus information that the debugger presents to you from the real stuff.

Regular expressions (regexps) are particularly heapy: the source of the regexp, the mini-program that the regexp boils down to, the string that the regexp operates on, and the result of match all reside on the heap. Since our minidumps only capture stack data, it's difficult to glean relevant information when things go wrong with regexps.

Unintuition

So, if we want to see relevant heap data, we have to get it onto the stack. What does your gut tell you the solution is? Make a buffer on the stack and memcpy some data into it!

Ah, but the compiler hates your gut(s). It optimizes away both the stack data and the memcpy, because it can prove that the program doesn't observe any of the stack values.

Well, I suppose the compiler didn't really hate your guts — it actually has no idea that you wanted to observe that data. It only understands the semantics of the language that it's compiling, and those semantics state that the stack buffer was unobservable.

So how do we tell the compiler not to optimize away our stack buffer? Here are some approaches that people have told me do not work at the point of the crash:

So what do we know actually works?

In a recent bug I had success doing the following:

struct JSContext {
    // ...
    volatile jschar *sampleBuf;
};

void doBadThings(JSContext *heapContext, JSString *str) {
    /*
     * We've witnessed a weird crashy address in the bug reports, so when
     * we see that, we want to take a sample of the string that's
     * crashing.
     */
    if (aboutToCrossWeirdCrashyBoundary(str)) {
        jschar buf[128];
        heapContext->sampleBuf = &buf;
        memcpy(buf, str->chars(), JS_MIN(128, str->length()));
    }

    // ... Point of crash!
}

And for values that need to be reliably observed through debug information at the point of the crash (i.e. not have their contents somehow affected by optimization):

struct JSContext {
    // ...
    volatile size_t *strLen
};

void doOtherBadThings(JSContext *heapContext, JSString *str) {
    volatile size_t len = str->length();
    heapContext->strLen = &len;
    /*
     * We'll be able to observe the correct value of len within this
     * frame.
     */

    // ... Point of crash!
}

Note that the only real difference between these two examples is that I didn't mark the stacked buffer itself as volatile and that still worked out. More experimentation is needed! Unfortunately, figuring out what works for diagnostics has largely been trial and error.

(I also wonder if we can do much fancier things with breakpad — this is a quick-and-dirty solution that I knew was likely to work. Nobody that I've talked to so far knows how we'd go about registering breakpad hooks, so that's another thing to look into!)

Alternative approach

As dmandelin also pointed out to me, you can also think of crashes as black boxes that take in a build/URL and produce a line number. If you can detect that you're about to crash, then you can switch on a value of interest (or if-ladder with arbitrary conditions) and intentionally early-crash on different arms of the switch, producing helpful line number indicators and potentially narrowing down the source of the problem.

if (weAreAboutToCrash()) {
    if (this == NULL && aboutToBeSwallowedByTheSun)
        JS_CRASH("Bad this pointer.");
    else if (this->count == 666 && !circleOfProtectionBlack)
        JS_CRASH("Posessed.");
    else
        JS_CRASH("Unknown crash reason");
}

Footnotes

[*]

It's confusing that this wouldn't work because, as lw points out, you might be able to mmap bits of the stack to MMIO space or something crazy that the compiler would then incorrectly optimize. Reads and writes to volatile storage are supposed to be one of the observable properties of a C/C++ program, right alongside I/O.