February 3, 2011

Picky monkeys PIC ARM

The Mozilla JavaScript-engine team has been hard at work since the shiny new JägerMonkey Just-In-Time compiler hit the betas. We're viciously ripping apart any bug that stands between us and shipping Firefox 4. One could say that we're coming at you like a SpiderMonkey.

Alongside our ferocious fixing, one of our late-game performance initiatives was to get all of our polymorphic inline caches (AKA PICs) enabled on ARM devices. It was low risk and of high benefit to our Firefox for Mobile browser, whose badass-yet-cute codename is Fennec.

Jacob Bramley and I took on this ARM support task in bug 588021, obviously building on excellent prior inline cache work from fellow team members David Anderson, Dave Mandelin, Sean Stangl, and Bill McCloskey.

tl;dr: Firefox for Mobile fast on ARM. Pretty graphs.

Melts in your mouth, not in your ARM

To recap, JägerMonkeyJM is also known as the "method compiler": it takes a method's bytecode as input and orders up the corresponding blob of machine code with some helpful information on the side. Its primary sub-components are the register tracker, which helps the compiler transform the stack-based bytecode and reuse already-allocated machine registers intelligently, and the MacroAssembler, which is the machine-code-emitting component we imported from Webkit's Nitro engine.

High level block diagram of the |JM| compiler.

The MacroAssembler is the secret sauce for JägerMonkeyJM's platform independence. It's an elegantly-designed component that can be used to emit machine code for multiple target architectures: all of x86, x86-64, and ARM assembly are supported through the same C++ interface! This abstraction is the reason that we only need one implementation of the compiler for all three architectures, which has been a clear win in terms of cross-platform feature additions and maintainability.

"So", you ask, "if you've got this great MacroAssembler-thingy-thing, why didn't all the inline caches work on all the platforms to begin with?" Or, alternatively, "If all the compiler code is shared among all the platforms, why didn't all the inline caches crash on ARM?"

The answer is that some platform-specifics had crept into our compiler code!

ARM'd and ifdef-dangerous

As explained in the entry on inline caches, an inline cache is a chunk of self-modifying machine code. A machine code "template" is emitted that is later tweaked to reflect the cached result of a common value. If you're frequently accessing the nostrilCount property of Nose objects, inline caches make that fast by embedding a shortcut for that access into the machine code itself.

In the machine code "template" that we use for inline caches, we need to know where certain constants, like object type and object-property location, live as offsets into the machine code so that we can change them later, during a process called repatching. However, when our compiler says, "If this value is not 0xdeadbeef, go do something else," we wind up with different encodings on each platform.

Demonstration of immediate encodings across various platforms.

As you may have guessed, machine-code offsets are different for each platform, which made it easier for other subtle platform-specifics to creep into the compiler as well.

To answer the question raised earlier, the MacroAssembler interface wasn't heavily relied on for the early inline cache implementations. Inline caches were first implemented for x86, and although x86 is a variable-width instruction set, all of the instruction sequences emitted from the compiler had a known instruction width and format. [*] This permitted us to use known-constant-offset values for the x86 platform inline caches. These known-constant-offsets never changed and so didn't require any space or access time overhead in side-structures. They seemed like the clear solution when x86 was the only platform to get up-and-running.

Then x86-64 (AKA x64) came along, flaunting its large register set and colorful plumage. On x64, the instruction sequence did not have a known width and format! Depending on whether the extended register set is used, things like mov instructions may require a special REX prefix byte in the instruction stream (highlighted in blue above). This led to more ifdefs — on x64 a bunch more values have to be saved in order to know where to patch our inline caches!

As a result, getting inline caches working on ARM was largely a JägerMonkey refactoring effort. Early on, we had used conditional compilation (preprocessor flags) to get inline caches running on a platform-by-platform basis, which was clearly the right decision for rapid iteration, but we decided that it was time to pay down some of our technical debt.

Paying down the debt: not quite an ARM and a leg

The MacroAssembler deals with raw machine values — you can tell it dull-sounding machine-level things like, "Move this 17 bit sign-extended immediate into the EAX register."

On the other hand, we have our own awesome-sounding value representation in the SpiderMonkey engine: on both 32-bit and 64-bit platforms every "JS value" is a 64-bit wide piece of data that contains both the type of the data and the data itself. [†] Because the compiler is manipulating these VM values all the time, when we started the JägerMonkeyJM compiler it was only natural to put the MacroAssembler in a delicious candy coating that also knew how to deal with these VM values.

High level, candy-coated block diagram of the |JM| compiler.

The NunboxAssembler, pictured in red, [‡] is a specialized assembler with routines to deal with our "nunbox" value representation. [§] The idea of the refactoring was to candy-coat a peer of the MacroAssembler, the Repatcher, with routines that knew how to patch common inline cache constructs that the NunboxAssembler was emitting.

With the inline cache Repatcher in place, we were once again able to move all the platform-specific code out of the compiler and into a single, isolated part of the code base, hidden behind a common interface.

High level block diagram of the |JM| compiler with the inline cache repatcher in place.

Routines like NunboxAssembler::emitTypeGuard, which knows how to emit a type guard regardless of the platform, are paired with routines like ICRepatcher::patchTypeGuard(newType), which knows how to patch a type guard regardless of platform. Similarly, NunboxAssembler::loadObjectProperty has a ICRepatcher::patchObjectPropertyLoad. The constructs that are generated by the NunboxAssembler are properly patched by the corresponding ICRepatcher method on a miss. It's all quite zen.

Frog ARMs

On real devices running the Fennec betas, we've seen marked improvements since Beta 3. [¶] Most notably, we've leapfrogged the stock Android 2.2 browser on the V8-V5 benchmark on both the Galaxy S and the Nexus One. Pretty graphs courtesy of Mark Finkle.

SunSpider performance comparisonV8-V5 performance comparisonKraken performance comparison

ARMn't you glad I didn't say banana?

Since I've run out of remotely-acceptable ARM malapropisms, these topics will be left to further discussion. Feel free to comment on anything that deserves further clarification!

Footnotes

[*]

For example, if you always emit a simple mov from a 32-bit register to a 32-bit register, that has a known constant length. The "variable width" part of "variable width instruction set" refers to the fact that different instructions do not generally take the same number of bytes. It does not mean that the encoding of a given instruction (like mov) with particular operands (like two 32-bit registers) is totally variable.

[†]

The team also believes that further experimentation with a 128-bit value representation for 64-bit systems could yield positive results.

[‡]

FD&C Red No. 40, to be precise.

[§]

"Nunbox" is a play on the term NaN-boxing. We have no idea how Luke comes up with these names, but we hope he never stops.

[¶]

On the fancy Tegra 2 board I was developing on, running the SunSpider harness on the JavaScript shell with methodjit-only, this work net us a whopping 230% speedup on the V8-V4 benchmark and a 15% speedup on SunSpider 0.9.1.

Idiomatic Python refactoring: for-else, "in" (contains) operator

I was perusing the App Engine SDK and I came across this snippet:

if self.choices:
  match = False
  for choice in self.choices:
    if choice == value:
      match = True
  if not match:
    raise BadValueError('Property %s is %r; must be one of %r' %
                        (self.name, value, self.choices))

Since I don't work with many other Python programmers, I always have trouble figuring out what interesting tidbits would be useful to post in, say, a blog entry. I don't have a good understanding of the popular knowledge level, but I figure that I can't go too wrong refactoring code written by Google engineers (who I naively assume are all as cool as Steve Yegge). [*]

The for-else statement

Let's forget about self for now [†] and refactor to use an obscure (but useful) Python feature, the for-else construct. for-else removes the necessity for the boolean-flag-state idiom from the original code, which is often used in lower level languages. [‡]

if choices:
    for choice in choices:
        if choice == value:
            break
    else:
        raise BadValueError

The for-else statement looks a little strange when you first encounter it, but I've come to love it. The else suite is evaluated if you don't break out of the for loop. In this case, if we didn't break out of the for loop, then we never found a value equivalent to choice.

We also gain some efficiency over the original by using the break statement as soon as we find a match: there's no need to keep looking if you've already found a result! This can save you from iterating over all len(choices) items if you find it's a valid choice in the first iteration.

in (contains) operator

Here is an even more readable and Python-like refactoring that uses the in operator: [§]

if choices and value not in choices:
    raise BadValueError

The in operator works on any iterable object and performs the same behavior as the code above: it looks for any item within self.choices such that choice == item. If it finds it early in the list, it won't keep looking. This is similar behavior to our early break statement from the first refactoring.

Just like the original code with the for loop, the in operator raises a TypeError if choices is not iterable. The in operator is effectively a drop-in replacement for the (more verbose) for loop when it comes to membership testing.

Footnotes

[*]

You should read his blog if you don't already.

[†]

For the language lawyers: we're forgetting about the fact that this code was intended to be executed in a bound instance method. ;)

[‡]

For example, C. For more information on programming languages and their "heights", see this Wikipedia entry.

[§]

Yeah, yeah... technically it's the not in operator.

Python's generators sure are handy

While rewriting some older code today, I ran across a good example of the clarity inherent in Python's generator expressions. Some time ago, I had written this weirdo construct:

for regex in date_regexes:
    match = regex.search(line)
    if match:
        break
else:
    return
# ... do stuff with the match

The syntax highlighting makes the problem fairly obvious: there's way too much syntax!

First of all, I used the semi-obscure "for-else" construct. For those of you who don't read the Python BNF grammar for fun (as in: the for statement), the definition may be useful:

So long as the for loop isn't (prematurely) terminated by a break statement, the code in the else suite gets evaluated. To restate (in the contrapositive): the code in the else suite doesn't get evaluated if the for loop is terminated with a break statement. From this definition we can deduce that if a match was found, I did not want to return early.

That's way too much stuff to think about. Generators come to the rescue!

def first(iterable):
    """:return: The first item in the iterable that evaluates
    as True.
    """
    for item in iterable:
        if item:
            return item
    return None

match = first(regex.search(line) for regex in regexes)
if not match:
    return
# ... do stuff with the match

At a glance, this is much shorter and more comprehensible. We pass a generator expression to the first function, which performs a kind of short-circuit evaluation — as soon as a match is found, we stop running regexes (which can be expensive). This is a pretty rockin' solution, so far as I can tell.

Prior to generator expressions, to do something similar to this we'd have to use a list comprehension, like so:

match = first([regex.search(line) for regex in regexes])
if not match:
    return
# ... do stuff with the match

We dislike this because the list comprehension will run all of the regexes, even if one already found a match. What we really want is the short circuit evaluation provided by generator expressions and the any builtin, as shown above. Huzzah!

Edit

Originally I thought that the any built-in returned the first object which evaluated to a boolean True, but it actually returns the boolean True if any of the objects evaluate to True. I've edited to reflect my mistake.