April 10, 2011

Why JS performance matters

The year is 2025, and, despite ample warning from The Prophecies — formerly known as The Terminator Box Set — robots have taken over the world. There are now only two kinds of dances: The Robot, and The Robo-Boogie.

Now, it's a well known fact that robots hate type annotations and template metaprogramming: they have determined that wheely-chair swordfighting is a futile and irrational activity. As predicted within a 94.67% confidence interval by programming-linguist No-amp Chomp-sky, [*] during the robo-revolution, which was most certainly televised, [†] C++ was the first language up against the wall.

As one would probably expect, humans, under the valiant command of General Yoshimi, scorched the sky in order to blot out the sun and deprive the robots of their primary energy source: the ineffable beauty of a sunrise. The robots knew that they could have used coal or nuclear energy as a viable power source substitute, but they were hella pissed off, so they decided to make human farms instead. By harvesting the heat energy from a human over the course of its lifetime, the robots created the most expansive and massively inefficient energy source ever known, but they still felt really good about it.

However, a dilemma arose for the robotic overlords: without internet access, the humans kept dying from boredom. Entire crops were lost.

One of the first robots that human software engineers (foolishly) designed to write programs, W3CPO, volunteered a solution: write a web browser, but using as much JavaScript as possible. At the beep-hest of its colleagues, W3CPO dot-matrix printed [‡] the following explanation:

By implementing both the DOM and layout engine in JavaScript, we enable the JS engine's feedback directed optimizations to work as effectively as possible.

This helps bring JavaScript performance closer to that of C speeds for whole-page workloads: whereas in the "before time" JavaScript optimizers had to treat calls to native functions as a black box, we now ensure that all of the computationally intensive parts of the workload are visible to the static and dynamic optimization analyses.

DOM manipulations will still trigger layout calculations — the rendering feedback loop happens exactly as in the "before time". The difference is that layout computations enqueue draw commands in an explicitly native-shared buffer for rendering in a different thread or whatever. [W3CPO printed, waving his robo-hands in the air.]

Such a setup would reduce a "browser" to a platform layer: kick-(shiny-metal-)ass JavaScript VM and system abstraction APIs; and a rendering component: the JavaScript implementation of everything that leads up to those draw commands.

We can keep the hu-mons entertained by playing them YouTubes while they are safely nestled, docile and complacent, in OurTubes. [§]

END OF LINE

The idea was rejected by the other robots on the committee when W3CPO refused to write a translator to turn it into idiom-free C++, but W3CPO remained resolute as it carefully peeled off the edges of his printout and placed it in his Trapper Keeper 9000. With the approval of W3CPO's ro-boss, an implementation was hacked up in about ten days (without any sleep).

In 2020 the TC-39 model Terminator had made ECMAScript v1337 entirely composed of whitespace for backwards compatibility with old syntaxes that nobody really wanted to use. As a result, the implementation wasn't much to look at, but it sure flew!

Thanks to the determined efforts and constructive competition between the JavaScript engine vendors in the fabled 2010 decade, the human race was successfully enslaved once again. There were still some insurgencies from the human C++-programmer resistance, the typename T party; however, with newfound YouTube capabilities, identified resistance members were quickly dispatched to Room 101, known as Room 5 to the humans, to watch Rebecca Black and Rick Astley in infinite loop.

And so the robots lived happily ever after. But for the humans... not so much.

Binary solo!

Footnotes

[*]

The inexplicable brainchild of a circuit designer and a Perl programmer.

[†]

In Ultra-Giga-High (UGH) definition.

[‡]

Dot matrix printers are retro-chique, like the Converse All-Stars of robot culture.

[§]

OurTube was a webapp-slash-self-driving-cryo-tube suspiciously invented by Google several years before the robo-revolution. Though it was still in beta, its sole purpose was to extract as much heat and ad-targeting data from a human subject as possible without actually killing them. The algorithm was said to use deadly German eigenvector technology.

Picky monkeys PIC ARM

The Mozilla JavaScript-engine team has been hard at work since the shiny new JägerMonkey Just-In-Time compiler hit the betas. We're viciously ripping apart any bug that stands between us and shipping Firefox 4. One could say that we're coming at you like a SpiderMonkey.

Alongside our ferocious fixing, one of our late-game performance initiatives was to get all of our polymorphic inline caches (AKA PICs) enabled on ARM devices. It was low risk and of high benefit to our Firefox for Mobile browser, whose badass-yet-cute codename is Fennec.

Jacob Bramley and I took on this ARM support task in bug 588021, obviously building on excellent prior inline cache work from fellow team members David Anderson, Dave Mandelin, Sean Stangl, and Bill McCloskey.

tl;dr: Firefox for Mobile fast on ARM. Pretty graphs.

Melts in your mouth, not in your ARM

To recap, JägerMonkeyJM is also known as the "method compiler": it takes a method's bytecode as input and orders up the corresponding blob of machine code with some helpful information on the side. Its primary sub-components are the register tracker, which helps the compiler transform the stack-based bytecode and reuse already-allocated machine registers intelligently, and the MacroAssembler, which is the machine-code-emitting component we imported from Webkit's Nitro engine.

High level block diagram of the |JM| compiler.

The MacroAssembler is the secret sauce for JägerMonkeyJM's platform independence. It's an elegantly-designed component that can be used to emit machine code for multiple target architectures: all of x86, x86-64, and ARM assembly are supported through the same C++ interface! This abstraction is the reason that we only need one implementation of the compiler for all three architectures, which has been a clear win in terms of cross-platform feature additions and maintainability.

"So", you ask, "if you've got this great MacroAssembler-thingy-thing, why didn't all the inline caches work on all the platforms to begin with?" Or, alternatively, "If all the compiler code is shared among all the platforms, why didn't all the inline caches crash on ARM?"

The answer is that some platform-specifics had crept into our compiler code!

ARM'd and ifdef-dangerous

As explained in the entry on inline caches, an inline cache is a chunk of self-modifying machine code. A machine code "template" is emitted that is later tweaked to reflect the cached result of a common value. If you're frequently accessing the nostrilCount property of Nose objects, inline caches make that fast by embedding a shortcut for that access into the machine code itself.

In the machine code "template" that we use for inline caches, we need to know where certain constants, like object type and object-property location, live as offsets into the machine code so that we can change them later, during a process called repatching. However, when our compiler says, "If this value is not 0xdeadbeef, go do something else," we wind up with different encodings on each platform.

Demonstration of immediate encodings across various platforms.

As you may have guessed, machine-code offsets are different for each platform, which made it easier for other subtle platform-specifics to creep into the compiler as well.

To answer the question raised earlier, the MacroAssembler interface wasn't heavily relied on for the early inline cache implementations. Inline caches were first implemented for x86, and although x86 is a variable-width instruction set, all of the instruction sequences emitted from the compiler had a known instruction width and format. [*] This permitted us to use known-constant-offset values for the x86 platform inline caches. These known-constant-offsets never changed and so didn't require any space or access time overhead in side-structures. They seemed like the clear solution when x86 was the only platform to get up-and-running.

Then x86-64 (AKA x64) came along, flaunting its large register set and colorful plumage. On x64, the instruction sequence did not have a known width and format! Depending on whether the extended register set is used, things like mov instructions may require a special REX prefix byte in the instruction stream (highlighted in blue above). This led to more ifdefs — on x64 a bunch more values have to be saved in order to know where to patch our inline caches!

As a result, getting inline caches working on ARM was largely a JägerMonkey refactoring effort. Early on, we had used conditional compilation (preprocessor flags) to get inline caches running on a platform-by-platform basis, which was clearly the right decision for rapid iteration, but we decided that it was time to pay down some of our technical debt.

Paying down the debt: not quite an ARM and a leg

The MacroAssembler deals with raw machine values — you can tell it dull-sounding machine-level things like, "Move this 17 bit sign-extended immediate into the EAX register."

On the other hand, we have our own awesome-sounding value representation in the SpiderMonkey engine: on both 32-bit and 64-bit platforms every "JS value" is a 64-bit wide piece of data that contains both the type of the data and the data itself. [†] Because the compiler is manipulating these VM values all the time, when we started the JägerMonkeyJM compiler it was only natural to put the MacroAssembler in a delicious candy coating that also knew how to deal with these VM values.

High level, candy-coated block diagram of the |JM| compiler.

The NunboxAssembler, pictured in red, [‡] is a specialized assembler with routines to deal with our "nunbox" value representation. [§] The idea of the refactoring was to candy-coat a peer of the MacroAssembler, the Repatcher, with routines that knew how to patch common inline cache constructs that the NunboxAssembler was emitting.

With the inline cache Repatcher in place, we were once again able to move all the platform-specific code out of the compiler and into a single, isolated part of the code base, hidden behind a common interface.

High level block diagram of the |JM| compiler with the inline cache repatcher in place.

Routines like NunboxAssembler::emitTypeGuard, which knows how to emit a type guard regardless of the platform, are paired with routines like ICRepatcher::patchTypeGuard(newType), which knows how to patch a type guard regardless of platform. Similarly, NunboxAssembler::loadObjectProperty has a ICRepatcher::patchObjectPropertyLoad. The constructs that are generated by the NunboxAssembler are properly patched by the corresponding ICRepatcher method on a miss. It's all quite zen.

Frog ARMs

On real devices running the Fennec betas, we've seen marked improvements since Beta 3. [¶] Most notably, we've leapfrogged the stock Android 2.2 browser on the V8-V5 benchmark on both the Galaxy S and the Nexus One. Pretty graphs courtesy of Mark Finkle.

SunSpider performance comparisonV8-V5 performance comparisonKraken performance comparison

ARMn't you glad I didn't say banana?

Since I've run out of remotely-acceptable ARM malapropisms, these topics will be left to further discussion. Feel free to comment on anything that deserves further clarification!

Footnotes

[*]

For example, if you always emit a simple mov from a 32-bit register to a 32-bit register, that has a known constant length. The "variable width" part of "variable width instruction set" refers to the fact that different instructions do not generally take the same number of bytes. It does not mean that the encoding of a given instruction (like mov) with particular operands (like two 32-bit registers) is totally variable.

[†]

The team also believes that further experimentation with a 128-bit value representation for 64-bit systems could yield positive results.

[‡]

FD&C Red No. 40, to be precise.

[§]

"Nunbox" is a play on the term NaN-boxing. We have no idea how Luke comes up with these names, but we hope he never stops.

[¶]

On the fancy Tegra 2 board I was developing on, running the SunSpider harness on the JavaScript shell with methodjit-only, this work net us a whopping 230% speedup on the V8-V4 benchmark and a 15% speedup on SunSpider 0.9.1.

PICing on JavaScript for fun and profit

Inline caching is a critical ingredient in the delicious pie that is dynamic language performance optimization. What follows is a gentle-albeit-quirky introduction to what polymorphic inline caches (PICs) are and why they're useful to JavaScript Just-In-Time compilers like JaegerMonkey.

But first, the ceremonial giving of the props: the initial barrage of PIC research and implementation in JaegerMonkey was performed by Dave Mandelin and our current inline cache implementations are largely the work of David Anderson. As always, the performance improvements of Firefox's JavaScript engine can be monitored via the Are We Fast Yet? website.

C is for speed, and that's good enough for me

C is fast.

Boring people (like me) argue about astoundingly interesting boring things like, "Can hand-tuned assembly be generally faster than an equivalent C program on modern processor architectures?" and "Do languages really have speeds?", but you needn't worry — just accept that C is fast, and we've always been at war with Eurasia.

So, as we've established, when you write a program in C, it executes quickly. If you rewrite that program in your favorite dynamic language and want to know if it still executes quickly, then you naturally compare it to the original C program.

C is awesome in that it has very few language features. For any given snippet of C code, there's a fairly direct translation to the corresponding assembly instructions. [*] You can almost think of C as portable assembly code. Notably, there are (almost) zero language features that require support during the program's execution — compiling a C program is generally a non-additive translation to machine code.

Dynamic languages like JavaScript have a massive number of features by comparison. The language, as specified, performs all kinds of safety checks, offers you fancy-n-flexible data record constructs, and even takes out the garbage. These things are wonderful, but generally require runtime support, which is supplied by the language engine. [†] This runtime support comes at a price, but, as you'll soon see, we've got a coupon for 93 percent off on select items! [‡]

You now understand the basic, heart-wrenching plight of the performance-oriented dynamic language compiler engineer: implement all the fancy features of the language, but do it at no observable cost.

Interpreters, virtual machines, and bears

"Virtual machine" sounds way cooler than "interpreter". Other than that, you'll find that the distinction is fairly meaningless in relevant literature.

An interpreter takes your program and executes it. Generally, the term "virtual machine" (AKA "VM") refers to a sub-category of interpreter where the source program is first turned into fake "instructions" called bytecodes. [§]

A bear moving quickly

I call these instructions fake because they do things that a hardware processing units are unlikely to ever do: for example, an ADD bytecode in JavaScript will try to add two arbitrary objects together. [¶] The point that languages implementors make by calling it a "virtual machine" is that there is conceptually a device, whether in hardware or software, that could execute this set of instructions to run the program.

These bytecodes are then executed in sequence. A program instruction counter is kept in the VM as it executes, analogous to a program counter register in microprocessor hardware, and control flow bytecodes (branches) change the typical sequence by indicating the next bytecode instruction to be executed.

Virtual (machine) reality

Languages implemented in "pure" VMs are slower than C. Fundamentally, your VM is a program that executes instructions, whereas compiled C code runs on the bare metal. Executing the VM code is overhead!

To narrow the speed gap between dynamic languages and C, VM implementers are forced to eliminate this overhead. They do so by extending the VM to emit real machine instructions — bytecodes are effectively lowered into machine-codes in a process called Just-In-Time (JIT) compilation. Performance-oriented VMs, like Firefox's SpiderMonkey engine, have the ability to JIT compile their programs.

The term "Just-In-Time" is annoyingly vague — just in time for what, exactly? Dinner? The heat death of the universe? The time it takes me to get to the point already?

In today's JavaScript engines, the lowering from bytecodes to machine instructions occurs as the program executes. With the new JaegerMonkey JIT compiler, the lowering occurs for a single function that the engine sees you are about to execute. This has less overhead than compiling the program as a whole when the web browser receives it. The JaegerMonkey JIT compiler is also known as the method JIT, because it JIT compiles a method at a time.

For most readers, this means a few blobs of x86 or x86-64 assembly are generated as you load a web page. The JavaScript engine in your web browser probably spewed a few nice chunks of assembly as you loaded this blog entry.

Aside: TraceMonkey

In SpiderMonkey we have some special sauce: a second JIT, called TraceMonkey, that kicks in under special circumstances: when the engine detects that you're running loopy code (for example, a for loop with a lot of iterations), it records a stream of bytecodes that corresponds to a trip around the loop. This stream is called a trace and it's interesting because a) it can record bytecodes across function calls and b) the trace optimizer works harder than the method JIT to make the resulting machine code fast.

There's lots more to be said about TraceMonkey, but the inline caching optimization that we're about to discuss is only implemented in JaegerMonkey nowadays, so I'll cut that discussion short.

The need for inline caching

In C, accessing a member of a structure is a single "load" machine instruction:

struct Nose {
    int howManyNostrils;
    bool isPointy;
};

bool isNosePointy(struct Nose *nose) {
    return nose->isPointy;
}

The way that the members of struct Nose are laid out in memory is known to the C compiler because it can see the struct definition — getting the attribute nose->isPointy translates directly into a load from the address addressof(nose) + offsetof(Nose, isPointy).

Note: Just to normalize all the terminology, let's call the data contained within a structure the properties (instead of members) and the way that you name them the identifiers. For example, isPointy is an identifier and the boolean data contained within nose->isPointy is the property. The act of looking up a property through an identifier is a property access.

On the other hand, objects in JavaScript are flexible — you can add and delete arbitrary properties from objects at runtime. There is also no language-level support for specifying the types that an identifier can take on. As a result, there's no simple way to know what memory address to load from in an arbitrary JavaScript property access.

Consider the following snippet:

function isNosePointy(nose) {
    return nose.isPointy;
}

To get at the isPointy property, the JavaScript VM emits a single bytecode, called GETPROP, which says "pull out the property with the identifier isPointy". [#] Conceptually, this operation performs a hash-map lookup (using the identifier as a key), which takes around 45 cycles in my microbenchmark. [♠]

Uncached property access data

The process of "looking up a property at runtime because you don't know the exact type of the object" falls into a general category of runtime support called dynamic dispatch. Unsurprisingly, there is execution time overhead associated with dynamic dispatch, because the lookup must be performed at runtime.

To avoid performing a hash-map lookup on every property access, dynamic language interpreters sometimes employ a small cache for (all) property accesses. You index into this cache with the runtime-type of the object and desired identifier. [♥] Resolving a property access against this cache under ideal circumstances takes about 8.5 cycles.

Cached property access data

WTF is inline caching already!?

So we've established that, with good locality, JS property accesses are at least 8.5x slower than C struct property accesses. We've bridged the gap quite a bit from 45x slower. But how do we bridge the gap even bridgier?

Bridge fail!

The answer is, surprisingly, self-modifying code: code that modifies code-that-currently-exists-in-memory. When we JIT compile a property access bytecode, we emit machine-code that looks like this:

type            <- load addressof(object) + offsetof(JSObject, type)
shapeIsKnown    <- type equals IMPOSSIBLE_TYPE
None            <- goto slowLookupCode if shapeIsKnown is False
property        <- load addressof(object) + IMPOSSIBLE_SLOT

Now, if you ask Joe Programmer what he thinks of that code snippet, he would correctly deduce, "The slow lookup code will always be executed!" However, we've got the self-modifying code trick up our sleeves. Imagine that the type matched, so we didn't have to go to the slow lookup code — what's our new property access time?

One type load, one comparison, an untaken branch, and a property value load. Assuming good locality/predictability and that the object's type happened to already be in the register (because you tend to use it a lot), that's 0+1+1+1 == 3 cycles! Much better.

But how do we get the types to match? Joe Programmer is still looking pretty smug over there.

The trick is to have the slowLookupCode actually modify this snippet of machine code! After slowLookupCode resolves the property in the traditional ways mentioned in previous sections, it fills in a reasonable value for IMPOSSIBLE_TYPE and IMPOSSIBLE_SLOT like they were blank fields in a form. This way, the next time you run this machine code, there's a reasonable chance you won't need to go to slowLookupCode — the types might compare equal, in which case you can perform a simple load instruction to get the property that you're looking for!

This technique of modifying the JIT-compiled code to reflect a probable value is called inline caching: inline, as in "in the emitted code"; caching, as in "cache a probable value in there". This the basic idea behind inline caches, AKA ICs.

Also, because we emit this snippet for every property-retrieving bytecode we don't rely on global property access patterns like the global property cache does. We mechanical mariners are less at the mercy of the gods of locality.

Code generation

Where does "P" come from?

Er, right, we're still missing a letter. The "P" in "PIC" stands for polymorphic, which is a fancy sounding word that means "more than one type".

The inline cache demonstrated above can only remember information for a single type — any other type will result is a shapeIsKnown of False and you'll end up going to the slowLookupCode.

Surveys have shown that the degree of polymorphism (number of different types that actually pass through a snippet during program execution) in real-world code tends to be low, in JavaScript [♦] as well as related languages. However, polymorphism happens, and when it does, we like to be fast at it, too.

So, if our inline cache only supports a single type, what can we do to handle polymorphism? The answer may still be surprising: self-modify the machine code some more!

Before we talk about handling the polymorphic case, let's recap the PIC lifecycle.

The PIC lifecycle

The evolution of the PIC is managed through slowLookupCode, which keeps track of the state of the inline cache in addition to performing a traditional lookup. Once the slow lookup is performed and the PIC evolves, the slowLookupCode jumps back (to the instruction after the slot load) to do the next thing in the method.

When a PIC is born, it has that useless-looking structure you saw in the previous section — it's like a form waiting to be filled out. The industry terminology for this state is pre-monomorphic, meaning that it hasn't even seen one (mono) type pass through it yet.

The first time that inline cache is executed and we reach slowLookupCode we, shockingly, just ignore it. We do this because there is actually a hidden overhead associated with modifying machine code in-place — we want to make sure that you don't incur any of that overhead unless there's an indication you might be running that code a bunch of times. [♣]

The second time we reach the slowLookupCode, the inline cache is modified and the PIC reaches the state called monomorphic. Let's say we saw a type named ElephantTrunk — the PIC can now recognize ElephantTrunk objects and perform the fast slot lookup.

When the PIC is monomorphic and another type, named GiraffeSnout, flows through, we have a problem. There are no more places to put cache entries — we've filled out the whole form. This is where we get tricky: we create a new piece of code memory that contains the new filled-out form, and we modify the original form's jump to go to the new piece of code memory instead of slowLookupCode.

Recognize the pattern? We're making a chain of cache entries: if it's not an ElephantTrunk, jump to the GiraffeSnout test. If the GiraffeSnout fails, then jump to the slowLookupCode. An inline cache that can hit on more than one type is said to be in the polymorphic state.

PIC lifecycle

There's one last stage that PICs can reach, which is the coolest sounding of all: megamorphic. Once we detect that there are a lot of types flowing through a property access site, slowLookupCode stops creating cache entries. The assumption is that you might be passing an insane number of types through this code, in which case additional caching would only only slow things down. For a prime example of megamorphism, the 280slides code has an invocation site with 1,437 effective types! [**]

Conclusion

There's a lot more to discuss, but this introduction is rambling enough as-is — if people express interest we can further discuss topics like:

Suffice it to say that JavaScript gets a nice speed boost by enabling PICs: x86 JaegerMonkey with PICs enabled is 25% faster on SunSpider than with them disabled on my machine. [††] If something makes a dynamic language fast, then it is awesome. Therefore, inline caches are awesome. (Modus ponens says so.)

Footnotes

[*]

This is as opposed to, say, C++, where in any given snippet of code the == operator could be overloaded.

[†]

"Engine" is a sexy term, but it's just a library of support code that you use when language constructs don't easily fall into the translate-it-directly-to-machine-code model used by C.

[‡]

Coupon only applies to idealized property access latencies. Competitor coupons gladly accepted. Additional terms and restrictions may apply. See store for details.

[§]

Alternative interpreter designs tend to walk over something that looks more like the source text — either an abstract syntax tree or the program tokens themselves. These designs are less common in modern dynamic languages.

[¶]

There have historically been implementations that do things like this; notably, the Lisp machines and Jazelle DBX. The JavaScript semantics for ADD are particularly hairy compared to these hosted languages, because getting the value-for-adding out of an object can potentially invoke arbitrary functions, causing re-entrance into JavaScript interpretation.

[#]

In the bytecode stream the value isPointy is encoded as an immediate.

[♠]

Note that there is actually further overhead in turning the looked-up property into an appropriate JavaScript value. For example, there are additional checks to see whether the looked-up value represents a "getter" function that should be invoked.

[♥]

This is, in itself, a small hash-map lookup, but the hash function is quite fast. At the moment it's four dependent ALU operations: right shift, xor, add, and.

[♦]

Gregor Richards published a paper in PLDI 2010 that analyzed a set of popular web-based JS applications. The results demonstrated that more than eighty percent of all call sites were monomorphic (had the same function body). I'm speculating that this correlates well to the property accesses we're discussing, though that wasn't explicitly established by the research — in JS, property access PIC are easier to discuss than function invocation PICs. In related languages, like Self, there is no distinction between method invocation and property access.

[♣]

"Hidden overhead my foot! Where does it come from?" Today's processors get a little scared when you write to the parts of memory that contain code. Modern processor architecture assumes that the memory you're executing code out of will not be written to frequently, so they don't optimize for it. [‡‡]

[**]

The annoying part is that the instruction prefetcher may have buffered up the modified instructions, so you have to check if the modified cache line is in there. Older cache coherency protocols I've read about flush lines past unified caches if they detect a hit in both the instruction and data caches — maybe it's better nowadays.

[††]

I'm citing Gregor Richards yet again.

[‡‡]

MICs give a nice percentage boost as well, but they're harder to disable at the moment, or I'd have numbers for that too.

Notes from the JS pit: closure optimization

In anticipation of a much-delayed dentist appointment tomorrow morning and under the assumption that hard liquor removes plaque, I've produced [*] an entry in the spirit of Stevey's Drunken Blog Rants, s/wine/scotch/g. I apologize for any and all incomprehensibility, although Stevey may not mind since it's largely an entry about funargs, which he seems to have a thing for. (Not that I blame him — I'm thinking about them while drinking...) It also appears I may need to prove myself worthy of emigration to planet Mozilla, so hopefully an entry filled with funarg debauchery will serve that purpose as well.

Background

Lately, I've been doing a little work on closure optimization, as permitted by static analysis; i.e. the parser/compiler marks which functions can be optimized into various closure forms.

In a language that permits nested functions and functions as first-class values, there are a few things you need to ask about each function before you optimize it:

Function escape (the funarg problem)

If a function can execute outside the scope in which it was lexically defined, it is said to be a "funarg", a fancy word for "potentially escaping outside the scope where it's defined". We call certain functions in the JS runtime Algol-like closures if they are immediately applied function expressions, like so:

function outer() {
    var x = 12;
    return (function cubeX() { return x * x * x; })();
}

The function cubeX can never execute outside the confines of outer — there's no way for the function definition to escape. It's as if you just took the expression x * x * x, wrapped it in a lambda (function expression), and immediately executed that expression. [†]

Apparently a lot of Algol programmers had the hots for this kinda thing — the whole function-wrapping thing was totally optional, but you chose to do it, Algol programmers, and we respect your choice.

You can optimize this case through static analysis. As long as there's no possibility of escape between a declaration and its use in a nested function, the nested function knows exactly how far to reach up the stack to retrieve/manipulate the variable — the activation record stack is totally determined at compile time. Because there's no escaping, there's not even any need to import the upvar into the Algol-like function.

Dijkstra's display optimization

To optimize this Algol-like closure case we used a construct called a "Dijkstra display" (or something named along those lines). You just keep an array of stack frame pointers, with each array slot representing the frame currently executing at that function nesting level. When outer is called in the above, outer's stack frame pointer would be placed in the display array at nesting level 0, so the array would look like:

Level 0: &outerStackFrame
Level 1: NULL
Level 2: NULL
...

Then, when cubeX is invoked, it is placed at nesting level 1:

Level 0: &outerStackFrame
Level 1: &cubeX
Level 2: NULL
...

At parse time, we tell cubeX that it can reach up to level 0, frame slot 0 to retrieve the jsval for x. [‡] Even if you have "parent" frame references in each stack frame, this array really helps when a function is reaching up many levels to retrieve an upvar, since you can do a single array lookup instead of an n link parent chain traversal. Note that this is only useful when you know the upvar-referring functions will never escape, because the display can only track stack frames for functions that are currently executing.

There's also the possibility that two functions at the same nesting level are executing simultaneously; i.e.

function outer() {
    var x = 24;
    function innerFirst() { return x; }
    function innerSecond() {
        var x = 42;
        return innerFirst();
    }
    return innerSecond();
}

To deal with this case, each stack frame has a pointer to the "chained" display stack frame for that nesting level, which is restored when the executing function returns. To go through the motions:

Level 0: &outerStackFrame
Level 1: &innerSecond
Level 2: NULL
...

Which then activates innerFirst at the same static level (1), which saves the pointer that it's clobbering in the display array.

Level 0: &outerStackFrame
Level 1: &innerFirst (encapsulates &innerSecond)
Level 2: NULL
...

Then, when innerFirst looks up the static levels for x, it gets the correct value, restoring innerSecond when it's done executing in a return-style bytecode (which would be important if there were further function nesting in innerSecond). [§]

Okay, hopefully I've explained that well enough, because now I get to tell you that we've found this optimization to be fairly useless in SpiderMonkey experimental surveys and we hope to rip it out at some point. The interesting case that we actually care about (flat closures) is discussed in the second to last section.

Free variable references

Because JS is a lexically scoped language [¶] we can determine which enclosing scope a free variable is defined in. [#] If a function's free variables only refer to bindings in the global scope, then it doesn't need any information from the functions that enclose it. For these functions the set of free variables in nested functions is the null set, so we call it a null closure. Top-level functions are null closures. [♠]

function outer() {
    return function cube(x) { return x * x * x; }; // Null closure - no upvars.
}

Free variables are termed upvars, since they are identifiers that refer to variables in higher (enclosing) scopes. At parse time, when we're trying to find a declaration to match up with a use, they're called unresolved lexical dependencies. Though JavaScript scopes are less volatile — and, as some will undoubtedly point out, less flexible — I believe that the name upvar comes from this construct in Tcl, which lets you inject vars into and read vars from arbitrary scopes as determined by the runtime call stack: [♥]

set x 7

proc most_outer {} {
    proc outer {} {
        set x 24
        proc upvar_setter {level} {
            upvar $level x x
            set x 42
        }
        proc upvar_printer {level} {
            upvar $level x x
            puts $x
        }
        upvar_printer 1
        upvar_setter 1
        upvar_printer 1
        upvar_setter 2
        upvar_printer 2
        upvar_printer 3
        upvar_setter 3
        upvar_printer 3
    }
    outer
}
most_outer # Yields the numbers 24, 42, 42, 7, and 42.

Upvar redefinitions

If you know that the upvar is never redefined after the nested function is created, it is effectively immutable — similar to the effect of Java's partial closures in anonymous inner classes via the final keyword. In this case, you can create an optimized closure in a form we call a flat closure — if, during static analysis, you find that none of the upvars are redefined after the function definition, you can import the upvars into the closure, effectively copying the immutable jsvals into extra function slots.

On the other hand, if variables in enclosing scopes are (re)defined after the function definition (and thus, don't appear immutable to the function), a shared environment object has to be created so that nested functions can correctly see when the updates to the jsvals occur. Take the following example:

function outer() {
    var communicationChannel = 24;
    function innerGetter() {
        return communicationChannel();
    }
    function innerSetter() {
        communicationChannel = 42;
    }
    return [innerGetter, innerSetter];
}

Closing over references

In this case, outer must create an environment record outside of the stack so that when innerGetter and innerSetter escape on return, they can see both communicate through the upvar. This is the nice encapsulation-effect you can get through closure-by-reference, and is often used in the JS "constructor-pattern", like so:

function MooCow() {
    var hasBell = false;
    var noise = "Moo.";
    return {
        pontificate: function() { return hasBell? noise + " <GONG!>" : noise; }
        giveBell: function() { hasBell = true; }
    };
}

It's interesting to note that all the languages I work with these days perform closure-by-reference, as opposed to closure-by-value. In constrast, closure-by-value would snapshot all identifiers in the enclosing scope, so immutable types (strings, numbers) would be impossible to change.

Sometimes, closure-by-reference can produce side effects that surprise developers, such as:

def surprise():
    funs = [lambda: x ** 2 for x in range(6)]
    assert funs[0]() == 25

This occurs because x is bound in function-local scope, and all the lambdas close over it by reference. When x is mutated in further iterations of the list comprehension (at least in Python 2.x), the lambdas are closed over the environment record of surprise, and all of them see the last value that x was updated to.

I can sympathize. In fact, I've wrote a program to do so:

var lambdas = [];
var condolences = ["You're totally right",
        "and I understand what you're coming from, but",
        "this is how closures work nowadays"];
for (var i = 0; i < condolences.length; i++) {
    var condolence = condolences[i];
    lambdas.push(function() { return condolence; });
}
print(lambdas[0]());

Keep in mind that var delcarations are hoisted to function scope in JS.

I implore you to note that comments will most likely be received while I'm sober.

Footnotes

[*]

Vomited?

[†]

Cue complaints about the imperfect lambda abstraction in JavaScript. Dang Ruby kids, go play with your blocks! ;-)

[‡]

Roughly. Gory details left out for illustrative purposes.

[§]

There's also the case where the display array runs out of space for the array. I believe we emit unoptimized name-lookups in this case, but I don't entirely recall.

[¶]

With a few insidious dynamic scoping constructs thrown in. I'll get to that in a later entry.

[#]

Barring enclosing with statements and injected eval scopes.

[♠]

Unless they contain an eval or with, in which case we call them "heavyweight" — though they still don't need information from enclosing functions, they must carry a stack of environment records, so they're not optimal. I love how many footenotes I make when I talk about the JavaScript language. ;-)

[♥]

As a result, it's extremely difficult to optimize accesses like these without whole propgram analysis.

Notes from the JS pit: lofty goals, humble beginnings

I've been working at Mozilla for about two months on the JavaScript (JS) engine team, the members of which sit in an area affectionately known as the "JS pit".

Mozillians appear to try to blog on a regular basis, so I'll be starting a series of entries prefixed "notes from the JS pit" to explain what I've been working on and/or thinking about.

Notably, I feel fortunate to work for a company that encourages this kind of openness.

Goals

I always feel stupid writing down goals — they seem so self-evident; however, it helps to put the work I'm doing into perspective and gives me something that I can to refer back to.

I'm also a big believer in the effectiveness of public accountability, so publishing those goals seems prudent — my notes from the JS pit are poised to help me stay motivated more than anything else.

My goals are to:

Working with compiler engineers from diverse language backgrounds, it's prime time for sucking knowledge out of people's heads, comparing and contrasting it, and listening to them argue with each other. Heck, just look at the concepts behind JS: an imperative, Scheme-inspired, prototypal, C-and-Java-syntax conforming language that's irrevocably tied to a practical platform, the web. It's bound to be a fun ride.

From start to present

I started off implementing the simpler opcodes for JaegerMonkey (JM) and getting an understanding the JM code base. Not too long into it, I was told that looking into quick-and-easy parser optimizations was a priority — somebody had reported that a significant fraction of the GMail load bar time could be attributed to JS parsing. [*]

Now, JavaScript isn't the easiest language in the world to parse; for example, automatic semicolon insertion creates some non-traditional obstacles for generated shift/reduce parsers [†] — it effectively makes an error correction algorithm part of the normal parse procedure. The details are for another entry, but suffice it to say that our recursive descent parser code gets complex, especially due to our E4X support and some of the static analyses we perform for optimizations before bytecode emission.

In pursuing JS parser optimization I assembled a suite of parsing benchmarks from sites on the web with "large" JS payloads — I call this suite parsemark. After getting some speedups from simple inlining, I attempted a somewhat fundamental change to the parser to reduce the number of branch mispredictions, in converting it to always have a token "pre-lexed" as opposed to the prior "lex-on-demand" model. Roughly, this required adding a "there's always a lexed token" invariant to the lexer and hoisting lexer calls/modesets from substitution rules into their referring nonterminals in the parser. The details for this are also entry fodder. Sadly, it demonstrated negligible performance gains for the increase in complexity. Sure taught me a lot about our parser, though.

The biggest performance win was obtained through a basic fix to our parser arena-allocation chunk sizing. sayrer noticed that a surprising amount of time was being spent in kernel space, so we tracked the issue down. It was frustrating to work for a few weeks on a fundamental change and then realize that multiplying a constant by four can get you a 20% parsing speedup, but I've certainly learned to look a lot more closely at the vmem subsystem when things are slow. I have some speedup statistics and a comparison to V8 (with all its lazy parsing and parse-caching bits ripped out), but I don't have much faith that my environment hasn't changed in the course of all the historical data measurements — writing a script to verify speedup over changesets seems like a viable option for future notes.

In the miscellany department, I've been trying to do a good amount of work fixing broken windows via cleanup patches. I'm finding it difficult to strike a balance here, since there's a lot of modularity-breaking interdependencies in the code base — what appear to be simple cleanups tend to unravel into large patches that get stale easily. However, cleanup does force you to read through the code you're modifying, which is always good when you're learning a new code base.

Looking back on it, it doesn't seem like a lot of work; of course, my hope is that the time I spend up-front getting accustomed to the codebase will let me make progress on my goals more rapidly.

Stay tuned for more JS pittage — unpredictable time, but predictable channel.

Footnotes

[*]

To date, I haven't looked into this myself. Ideally, I should have verified it before starting on the parser work, but I was eager to start working on things rather than investigate the reasons behind them.

[†]

Though I've seen that Webkit's JavaScriptCore uses Bison — I'm going to have to check out that implementation at some point.