January 6, 2012

String representation in SpiderMonkey

I'm back from holiday break and I need to limber up my tech writing a bit. Reason 1337 of my ever-growing compendium, Nerdy Reasons to Love the Internet, is that there are always interesting discussions going on. [*] I came across Never create Ruby strings longer than 23 characters the other day, and despite the link-bait title, there's a nice discussion of string representation within MRI (a Ruby VM).

My recap will be somewhat abbreviated, since I've only given myself a chunk of the morning to write this, so feel free to ask for clarification / follow up in the comments.

Basic language overview

At the language level JavaScript strings are pretty easy to understand. They are immutable, same as in Python:

>>> foo = 'abc'
>>> foo[2] = 'd'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
js> options('strict')
""
js> foo = 'abc'
"abc"
js> foo[2] = 'd'
typein:3: strict warning: foo[2] is read-only
"d"

But you can (without mutating any of the original values) compare them for equality, concat them, slice them, regexp replace/split/match on them, trim whitespace from them, slap them to chop vegetables, and so forth. (See the MDN docs for String.prototype methods.) In the VM, we need to make those operations fast, with an emphasis on the operations that the web uses heavily, which are ideally [†] the ones reflected in benchmarks.

Abstractly

In an abstract sense, a primitive string in SpiderMonkey is a GC cell (i.e. small header that is managed by the garbage collector) that has a length and refers to an array of UCS2 (uniformly 16-bit) characters. [‡]

Recall that, in many dynamic language implementations, type tagging is used in order to represent the actual type of an statically-unknown-typed value at runtime. This generally allows you to work on integers (and, in SpiderMonkey, doubles) without allocating any space on the heap. Primitive strings are very important to distinguish quickly and they are subtly distinct from (non-primitive) objects, so they have their own type tag in our value representation, as you can see in the following VM function:

/*
 * Convert the given value to a string.  This method includes an inline
 * fast-path for the case where the value is already a string; if the value is
 * known not to be a string, use ToStringSlow instead.
 */
static JS_ALWAYS_INLINE JSString *
ToString(JSContext *cx, const js::Value &v)
{
    if (v.isString())
        return v.toString();
    return ToStringSlow(cx, v);
}

Aside

In JavaScript there's an annoying distinction between primitive strings and string objects that you may have seen:

js> foo = new String('abc')
(new String("abc"))
js> foo.substr(0, 2)
"ab"
js> foo[2]
"c"
js> foo.toString()
"abc"

For simplicity and because they're uninteresting, let's pretend those new String things don't exist.

Atoms

The simplest string form to describe is called an "atom", which is somewhat similar to an interned string in Python. When you write a literal string or identifier in your JavaScript code, SpiderMonkey's parser turns it into one of these atoms.

(function() {
    // Both 'someObject' and 'twenty' are atomized at parse time!
    return someObject['twenty'];
})()

Note that the user has no overt control over which strings get atomized (i.e. there is no intern builtin). Also, there are a bunch of "primordial" atoms that the engine creates when it starts up: things like the empty string, prototype, apply, and so on.

The interesting property of atoms is that any two atoms can be compared in O(1) time (via pointer comparison). Some work is required on behalf of the runtime to guarantee that property.

To get an atom within the VM, you have to say, "Hey SpiderMonkey runtime, atomize these characters for me!" In the general case the runtime then does a classic "get or create" via a hash table lookup: it determines whether or not those characters have an existing atom and, if not, creates one. The atomized primitive string that you get back always has its characters contiguous in memory — a property which is interesting by contrast to...

Ropes

Let's say that you had immutable strings, like in JavaScript, and you had three large books already in string form: let's call them AoCP I, II, and III. Then, some jerk thinks it would be funny to take the first third of the first book, the second third of the second book, and the third third of the third book, and slice them together into a single string.

What's the simplest thing that could possibly work? Let's say that each book is a 8MiB long. You could allocate a new, 8MiB array of characters and memcpy the appropriate characters from each string into the resulting buffer, but now you've added 33% memory overhead and wasted quite a few cycles.

A related issue is efficient appending and prepending. Let's say you have a program that does something like:

var resultString = '';

function onMoreText(text) {
    // If the text starts with a letter in the lower part of the alphabet,
    // put it at the beginning; otherwise, put it at the end.
    if (text[0] < 'l')
        resultString = text + resultString;
    else
        resultString = resultString + text;
};

If you did the naive "new string and memcpy" for all of the appends and prepends, you'd end up creating a lot of useless garbage inside the VM. The Python community has the conventional wisdom that you should build up a container (like a deque) and join on it, but it's difficult to hold the entire ecosystem of web programmers to such standards.

In the SpiderMonkey VM, the general solution to problems like these this is to build up a tree-like data structure that represents the sequence of immutable substrings. and collapse that datastructure only when necessary. Say that you write this:

(function weirdJoin(a, b, c, d) {
    var lhs = a + b;
    var rhs = c + d;
    var result = lhs + rhs;
    return result;
})('I', 'love', 'SpiderMonkey', '!');

The concatenation is performed lazily by using a tree-like data structure (actually a DAG, since the same string cell can appear in the structure at multiple points) that we call a rope. Say that all of the arguments are different atoms — the resulting rope would look like:

SpiderMonkey rope concatenation example.

Since strings are immutable at the language level, cycles can never form. When the character array is requested within the engine, a binary tree traversal is performed to flatten the constituent strings' characters into a single, newly-allocated buffer. Note that, when the rope is not flattened, the characters which constitute the overall string are not in a single contiguous region of memory — they're spread across several buffers!

Dependent strings

How about when you do superHugeString.substr(4096, 16384)? Naively, you need to copy the characters in that range into a new string.

However, in SpiderMonkey there's also a concept of dependent strings which simply reuse some of the buffer contents of an existing string's character array. In a very simple fashion, the derived string keeps the referred-to string alive in order to reuse the characters in the referred-to string's buffer.

Fancier and fancier!

Really small strings are a fairly common case: they are used for things like array property indices and single-characters of strings — recall that, in JavaScript, all object properties are named by strings, unlike in languages like Python which uses arbitrary hashables. To optimize for this case, we have strings with characters embedded into their GC cell header, avoiding a heap-allocated character buffer. [§] We also pre-initialize many of these (less than length-3 strings and integers up to 256) atoms when the runtime starts up to bypass the typical hash table lookup overhead.

I'm out of time, but I hope this gives some insight into the good number of tricks are played to make common uses of JavaScript strings fast within the SpiderMonkey VM. For you curious types, there's lots more information in the code!

Footnotes

[*]

Of course, reason 1336 is that there are so many lame 1337 references that it's actually still funny.

[†]

Note the emphasis on ideally. Ideally, it should make you chuckle.

[‡]

Fun aside: the maximum length of a string in our engine is currently bounded to 28 bits.

[§]

Our garbage collector is not currently capable of creating variable-sized allocations — we only have fixed-size header classes.

SpiderMonkey bubbles OOM retvals

Handling Out-of-Memory conditions gracefully in a large-scale application is notoriously difficult. [*] You generally have to detect that you're OOMing at an arbitrary allocation site, then carefully not allocate any memory, but delegate to part of your application that can recover or report what happened to the user, all without allocating any memory.

Did I mention that, once you've OOM'd, you really shouldn't allocate any memory?

Recovery, so far as I know, means flushing out any expendable caches or trying to coax the operating system to page stuff out to disk.

Of course, malloc may only indicate this OOM-like scenario (by returning NULL) if you're "lucky". If you're unlucky, on overcommit systems you can get taken out by the OOM killer, dispensing its unique brand of indiscriminate justice. [†] Or, you can crawl to brain-numbing speeds as your memory is relegated to swap and the bully operating system chants, "Stop DoSing yourself! Stop DoSing yourself!"

In any case, a NULL return value from malloc must not be propagated down the "successful allocation" paths in order to avoid potentially-exploitable NULL pointer deference vulnerabilities.

SpiderMonkey

In SpiderMonkey we check for error conditions, including OOM conditions, everywhere. (C++ exceptions were not attempted in the Mozilla code base, due to performance issues with the Windows ABI.) As a result, most functions in SpiderMonkey are fallible. A typical signature looks something like:

bool
DoSomeStuff(JSContext *cx)
{
    MonkeyCage *cage = cx->new_<MonkeyCage>();
    if (!cage)
        return false;

    // ...put monkeys in the cage, or something...

    return true;
}

A bool is returned in order to indicate failure. cx is the execution context, which is fancy way of saying, "An object that you thread through all your engine functions because it holds references to junk you're going to need."

One thing you're definitely going to need is allocation functionality. You can get at allocation functionality through helper methods like cx->malloc_() and cx->new_<>() — these do some accounting (how many outstanding bytes have been malloc'd on this context?) and, if you run out of memory, flush some GC stuff to see if there's free space to be had.

When an error occurs while you DoSomeStuff, you set "exception" details (such as a JavaScript exception object or an "I've hit an OOM" flag) on the context. Then, you return false to your caller. And, if it doesn't know how to handle the exception, that caller returns false to its caller.

This transitive "bubbling" of a false return value to callers unwinds the C++ stack — RAII objects with destructors release any resources they had acquired — and permits callers who understand the error to handle it appropriately. In this sense, even out-of-memory errors are "catchable" by native code in some sense.

At the outer membrane of the engine is the SpiderMonkey API, called the JSAPI. The functions in the JSAPI reflect this same fallibility: most of the API functions also return a boolean value and take a context which can be used to report an error.

V8

By contrast, V8 is able to use this construct to avoid a similar OOM check-and-bubble guideline:

void* Malloced::New(size_t size) {
  void* result = malloc(size);
  if (result == NULL) {
    v8::internal::FatalProcessOutOfMemory("Malloced operator new");
  }
  return result;
}

The FatalProcessOutOfMemory calls a fatal error callback and then calls abort(). It doesn't look like the callback is capable of doing recovery and continuing execution, but I'm new to the codebase, so I'm not totally sure.

With process isolation, calling abort() when you run out of memory is somewhat excusable. If we did this in Firefox, which does not have process-per-tab, one page which hit such an OOM would take out the whole program. In Chrome, however, an OOM handled in this way will take out the current process-isolated tab group. When you have lots of tabs open in Chrome, a sizable number of tabs may be muxed to a single process, in which case this is might be considered bad user-level behavior, but it works well in the more common case (with a small number of tabs).

Ignoring opinions on whether aborting a tab group is somehow "worse" than alternatives, I have to say that not explicitly checking for OOMs is nice. Writing, maintaining, and validating the correctness of seldom-taken but abundantly-existent OOM paths is a tax on development.

"Infallible" allocation

Allocations can be bucketed into two simple categories:

If you've attempted allocation recovery procedures but still fail to allocate the former category, life is tough. These should generally succeed, because they're well-understood known quantities: a call to malloc(sizeof(double)) failing means you're up against a pretty serious memory limit, so it's tough to do more than display a pre-canned, "Sorry, everything is hosed!" dialog. Unless your process is architected such that you can cleave off and deallocate a large amount of currently used (and otherwise unreferred to) memory at the user's behest with zero memory overhead, you don't have very good options.

By constrast, the latter might not succeed just because the user-controlled content is sufficiently weird, malformed, or malicious.

To avoid the tax I mentioned in the last section, the Mozilla Gecko engine (outside of SpiderMonkey) has been moving to use a strategy called "infallible malloc" for the former category of allocations. If a well-understood allocation of a reasonable and generally fixed size fails, it will first attempt memory recovery procedures [‡] and, failing at that, will cause the process to abort. With this scheme, you avoid the development tax of checking of OOM conditions.

Back to SpiderMonkey

So, is there practical benefit to bubbling an OOM through the whole engine to the API caller?

Currently, yes. Ultimately, probably not.

At the moment, both of the categories of allocation I mentioned are using the same mechanism, so hitting a content-induced OOM (second category) in SpiderMonkey will bubble the issue up to Gecko, which will be able to report that error and proceed as normal. In the future, however, there's no essential reason to bubble the issue up to Gecko: we could just use infallible malloc from SpiderMonkey directly.

SpiderMonkey already supports specification of a custom allocator at compile time, so we would just need to incrementally identify known- and fixed-size allocations and port them to use infallible malloc. That way, instead of bubbling up a return code from SpiderMonkey to ultimately cause an abort() from infallible malloc in Gecko, we could cause the abort() from SpiderMonkey's infallible malloc call directly, and avoid the OOM development tax.

Footnotes

[*]

With apologies to Erik Corry for misunderstanding how V8 handled OOMs in my recent reading!

[†]

This could occur due to a legitimate dereference of a memory chunk that malloc cheerily gave you — it just never happened to be backed by physical memory! Similarly, on a system where multiple applications are running simultaneously, any one of them is a potential candidate for OOM killing, so your process may get OOM killed because a different process legitimately dereferenced its overcommitted memory.

[‡]

I've been informed that this is not currently enabled, but is under active development by the MemShrink team.

What if MoCo were hit by a bus?

I went to my first Hybrid Factory event with some of my teammates this past evening, which was kind of like a mini-workshop. The event was titled, "How to effectively work as a Tech Lead," given by Derek Parham.

One of the main topics was delegation: the audience interaction portion of the event asked us to consider how we would approach delegating all of our current tasks to other people on our teams.

During this exercise sstangl brought up something profound: rather than other Mozilla Corporation-employed teammates, how much of our current task load could we delegate to the community members outside of Mozilla Corporation (AKA MoCo)? These are the people who voluntarily devote their time and effort towards advancing the Mozilla project.

Of course, the talk also covered the classic "bus test," which asks, "If [person X] were hit by a bus, how would we continue to function?" It wasn't a big leap to my asking, "If all of MoCo were hit by a bus, how well situated is the community to carry our outstanding tasks and projects?"

Like all fun hypotheticals, it's far fetched and a bit apocalyptic, but it forces you to think about your team's work and coordination methods in a very different light.

I suppose a related, follow-up question is: if the Mozilla organization is designed to empower a worldwide community, but we couldn't survive a MoCo bus scenario, then are we managing the project in a sustainable way?

Maybe people who oversee the project as a whole (and those who are more familiar with the philosophy behind our governance) have a definitive answer. In any case, it's interesting food for thought.

Collaboration and concentration

Latest in the, "People problems are hard, I'm glad I just program computers," set of thought processes. If you have thought-provoking anecdotal data points, your comments are welcome!

There's a confounding and contentious dynamic in programming between collaboration and concentration. This goes beyond just individual and team efficiency — it's also about finding a balance that lets you enjoy and grow your craft.

There are certainly times that you want to hunker down in an LCD-monitor-adorned sensory deprivation chamber and do some badass debugging, find all the corner cases in that nasty algorithm, or suss out all the invariants in a complicated system.

But, on the other hand, there are certainly times that you need to write code during a flurry of peer interaction. If you're ramping up on a new system, the thirty seconds it takes you to ask another human being a question and receive a solid, well-informed answer usually trumps the hours you'd spend reaching the "same" conclusion. With less certainty. Oh, and you actually missed the trap everybody else already knows about, so you actually figured it out wrong.

The extent of the dichotomy is clearly not limited to these examples. It's easy to think of times you've had to pepper somebody with questions as well as times you wanted to hole up in a cave to get some code done. For my systems programming experience I'd hazard a guess that it's around an 80/20 split between concentration and collaboration.

So, it's interesting to ponder: if you were building a team, how would you structure the work space or work activities to enable collaboration when it was necessary, but enable concentration in the common case?

I suppose it's doubly difficult because you also want your team to feel empowered — capable of seeking out collaborative help when they need it in order to get things done and make progress, but also empowered to cordon themselves off and have at it.

I can pretty easily find research that involves hospital workers in the early 1990s, but I have to imagine that a team of brilliant systems programmers is a different beast altogether.

Mapping the monkeysphere

Our PyPy friends (in FreeNode #pypy) saw the recent IonMonkey Q&A on InfoQ, but it lead to some confusion:

<Alex_Gaynor> verte: I swear, everytime a JS vendor opens their mouth I get more and more confused
<verte> Alex_Gaynor: indeed. isn't [an SSA-based IR] covered in Fisher Price' My First Compiler?

FreeNode #pypy, 2011-05-02

I figured if the PyPy folks — the folks who wrote a language system in which you can write interpreters such that those interpreters can be traced by construction — aren't sure what the JS team is up to, then some clarification couldn't hurt. Many thanks to Alex Gaynor for taking the time to chat about good topics for further discussion.

SpiderMonkey: primordial ooze

Back in the day, there was only one monkey.

The name SpiderMonkey was the very first monkey in the codebase, a name that was picked back when Brendan created the JavaScript engine implementation for Netscape 2. [*] I was told that the name was chosen because the spider monkey is the fastest of the monkeys, but I'm having trouble citing a reputable source — I feel the Internet Archive geocities page for user SpiderMonkeysRule3375 may have a tinge of bias.

Anyways, in the unimonkey era, it was clear that SpiderMonkey referred to "that chunk of code that runs JavaScript." That chunk of code had an interpreter component inside of it. Back in those days people said things like, "Duh, JavaScript is an interpreted language. Now pass me that anti-skip enabled CD-player because I just got the new Kris Kross album." Meanwhile, Chambers, Ungar, and Hölzle were getting so-called "interpreted" languages like Self compiling within 5x performance of optimized C code, [†] but that's a whole 'nother rant...

SpiderMonkey's interpreter-based development continued for many years. Then, with the release of Firefox 3.5, in a move generally regarded as epic, the JS engine folks decided to ship a trace compiler implementation to 90 million active daily users. The addition of this Just-In-Time (JIT) compilation capability was a large evolutionary step in the codebase, so in 2008 that evolutionary step in the Mozilla JS engine implementation became known as TraceMonkey.

So, to recap: the JS engine was still called SpiderMonkey, but the evolutionary step was called TraceMonkey. Think of it like a release version.

TraceMonkey: monkey two in the monkey sea

So, with the advent of TraceMonkey, the JS engine had two "execution modes": the interpreter and the tracer. Each "execution mode" has its own implementation of the JavaScript bytecode execution:

The tracer uses a compiler backend called NanoJIT. The observation that the tracer does, called recording, actually builds the NanoJIT compiler's primary data structure, known in compiler jargon as its intermediate representation (IR). When the tracer is done observing, it asks NanoJIT to turn the data structure that it built into native assembly code. See the MDC docs on the Tracing JIT for more info.

Along the way, NanoJIT does some of the alphabet soup compiler optimizations on this data structure: Common Subexpression Elimination (CSE), Dead Code Elimination (DCE), and local register allocation. More recently, with efforts on the part of nnethercote, it's capable of Loop Invariant Code Motion and distinction between alias groups. As I understand it, the tracer's NanoJIT compiler pipeline configuration does an incremental forwards pass as you emit instructions into the buffer, and then performs a subsequent backwards pass that does something I've heard called "destination driven code generation". Sorry that I'm short on details, but I haven't worked on NanoJIT.

NanoJIT's IR form is called linear Static Single Assignment (SSA). [‡] The term "SSA" is a piece of compiler jargon that sounds scary and inaccessible, but it's very simple when you whittle it down. If you think of instructions as producing values instead of writing to variables, then the sources of your instructions become other instructions. It's generally easier for a backend, like NanoJIT, to analyze instructions in the IR when they avoid sticking their results in a shared space (like a variable). Unfortunately, a full treatment of SSA form and why it's useful for optimizing compilers is future entry fodder.

I believe the term "linear SSA" refers to the fact that the tracer records a strictly linear sequence of instructions known as the trace. This recording process uses the instruction-as-producing-value model in building its data structure. The IR is qualified as "linear" because, in a linear sequence of instructions, there are no "joins", like you would deal with when compiling conditionals or loop headers.

Contrast linear and region-based IR.

There's been talk of adding a "join" instruction to the NanoJIT IR (jargon: phi instruction), which would permit the tracer to form simple branching operations in a trace; for example, computing the max of two numbers, coercing a number value that may be an integer to a double, or inlining a simple/tiny "else" block (jargon: alternate) of a conditional. From a cursory look at the NanoJIT source I don't think that's happened yet.

JägerMonkey: brushin' the monkey-fangs with a bottle of Jack

The tracer was optimized and stabilized through Firefox 3.6. Post FF 3.6, slightly before I arrived on the scene, the JS team decided that a baseline JIT should be implemented to increase performance in situations where the tracer was suboptimal — either because the record-and-compile process was taking too long to or because of trace instability.

In 2010, the team created JägerMonkey, the next evolutionary step in the SpiderMonkey engine. JägerMonkey added a whole-method compiler with miniscule compile times.

At this point we had three execution modes:

Conveniently, each of these modes corresponded to an evolutionary step in the engine, so the components were colloquially named:

Even though the components are referred to as such, the JS engine as a whole is named SpiderMonkey.

Pictoral representation of the MonkeySphere.

The JägerMonkey compiler loops over the bytecodes of a method, performing abstract interpretation to turn the stack-based bytecode program into register-based assembly. During abstract interpretation, the compiler performs a suite of local optimizations: copy propagation, constant propagation, arithmetic expression folding, and greedy local register allocation. However, because the JägerMonkey compiler is designed, in part, for efficient speed of compilation, it doesn't build an IR and doesn't perform any "global" (cross-basic-block) optimizations. In contrast to the tracer, as a whole-method JIT it has to deal with the full gamut of control flow, so it ends up dropping most of its optimization information at join points.

The method compiler also works on the JS execution stack directly, instead of forming its own side-stack for executing on "unboxed" values, like our trace-compiled code does. The performance impact of "falling off trace" and figuring out how to restore (jargon: reify) the latent JS stack state was one of the things that the baseline JägerMonkey compiler was designed to avoid.

IonMonkey: highly focused monkey energy

The most recent evolutionary step for the engine has been dubbed "IonMonkey". The IonMonkey compiler introduces a bushel of ideas we haven't seen in the SpiderMonkey engine before. It's designed as a whole-method compiler, but IonMonkey knows quite a few tricks that the JägerMonkey compiler was never taught.

The primary purpose of IonMonkey is perform whole-method (jargon: global) compiler optimizations on code with join points (jargon: regions), like loops and conditionals, using a full-fledged IR. Generally, this makes it a more powerful, modern compiler design, incorporating some of the "classical" compiler algorithms that you have in close-to-the-metal compiler suites like LLVM and GCC, except specialized to represent and work on higher-level JS semantics with minimal overhead.

Type specialization

You can significantly optimize JavaScript programs by specializing the generated code for the types of objects that actually flow through it. The IonMonkey compiler is designed to use type information from either static type-inference analysis or runtime type profiling to specialize the code it compiles. Each of these type-information gathering approaches has its own merits.

Type data collected over several runtime code executions is a promising alternative to the single-recording trace formation that we currently use in the tracer. For code that's identified as hot, runtime type profiling collects, over the course of several executions, the set of types observed at each program point. This set of types is used to bias the generated code — the compiler assumes the types runtime profiling has observed are the ones that matter, and guards that that assumption is correct in the generated code with branch instructions.

Static type inference, on the other hand, is known to produce a (conservatively) correct set of types that can flow through each point in the program — this is what's called a sound whole-program analysis. When it's used to complement run-time type profiling, type inference is all about giving you the knowledge to eliminate guards and unnecessary fallback code.

The IonMonkey compiler is capable of consolidating guards, but some guards defy consolidation, [§] and eliminating even very predictable guards can be important in hot loops. [¶] With type inference, the compiler knows things about a point in a JS program like:

The value in a is definitely a dense array of length 42 (with no "holes") that only contains integer values.

When you know a is full of integers and that idx is an integer, you don't have to guard that the value of a[idx] is an integer. You can also use interval analysis to avoid the length-check guard that establishes idx is less than 42.

Design space

Personally, I don't think the design space for whole-method compilation of dynamic languages is well explored at this point. Making a HotSpot analog for JS has a lot of open ended, how-will-it-work-best style questions. Here's a sampling:

Will there be a monkey king?

In the coming months the IonMonkey team will be answering an interesting question: do all of the existing execution modes have their place? The thing about adaptive compilation is just that... it's adaptive. Clearly, the fittest monkeys will survive.

An adaptive compiler can potentially subsume the roles of other execution modes. Yes, IonMonkey is being designed to whole-method optimize code that runtime profiling indicates is hot. However, despite the overhead of building an IR, it can potentially do quick block-local optimizations to make "baseline" generated code, like JägerMonkey did.

The IonMonkey backend can also potentially be used as the compiler for traces formed using the existing monitoring infrastructure — compilers that handle regions can work just as well on linear instruction sequences. Linear instruction sequences are generally easier to optimize, because you don't have to reconcile (jargon: meet) optimization information at join points.

The idea of a one-size-fits-all JS compiler platform is alluring, but, in my mind, it's not clear how feasible that really is. As the IonMonkey compiler comes online, the experimentation can start. The team will just have to bust out the range finders and measure the badassitude of the ion beam. We won't regress performance, so which monkeys remain standing has yet to be seen.

I suppose what I'm saying is that you can count on us to pit our vicious monkeys against each other in a deathmatch for your personal benefit.

Look at me, still talking, when there's science to do!

Footnotes

[*]

From the MDC wiki: "JS and NSPR have common roots in the dawn of time (Netscape 2)". Before that it had "prototype" names like Mocha and LiveScript.

[†]

Chambers, Ungar, Lee - An Efficient Implementation of SELF, 5. The Compiler. Also see the Oracle Labs SELF papers.

[‡]

A name which I'd never heard of before working on the JS engine.

[§]

Either because they are loop-iteration dependent or because of the presence of code constructs that you can't consolidate across.

[¶]

Microprocessor architecture perspective: even though modern microprocessors have loop buffer with pre-translated microcode and full branch prediction, you can't retire any of the guard's dependent instructions until you see that the guard has been predicted correctly.

[#]

The JVM's -server flag enables the highest optimization levels available in the JVM, a startup option typically reserved for more fixed, processing-intensive, server-class workloads.