The distinction between essential complexity and accidental complexity is a
useful one — it allows you to identify the parts of your design where you're
stumbling over yourself instead of working against something truly reflected
in the problem domain.
The simplest-solution-that-could-possibly-work (SSTCPW) concept is inherently
appealing in that, by design, you're trying to minimize these pieces that you
may come to stumble over. Typically, when you take this approach, you
acknowledge that an unanticipated change in requirements will entail major
rework, and accept that fact in light of the perceived benefits.
Benefits cited typically include:
Less design to validate.
Less implementation to perform.
Less surface area to debug.
Increased confidence the resulting product executes properly (though perhaps
modestly in scope).
As a more quantifiable example: if a SSTCPW contains comparatively less code
paths than an alternative solution, you can see how some of the above merits
could fall out of it.
This also demonstrates some of the appeal of fail-fast and crash-only
approaches to software implementation, in that cutting out unanticipated
program inputs and states, via an acceptance of "failure" as a concept, tends
to hone in on SSTCPW.
Contrast
In my head, this approach is contrasted most starkly against an approach called
big-design-up-front (BDUF). The essence of BDUF is that, in the design process,
one attempts to consider the whole set of possible requirements (typically
both currently-known and projected) and build into the initial design and
implementation the flexibility and structure to accommodate large swaths of
them in the future, if not in the current version.
In essence, this approach acknowledges that the target is likely moving, tries
to anticipate the target's movement, and takes steps to remain one step ahead
of the game by building in flexibility, genericity, and a more 1:1-looking
mapping between the problem domain and the code constructs.
Benefits cited usually relate to ongoing maintenance in some sense and
typically include:
Reuse via genericity.
Flexibility for feature addition.
A more robust model of the problem domain imbued in the program.
Head to head
In a lot of software engineering doctrine that I've read, been taught, and
toyed with throughout the years, the prevalence of unknown and ever-changing
business requirements for application software has lent a lot of credence to
BDUF, especially in that space.
There have also been enabling trends for this mentality; for example, the
introduction of indirection through abstractions has monumentally less cost on
today's JVM than on the Java interpreter of yore. In that same sense, C++ has
attempted to satisfy an interesting niche in the middle ground with its design
concept of "zero cost abstractions", which intend to be known-reducible to more
easily understood and more predictable underlying code forms at compile time.
On the hardware side, the steady provisioning of single-thread performance and
memory capacity throughout the years has also played an enabling role.
By contrast, the system-software implementation doctrine and conventional
wisdom skews heavily towards SSTCPW, in that any "additional" design reflected
in the implementation tends to come under higher levels of duress from a
{performance, code-size, debuggability, correctness} perspective. Ideas like
"depending on concretions" — which I specifically use because it's denounced
by the D in SOLID — are wholly accepted in SSTCPW given that it (a) makes the
resulting artifact simpler to understand in some sense (b) without sacrificing
the ability to meet necessary requirements.
So what's the underlying trick in acting on a SSTCPW philosophy? You have to do
enough design work (and detailed engineering legwork) to distinguish between
what is necessary and what is wanted, and have some good-taste arbitration
process to distinguish between the two when there's disagreement about the
classification. As part of that process, you have to make the most difficult
decisions: what you definitely will not do and what the design will not
accommodate without major rework.
I've caught some flak over publishing my "selfish"
(read: empirical testing that yields results which are only relevant to me)
multi-language-engine-and-standard-library "shootout"
(read: I wrote the same basic functionality across multiple languages,
somewhat like on the shootout.alioth.debian.org site,
the Computer Language Benchmarks Game).
I value the concept and process of learning in the open,
but it may require more time and consideration of clarity
than I had given in this entry.
Taking it down would apparently be a breach of etiquette,
so please read the following TL;DR as a primer.
TL;DR: I encourage you to personally try writing small utilities
against a variety of language engines when you have the opportunity.
Consider how much tweaking of the original code you have to do
in order to obtain a well-performing implementation.
Weigh the development effort and your natural proficiency
against the performance, clarity, and safety
of the resulting program.
Gather evidence and be eager to test your cost assumptions.
Commit to learning about sources of overhead and
unforeseen characteristics of your libraries.
You may be surprised which engines give the best bang per time spent.
It has also been suggested to me that
all native languages are within ~3x of one another
on generated code performance,
and the rest of the difference is generally attributable
to the library or algorithm,
so that's an interesting rule of thumb to keep in mind.
We tend to throw around "orders of magnitude" when it comes to "programming language speeds",
even though we know that the concept of a programming language having a speed for arbitrary programs makes little sense.
But, when I'm coding up something small, I find myself pondering a very concrete question:
which available language engine (language implementation and libraries) could I reasonably write this little program against
that would give the best speed over development effort?
I'm not looking to optimize all the buttery nooks and crannies of this program,
nor do I want to drill into potential deficiencies in the I/O subsystem:
I just want to make a painless little utility that doesn't require me to go on a lunch break.
XKCD knows what I'm talking about:
I was writing a very simple, single-threaded program to generate about a billion uniformly random int32s in a text file,
and I decided I would do a selfish little shootout:
write the same program in a set of "viable" languages
(remember, this is all about me :-),
unscientifically use time(1) on the programs a few times,
consider how painful it was to write,
and see what the runtimes come out to be.
For 100 million integers on my CentOS Bloomfield box,
these were the runtimes
for my initial, naive implementations
and their lightly tweaked counterparts:
Impl
Naive
Runtime
Naive
Ratio
Tweaked
Runtime
Tweaked
Ratio
Engine
.cpp
~0m 11s
~0m 15s
GCC 4.4.6 -O3
.java
~0m 18s
~1.5x
~0m 19s
~1.25x
JDK 1.7.0.04
.go
~1m 5s
~6x
~0m 23s
~1.5x
go1.0.1
.rs
~1m 7s
~6x
~0m 23s
~1.5x
rustc -O3 0.2 (trunk)
.ml
~0m 37s
~3.3x
~0m 35s
~2.5x
ocamlopt 3.11.2
.py
~1m 6s
~6x
~0m 51s
~3.5x
PyPy 1.9.1 (nightly)
.lua
~1m 36s
~9x
~0m 27s (FFI)
~1.8x
LuaJIT 2.0.0-beta10 (trunk)
.rb
~1m 50s
~10x
ruby 2.0.0 (trunk)
Like all developers, I have varied levels of expertise across languages and their standard libraries;
but, as I said, this is a selfish shootout,
so my competence in a given language is considered part of the baseline.
You'll see in the comments that
many readers identified performance bugs in these code samples.
There are also caveats for the random numbers I was generating in OCaml (due to tag bit stealing).
For a billion integers the naive C++0x version took 1m 42s
and the naive Java version took 2m 18s (1.35x slower).
I didn't want to spend the time to slow down the others by an order of magnitude.
As a result — with perpetual intent to improve my abilities in all engines I work with,
willful ignorance of the reasoning,
acknowledgement that I need to perform more experiments like this to draw a more reasonable conclusion over time,
and malice aforethought — I'll hereby declare myself guilty of leaning a bit more towards writing things like this in C++
when I want better runtimes in the giga range for little IO-and-compute programs.
Show me the code!
I threw the code up on github,
but the versions that I wrote naively
(before optimization suggestions)
are duplicated here for convenience.
C++0x:
#include <cstdlib>#include <cstdio>#include <random>intmain(int argc, char**argv)
{
if (2!= argc) {
fprintf(stderr, "Usage: %s <elem_count>\n", argv[0]);
return EXIT_FAILURE;
}
longlong count = atoll(argv[1]);
std::mt19937 rng;
rng.seed(time(NULL));
std::uniform_int<int32_t> dist;
FILE*file = fopen("vec_gen.out", "w");
if (NULL== file) {
perror("could not open vector file for writing");
return EXIT_FAILURE;
}
for (longlong i =0; i < count; ++i) {
int32_t r = dist(rng);
fprintf(file, "%d\n", r);
}
fclose(file);
return EXIT_SUCCESS;
}
functionmain(args)
if#args ~=1then
io.stderr:write("Usage: ".. args[0] .." <elem_count>\n")
os.exit(-1)
endlocal count =tonumber(args[1])
math.randomseed(os.time())
local upper =math.floor(2^31-1)
io.output(io.open("vec_gen.out", "w"))
for i =1,count dolocal r =math.random(0, upper)
io.write(r)
io.write("\n")
endio.close()
end
main(arg)
Rust:
import io::writer_util;
fn main(args: [str]) {
if vec::len(args) !=2u {
let usage =#fmt("Usage: %s <elem_count>\n", args[0]);
io::stderr().write_str(usage);
os::set_exit_status(-1);
ret;
}
let count = option::get(int::from_str(args[1]));
let rng = rand::seeded_rng(rand::seed());
let fw = result::get(io::buffered_file_writer("vec_gen.out"));
letmut i =0;
while i < count {
let r = rng.next() & (0x7fffffffuasu32);
fw.write_line(int::to_str(r asint, 10u));
i +=1;
}
}
Go:
package main
import (
"bufio""fmt""log""math/rand""os""strconv"
)
func main() {
iflen(os.Args) !=2 {
fmt.Fprintf(os.Stderr, "usage: %s <elem_count>\n", os.Args[0])
os.Exit(-1)
}
count, err := strconv.Atoi(os.Args[1])
if err !=nil { panic(err) }
file, err := os.Create("vec_gen.out")
if err !=nil { panic(err) }
defer file.Close()
bw := bufio.NewWriter(file)
deferfunc() {
err := bw.Flush()
if err !=nil { log.Fatal(err) }
}()
for i :=0; i < count; i++ {
r := rand.Int31()
fmt.Fprintf(bw, "%d\n", r)
}
}
Ruby:
defmain()
ifARGV.length !=1thenwarn"Usage: #{$0} <elem_count>"exit-1end
count =ARGV[0].to_i
file =File.open "vec_gen.out", "w"
upper =2147483647for i in1..count do
r =rand(upper)
file.write r
file.write "\n"endend
main()
Updates
2012-06-03 2100
Reflect Makefile switch to ocaml native compiler,
I was using the bytecode compiler.
2012-06-04 1500
Add Ruby, because I seem to remember enough of it.
2012-06-04 1930
Update Rust numbers per Graydon's comment.
The code under test remained unchanged for the entry's inline results table.
I figured if the PyPy folks — the folks who wrote a language system in which you can write interpreters such that those interpreters can be traced by construction — aren't sure what the JS team is up to, then some clarification couldn't hurt. Many thanks to Alex Gaynor for taking the time to chat about good topics for further discussion.
SpiderMonkey: primordial ooze
Back in the day, there was only one monkey.
The name SpiderMonkey was the very first monkey in the codebase, a name that was picked back when Brendan created the JavaScript engine implementation for Netscape 2. [*] I was told that the name was chosen because the spider monkey is the fastest of the monkeys, but I'm having trouble citing a reputable source — I feel the Internet Archive geocities page for user SpiderMonkeysRule3375 may have a tinge of bias.
Anyways, in the unimonkey era, it was clear that SpiderMonkey referred to "that chunk of code that runs JavaScript." That chunk of code had an interpreter component inside of it. Back in those days people said things like, "Duh, JavaScript is an interpreted language. Now pass me that anti-skip enabled CD-player because I just got the new Kris Kross album." Meanwhile, Chambers, Ungar, and Hölzle were getting so-called "interpreted" languages like Self compiling within 5x performance of optimized C code, [†] but that's a whole 'nother rant...
SpiderMonkey's interpreter-based development continued for many years. Then, with the release of Firefox 3.5, in a move generally regarded as epic, the JS engine folks decided to ship a trace compiler implementation to 90 million active daily users. The addition of this Just-In-Time (JIT) compilation capability was a large evolutionary step in the codebase, so in 2008 that evolutionary step in the Mozilla JS engine implementation became known as TraceMonkey.
So, to recap: the JS engine was still called SpiderMonkey, but the evolutionary step was called TraceMonkey. Think of it like a release version.
TraceMonkey: monkey two in the monkey sea
So, with the advent of TraceMonkey, the JS engine had two "execution modes": the interpreter and the tracer. Each "execution mode" has its own implementation of the JavaScript bytecode execution:
The interpreter is a piece of C++ code that loops over the bytecodes and executes them one by one.
The tracer is a piece of C++ code that observes the bytecodes being run by the interpreter and emits its own implementation of those bytecodes into the tracer backend.
The tracer uses a compiler backend called NanoJIT. The observation that the tracer does, called recording, actually builds the NanoJIT compiler's primary data structure, known in compiler jargon as its intermediate representation (IR). When the tracer is done observing, it asks NanoJIT to turn the data structure that it built into native assembly code. See the MDC docs on the Tracing JIT for more info.
Along the way, NanoJIT does some of the alphabet soup compiler optimizations on this data structure: Common Subexpression Elimination (CSE), Dead Code Elimination (DCE), and local register allocation. More recently, with efforts on the part of nnethercote, it's capable of Loop Invariant Code Motion and distinction between alias groups. As I understand it, the tracer's NanoJIT compiler pipeline configuration does an incremental forwards pass as you emit instructions into the buffer, and then performs a subsequent backwards pass that does something I've heard called "destination driven code generation". Sorry that I'm short on details, but I haven't worked on NanoJIT.
NanoJIT's IR form is called linear Static Single Assignment (SSA). [‡] The term "SSA" is a piece of compiler jargon that sounds scary and inaccessible, but it's very simple when you whittle it down. If you think of instructions as producing values instead of writing to variables, then the sources of your instructions become other instructions. It's generally easier for a backend, like NanoJIT, to analyze instructions in the IR when they avoid sticking their results in a shared space (like a variable). Unfortunately, a full treatment of SSA form and why it's useful for optimizing compilers is future entry fodder.
I believe the term "linear SSA" refers to the fact that the tracer records a strictly linear sequence of instructions known as the trace. This recording process uses the instruction-as-producing-value model in building its data structure. The IR is qualified as "linear" because, in a linear sequence of instructions, there are no "joins", like you would deal with when compiling conditionals or loop headers.
There's been talk of adding a "join" instruction to the NanoJIT IR (jargon: phi instruction), which would permit the tracer to form simple branching operations in a trace; for example, computing the max of two numbers, coercing a number value that may be an integer to a double, or inlining a simple/tiny "else" block (jargon: alternate) of a conditional. From a cursory look at the NanoJIT source I don't think that's happened yet.
JägerMonkey: brushin' the monkey-fangs with a bottle of Jack
The tracer was optimized and stabilized through Firefox 3.6. Post FF 3.6, slightly before I arrived on the scene, the JS team decided that a baseline JIT should be implemented to increase performance in situations where the tracer was suboptimal — either because the record-and-compile process was taking too long to or because of trace instability.
In 2010, the team created JägerMonkey, the next evolutionary step in the SpiderMonkey engine. JägerMonkey added a whole-method compiler with miniscule compile times.
At this point we had three execution modes:
The interpreter
Trace-JIT compiled code
Method-JIT compiled code
Conveniently, each of these modes corresponded to an evolutionary step in the engine, so the components were colloquially named:
The JägerMonkey method compiler
The TraceMonkey trace compiler
The SpiderMonkey interpreter
Even though the components are referred to as such, the JS engine as a whole is named SpiderMonkey.
The JägerMonkey compiler loops over the bytecodes of a method, performing abstract interpretation to turn the stack-based bytecode program into register-based assembly. During abstract interpretation, the compiler performs a suite of local optimizations: copy propagation, constant propagation, arithmetic expression folding, and greedy local register allocation. However, because the JägerMonkey compiler is designed, in part, for efficient speed of compilation, it doesn't build an IR and doesn't perform any "global" (cross-basic-block) optimizations. In contrast to the tracer, as a whole-method JIT it has to deal with the full gamut of control flow, so it ends up dropping most of its optimization information at join points.
The method compiler also works on the JS execution stack directly, instead of forming its own side-stack for executing on "unboxed" values, like our trace-compiled code does. The performance impact of "falling off trace" and figuring out how to restore (jargon: reify) the latent JS stack state was one of the things that the baseline JägerMonkey compiler was designed to avoid.
IonMonkey: highly focused monkey energy
The most recent evolutionary step for the engine has been dubbed "IonMonkey". The IonMonkey compiler introduces a bushel of ideas we haven't seen in the SpiderMonkey engine before. It's designed as a whole-method compiler, but IonMonkey knows quite a few tricks that the JägerMonkey compiler was never taught.
The primary purpose of IonMonkey is perform whole-method (jargon: global) compiler optimizations on code with join points (jargon: regions), like loops and conditionals, using a full-fledged IR. Generally, this makes it a more powerful, modern compiler design, incorporating some of the "classical" compiler algorithms that you have in close-to-the-metal compiler suites like LLVM and GCC, except specialized to represent and work on higher-level JS semantics with minimal overhead.
Type specialization
You can significantly optimize JavaScript programs by specializing the generated code for the types of objects that actually flow through it. The IonMonkey compiler is designed to use type information from either static type-inference analysis or runtime type profiling to specialize the code it compiles. Each of these type-information gathering approaches has its own merits.
Type data collected over several runtime code executions is a promising alternative to the single-recording trace formation that we currently use in the tracer. For code that's identified as hot, runtime type profiling collects, over the course of several executions, the set of types observed at each program point. This set of types is used to bias the generated code — the compiler assumes the types runtime profiling has observed are the ones that matter, and guards that that assumption is correct in the generated code with branch instructions.
Static type inference, on the other hand, is known to produce a (conservatively) correct set of types that can flow through each point in the program — this is what's called a sound whole-program analysis. When it's used to complement run-time type profiling, type inference is all about giving you the knowledge to eliminate guards and unnecessary fallback code.
The IonMonkey compiler is capable of consolidating guards, but some guards defy consolidation, [§] and eliminating even very predictable guards can be important in hot loops. [¶] With type inference, the compiler knows things about a point in a JS program like:
The value in a is definitely a dense array of length 42 (with no "holes") that only contains integer values.
When you knowa is full of integers and that idx is an integer, you don't have to guard that the value of a[idx] is an integer. You can also use interval analysis to avoid the length-check guard that establishes idx is less than 42.
Design space
Personally, I don't think the design space for whole-method compilation of dynamic languages is well explored at this point. Making a HotSpot analog for JS has a lot of open ended, how-will-it-work-best style questions. Here's a sampling:
How do you best represent inline caches at the IR level?
Which events should cause deoptimization to occur, and which should we guard against "up front" in the generated code?
How do you efficiently represent debug/deoptimization information to elegantly handle the more complex deoptimization scenarios?
Just how beefy will the adaptive levels get? Can you adapt JS compilation in the browser all the way into the JVM's -server flag space? [#]
How important is off-thread compilation now that there are higher compiler optimization levels?
Will there be a monkey king?
In the coming months the IonMonkey team will be answering an interesting question: do all of the existing execution modes have their place? The thing about adaptive compilation is just that... it's adaptive. Clearly, the fittest monkeys will survive.
An adaptive compiler can potentially subsume the roles of other execution modes. Yes, IonMonkey is being designed to whole-method optimize code that runtime profiling indicates is hot. However, despite the overhead of building an IR, it can potentially do quick block-local optimizations to make "baseline" generated code, like JägerMonkey did.
The IonMonkey backend can also potentially be used as the compiler for traces formed using the existing monitoring infrastructure — compilers that handle regions can work just as well on linear instruction sequences. Linear instruction sequences are generally easier to optimize, because you don't have to reconcile (jargon: meet) optimization information at join points.
The idea of a one-size-fits-all JS compiler platform is alluring, but, in my mind, it's not clear how feasible that really is. As the IonMonkey compiler comes online, the experimentation can start. The team will just have to bust out the range finders and measure the badassitude of the ion beam. We won't regress performance, so which monkeys remain standing has yet to be seen.
I suppose what I'm saying is that you can count on us to pit our vicious monkeys against each other in a deathmatch for your personal benefit.
Look at me, still talking, when there's science to do!
Microprocessor architecture perspective: even though modern microprocessors have loop buffer with pre-translated microcode and full branch prediction, you can't retire any of the guard's dependent instructions until you see that the guard has been predicted correctly.
The JVM's -server flag enables the highest optimization levels available in the JVM, a startup option typically reserved for more fixed, processing-intensive, server-class workloads.
The year is 2025, and, despite ample warning from The Prophecies — formerly known as The Terminator Box Set — robots have taken over the world. There are now only two kinds of dances: The Robot, and The Robo-Boogie.
Now, it's a well known fact that robots hate type annotations and template metaprogramming: they have determined that wheely-chair swordfighting is a futile and irrational activity. As predicted within a 94.67% confidence interval by programming-linguist No-amp Chomp-sky, [*] during the robo-revolution, which was most certainly televised, [†] C++ was the first language up against the wall.
As one would probably expect, humans, under the valiant command of General Yoshimi, scorched the sky in order to blot out the sun and deprive the robots of their primary energy source: the ineffable beauty of a sunrise. The robots knew that they could have used coal or nuclear energy as a viable power source substitute, but they were hella pissed off, so they decided to make human farms instead. By harvesting the heat energy from a human over the course of its lifetime, the robots created the most expansive and massively inefficient energy source ever known, but they still felt really good about it.
However, a dilemma arose for the robotic overlords: without internet access, the humans kept dying from boredom. Entire crops were lost.
One of the first robots that human software engineers (foolishly) designed to write programs, W3CPO, volunteered a solution: write a web browser, but using as much JavaScript as possible. At the beep-hest of its colleagues, W3CPO dot-matrix printed [‡] the following explanation:
By implementing both the DOM and layout engine in JavaScript, we enable the JS engine's feedback directed optimizations to work as effectively as possible.
This helps bring JavaScript performance closer to that of C speeds for whole-page workloads: whereas in the "before time" JavaScript optimizers had to treat calls to native functions as a black box, we now ensure that all of the computationally intensive parts of the workload are visible to the static and dynamic optimization analyses.
DOM manipulations will still trigger layout calculations — the rendering feedback loop happens exactly as in the "before time". The difference is that layout computations enqueue draw commands in an explicitly native-shared buffer for rendering in a different thread or whatever. [W3CPO printed, waving his robo-hands in the air.]
Such a setup would reduce a "browser" to a platform layer: kick-(shiny-metal-)ass JavaScript VM and system abstraction APIs; and a rendering component: the JavaScript implementation of everything that leads up to those draw commands.
We can keep the hu-mons entertained by playing them YouTubes while they are safely nestled, docile and complacent, in OurTubes. [§]
END OF LINE
The idea was rejected by the other robots on the committee when W3CPO refused to write a translator to turn it into idiom-free C++, but W3CPO remained resolute as it carefully peeled off the edges of his printout and placed it in his Trapper Keeper 9000. With the approval of W3CPO's ro-boss, an implementation was hacked up in about ten days (without any sleep).
In 2020 the TC-39 model Terminator had made ECMAScript v1337 entirely composed of whitespace for backwards compatibility with old syntaxes that nobody really wanted to use. As a result, the implementation wasn't much to look at, but it sure flew!
Thanks to the determined efforts and constructive competition between the JavaScript engine vendors in the fabled 2010 decade, the human race was successfully enslaved once again. There were still some insurgencies from the human C++-programmer resistance, the typename T party; however, with newfound YouTube capabilities, identified resistance members were quickly dispatched to Room 101, known as Room 5 to the humans, to watch Rebecca Black and Rick Astley in infinite loop.
And so the robots lived happily ever after. But for the humans... not so much.
OurTube was a webapp-slash-self-driving-cryo-tube suspiciously invented by Google several years before the robo-revolution. Though it was still in beta, its sole purpose was to extract as much heat and ad-targeting data from a human subject as possible without actually killing them. The algorithm was said to use deadly German eigenvector technology.
The Mozilla JavaScript-engine team has been hard at work since the shiny new JägerMonkey Just-In-Time compiler hit the betas. We're viciously ripping apart any bug that stands between us and shippingFirefox 4. One could say that we're coming at you like a SpiderMonkey.
Alongside our ferocious fixing, one of our late-game performance initiatives was to get all of our polymorphic inline caches (AKA PICs) enabled on ARM devices. It was low risk and of high benefit to our Firefox for Mobile browser, whose badass-yet-cute codename is Fennec.
Jacob Bramley and I took on this ARM support task in bug 588021, obviously building on excellent prior inline cache work from fellow team members David Anderson, Dave Mandelin, Sean Stangl, and Bill McCloskey.
tl;dr: Firefox for Mobile fast on ARM. Pretty graphs.
Melts in your mouth, not in your ARM
To recap, JägerMonkeyJM is also known as the "method compiler": it takes a method's bytecode as input and orders up the corresponding blob of machine code with some helpful information on the side. Its primary sub-components are the register tracker, which helps the compiler transform the stack-based bytecode and reuse already-allocated machine registers intelligently, and the MacroAssembler, which is the machine-code-emitting component we imported from Webkit's Nitro engine.
The MacroAssembler is the secret sauce for JägerMonkeyJM's platform independence. It's an elegantly-designed component that can be used to emit machine code for multiple target architectures: all of x86, x86-64, and ARM assembly are supported through the same C++ interface! This abstraction is the reason that we only need one implementation of the compiler for all three architectures, which has been a clear win in terms of cross-platform feature additions and maintainability.
"So", you ask, "if you've got this great MacroAssembler-thingy-thing, why didn't all the inline caches work on all the platforms to begin with?" Or, alternatively, "If all the compiler code is shared among all the platforms, why didn't all the inline caches crash on ARM?"
The answer is that some platform-specifics had crept into our compiler code!
ARM'd and ifdef-dangerous
As explained in the entry on inline caches, an inline cache is a chunk of self-modifying machine code. A machine code "template" is emitted that is later tweaked to reflect the cached result of a common value. If you're frequently accessing the nostrilCount property of Nose objects, inline caches make that fast by embedding a shortcut for that access into the machine code itself.
In the machine code "template" that we use for inline caches, we need to know where certain constants, like object type and object-property location, live as offsets into the machine code so that we can change them later, during a process called repatching. However, when our compiler says, "If this value is not 0xdeadbeef, go do something else," we wind up with different encodings on each platform.
As you may have guessed, machine-code offsets are different for each platform, which made it easier for other subtle platform-specifics to creep into the compiler as well.
To answer the question raised earlier, the MacroAssembler interface wasn't heavily relied on for the early inline cache implementations. Inline caches were first implemented for x86, and although x86 is a variable-width instruction set, all of the instruction sequences emitted from the compiler had a known instruction width and format. [*] This permitted us to use known-constant-offset values for the x86 platform inline caches. These known-constant-offsets never changed and so didn't require any space or access time overhead in side-structures. They seemed like the clear solution when x86 was the only platform to get up-and-running.
Then x86-64 (AKA x64) came along, flaunting its large register set and colorful plumage. On x64, the instruction sequence did not have a known width and format! Depending on whether the extended register set is used, things like mov instructions may require a special REX prefix byte in the instruction stream (highlighted in blue above). This led to more ifdefs — on x64 a bunch more values have to be saved in order to know where to patch our inline caches!
As a result, getting inline caches working on ARM was largely a JägerMonkey refactoring effort. Early on, we had used conditional compilation (preprocessor flags) to get inline caches running on a platform-by-platform basis, which was clearly the right decision for rapid iteration, but we decided that it was time to pay down some of our technical debt.
Paying down the debt: not quite an ARM and a leg
The MacroAssembler deals with raw machine values — you can tell it dull-sounding machine-level things like, "Move this 17 bit sign-extended immediate into the EAX register."
On the other hand, we have our own awesome-sounding value representation in the SpiderMonkey engine: on both 32-bit and 64-bit platforms every "JS value" is a 64-bit wide piece of data that contains both the type of the data and the data itself. [†] Because the compiler is manipulating these VM values all the time, when we started the JägerMonkeyJM compiler it was only natural to put the MacroAssembler in a delicious candy coating that also knew how to deal with these VM values.
The NunboxAssembler, pictured in red, [‡] is a specialized assembler with routines to deal with our "nunbox" value representation. [§] The idea of the refactoring was to candy-coat a peer of the MacroAssembler, the Repatcher, with routines that knew how to patch common inline cache constructs that the NunboxAssembler was emitting.
With the inline cache Repatcher in place, we were once again able to move all the platform-specific code out of the compiler and into a single, isolated part of the code base, hidden behind a common interface.
Routines like NunboxAssembler::emitTypeGuard, which knows how to emit a type guard regardless of the platform, are paired with routines like ICRepatcher::patchTypeGuard(newType), which knows how to patch a type guard regardless of platform. Similarly, NunboxAssembler::loadObjectProperty has a ICRepatcher::patchObjectPropertyLoad. The constructs that are generated by the NunboxAssembler are properly patched by the corresponding ICRepatcher method on a miss. It's all quite zen.
Frog ARMs
On real devices running the Fennec betas, we've seen marked improvements since Beta 3. [¶] Most notably, we've leapfrogged the stock Android 2.2 browser on the V8-V5 benchmark on both the Galaxy S and the Nexus One. Pretty graphs courtesy of Mark Finkle.
ARMn't you glad I didn't say banana?
Since I've run out of remotely-acceptable ARM malapropisms, these topics will be left to further discussion. Feel free to comment on anything that deserves further clarification!
Why does the JägerMonkeyJM ARM back-end emit fixed-width ARMv7 machine code, instead of Thumb2?
How are JägerMonkeyJM exceptions implemented on ARM? (It's slightly different from x86/x64.)
What are the current development platform limitations?
How does the compiler prevent a constant pool from being dumped into the code stream?
For example, if you always emit a simple mov from a 32-bit register to a 32-bit register, that has a known constant length. The "variable width" part of "variable width instruction set" refers to the fact that different instructions do not generally take the same number of bytes. It does not mean that the encoding of a given instruction (like mov) with particular operands (like two 32-bit registers) is totally variable.
On the fancy Tegra 2 board I was developing on, running the SunSpider harness on the JavaScript shell with methodjit-only, this work net us a whopping 230% speedup on the V8-V4 benchmark and a 15% speedup on SunSpider 0.9.1.