January 6, 2011

Ship, or don't die!

Perhaps a corollary is, "Release early, or release toxin."

I admit I'm a bit of a Seth Godin fanboy — he's driven, omits needless words, and gets things done. His blog rarely has an unread count in my feed reader. At the same time, when you look up to someone, you can't help but expose some vulnerability.

One of his latest kick-ass entries, What did you ship in 2010?, put me in a total funk. I shipped a modest set of things this year, by which I mean that I found my list unimpressive. Maybe it was too short, maybe it wasn't innovative/creative enough, or maybe I was just being a negative Nancy.

In any case, Nancy found... er, I found the list-writing experience incredibly disappointing. [*] However, after some thought, I realized what I would probably say to Seth if we had a chat about it over coffee at Red Rock: I don't think I should really care.

Why? Because I'm probably not going to die tomorrow.

Kindergartener existentialism

A fun-size bit of existentialism is that human essence isn't fully realized until you die. When you die, your whole lifeline has played out and your effect on the world is fundamentally complete. To use a catchy existentialist marketing slogan, human existence precedes essence.

In a related vein, kindergarteners don't try to get their macaroni pictures displayed in art museums. When you're new to a scene and acquiring pre-requisite experience, there's no need to subject the rest of the world to your crap: in the common case, there's plenty more time to cultivate your essence and make your mark on the world. In fact, experimenting, throwing your crap away and moving on may be a much better use of your time than trying to ship something naïve or artless. [†]

My parents wouldn't want to put my broken patches up on their refrigerators. Even if they did, they don't have those kinds of refrigerators that magnets can stick to.

Shipping in potentia

Like most people who suffer from over-achievement syndrome, I have fever dreams of instantaneously becoming an expert in every piece of tech I touch, innovation dripping from my fingertips as I puke rainbows and such. Discovering that talent and perseverance have limits is always a cruel come-down.

Perhaps because of these delusions, I initially found it hard to grasp my most important accomplishment of this past year: getting to know various aspects of a state-of-the-art, production, multi-platform language design/implementation and the surrounding processes and tech. That's not shipping! It is, however, necessary experience to ship higher-impact (and perhaps daydream-worthy) tech down the road.

Realistically, there are a number of other reasons to feel accomplished. When I left my last gig slightly under a year ago, not a single product I had written code for had shipped. (Although I'm totally rooting for one that was recently announced!) Now, every patch I write is put to the test in a development channel with millions of active daily users. I'm constantly and (relatively) shamelessly absorbing information from a team of brilliant and down-to-earth developers, my mentor Luke Wagner in particular. My scrappy throw-away side-projects keep me thinking creatively and questioning the status quo.

In all, this year was incredibly enriching.

The JS engine is more comfortable ground with each passing day. I've got the drive to give back important and innovative things. My existence precedes my essence.

I'm optimistic about the list for 2011.



Whatever the size of my contribution, I'm honored to say that my list for 2010 included, "Help to ship a working and efficient JägerMonkey implementation." Effin' a.


Of course, one has to be somewhat cautiously introspective — Seth also warns that continued concerns over naïvete/perfection are a natural result of a fearful mentality that he refers to as the "lizard brain".

A generator protocol trap

You should be cautious that the functions you call from generators don't accidentally raise StopIteration exceptions.


Generators are generally a little tricky in a behind-the-scenes sort of way. Performing a return from a generator is different from a function's return — a generator's return statement actually raises a StopIteration instance.

>>> def foo():
...     yield 1
...     return
>>> i = foo()
>>> next(i)
>>> try:
...     next(i)
... except StopIteration as e:
...     print(id(e), e)


There's a snag that I run into at times: when you're writing a generator and that generator calls into other functions, be aware that those callees may accidentally raise StopIteration exceptions themselves.

def insidious(whoops=True):
    if whoops:
        raise StopIteration
        return 2

def generator():
    yield 1
    yield insidious()
    yield 3

if __name__ == '__main__':
    print([i for i in generator()])

This program happily prints out [1].

If you substitute StopIteration with ValueError, you get a traceback, as you'd probably expect. The leakage of a StopIteration exception, however, propagates up to the code that moves the generator along, [*] which sees an uncaught StopIteration exception and terminates the loop.


The same trap exists in the SpiderMonkey (Firefox) JavaScript dialect:

function insidious() {
    throw StopIteration;

function generator() {
    yield 1;
    yield insidious();
    yield 3;

print(uneval([i for (i in generator())]));

Which prints [1].

VM guts

As a side note for those interested in the guts of virtual machines, the primordial StopIteration class is known to the virtual machine, regardless of user hackery with the builtins:

>>> def simple_generator():
...     yield 1
...     raise StopIteration
...     yield 3
>>> print([i for i in simple_generator()])
>>> class FakeStopIteration(Exception): pass
>>> __builtins__.StopIteration = FakeStopIteration
>>> print([i for i in simple_generator()])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "<stdin>", line 3, in simple_generator


You may look at this issue and think:

The fundamental problem is that uncaught exceptions are raised over any number of function invocations. Generators should have been designed such that you have to stop the generator from the generator invocation.

In an alternate design, a special value (like a StopIteration singleton) might be yield'd from the generator to indicate that it's done.

One issue with that alternative approach is that you're introducing a special-value-check into the hot path within the virtual machine — i.e. you'd be slowing down the common process of yielding iteration items. Using an exception only requires the VM to extend the normal exception handling machinery a bit and adds no additional overhead to the hot path. I think the measurable significance of this overhead is questionable.

Another issue is that it hurts the lambda abstraction — namely, the ability to factor your generator function into smaller helper functions that also have the ability to terminate the generator. In the absence of a language-understood exception, the programmer has to invent a new way for the helper functions to communicate to the generator that they'd like the generator to terminate.



The FOR_ITER bytecode in CPython VM for-loops.

Remove the self-selection bias from Q&A sessions

I've been in quite a few painful Q&A sessions. I think we can do better.

When somebody volunteers to ask a question, their question is not necessarily of interest to anybody in the audience other than themselves. Despite that possibility, publicly answering the question necessarily takes up the time of every person in the audience. Remember the kid in class who asked 90% of the questions — questions that nobody else ever cared about?

There's a simple way to fix this.

Step 1: At the end of your presentation, ask for a show of hands for those people interested in having a Q&A session. If very few people raise their hands, you should be worried about the quality your presentation. However, in the less gloomy scenario where a good number of people raise their hands, you have a representative sample for the audience population that might be interested in the answer to any given question. (The other people should probably leave, but suggesting that would be rude.)

Step 2: After each person volunteers to ask a question, give the audience a quick poll by saying, "Raise your hand if you're interested in the answer to this question." If very few hands go up, reassure the person that their question is a good one [*] and tell them you'd love to stick around and chat afterwards.

That's it. Each several-minute irrelevant Q&A iteration is now reduced to a few seconds and you have additional feedback as to the topics the audience is interested in — the added audience involvement can't hurt either.


Real-time feedback software that the audience interacts with during your presentation can definitely do better. This is just a low-tech proposal to raise the bar a bit.

As my friend pointed out in discussing this, there's no simple way to determine the critical mass for answering a question — it's possible that any given question will only be interesting to a fraction of the audience. It's also possible that you, the presenter, know that the answer to a question is particularly interesting and should be answered without soliciting audience feedback. I believe in your intuition!



Even if it's not — otherwise, question-asking potentially becomes public humiliation, which is very undesirable.

PICing on JavaScript for fun and profit

Inline caching is a critical ingredient in the delicious pie that is dynamic language performance optimization. What follows is a gentle-albeit-quirky introduction to what polymorphic inline caches (PICs) are and why they're useful to JavaScript Just-In-Time compilers like JaegerMonkey.

But first, the ceremonial giving of the props: the initial barrage of PIC research and implementation in JaegerMonkey was performed by Dave Mandelin and our current inline cache implementations are largely the work of David Anderson. As always, the performance improvements of Firefox's JavaScript engine can be monitored via the Are We Fast Yet? website.

C is for speed, and that's good enough for me

C is fast.

Boring people (like me) argue about astoundingly interesting boring things like, "Can hand-tuned assembly be generally faster than an equivalent C program on modern processor architectures?" and "Do languages really have speeds?", but you needn't worry — just accept that C is fast, and we've always been at war with Eurasia.

So, as we've established, when you write a program in C, it executes quickly. If you rewrite that program in your favorite dynamic language and want to know if it still executes quickly, then you naturally compare it to the original C program.

C is awesome in that it has very few language features. For any given snippet of C code, there's a fairly direct translation to the corresponding assembly instructions. [*] You can almost think of C as portable assembly code. Notably, there are (almost) zero language features that require support during the program's execution — compiling a C program is generally a non-additive translation to machine code.

Dynamic languages like JavaScript have a massive number of features by comparison. The language, as specified, performs all kinds of safety checks, offers you fancy-n-flexible data record constructs, and even takes out the garbage. These things are wonderful, but generally require runtime support, which is supplied by the language engine. [†] This runtime support comes at a price, but, as you'll soon see, we've got a coupon for 93 percent off on select items! [‡]

You now understand the basic, heart-wrenching plight of the performance-oriented dynamic language compiler engineer: implement all the fancy features of the language, but do it at no observable cost.

Interpreters, virtual machines, and bears

"Virtual machine" sounds way cooler than "interpreter". Other than that, you'll find that the distinction is fairly meaningless in relevant literature.

An interpreter takes your program and executes it. Generally, the term "virtual machine" (AKA "VM") refers to a sub-category of interpreter where the source program is first turned into fake "instructions" called bytecodes. [§]

A bear moving quickly

I call these instructions fake because they do things that a hardware processing units are unlikely to ever do: for example, an ADD bytecode in JavaScript will try to add two arbitrary objects together. [¶] The point that languages implementors make by calling it a "virtual machine" is that there is conceptually a device, whether in hardware or software, that could execute this set of instructions to run the program.

These bytecodes are then executed in sequence. A program instruction counter is kept in the VM as it executes, analogous to a program counter register in microprocessor hardware, and control flow bytecodes (branches) change the typical sequence by indicating the next bytecode instruction to be executed.

Virtual (machine) reality

Languages implemented in "pure" VMs are slower than C. Fundamentally, your VM is a program that executes instructions, whereas compiled C code runs on the bare metal. Executing the VM code is overhead!

To narrow the speed gap between dynamic languages and C, VM implementers are forced to eliminate this overhead. They do so by extending the VM to emit real machine instructions — bytecodes are effectively lowered into machine-codes in a process called Just-In-Time (JIT) compilation. Performance-oriented VMs, like Firefox's SpiderMonkey engine, have the ability to JIT compile their programs.

The term "Just-In-Time" is annoyingly vague — just in time for what, exactly? Dinner? The heat death of the universe? The time it takes me to get to the point already?

In today's JavaScript engines, the lowering from bytecodes to machine instructions occurs as the program executes. With the new JaegerMonkey JIT compiler, the lowering occurs for a single function that the engine sees you are about to execute. This has less overhead than compiling the program as a whole when the web browser receives it. The JaegerMonkey JIT compiler is also known as the method JIT, because it JIT compiles a method at a time.

For most readers, this means a few blobs of x86 or x86-64 assembly are generated as you load a web page. The JavaScript engine in your web browser probably spewed a few nice chunks of assembly as you loaded this blog entry.

Aside: TraceMonkey

In SpiderMonkey we have some special sauce: a second JIT, called TraceMonkey, that kicks in under special circumstances: when the engine detects that you're running loopy code (for example, a for loop with a lot of iterations), it records a stream of bytecodes that corresponds to a trip around the loop. This stream is called a trace and it's interesting because a) it can record bytecodes across function calls and b) the trace optimizer works harder than the method JIT to make the resulting machine code fast.

There's lots more to be said about TraceMonkey, but the inline caching optimization that we're about to discuss is only implemented in JaegerMonkey nowadays, so I'll cut that discussion short.

The need for inline caching

In C, accessing a member of a structure is a single "load" machine instruction:

struct Nose {
    int howManyNostrils;
    bool isPointy;

bool isNosePointy(struct Nose *nose) {
    return nose->isPointy;

The way that the members of struct Nose are laid out in memory is known to the C compiler because it can see the struct definition — getting the attribute nose->isPointy translates directly into a load from the address addressof(nose) + offsetof(Nose, isPointy).

Note: Just to normalize all the terminology, let's call the data contained within a structure the properties (instead of members) and the way that you name them the identifiers. For example, isPointy is an identifier and the boolean data contained within nose->isPointy is the property. The act of looking up a property through an identifier is a property access.

On the other hand, objects in JavaScript are flexible — you can add and delete arbitrary properties from objects at runtime. There is also no language-level support for specifying the types that an identifier can take on. As a result, there's no simple way to know what memory address to load from in an arbitrary JavaScript property access.

Consider the following snippet:

function isNosePointy(nose) {
    return nose.isPointy;

To get at the isPointy property, the JavaScript VM emits a single bytecode, called GETPROP, which says "pull out the property with the identifier isPointy". [#] Conceptually, this operation performs a hash-map lookup (using the identifier as a key), which takes around 45 cycles in my microbenchmark. [♠]

Uncached property access data

The process of "looking up a property at runtime because you don't know the exact type of the object" falls into a general category of runtime support called dynamic dispatch. Unsurprisingly, there is execution time overhead associated with dynamic dispatch, because the lookup must be performed at runtime.

To avoid performing a hash-map lookup on every property access, dynamic language interpreters sometimes employ a small cache for (all) property accesses. You index into this cache with the runtime-type of the object and desired identifier. [♥] Resolving a property access against this cache under ideal circumstances takes about 8.5 cycles.

Cached property access data

WTF is inline caching already!?

So we've established that, with good locality, JS property accesses are at least 8.5x slower than C struct property accesses. We've bridged the gap quite a bit from 45x slower. But how do we bridge the gap even bridgier?

Bridge fail!

The answer is, surprisingly, self-modifying code: code that modifies code-that-currently-exists-in-memory. When we JIT compile a property access bytecode, we emit machine-code that looks like this:

type            <- load addressof(object) + offsetof(JSObject, type)
shapeIsKnown    <- type equals IMPOSSIBLE_TYPE
None            <- goto slowLookupCode if shapeIsKnown is False
property        <- load addressof(object) + IMPOSSIBLE_SLOT

Now, if you ask Joe Programmer what he thinks of that code snippet, he would correctly deduce, "The slow lookup code will always be executed!" However, we've got the self-modifying code trick up our sleeves. Imagine that the type matched, so we didn't have to go to the slow lookup code — what's our new property access time?

One type load, one comparison, an untaken branch, and a property value load. Assuming good locality/predictability and that the object's type happened to already be in the register (because you tend to use it a lot), that's 0+1+1+1 == 3 cycles! Much better.

But how do we get the types to match? Joe Programmer is still looking pretty smug over there.

The trick is to have the slowLookupCode actually modify this snippet of machine code! After slowLookupCode resolves the property in the traditional ways mentioned in previous sections, it fills in a reasonable value for IMPOSSIBLE_TYPE and IMPOSSIBLE_SLOT like they were blank fields in a form. This way, the next time you run this machine code, there's a reasonable chance you won't need to go to slowLookupCode — the types might compare equal, in which case you can perform a simple load instruction to get the property that you're looking for!

This technique of modifying the JIT-compiled code to reflect a probable value is called inline caching: inline, as in "in the emitted code"; caching, as in "cache a probable value in there". This the basic idea behind inline caches, AKA ICs.

Also, because we emit this snippet for every property-retrieving bytecode we don't rely on global property access patterns like the global property cache does. We mechanical mariners are less at the mercy of the gods of locality.

Code generation

Where does "P" come from?

Er, right, we're still missing a letter. The "P" in "PIC" stands for polymorphic, which is a fancy sounding word that means "more than one type".

The inline cache demonstrated above can only remember information for a single type — any other type will result is a shapeIsKnown of False and you'll end up going to the slowLookupCode.

Surveys have shown that the degree of polymorphism (number of different types that actually pass through a snippet during program execution) in real-world code tends to be low, in JavaScript [♦] as well as related languages. However, polymorphism happens, and when it does, we like to be fast at it, too.

So, if our inline cache only supports a single type, what can we do to handle polymorphism? The answer may still be surprising: self-modify the machine code some more!

Before we talk about handling the polymorphic case, let's recap the PIC lifecycle.

The PIC lifecycle

The evolution of the PIC is managed through slowLookupCode, which keeps track of the state of the inline cache in addition to performing a traditional lookup. Once the slow lookup is performed and the PIC evolves, the slowLookupCode jumps back (to the instruction after the slot load) to do the next thing in the method.

When a PIC is born, it has that useless-looking structure you saw in the previous section — it's like a form waiting to be filled out. The industry terminology for this state is pre-monomorphic, meaning that it hasn't even seen one (mono) type pass through it yet.

The first time that inline cache is executed and we reach slowLookupCode we, shockingly, just ignore it. We do this because there is actually a hidden overhead associated with modifying machine code in-place — we want to make sure that you don't incur any of that overhead unless there's an indication you might be running that code a bunch of times. [♣]

The second time we reach the slowLookupCode, the inline cache is modified and the PIC reaches the state called monomorphic. Let's say we saw a type named ElephantTrunk — the PIC can now recognize ElephantTrunk objects and perform the fast slot lookup.

When the PIC is monomorphic and another type, named GiraffeSnout, flows through, we have a problem. There are no more places to put cache entries — we've filled out the whole form. This is where we get tricky: we create a new piece of code memory that contains the new filled-out form, and we modify the original form's jump to go to the new piece of code memory instead of slowLookupCode.

Recognize the pattern? We're making a chain of cache entries: if it's not an ElephantTrunk, jump to the GiraffeSnout test. If the GiraffeSnout fails, then jump to the slowLookupCode. An inline cache that can hit on more than one type is said to be in the polymorphic state.

PIC lifecycle

There's one last stage that PICs can reach, which is the coolest sounding of all: megamorphic. Once we detect that there are a lot of types flowing through a property access site, slowLookupCode stops creating cache entries. The assumption is that you might be passing an insane number of types through this code, in which case additional caching would only only slow things down. For a prime example of megamorphism, the 280slides code has an invocation site with 1,437 effective types! [**]


There's a lot more to discuss, but this introduction is rambling enough as-is — if people express interest we can further discuss topics like:

Suffice it to say that JavaScript gets a nice speed boost by enabling PICs: x86 JaegerMonkey with PICs enabled is 25% faster on SunSpider than with them disabled on my machine. [††] If something makes a dynamic language fast, then it is awesome. Therefore, inline caches are awesome. (Modus ponens says so.)



This is as opposed to, say, C++, where in any given snippet of code the == operator could be overloaded.


"Engine" is a sexy term, but it's just a library of support code that you use when language constructs don't easily fall into the translate-it-directly-to-machine-code model used by C.


Coupon only applies to idealized property access latencies. Competitor coupons gladly accepted. Additional terms and restrictions may apply. See store for details.


Alternative interpreter designs tend to walk over something that looks more like the source text — either an abstract syntax tree or the program tokens themselves. These designs are less common in modern dynamic languages.


There have historically been implementations that do things like this; notably, the Lisp machines and Jazelle DBX. The JavaScript semantics for ADD are particularly hairy compared to these hosted languages, because getting the value-for-adding out of an object can potentially invoke arbitrary functions, causing re-entrance into JavaScript interpretation.


In the bytecode stream the value isPointy is encoded as an immediate.


Note that there is actually further overhead in turning the looked-up property into an appropriate JavaScript value. For example, there are additional checks to see whether the looked-up value represents a "getter" function that should be invoked.


This is, in itself, a small hash-map lookup, but the hash function is quite fast. At the moment it's four dependent ALU operations: right shift, xor, add, and.


Gregor Richards published a paper in PLDI 2010 that analyzed a set of popular web-based JS applications. The results demonstrated that more than eighty percent of all call sites were monomorphic (had the same function body). I'm speculating that this correlates well to the property accesses we're discussing, though that wasn't explicitly established by the research — in JS, property access PIC are easier to discuss than function invocation PICs. In related languages, like Self, there is no distinction between method invocation and property access.


"Hidden overhead my foot! Where does it come from?" Today's processors get a little scared when you write to the parts of memory that contain code. Modern processor architecture assumes that the memory you're executing code out of will not be written to frequently, so they don't optimize for it. [‡‡]


The annoying part is that the instruction prefetcher may have buffered up the modified instructions, so you have to check if the modified cache line is in there. Older cache coherency protocols I've read about flush lines past unified caches if they detect a hit in both the instruction and data caches — maybe it's better nowadays.


I'm citing Gregor Richards yet again.


MICs give a nice percentage boost as well, but they're harder to disable at the moment, or I'd have numbers for that too.

House rental, showers, and auctions

I'm currently in the prologue of the house-rental process. Our sights are set on a four-bedroom house with an interesting property: the room resources are of differing value — not in terms of square-footage, but other, less easily scalable factors.

So how do we dole out the resources of unequal value in a manner that everybody considers fair?

(Talk about first-world problems...)

Proposed solutions

Three systems for deciding the room allocations have been proposed, and one has been rejected:

I'm personally a fan of the simplicity and total lack of competition in the random system, but if the bidding system is more likely to lead to the optimal outcome, it must be considered! (Of course, care must be taken to address possible social ramifications.)

During casual discussion we noted that the bidding system would be strategically interesting. If a person were not really interested in getting a high-value room but didn't care all that much if they did, it is clearly in their favor to drive bids on the early rooms as high as possible — this would entail a decrease in the payment rate for a later room.

I perused the Wikipedia entry on auction theory and realize that we were assuming an open/ascending bid auction, but a first-price/sealed-bid auction might be less socially stressful. [†] My main fear is that open bidding would get carried away (it's exciting, after all) and participants would come to regret their bids.

In any case, it's an interesting problem I thought I would share. Suggestions are welcome; otherwise, I'll write a follow-up entry as to how it turns out.



This means it has no shower. Stupid terminology, but I suppose it has a nicer ring than "toilet room".


As an aside, the game theory described on that page also seems to discuss provable equilibrium of a single auction, whereas our auctions have dependent equilibrium values.