August 22, 2011

Blog engine hot swap: from Wordpress to MicroClog

This entry marks the transition from my use of Wordpress software to a small monster of my own creation, which I have named MicroClog.

Of all the things I've lost...

I first switched to Wordpress, from Blogger, about three years ago.

For a long time, I felt that I wasn't crazy enough to write my own blog software. So, in a totally sane fashion, I:

Of course, with every edit, I repeated all of these steps.

Now, this wouldn't be so bad, if writing weren't such a damn perfectionist art. I'm not sure the average number of cycles I took around this loop of automation apathy for each blog entry, but I would guess it was around five. Each trip around the loop I hated it more.

They say that the definition of insanity is doing the same thing over and over again, but expecting different results.

Eventually, I snapped, and decided something had to be done. [*]

I think I've successfully channeled my gripes into the implementation of MicroClog. I hope that, ultimately, greasing the wheels on this process will help flush the 83 entries in my drafts folder (along with a small handful of unfulfilled promises to write something) out to the internet.

The idea behind MicroClog

Writing about code is a total pain in most blog engines. Writing in reStructuredText rocks.

MicroClog chooses reStructuredText over WYSIWYG/HTML editing and existing distributed revision control systems over a in-blog-engine revision control system. The current workflow for MicroClog is:

There's also ways to share drafts in a restricted fashion. I'm currently hacking together a "live preview" on the server side for the reST entries you're editing on the client side, using the fancy new server-sent events API.

Ultimately, there are a few simple tasks that I want to optimize for:

I love writing in my text editor — especially when writing about code — but I also want to marginalize the advantages WYSIWYG has over markup by getting live previews as smooth as possible.

Feature creep

There are some more sordid incentives for me to have all my blog data easily queried and manipulated in a Django app. A few of the features I'd like to try adding in the future:

First class updates

I would really like to support the idea of an "update" or "followup" as a first class feature — manually hacking old entries to point at newer ones with followup content is lame, and engine support for that kind of workflow isn't difficult.

More widgets

I've always wanted to have a widget where I could select a handful of my hundred-odd drafts and generate a poll where users could select the title/intro blurb that was most interesting to them. Knowing what people are interested in reading gives me additional motivation.

Decoupling syndication and entry labeling

I find that tying planet syndication directly to the feed generated for a label has been bothersome. Sometimes I feel like I want to syndicate an entry to a planet but that label isn't appropriate, or sometimes I don't want to syndicate an entry to a planet but I do want to use the label.

Statistics pr0n

Because data is fun to look at. Some ideas I've had:

  • tf/idf style analysis to suggest tags automatically

  • Plot of entries correlated against start/publish date/time

  • Start-to-publish duration versus word count

The good left undone

I'm still writing the software out of a private repo, because I've perpetrated some epic hackery in the interest of shipping.

There's a bunch done, but there's a lot more cleanup, generalization, and feature implementation to do. My day job isn't at all webdev related, so if you're knowledgeable and interested in helping me generalize a system like this for public source release, feel free to get in touch!

Footnotes

[*]

Yes, I considered writing an extension to Wordpress, but I took a personal vow to stop writing PHP quite some time ago.

The latest in monkey development

We've optimized closures, PIC'd on JavaScript, ARM'd ourselves to the hilt, been eradicated by robot overlords, and mapped the monkeysphere. We've been through a lot together. You've had to read a lot of words.

I'm happy to present you with the latest in monkey developments — you've earned it:

Tiny tutorial: learning about GCC generated code with objdump

The aroma is calling you. The best part of waking up is x86-64 in your CPU. [*]

Let's say it's early in the morning and you're a little cranky, but you want to see if/how compiler converts the following into branch-less code:

extern int a, b, c;

void test() {
    c += a == b;
}

You're using the boolean result from the binary comparison operator to, potentially, bump the value in c. You know that the result of the equality is a bit value because the spec wouldn't lie to you — you've been through so much together:

Each of the operators yields 1 if the specified relation is true and 0 if it is false. The result has type int.

—ISO C99 6.5.9 Equality Operators (3)

By using extern variables as operands, you prevent the compiler from constant-folding or optimizing anything away, since it doesn't know squat about the variables except for their type.

To check out what the compiler generates, all that you have to run is the following:

gcc -O3 -Wall -c foo.c
objdump -d -r -Mintel foo.o

To get a disassembly of the optimized instruction sequence in Intel assembly syntax, with relocations — extern placeholders whose actual memory addresses get filled in by the linker — displayed inline:

0000000000000000 <test>:
   0:   8b 05 00 00 00 00       mov    eax,DWORD PTR [rip+0x0]        # 6 <test+0x6>
                        2: R_X86_64_PC32        a-0x4
   6:   3b 05 00 00 00 00       cmp    eax,DWORD PTR [rip+0x0]        # c <test+0xc>
                        8: R_X86_64_PC32        b-0x4
   c:   0f 94 c0                sete   al
   f:   0f b6 c0                movzx  eax,al
  12:   01 05 00 00 00 00       add    DWORD PTR [rip+0x0],eax        # 18 <test+0x18>
                        14: R_X86_64_PC32       c-0x4
  18:   c3                      ret

The crux of the fun is the sete opcode, part of the set* family of 8-bit opcodes that set the 8-bit operand to the bit value of a condition flag.

Then, because 8-bit opcodes preserve the higher bits of the register they operate on, and you need to perform a clean add of a bit value (with no junk left hanging around in the higher bits), you zero-extend the 8-bit value into the corresponding 32-bit form.

Finally, you have the bit value in eax which you can simply add to (the placeholder for) c.

It's also fun to note that, even if you used a 64-bit wide type (a long on an LP64 system like mine), the same sign-extending code sequence would be generated! Because 32-bit operations don't preserve the higher (32) bits of the register they operate on, but instead clear them out, the movzx instruction actually zeroes out all the bits in rax aside from the 8 in al that you're zero extending. For even more tutorial-imbuing goodness, you can try switching the extern declaration over to long and test it out for yourself.

Footnotes

[*]

You, too, can make over-dramatized faces like all the people in that 80s commercial!

Mapping the monkeysphere

Our PyPy friends (in FreeNode #pypy) saw the recent IonMonkey Q&A on InfoQ, but it lead to some confusion:

<Alex_Gaynor> verte: I swear, everytime a JS vendor opens their mouth I get more and more confused
<verte> Alex_Gaynor: indeed. isn't [an SSA-based IR] covered in Fisher Price' My First Compiler?

FreeNode #pypy, 2011-05-02

I figured if the PyPy folks — the folks who wrote a language system in which you can write interpreters such that those interpreters can be traced by construction — aren't sure what the JS team is up to, then some clarification couldn't hurt. Many thanks to Alex Gaynor for taking the time to chat about good topics for further discussion.

SpiderMonkey: primordial ooze

Back in the day, there was only one monkey.

The name SpiderMonkey was the very first monkey in the codebase, a name that was picked back when Brendan created the JavaScript engine implementation for Netscape 2. [*] I was told that the name was chosen because the spider monkey is the fastest of the monkeys, but I'm having trouble citing a reputable source — I feel the Internet Archive geocities page for user SpiderMonkeysRule3375 may have a tinge of bias.

Anyways, in the unimonkey era, it was clear that SpiderMonkey referred to "that chunk of code that runs JavaScript." That chunk of code had an interpreter component inside of it. Back in those days people said things like, "Duh, JavaScript is an interpreted language. Now pass me that anti-skip enabled CD-player because I just got the new Kris Kross album." Meanwhile, Chambers, Ungar, and Hölzle were getting so-called "interpreted" languages like Self compiling within 5x performance of optimized C code, [†] but that's a whole 'nother rant...

SpiderMonkey's interpreter-based development continued for many years. Then, with the release of Firefox 3.5, in a move generally regarded as epic, the JS engine folks decided to ship a trace compiler implementation to 90 million active daily users. The addition of this Just-In-Time (JIT) compilation capability was a large evolutionary step in the codebase, so in 2008 that evolutionary step in the Mozilla JS engine implementation became known as TraceMonkey.

So, to recap: the JS engine was still called SpiderMonkey, but the evolutionary step was called TraceMonkey. Think of it like a release version.

TraceMonkey: monkey two in the monkey sea

So, with the advent of TraceMonkey, the JS engine had two "execution modes": the interpreter and the tracer. Each "execution mode" has its own implementation of the JavaScript bytecode execution:

The tracer uses a compiler backend called NanoJIT. The observation that the tracer does, called recording, actually builds the NanoJIT compiler's primary data structure, known in compiler jargon as its intermediate representation (IR). When the tracer is done observing, it asks NanoJIT to turn the data structure that it built into native assembly code. See the MDC docs on the Tracing JIT for more info.

Along the way, NanoJIT does some of the alphabet soup compiler optimizations on this data structure: Common Subexpression Elimination (CSE), Dead Code Elimination (DCE), and local register allocation. More recently, with efforts on the part of nnethercote, it's capable of Loop Invariant Code Motion and distinction between alias groups. As I understand it, the tracer's NanoJIT compiler pipeline configuration does an incremental forwards pass as you emit instructions into the buffer, and then performs a subsequent backwards pass that does something I've heard called "destination driven code generation". Sorry that I'm short on details, but I haven't worked on NanoJIT.

NanoJIT's IR form is called linear Static Single Assignment (SSA). [‡] The term "SSA" is a piece of compiler jargon that sounds scary and inaccessible, but it's very simple when you whittle it down. If you think of instructions as producing values instead of writing to variables, then the sources of your instructions become other instructions. It's generally easier for a backend, like NanoJIT, to analyze instructions in the IR when they avoid sticking their results in a shared space (like a variable). Unfortunately, a full treatment of SSA form and why it's useful for optimizing compilers is future entry fodder.

I believe the term "linear SSA" refers to the fact that the tracer records a strictly linear sequence of instructions known as the trace. This recording process uses the instruction-as-producing-value model in building its data structure. The IR is qualified as "linear" because, in a linear sequence of instructions, there are no "joins", like you would deal with when compiling conditionals or loop headers.

Contrast linear and region-based IR.

There's been talk of adding a "join" instruction to the NanoJIT IR (jargon: phi instruction), which would permit the tracer to form simple branching operations in a trace; for example, computing the max of two numbers, coercing a number value that may be an integer to a double, or inlining a simple/tiny "else" block (jargon: alternate) of a conditional. From a cursory look at the NanoJIT source I don't think that's happened yet.

JägerMonkey: brushin' the monkey-fangs with a bottle of Jack

The tracer was optimized and stabilized through Firefox 3.6. Post FF 3.6, slightly before I arrived on the scene, the JS team decided that a baseline JIT should be implemented to increase performance in situations where the tracer was suboptimal — either because the record-and-compile process was taking too long to or because of trace instability.

In 2010, the team created JägerMonkey, the next evolutionary step in the SpiderMonkey engine. JägerMonkey added a whole-method compiler with miniscule compile times.

At this point we had three execution modes:

Conveniently, each of these modes corresponded to an evolutionary step in the engine, so the components were colloquially named:

Even though the components are referred to as such, the JS engine as a whole is named SpiderMonkey.

Pictoral representation of the MonkeySphere.

The JägerMonkey compiler loops over the bytecodes of a method, performing abstract interpretation to turn the stack-based bytecode program into register-based assembly. During abstract interpretation, the compiler performs a suite of local optimizations: copy propagation, constant propagation, arithmetic expression folding, and greedy local register allocation. However, because the JägerMonkey compiler is designed, in part, for efficient speed of compilation, it doesn't build an IR and doesn't perform any "global" (cross-basic-block) optimizations. In contrast to the tracer, as a whole-method JIT it has to deal with the full gamut of control flow, so it ends up dropping most of its optimization information at join points.

The method compiler also works on the JS execution stack directly, instead of forming its own side-stack for executing on "unboxed" values, like our trace-compiled code does. The performance impact of "falling off trace" and figuring out how to restore (jargon: reify) the latent JS stack state was one of the things that the baseline JägerMonkey compiler was designed to avoid.

IonMonkey: highly focused monkey energy

The most recent evolutionary step for the engine has been dubbed "IonMonkey". The IonMonkey compiler introduces a bushel of ideas we haven't seen in the SpiderMonkey engine before. It's designed as a whole-method compiler, but IonMonkey knows quite a few tricks that the JägerMonkey compiler was never taught.

The primary purpose of IonMonkey is perform whole-method (jargon: global) compiler optimizations on code with join points (jargon: regions), like loops and conditionals, using a full-fledged IR. Generally, this makes it a more powerful, modern compiler design, incorporating some of the "classical" compiler algorithms that you have in close-to-the-metal compiler suites like LLVM and GCC, except specialized to represent and work on higher-level JS semantics with minimal overhead.

Type specialization

You can significantly optimize JavaScript programs by specializing the generated code for the types of objects that actually flow through it. The IonMonkey compiler is designed to use type information from either static type-inference analysis or runtime type profiling to specialize the code it compiles. Each of these type-information gathering approaches has its own merits.

Type data collected over several runtime code executions is a promising alternative to the single-recording trace formation that we currently use in the tracer. For code that's identified as hot, runtime type profiling collects, over the course of several executions, the set of types observed at each program point. This set of types is used to bias the generated code — the compiler assumes the types runtime profiling has observed are the ones that matter, and guards that that assumption is correct in the generated code with branch instructions.

Static type inference, on the other hand, is known to produce a (conservatively) correct set of types that can flow through each point in the program — this is what's called a sound whole-program analysis. When it's used to complement run-time type profiling, type inference is all about giving you the knowledge to eliminate guards and unnecessary fallback code.

The IonMonkey compiler is capable of consolidating guards, but some guards defy consolidation, [§] and eliminating even very predictable guards can be important in hot loops. [¶] With type inference, the compiler knows things about a point in a JS program like:

The value in a is definitely a dense array of length 42 (with no "holes") that only contains integer values.

When you know a is full of integers and that idx is an integer, you don't have to guard that the value of a[idx] is an integer. You can also use interval analysis to avoid the length-check guard that establishes idx is less than 42.

Design space

Personally, I don't think the design space for whole-method compilation of dynamic languages is well explored at this point. Making a HotSpot analog for JS has a lot of open ended, how-will-it-work-best style questions. Here's a sampling:

Will there be a monkey king?

In the coming months the IonMonkey team will be answering an interesting question: do all of the existing execution modes have their place? The thing about adaptive compilation is just that... it's adaptive. Clearly, the fittest monkeys will survive.

An adaptive compiler can potentially subsume the roles of other execution modes. Yes, IonMonkey is being designed to whole-method optimize code that runtime profiling indicates is hot. However, despite the overhead of building an IR, it can potentially do quick block-local optimizations to make "baseline" generated code, like JägerMonkey did.

The IonMonkey backend can also potentially be used as the compiler for traces formed using the existing monitoring infrastructure — compilers that handle regions can work just as well on linear instruction sequences. Linear instruction sequences are generally easier to optimize, because you don't have to reconcile (jargon: meet) optimization information at join points.

The idea of a one-size-fits-all JS compiler platform is alluring, but, in my mind, it's not clear how feasible that really is. As the IonMonkey compiler comes online, the experimentation can start. The team will just have to bust out the range finders and measure the badassitude of the ion beam. We won't regress performance, so which monkeys remain standing has yet to be seen.

I suppose what I'm saying is that you can count on us to pit our vicious monkeys against each other in a deathmatch for your personal benefit.

Look at me, still talking, when there's science to do!

Footnotes

[*]

From the MDC wiki: "JS and NSPR have common roots in the dawn of time (Netscape 2)". Before that it had "prototype" names like Mocha and LiveScript.

[†]

Chambers, Ungar, Lee - An Efficient Implementation of SELF, 5. The Compiler. Also see the Oracle Labs SELF papers.

[‡]

A name which I'd never heard of before working on the JS engine.

[§]

Either because they are loop-iteration dependent or because of the presence of code constructs that you can't consolidate across.

[¶]

Microprocessor architecture perspective: even though modern microprocessors have loop buffer with pre-translated microcode and full branch prediction, you can't retire any of the guard's dependent instructions until you see that the guard has been predicted correctly.

[#]

The JVM's -server flag enables the highest optimization levels available in the JVM, a startup option typically reserved for more fixed, processing-intensive, server-class workloads.

Remembrance for Codington

I marched past the broken-down memory fences into Codington, white-knuckle clutching my .vimrc caliber sawed-off. It was a ghost town — not another developer within picoseconds.

"Can't get distracted," I mumbled into the abyss, "Sweep the perimeter, get a feel for what we're dealing with."

All it took was a few femtos of strafing — what had happened in Codington hit you in the face like a bag of bricks. All you could see was mutables and const_casts strewn around among the faux-silver bullet casings and streaks of blood that trail indescernably into the distance.

The plague had descended upon Codington as it had so many cities before. I spat in disdain and snarled.

"Const disease."