November 14, 2012

ARM chars are unsigned by default

[Latest from the "I can't believe I'm writing a blog entry about this" department, but the context and surrounding discussion is interesting. --Ed]

If you're like me, or one of the other thousands of concerned parents who has borne C code into this cruel, topsy-turvy, and oftentimes undefined world, you read the C standard aloud to your programs each night. It's comforting to know that K&R are out there, somewhere, watching over them, as visions of Duff's Devices dance in their wee little heads.

The shocking truth

In all probability, you're one of today's lucky bunch who find out that the signedness of the char datatype in C is undefined. The implication being, when you write char, the compiler is implicitly (but consistently) giving it either the signed or unsigned modifier. From the spec: [*]

The three types char, signed char, and unsigned char are collectively called the character types. The implementation shall define char to have the same range, representation, and behavior as either signed char or unsigned char.

...

Irrespective of the choice made, char is a separate type from the other two and is not compatible with either.

—ISO 9899:1999, section "6.2.5 Types"

Why is char distinct from the explicitly-signed variants to begin with? A great discussion of historical portability questions is given here:

Fast forward [to 1993] and you'll find no single "load character from memory and sign extend" in the ARM instruction set. That's why, for performance reasons, every compiler I'm aware of makes the default char type signed on x86, but unsigned on ARM. (A workaround for the GNU GCC compiler is the -fsigned-char parameter, which forces all chars to become signed.)

Portability and the ARM Processor, Trevor Harmon, 2003

It's worth noting, though, that in modern times there are both LDRB (Load Register Byte) and LDRSB (Load Register Signed Byte) instructions available in the ISA that do sign extension after the load operation in a single instruction. [†]

So what does this mean in practice? Conventional wisdom is that you use unsigned values when you're bit bashing (although you have to be extra careful bit-bashing types smaller than int due to promotion rules) and signed values when you're doing math, [‡] but now we have this third type, the implicit-signedness char. What's the conventional wisdom on that?

Signedness-un-decorated char is for ASCII text

If you find yourself writing:

char some_char = NUMERIC_VALUE;

You should probably reconsider. In that case, when you're clearly doing something numeric, spring for a signed char so the effect of arithmetic expressions across platforms is more clear. But the more typical usage is still good:

char some_char = 'a';

For numeric uses, also consider adopting a fixed-width or minimum-width datatype from <stdint.h>. You really don't want to hold the additional complexity of char signedness in your head, as integer promotion rules are already quite tricky.

Examples to consider

Some of the following mistakes will trigger warnings, but you should realize there's something to be aware of in the warning spew (or a compiler option to consider changing) when you're cross-compiling for ARM.

Example of badness: testing the high bit

Let's say you wanted to see if the high bit were set on a char. If you assume signed chars, this easy-to-write comparison seems legit:

if (some_char < 0)

But if your char type is unsigned that test will never pass.

Example of badness: comparison to negative numeric literals

You could also make the classic mistake:

char c = getchar(); // Should actually be placed in an int!
while (c != EOF)

This comparison would never return true with an 8-bit unsigned char datatype and a 32-bit int datatype. Here's the breakdown:

When getchar() returns ((signed int) -1) to represent EOF, you'll truncate that value into 0xFFu (because chars are an unsigned 8-bit datatype). Then, when you compare against EOF, you'll promote that unsigned value to a signed integer without sign extension (preserving the bit pattern of the original, unsigned char value), and get comparison between 0xFF (255 in decimal) and 0xFFFFFFFF (-1 in decimal). For all the values in the unsigned char range, I hope it's clear that this test will never pass. [§]

To make the example a little more obvious we can replace the call to getchar() and the EOF with a numeric -1 literal and the same thing will happen.

char c = -1;
assert(c == -1); // This assertion fails. Yikes.

That last snippet can be tested by compiling in GCC with -fsigned-char and -funsigned-char if you'd like to see the difference in action.

Footnotes

[*]

The spec goes on to say that you can figure out the underlying signedness by checking whether CHAR_MIN from <limits.h> is 0 or SCHAR_MIN. In C++ you could do the <limits>-based std::numeric_limits<char>::is_signed dance.

[†]

Although the same encodings exist in Thumb-sub-ISA, the ARM-sub-ISA encoding for LSRSB lacks a shift capability on the load output as a result of this historical artifact.

[‡]

Although sometimes of the tradeoffs can be more subtle. Scott Meyers discusses more issues quite well, per usual.

[§]

Notably, if you make the same mistake in in the signed char case you can breathe easier, because you'll sign extend for the comparison, making the test passable.

Accomplish your new year's resolution of being more badass

I know what you're going through — I've done it all before.

The market is teeming with products that purport to help you meet your badassery quota.

First you do the shakes. Then, you go with the bars that say they're infused with the good stuff, but just seem to leave a slightly corrugated taste in your mouth. Finally, you're droppin' hard-earned dinero on a facility that you don't need or a professional badassery trainer whose appointments you desperately wish you could cancel.

But I'm here to tell you don't need to shell out cash-money to become more badass, my friends. Not anymore, thanks to the beauty of open source, the ES6 plans wiki page, and our delightful SpiderMonkey technical staff who are standing by to receive your calls for mentorship.

Allow me to explain.

Badass begets badass

You may have seen Badass JavaScript: self described as, "A showcase of awesome JavaScript code that pushes the boundaries of what's possible on the web." Check out their badass year in review if you haven't. (Some of the stuff that the interwebs has done with JavaScript even has @NeckbeardHacker envious.)

It probably won't surprise you, but do you know what those folks love? JavaScript evolution. There's nothing quite so refreshing as a fresh-squeezed JS language feature that removes that irritating itching-burning sensation. Sets that do what you mean! String repetition that you can use without copy/pasting from your last project! Inverse hyperbolic sine that you can use for... your... math! (Again, all of this is in the ES6 plans.)

I, for example, have wanted String.prototype.startsWith very badly, to the point that I've started washing people's window panes against their will as they exit highway 101. Around here, odds are that a programmer sees my sign and implements the thing just to stop me from bothering them again. (A little tactic that I call SpiderGuerilla warfare.)

Me, holding my SpiderGuerilla sign.

So what are you waiting for?

I know, you're probably already quite beefcake, but here's my three step plan:

  1. Watch the SpiderMonkey hacking intro.

  2. Pick out a bug from the ES6 plans.

  3. Come talk to great people on irc.mozilla.org in channel #jsapi (for example, cdleary, jorendorff, luke, or whoever else is in there) or comment in the bug — tell them that you're on a quest to become even more badass, describe a bug that you're interested in taking, and give a quick note on what you've done with the engine so far — for example, walking through the video in step 1! We'll find you a mentor who will get you started on the right track.

Don't miss out on this exclusive offer — SpiderMonkey contribution is not sold in stores.

In fact, if you act now, we'll throw in an IonMonkey shirt (or another Firefox shirt of equivalent awesomeness) and publish a blurb about your feature in Mozilla hacks. Of course, you can also have yourself added to about:credits, providing that's what you are into.

IonMonkey shirt.

This one-of-a-kind offer is too ridonk to pass up. Just listen to this testimonial from one of our badass contributors:

I started contributing to SpiderMonkey and now I can write a JIT compiler from scratch in a matter of days. BEEFCAKE!

@evilpies [Liberally paraphrased]

See you in the tubes!

SpiderMonkey bubbles OOM retvals

Handling Out-of-Memory conditions gracefully in a large-scale application is notoriously difficult. [*] You generally have to detect that you're OOMing at an arbitrary allocation site, then carefully not allocate any memory, but delegate to part of your application that can recover or report what happened to the user, all without allocating any memory.

Did I mention that, once you've OOM'd, you really shouldn't allocate any memory?

Recovery, so far as I know, means flushing out any expendable caches or trying to coax the operating system to page stuff out to disk.

Of course, malloc may only indicate this OOM-like scenario (by returning NULL) if you're "lucky". If you're unlucky, on overcommit systems you can get taken out by the OOM killer, dispensing its unique brand of indiscriminate justice. [†] Or, you can crawl to brain-numbing speeds as your memory is relegated to swap and the bully operating system chants, "Stop DoSing yourself! Stop DoSing yourself!"

In any case, a NULL return value from malloc must not be propagated down the "successful allocation" paths in order to avoid potentially-exploitable NULL pointer deference vulnerabilities.

SpiderMonkey

In SpiderMonkey we check for error conditions, including OOM conditions, everywhere. (C++ exceptions were not attempted in the Mozilla code base, due to performance issues with the Windows ABI.) As a result, most functions in SpiderMonkey are fallible. A typical signature looks something like:

bool
DoSomeStuff(JSContext *cx)
{
    MonkeyCage *cage = cx->new_<MonkeyCage>();
    if (!cage)
        return false;

    // ...put monkeys in the cage, or something...

    return true;
}

A bool is returned in order to indicate failure. cx is the execution context, which is fancy way of saying, "An object that you thread through all your engine functions because it holds references to junk you're going to need."

One thing you're definitely going to need is allocation functionality. You can get at allocation functionality through helper methods like cx->malloc_() and cx->new_<>() — these do some accounting (how many outstanding bytes have been malloc'd on this context?) and, if you run out of memory, flush some GC stuff to see if there's free space to be had.

When an error occurs while you DoSomeStuff, you set "exception" details (such as a JavaScript exception object or an "I've hit an OOM" flag) on the context. Then, you return false to your caller. And, if it doesn't know how to handle the exception, that caller returns false to its caller.

This transitive "bubbling" of a false return value to callers unwinds the C++ stack — RAII objects with destructors release any resources they had acquired — and permits callers who understand the error to handle it appropriately. In this sense, even out-of-memory errors are "catchable" by native code in some sense.

At the outer membrane of the engine is the SpiderMonkey API, called the JSAPI. The functions in the JSAPI reflect this same fallibility: most of the API functions also return a boolean value and take a context which can be used to report an error.

V8

By contrast, V8 is able to use this construct to avoid a similar OOM check-and-bubble guideline:

void* Malloced::New(size_t size) {
  void* result = malloc(size);
  if (result == NULL) {
    v8::internal::FatalProcessOutOfMemory("Malloced operator new");
  }
  return result;
}

The FatalProcessOutOfMemory calls a fatal error callback and then calls abort(). It doesn't look like the callback is capable of doing recovery and continuing execution, but I'm new to the codebase, so I'm not totally sure.

With process isolation, calling abort() when you run out of memory is somewhat excusable. If we did this in Firefox, which does not have process-per-tab, one page which hit such an OOM would take out the whole program. In Chrome, however, an OOM handled in this way will take out the current process-isolated tab group. When you have lots of tabs open in Chrome, a sizable number of tabs may be muxed to a single process, in which case this is might be considered bad user-level behavior, but it works well in the more common case (with a small number of tabs).

Ignoring opinions on whether aborting a tab group is somehow "worse" than alternatives, I have to say that not explicitly checking for OOMs is nice. Writing, maintaining, and validating the correctness of seldom-taken but abundantly-existent OOM paths is a tax on development.

"Infallible" allocation

Allocations can be bucketed into two simple categories:

If you've attempted allocation recovery procedures but still fail to allocate the former category, life is tough. These should generally succeed, because they're well-understood known quantities: a call to malloc(sizeof(double)) failing means you're up against a pretty serious memory limit, so it's tough to do more than display a pre-canned, "Sorry, everything is hosed!" dialog. Unless your process is architected such that you can cleave off and deallocate a large amount of currently used (and otherwise unreferred to) memory at the user's behest with zero memory overhead, you don't have very good options.

By constrast, the latter might not succeed just because the user-controlled content is sufficiently weird, malformed, or malicious.

To avoid the tax I mentioned in the last section, the Mozilla Gecko engine (outside of SpiderMonkey) has been moving to use a strategy called "infallible malloc" for the former category of allocations. If a well-understood allocation of a reasonable and generally fixed size fails, it will first attempt memory recovery procedures [‡] and, failing at that, will cause the process to abort. With this scheme, you avoid the development tax of checking of OOM conditions.

Back to SpiderMonkey

So, is there practical benefit to bubbling an OOM through the whole engine to the API caller?

Currently, yes. Ultimately, probably not.

At the moment, both of the categories of allocation I mentioned are using the same mechanism, so hitting a content-induced OOM (second category) in SpiderMonkey will bubble the issue up to Gecko, which will be able to report that error and proceed as normal. In the future, however, there's no essential reason to bubble the issue up to Gecko: we could just use infallible malloc from SpiderMonkey directly.

SpiderMonkey already supports specification of a custom allocator at compile time, so we would just need to incrementally identify known- and fixed-size allocations and port them to use infallible malloc. That way, instead of bubbling up a return code from SpiderMonkey to ultimately cause an abort() from infallible malloc in Gecko, we could cause the abort() from SpiderMonkey's infallible malloc call directly, and avoid the OOM development tax.

Footnotes

[*]

With apologies to Erik Corry for misunderstanding how V8 handled OOMs in my recent reading!

[†]

This could occur due to a legitimate dereference of a memory chunk that malloc cheerily gave you — it just never happened to be backed by physical memory! Similarly, on a system where multiple applications are running simultaneously, any one of them is a potential candidate for OOM killing, so your process may get OOM killed because a different process legitimately dereferenced its overcommitted memory.

[‡]

I've been informed that this is not currently enabled, but is under active development by the MemShrink team.

Thinking of dynamic language code as a template

You can think of a dynamic language code as a template. This code has generality because of duck typing; the idea is roughly, "If the code can work on a value, it will, with no muss or fuss."

Some programmers find this to be an empowering form of "do what I say" — you don't need to explain everything you want your code to do in terms that even an optimizing compiler can understand.

In traditional dynamic language VMs, this template is instantiated with concrete value types at runtime. When a value flows into a bit of code, the interpreter inspects the type of the value and then runs the appropriate "instantiation of the template".

Example

Consider the function:

function bitAndAdd(a, b, c) {
    return a + (b & c);
}

In its maximally generic, template-looking form it reads like this pseudo bytecode:

bitAndAdd<T0, T1, T2> =
    GETARG0 -> T0
    GETARG1 -> T1
    GETARG2 -> T2
    BITAND<T1, T2> -> T3
    ADD<T1, T3> -> T4

In the JS engine, the T-prefixed types can be any in the set {null, undefined, boolean, int32, double, object, string} — these are the engine internal value tags.

In a bytecode-based interpreter, each bytecode typically asks, "Is this thing that I'm acting on an int32? Or double? Or string? Or ...? Or user defined type?" [*] Once that's determined, it runs the appropriate sub-section of the bytecode implementation for that type. The question-asking part of the bytecode is, in effect, directing you to the appropriate template instantiation. (Note that bytecodes are kind of like mini-templates composed into the larger function template.)

When you call bitAndAdd<int32, int32, int32>, you get an instantiated code path at runtime that looks like:

bitAndAdd<int32, int32, int32> =
    GETARG0 -> int32
    GETARG1 -> int32
    GETARG2 -> int32
    BITAND<int32, int32> -> int32
    ADD<int32, int32> -> ...

This path represents all of the machine code executed in the interpreter that does useful work for this function. Imagine the bytecodes that make up the function as a bunch of enum values in an array. For each enum value in the array, the interpreter dispatches to an type-unspecialized bytecode implementation. Then, the unspecialized bytecode implementation inspects the input types and dispatches to the specialized implementation.

A decent example is the ADD opcode in JavaScript. The unspecialized version of ADD looks something like:

ADD<T0, T1>(T0 lhs, T1 rhs) {
    if (T0 is int32 and T1 is int32)
        result = ADD<int32, int32>(lhs.getInt32(), rhs.getInt32())
    else if (T0 is string and T1 is string)
        result = ADD<string, string>(lhs.getString(), rhs.getString())
    else if (...)
        ...
}

You can see the unspecialized ADD delegating to the type-specialized bytecode sub-implementations. A specialization like ADD<int32, int32> are fairly simple and read something like:

ADD<int32, int32>(int32 lhs, int32 rhs) {
    result = lhs + rhs
    if (overflow)
        result = (double) lhs + (double) rhs;
}

Specializing the template

Of course, with JIT technology, we try to be a lot smarter about making the instantiation efficient. There are a few key mechanisms:

Avoiding bytecode dispatch

The composition of mini-templates (bytecodes) into the larger program template (made out of bytecodes) causes "jump to the next bytecode in the program" cost in an interpreter loop. By creating a machine-code implementation for the sequence of bytecodes that make up a function, we naturally cause machine-code duplication, but avoid that interpretation overhead. This is a natural consequence of choosing to JIT compile a function.

Constant propagation / local type inference (prime example: JägerMonkey)

If the JIT sees that you're performing a bitwise-and, and the spec says that the result of a bitwise-and is always an int32, the JIT can safely assume that fact downstream in the code; for example, in:

function byteSquare(x) {
    var a = x & 0xff;
    var b = a * a;
    return b;
}

The inputs to the a * a multiply are int32s, without question. There's no reason to instantiate the machine code for other types when you know it's going to be a int32.

Whole program type inference (prime examples: JägerMonkey + Type Inference, IonMonkey)

Flowing conservative type information through all the code in the system tells you key interprocedural facts. For example, just by looking at your program it may be easy to see that the divmod(dividend, divisor) function is only ever called with two constant integers as arguments.

With that as the basic idea, the analysis can get a lot fancier; for example, analyses can see that you're only storing new Car() objects into the vehicle property of your Person records, and that new Car() will always have an int32 value in its mufflerCount attribute.

By considering everything [†] that the template could possibly do, we can still narrow down the possibilities in worthwhile cases. This less us instantiate less code and know more facts for use in optimization.

Trace recording (prime example: TraceMonkey)

When a loop is determined to be important (i.e. it hits a certain execution count) a recorder notes which bytecodes execute in the interpreter and what types flow into them. This avoids both bytecode dispatch overhead and type ambiguity, creating a very highly specialized template.

However, if the types ever stop matching up with the observed bytecode execution — say, myObject.quotient used to be an int32, but when we ran the trace a subsequent time it turned out to be a double — the execution of that trace cannot continue, because it's instantiated for the wrong types downstream. Because of this possibility, a bunch of checks for type compatibility have to be inserted in the template instantiation.

It's important to note that the generated template is more than just type dependent — it can also be value depdendent! Because a trace only records a single execution path and questions like "Do I execute the if body or the else body?" can be determined by the values held by a given type.

Some folks argue that single-execution recording tends to create overly instantiated templates for typical dynamic language code.

Profiling (prime example: IonMonkey)

If you think of trace recording as observing bytecode for a single run of execution, profiling is observing bytecode for n runs of execution [‡] and accumulating the notes. This eases the pain of compiling "flip floppy" code, because you don't have to see the primarily taken path the first time, you have n tries to get a feel for it.

This process lets you instantiate a template that's general enough to handle all the things you've seen over the n iterations you've observed.

While trace compilation culls out all but one path of control flow, profiling data can be used to cull out all unobserved (and presumably cold) control flow as well. This may increase the effectiveness and decrease the runtime of dataflow-analysis-based optimizations, because there's less unimportant code to consider. I'm not aware of existing profiling JITs that do this, however, which indicates to me that it may not be an overall win.

Further discussion

Dynamic language compilers try to use runtime feedback as well as things it can infer about the code to specialize the template in the best way possible and make generated machine code blazingly fast. We want to be able to mix, "things we definitely know" (from analysis) with "things we are pretty sure we know" (from runtime feedback) in order to get the maximum speedup possible.

One exciting thing about Just-In-Time is that this runtime feedback makes it at least theoretically possible to exceed the speed of a corresponding static program; however, there are signficant challenges involved. Whenever we start leaning too heavily on "things we are pretty sure we know," we need to be able to keep the world from falling apart when we're wrong.

It's actually a fairly emotional topic in compiler engineering: how do you hold the world together when everything you thought you knew is wrong?

Be strong. Until next time.

Footnotes

[*]

Of course, bytecodes that don't need to know about the types of its operands don't ask the questions. Note that GETARG0 takes no parameters, and so wouldn't need to ask any questions.

[†]

Nearly everything. The type inference algorithm actually makes some conservative simplifications to cut down on analysis time.

[‡]

To some approximation. As an alternative, you can also observe execution for some period of time t, which has some correlation to the number of runs of execution.

Understanding JIT spray

Steel your mind for a tale of intrigue, intertwingled with a complex topic in browser security. (It's kind of all over the place, but I might spray something useful.)

Our story, like so many others, starts out with a browser user like yourself, a bottle of red wine, and a devoted young hacker from the Eastern Bloc [*] that answers to the handle "Coleslaw". [†]

Winey-and-Cheesy Corporation, the largest international wine and cheese distributor, has just blitzkrieg bopped the mainstream media over the head with a tactical PR campaign — a free case of wine and sizable wheel of Gouda for the five millionth visitor to their website. [‡]

The only problem is that Winey-and-Cheesy's massively trafficked website... has been owned.

Coleslaw is something of a wunderkind, and has, through feats of social engineering and technical prowess paralleled only by terrible movies from the mid 90s, gained the ability to insert some arbitrary, special-sauce HTML and JavaScript into that promotional page.

Coleslaw intends to perform a "zero-day attack" — this means that there's a bug in the browser that Coleslaw knows about, but that the browser vendors are unaware of. Coleslaw thinks that this bug can be used to take over the machines of some unsuspecting users who visit the promotional page, capitalizing on their maniacal love of fine dining.

The attacker's dilemma

So, to recap, Coleslaw has found a bug in the browser. Coleslaw wants to exploit that bug in order to obtain arbitrary code execution — the ability to run whatever code Coleslaw feels like on the machine that's running the vulnerable browser. The question is, how does Coleslaw get from point A, "I see that there's a bug", to point B, "I can run anything I want on the vulnerable machine"? The process of figuring this out is called exploit development.

The exploit development process is a narrative about control. Coleslaw starts off by having control over a small set of things — the JavaScript and HTML on a page that the browser visits — but wants to end up controlling everything that the user controls on the vulnerable machine. The environment that internet sites have access to is supposed to be sandboxed; i.e. internet sites are expected to have a somewhat limited and carefully selected set of things it can control. For example, web sites that you happen to stumble across shouldn't be able to delete files off of your hard drive.

Strongly-related to this narrative about control is the concept of determinism. If Coleslaw has a concrete understanding that performing some action, like calling alert from JavaScript, always results in some consequence, like creating and displaying an alert window in the browser, then Coleslaw has effectively extended the realm of control from JavaScript to triggering-the-code-in-the browser-that-displays-an-alert-dialog. Barring bugs, the realm of control is always confined to the sandbox — the set of possible actions are those that the browser vendor permits an untrusted website to take.

Not all bugs are created equal

There are lots of different kinds of bugs that browser software can have. There's a relatively tiny set of bugs that permit control flow hijacking, which are generally of interest for gaining arbitrary code execution. Successful hijacking implies that you have the ability to control the address of the instruction being executed, which is commonly referred to as pseudo-register %eip (where ip is short for instruction pointer). With full control of %eip, the attacker can point it at any executable code — possibly at executable code that they've created.

Control flow hijacking is typically accomplished through some kind of memory corruption, stemming from errors in the use of type-unsafe programming constructs in the browser. In general, the bugs of interest for control flow hijacking are:

There's also the possibility of using an attacker-controllable invalid memory read bug to cause an invalid write to happen further along in program execution. Bugs that cause segfaults are carefully evaluated by browser security teams to see if the invalid memory access being performed can be manipulated for use in control flow hijacking.

Platform-level mitigations: DEP, ASLR, and canaries

There are some nifty platform-level protections against traditional control flow hijacking techniques. They make both taking control of %eip and executing an attacker controlled code sequence more difficult.

One control-flow hijacking mitigation is stack smashing protection, which is enabled at compile time using a technique referred to as "canary values". An attacker could historically use stack buffer overruns to clobber the return address in a function frame with a target %eip value, and the ret instruction at the end of the function's machine code would return to that new (attacker controlled) address value. With this mitigation enabled, however, the compiler places a special value on the stack between local variables (where the buffer lives) and the return value. The compiler also augments the function body with pre-return function prologue code that checks the canary value on the stack against its original value. If a stack buffer overrun causes the return value to be overwritten, the canary that lives in the contiguous space between the locals and return value should indicate that things have gone horribly wrong.

Generally, we tend to think of executables as containing all their executable code as static machine-code. Other than the code that the compiler spat out as specific sections of the executable, nothing else should run over the course of the program's execution. This expectation is codified in an OS-level mitigation called Data Execution Prevention (DEP).

The goal of DEP is to prevent things which are not code from being executed as code at runtime. Your program stack, for example, is just a bunch of space for data that your C function frames can't keep in registers. There's basically no reason that a sane program would ever want to start executing the stack area of memory like it were code — if something like that were to happen, it would be better if your program just terminated, because it could be the pivotal point before an attacker like Coleslaw takes control. Program termination means loss of control for the attacker.

Trying to execute code that was not in the original binary will generally cause the program to fault. In a JIT, however, we purposefully create code at runtime, violating the all-the-code-is-in-the-binary assumption. As a result, we have to explicitly mark the machine code that we create as executable by calling to an operating system API function, like VirtualProtect or mprotect, to indicate that the data the process has created should really be executable.

DEP's close friend from acronym club is Address Space Layout Randomization (ASLR). ASLR reduces determinism in the process that the attacker is trying to exploit by randomizing the stack address, library loading addres, heap address, and PEB address, amongst other key program components. With this mitigation, hardcoded constant addresses in attacker crafted code become probabilistically unlikely to succeed at hitting their target. As an example, the start of the program stack could wind up being placed at one of 16 thousand locations!

This also means that the address of system DLLs, like the ones containing OS API functions like VirtualProtect and C library functions like system, are probabilistically unknown to the attacker. Since the browser ships linked with all ASLR-enabled DLLs, it's difficult to use linked DLL code as direct footholds in process space.

Coleslaw wants to run an attacker-controlled code payload, but DEP makes it difficult to execute that payload, since it won't be marked as executable by default.

Coleslaw wants to be able to turn the bug that relinquishes control of %eip into a reliable exploit, but ASLR makes it difficult to know where to point %eip in order to run exploit code.

I imagine that turning a crash into an exploit isn't trivial these days.

Staged shellcode payloads

The machine-code payloads that attackers create are referred to as shellcode. Shellcode is generally characterized by its size and its goal, which is usually reflected by the "stage" it's said to be running. For example, the very first shellcode to run, in computer science style, is referred to as "stage 0".

Intermediate stages of shellcode are often used to bootstrap more complex executable code sequences. The complexity involved in turning a bug into an exploit often prevents arbitrarily complex code sequences from executing immediately, so tinier code sequences are written that just delegate responsibility to a more easily formed executable payload. Constraints that apply to the code that the exploit starts running directly tend to disappear after you've gone through some amount of indirection.

Shellcode can easily embed astoundingly small code sequences called "egg hunters" to find the memory address of other attacker-controlled payloads. The egg hunters are designed to avoid faulting the application, because faults cause the attacker to lose control. They work by performing a series of fast-and-minimally-sized system calls to determine whether a virtual memory page is safe to traverse through and read to find the "egg" payload delimiter.

Once the address of a stage 1 data payload is determined, stage 0 shellcode may attempt to make that segment of memory executable. Despite ASLR, the address of the VirtualProtect function can be derived by hopping from the known TEB address to the PEB address to the DLL loader address mapping table. Once executable permissions have been added to the stage 1 shellcode, it can simply be jumped to.

Another alternative, if the stage 0 shellcode is executing out of a code space with both writable and executable permissions and sufficient available space, is to use what's called a "GetPC" shellcode sequence to determine the current value of %eip and then copy the contents of a stage 1 shellcode payload buffer into the current code space.

For some bugs it may be possible to create "common" stage 0 shellcode to bootstrap any other shellcode payload — this common shellcode is a valuable commodity for exploit toolkits.

JIT spray, deconstructed

As mentioned earlier, the JIT has to mark its own assembly buffers as executable. An attacker may look at using that fact to generate executable stage 0 shellcode, in order to bypass some of the pain inflicted by DEP. But how could you possibly use JIT compilation process to make shellcode?

JIT spraying is the process of coercing the JIT engine to write many executable pages with embedded shellcode.

Blazakis, 2010

Dion Blazakis wrote the seminal paper on JIT spray, in which he presented a jaw-dropping example. Blazakis noticed that the following ActionScript [¶] code:

var y = (
    0x3c54d0d9 ^
    0x3c909058 ^
    0x3c59f46a ^
    0x3c90c801 ^
    0x3c9030d9 ^
    0x3c53535b ^
    ...
)

Was JIT-compiled into the following instruction sequence:

addr    op imm          assembly
0       B8 D9D0543C     MOV EAX,3C54D0D9
5       35 5890903C     XOR EAX,3C909058
10      35 6AF4593C     XOR EAX,3C59F46A
15      35 01C8903C     XOR EAX,3C90C801
20      35 D930903C     XOR EAX,3C9030D9
25      35 5B53533C     XOR EAX,3C53535B

Check out the first line — it's showing that the first instruction is a MOV that places the 32-bit immediate payload into the EAX register. The 32-bit immediate payload from that instruction (3C54D0D9) is exactly the immediate that was used as the left-hand-side to the long XOR sequence in the original ActionScript code.

Now, if we look at the subsequent lines, we see that the addr column, which is showing the address of instructions relative to the start of the sequence, goes up by 5 every time. That's because each instruction after the initial MOV is performing an XOR against the original value in the accumulator register, EAX, exactly as the ActionScript program described.

Each of these instructions is exactly five bytes long — each instruction has a one-byte opcode prefix, given under the op column, followed by a 32 bit immediate constant: the opcode for MOV EAX,[imm32] is 0xB8, and the opcode sequence for XOR EAX,[imm32] is 0x35.

The immediate column may look confusing at a glance, but it's actually just the little-endian equivalent of the 32-bit immediate given in the assembly: the "little end" (least significant byte) goes "in" (at the lowest memory address), which is why the byte order looks flipped around from the one given in the assembly (and in the original ActionScript program).

It may not look so sinister, but the above table is actually deceiving you! In the table, all of the instructions are the same number of bytes (5) in length. On x86 CPUs, however, instructions are actually a variable number of bytes in length: instructions can be as small as a single byte, but can get quite long: the nop instruction is just a 0x90 opcode byte with no operands, whereas the movl $0xdeadbeef, 0x12345678(%ebx,%edx,1) instruction is significantly larger. [#]

As a result, when we look at this instruction sequence "crooked" — with a one-byte skew to the address — we decode a totally different sequence of instructions. I'll show you what I mean.

Our instructions in memory look like the following buffer:

static const char buf[] = {
    0xB8, 0xD9, 0xD0, 0x54, 0x3C,
    0x35, 0x58, 0x90, 0x90, 0x3C,
    0x35, 0x6A, 0xF4, 0x59, 0x3C,
    0x35, 0x01, 0xC8, 0x90, 0x3C,
    0x35, 0xD9, 0x30, 0x90, 0x3C,
    0x35, 0x5B, 0x53, 0x53, 0x3C
};

When we load this up in GDB, and run the disassemble command, we confirm the instructions present in the above table::

(gdb) disassemble/r buf
Dump of assembler code for function buf:
   0x08048460 <+0>:  b8 d9 d0 54 3c mov    eax,0x3c54d0d9
   0x08048465 <+5>:  35 58 90 90 3c xor    eax,0x3c909058
   0x0804846a <+10>: 35 6a f4 59 3c xor    eax,0x3c59f46a
   0x0804846f <+15>: 35 01 c8 90 3c xor    eax,0x3c90c801
   0x08048474 <+20>: 35 d9 30 90 3c xor    eax,0x3c9030d9
   0x08048479 <+25>: 35 5b 53 53 3c xor    eax,0x3c53535b

But then, if we look at the buffer with a one-byte offset, we see a totally different set of instructions! Note the use of buf+1 as the disassembly target.:

(gdb) disassemble/r (buf+1), (buf+sizeof(buf))
Dump of assembler code from 0x8048461 to 0x804847e:
   0x08048461 <buf+1>:       d9 d0  fnop
   0x08048463 <buf+3>:       54     push   esp
   0x08048464 <buf+4>:       3c 35  cmp    al,0x35
   0x08048466 <buf+6>:       58     pop    eax
   0x08048467 <buf+7>:       90     nop
   0x08048468 <buf+8>:       90     nop
   0x08048469 <buf+9>:       3c 35  cmp    al,0x35
   0x0804846b <buf+11>:      6a f4  push   0xfffffff4
   0x0804846d <buf+13>:      59     pop    ecx
   0x0804846e <buf+14>:      3c 35  cmp    al,0x35
   0x08048470 <buf+16>:      01 c8  add    eax,ecx
   0x08048472 <buf+18>:      90     nop
   0x08048473 <buf+19>:      3c 35  cmp    al,0x35
   0x08048475 <buf+21>:      d9 30  fnstenv [eax]
   0x08048477 <buf+23>:      90     nop
   0x08048478 <buf+24>:      3c 35  cmp    al,0x35
   0x0804847a <buf+26>:      5b     pop    ebx
   0x0804847b <buf+27>:      53     push   ebx

If you look down the middle part of the two disassemblies, before the assembly mnemonics, you can read that the bytes are the same from left to right: the first line of the first disassemblies goes b8 d9 d0 54 3c, and the second disassembly starts on the second byte of that same sequence with d9 d0 54 3c, straddling multiple instructions. This is the magic of variable length instruction encoding: when you look at an instruction stream a little bit sideways, things can change very drastically.

Yo dawg, I heard you like X86 assembly...

It's not obvious, at first glance, just how clever this technique is.

The goal of the ActionScript code pattern is for the attacker to insert arbitrary bytes into the code stream that the JIT otherwise generates. The attacker then uses these arbitrary bytes as an alternate instruction stream. However, the attacker has to compensate for the non-attacker-controlled bytes that surround its own.

Each 32-bit immediate encoded in the ActionScript program starts with a MSB of 0x3c — that byte is little-endian encoded and placed, in memory, right before each of the 0x35s that represent the XOR EAX,[imm32] opcode.

Jumping to the 1-byte offset from the base address of the instruction stream starts us off executing 0xd9 0xd0 — a two-byte instruction that runs a no-op on the floating point unit. Both of these bytes were part of the attacker's immediate value: 0x3c54d0d9.

Effectively, the attacker is able to control 4 out of every 5 bytes per instruction in the stream. They are somewhat limited by the bytes fixed in the instruction stream, however — the MSB of each immediate is a 0x3c so that it can successfully combine with the 0x35 from the XOR EAX,[imm32] opcode to create a nop-like instruction, cmp al,0x35, that keeps the stream executing at the 1-byte offset.

It'd be ideal for the attacker if they could find a way to incorporate the 0x35 into an instruction in a useful way, instead of having to lose a byte in order to control it; however, there are lots of fun tricks that you can play to make compact instruction sequences. By making use of the stacky subset of x86 you can get a nice little MISCy program: pushes and pops are nice one-byte instructions that you can split across the semantic nops to simulate moves, and pushing 8-bit signed immediates only takes two bytes, as you can see at buf+11. Dumping your floating point coprocessor state out to the stack is a two byte sequence. Accessing the TEB is a three byte sequence. How can you not love x86?

For this particular code sequence, the attacker only has a 1/5 chance of jumping to an %eip that gives control back to the JIT program — if you land anywhere in the constant-encoded portion, the instruction sequence will be entirely different.

Outstanding issues

So now we know the basic requirements for pulling off a JIT spray attack:

But wait, how do you know where to jump?

JIT spray opens up the possibility for an attacker to create a lot of very similar code via the JIT compiler, possibly with nop sled prefixes. As a result, one approach to bypassing both DEP and ASLR is to fill enough of the address space with JIT code that you can jump to a random location and hit an attacker-controlled portion valid JIT code buffer with reasonable probability.

But this leads to further questions: What address does the attacker pick to jump to? How much code memory does the attacker spray? Creating a reliable exploit seems significantly more difficult.

Blazakis' solution

In order to create a reliable exploit — as opposed to a probabilistic one — Blazakis used the techniques of pointer inferencing and heap feng shui.

The sandbox makes it particularly tricky to figure out where things live in memory. Those kinds of details definitely aren't supposed to be exposed through the sandbox. If the attacker were able to figure out the locations of things in memory space through the sandbox, it would be considered an information leak.

Pointer inferencing is the technique that Blazakis used to accurately determine the memory location of heapified ActionScript entities in the Flash VM. The inferencing described in Blazakis' paper is cleverly based on the fact that literal integer values in the Flash virtual machines are [♠] hashed alongside of heap-object pointers. By observing the default dictionary enumeration order — the order in which keys exist in the hash table — Blazakis was able to narrow down the value of the object pointer to its exact location. [♥]

"Heap feng shui" is the process of understanding the memory allocation behaviors of the sandboxed environment that code is running in, and using that knowledge to place objects in some known locations in memory relative to each other. Blazakis noted that the ActionScript object heap expands in 16MiB increments and took into account the heuristics for executable allocations when loading ActionScript bytecode entities. Blazakis also relied on the usage of VirtualAlloc in the ActionScript memory allocator, with the knowledge that VirtualAlloc maps the first 64KiB aligned hole that's found in a linear scan through the virtual address space.

Blazakis was able to combine these techniques into reliable stage 0 shellcode execution by:

  1. Determining the exact pointer of the first object within a 16MiB heap chunk.

  2. Spraying just enough JIT code to place a JIT code allocation right after that 16MiB chunk.

  3. Determining the JIT spray address to be the object address + 16MiB.

  4. Adding a value like 0x101 to the base address to get an unaligned JIT code location, as described in the JIT spray section above.

  5. Jumping to that resulting address.

Back to the story: the law of large numbers

So, Coleslaw intends to use a multi-step process:

  1. Find bug that permits control flow hijacking

  2. Perform JIT spray

  3. Jump to probabilistic address for stage 0 shellcode

Importance of leaked information about the memory map becomes apparent here: it prevents you from doing a JIT-spray and jump-pray. However, given enough visitors, like the five million to Winey-and-Cheesy's giveaway, we have to start calculating expected values. As mitigations are added to lower the probability of success, we can see the expected value of ownage drop as well.

What will become of our anti-hero?

Will Coleslaw be able to use JIT spray to successfully exploit the browser?

Will pesky JavaScript engine hackers join forces with The League of Extraordinary Security Experts in an attempt to foil Coleslaw's plans with epic mitigations?

What will become of the wheel of Gouda? I hear that they don't keep for very long...

All this, and more, next week — same time, same channel — on Honest to a Segfault!

Footnotes

[*]

As forensics conjectured from their IP addresses.

[†]

I got tired of writing "the attacker" all the fricking time, so I decided to endear the attacker to you with a ridiculous name. Alice, Bob, and Eve couldn't make it to the party. Just roll with it.

[‡]

By now you can probably tell that I wrote this entry while I was hungry.

[§]

Overruns are effectively a subset of memory write bugs that are constrained to a single contiguous set of addresses in memory.

[¶]

ActionScript is the close relative of JavaScript that's used in the Adobe platform for programming Adobe Flash applications and the like.

[#]

To rattle off some more fun 1-byte instructions, there's RET (0xC3), LEAVE (0xC9), INT3 interrupt instruction (0xCC), and push/pop for all registers (0x50 + regno from Intel reference manual table 3-1).

[♠]

Were? I'm not sure how much things have changed since then.

[♥]

In JavaScript, by contrast, string (object) and integer keys never coexist in the same dictionary-like object. Weak maps require their keys to be objects and don't permit enumeration.