December 8, 2011

SpiderMonkey bubbles OOM retvals

Handling Out-of-Memory conditions gracefully in a large-scale application is notoriously difficult. [*] You generally have to detect that you're OOMing at an arbitrary allocation site, then carefully not allocate any memory, but delegate to part of your application that can recover or report what happened to the user, all without allocating any memory.

Did I mention that, once you've OOM'd, you really shouldn't allocate any memory?

Recovery, so far as I know, means flushing out any expendable caches or trying to coax the operating system to page stuff out to disk.

Of course, malloc may only indicate this OOM-like scenario (by returning NULL) if you're "lucky". If you're unlucky, on overcommit systems you can get taken out by the OOM killer, dispensing its unique brand of indiscriminate justice. [†] Or, you can crawl to brain-numbing speeds as your memory is relegated to swap and the bully operating system chants, "Stop DoSing yourself! Stop DoSing yourself!"

In any case, a NULL return value from malloc must not be propagated down the "successful allocation" paths in order to avoid potentially-exploitable NULL pointer deference vulnerabilities.

SpiderMonkey

In SpiderMonkey we check for error conditions, including OOM conditions, everywhere. (C++ exceptions were not attempted in the Mozilla code base, due to performance issues with the Windows ABI.) As a result, most functions in SpiderMonkey are fallible. A typical signature looks something like:

bool
DoSomeStuff(JSContext *cx)
{
    MonkeyCage *cage = cx->new_<MonkeyCage>();
    if (!cage)
        return false;

    // ...put monkeys in the cage, or something...

    return true;
}

A bool is returned in order to indicate failure. cx is the execution context, which is fancy way of saying, "An object that you thread through all your engine functions because it holds references to junk you're going to need."

One thing you're definitely going to need is allocation functionality. You can get at allocation functionality through helper methods like cx->malloc_() and cx->new_<>() — these do some accounting (how many outstanding bytes have been malloc'd on this context?) and, if you run out of memory, flush some GC stuff to see if there's free space to be had.

When an error occurs while you DoSomeStuff, you set "exception" details (such as a JavaScript exception object or an "I've hit an OOM" flag) on the context. Then, you return false to your caller. And, if it doesn't know how to handle the exception, that caller returns false to its caller.

This transitive "bubbling" of a false return value to callers unwinds the C++ stack — RAII objects with destructors release any resources they had acquired — and permits callers who understand the error to handle it appropriately. In this sense, even out-of-memory errors are "catchable" by native code in some sense.

At the outer membrane of the engine is the SpiderMonkey API, called the JSAPI. The functions in the JSAPI reflect this same fallibility: most of the API functions also return a boolean value and take a context which can be used to report an error.

V8

By contrast, V8 is able to use this construct to avoid a similar OOM check-and-bubble guideline:

void* Malloced::New(size_t size) {
  void* result = malloc(size);
  if (result == NULL) {
    v8::internal::FatalProcessOutOfMemory("Malloced operator new");
  }
  return result;
}

The FatalProcessOutOfMemory calls a fatal error callback and then calls abort(). It doesn't look like the callback is capable of doing recovery and continuing execution, but I'm new to the codebase, so I'm not totally sure.

With process isolation, calling abort() when you run out of memory is somewhat excusable. If we did this in Firefox, which does not have process-per-tab, one page which hit such an OOM would take out the whole program. In Chrome, however, an OOM handled in this way will take out the current process-isolated tab group. When you have lots of tabs open in Chrome, a sizable number of tabs may be muxed to a single process, in which case this is might be considered bad user-level behavior, but it works well in the more common case (with a small number of tabs).

Ignoring opinions on whether aborting a tab group is somehow "worse" than alternatives, I have to say that not explicitly checking for OOMs is nice. Writing, maintaining, and validating the correctness of seldom-taken but abundantly-existent OOM paths is a tax on development.

"Infallible" allocation

Allocations can be bucketed into two simple categories:

If you've attempted allocation recovery procedures but still fail to allocate the former category, life is tough. These should generally succeed, because they're well-understood known quantities: a call to malloc(sizeof(double)) failing means you're up against a pretty serious memory limit, so it's tough to do more than display a pre-canned, "Sorry, everything is hosed!" dialog. Unless your process is architected such that you can cleave off and deallocate a large amount of currently used (and otherwise unreferred to) memory at the user's behest with zero memory overhead, you don't have very good options.

By constrast, the latter might not succeed just because the user-controlled content is sufficiently weird, malformed, or malicious.

To avoid the tax I mentioned in the last section, the Mozilla Gecko engine (outside of SpiderMonkey) has been moving to use a strategy called "infallible malloc" for the former category of allocations. If a well-understood allocation of a reasonable and generally fixed size fails, it will first attempt memory recovery procedures [‡] and, failing at that, will cause the process to abort. With this scheme, you avoid the development tax of checking of OOM conditions.

Back to SpiderMonkey

So, is there practical benefit to bubbling an OOM through the whole engine to the API caller?

Currently, yes. Ultimately, probably not.

At the moment, both of the categories of allocation I mentioned are using the same mechanism, so hitting a content-induced OOM (second category) in SpiderMonkey will bubble the issue up to Gecko, which will be able to report that error and proceed as normal. In the future, however, there's no essential reason to bubble the issue up to Gecko: we could just use infallible malloc from SpiderMonkey directly.

SpiderMonkey already supports specification of a custom allocator at compile time, so we would just need to incrementally identify known- and fixed-size allocations and port them to use infallible malloc. That way, instead of bubbling up a return code from SpiderMonkey to ultimately cause an abort() from infallible malloc in Gecko, we could cause the abort() from SpiderMonkey's infallible malloc call directly, and avoid the OOM development tax.

Footnotes

[*]

With apologies to Erik Corry for misunderstanding how V8 handled OOMs in my recent reading!

[†]

This could occur due to a legitimate dereference of a memory chunk that malloc cheerily gave you — it just never happened to be backed by physical memory! Similarly, on a system where multiple applications are running simultaneously, any one of them is a potential candidate for OOM killing, so your process may get OOM killed because a different process legitimately dereferenced its overcommitted memory.

[‡]

I've been informed that this is not currently enabled, but is under active development by the MemShrink team.

Reviews redux

"Whoa, Billy reviewed a one-meg patch to the hairiest part of the codebase in just two hours!" [*]

It's pretty easy to identify what's wrong with that sentence. The speed of a review is not an achievement. Billy could have literally just checked the, "yes, I reviewed it" button without looking at the patch.

... but an empty review looks pretty bad, especially as the size of the patch grows. So maybe Billy padded it out by identifying two hours worth of style nits and asking for a few comments here and there. In any case, the code quality is no more assured after the review than before it.

Conventional wisdom is that it's economically prudent to do good code reviews: finding defects early incurs the lowest cost, review has a 'peer pressure' based motivation towards quality improvement, and review creates knowledge redundancy that mitigates the bus effect. In research literature on code review, effectiveness is typically measured as "defects found per KLoC". [†] However, this dimension ignores the element of "time per review": I'm going to argue that time to give a good review varies in the complexity and size of the modifications.

Now, one can argue that, if Billy does anything more than ignorantly checking the little "I've reviewed this" box, he has the potential to add value. After all, code isn't going to be defect-free when it comes out of review, so we're just talking about a difference in degree. If we assume that truly obscure or systematic bugs won't jump out from a diff, what additional value is Billy really providing by taking a long time?

This is where it gets tricky. I think the reason that folks can have trouble deciding how long reviews should take is that we don't know what a review really entails. When I request that somebody review my patch, what will they try to suss out? What kind of code quality (in terms of functional correctness and safety) is actually being assured at the component level, across all reviewed code?

If you can't say that your reviews ensure some generally understood level of code quality (i.e. certain issues have definitively been considered), it's hard to say that you're using reviews as an effective tool.

Aside: even with clear expectations for the code review process, each party has to exercise some discipline and avoid the temptation to lean on the other party. For mental framing purposes, it's a defect-finding game in which you're adversaries: the developer wants to post a patch with as few defects as possible and the reviewer wants to find as many defects as they possibly can within a reasonable window of time.

A few best practices

From the research I've read on code review, these are two simple things that are supposed to increase defect-finding effectiveness:

Scan, then dig.

Do a preliminary pass to get the gist of how it's structured and what it's doing. Note down anything that looks fishy at a glance. Once you finish your scan, then do another pass that digs into all the corner cases you can think of and inspects each line thoroughly.

Keep checklists.

One checklist for self-reviews and one checklist for reviews of everybody else's stuff. I've seen it recommended that you scan through the code once for every checklist item to do a truly thorough review.

The self-review checklist is important because you tend to repeat the same mistakes until you've learned them cold. When you make a defect and it gets caught, figure out where it fits into your list and make a mental and/or physical note of the example, or add it as a new category.

Having a communal checklist can also be helpful for identifying group pain points. "Everybody screws up GC-rooting JSString-derived chars sometimes," is easily codified in a communal checklist document that the whole team can reference. In addition, this document helps newcomers avoid potential pitfalls and points out areas of the code that could generally be more usable / less error prone.

Here's another nice summary of more effective practices.

I'm personally of the opinion that, if you find something that you think is defective, you try to write a test to demonstrate it. The beneficial outcomes of this are:

I think in an ideal situation there are also linter tools in place to avoid style nits altogether: aside from nits masquerading as legitimate review comments, automatically enforced stylistic consistency is nice.

Footnotes

[*]

Just in case you were wondering, Billy is not an actual person. I think I started using Billy as my hypothetical example person's name after I saw this fairly amusing video.

[†]

In the literature almost every substantive comment is grouped under the term "defect". This includes things like design decisions and suggested factorings. In the same sense that finding a behavioral error early has benefit, finding these issues early helps improve the overall quality of the product going forward.

Too smart, doesn't get quite so many things done?

We care about our craft. We're totally smart and get things done. No question.

But "smart and gets things done" has to have some kind of spectrum associated with it, right? There's at least a "smart" dimension and a "gets things done" dimension.

An easy question to ask is, "Am I overthinking?" (This is especially easy to ask if you're overthinking.)

We often quibble about how to get things done better [*] in terms of practicalities, but it often feels like people who ignore the long tail of practicalities achieve greatness with the least effort.

If you had to pick one, would it be better to over-think or to over-do?

(My advice: don't think about it too much.)

Footnotes

[*]

In some asymptotic sense of better.

What if MoCo were hit by a bus?

I went to my first Hybrid Factory event with some of my teammates this past evening, which was kind of like a mini-workshop. The event was titled, "How to effectively work as a Tech Lead," given by Derek Parham.

One of the main topics was delegation: the audience interaction portion of the event asked us to consider how we would approach delegating all of our current tasks to other people on our teams.

During this exercise sstangl brought up something profound: rather than other Mozilla Corporation-employed teammates, how much of our current task load could we delegate to the community members outside of Mozilla Corporation (AKA MoCo)? These are the people who voluntarily devote their time and effort towards advancing the Mozilla project.

Of course, the talk also covered the classic "bus test," which asks, "If [person X] were hit by a bus, how would we continue to function?" It wasn't a big leap to my asking, "If all of MoCo were hit by a bus, how well situated is the community to carry our outstanding tasks and projects?"

Like all fun hypotheticals, it's far fetched and a bit apocalyptic, but it forces you to think about your team's work and coordination methods in a very different light.

I suppose a related, follow-up question is: if the Mozilla organization is designed to empower a worldwide community, but we couldn't survive a MoCo bus scenario, then are we managing the project in a sustainable way?

Maybe people who oversee the project as a whole (and those who are more familiar with the philosophy behind our governance) have a definitive answer. In any case, it's interesting food for thought.

Collaboration and concentration

Latest in the, "People problems are hard, I'm glad I just program computers," set of thought processes. If you have thought-provoking anecdotal data points, your comments are welcome!

There's a confounding and contentious dynamic in programming between collaboration and concentration. This goes beyond just individual and team efficiency — it's also about finding a balance that lets you enjoy and grow your craft.

There are certainly times that you want to hunker down in an LCD-monitor-adorned sensory deprivation chamber and do some badass debugging, find all the corner cases in that nasty algorithm, or suss out all the invariants in a complicated system.

But, on the other hand, there are certainly times that you need to write code during a flurry of peer interaction. If you're ramping up on a new system, the thirty seconds it takes you to ask another human being a question and receive a solid, well-informed answer usually trumps the hours you'd spend reaching the "same" conclusion. With less certainty. Oh, and you actually missed the trap everybody else already knows about, so you actually figured it out wrong.

The extent of the dichotomy is clearly not limited to these examples. It's easy to think of times you've had to pepper somebody with questions as well as times you wanted to hole up in a cave to get some code done. For my systems programming experience I'd hazard a guess that it's around an 80/20 split between concentration and collaboration.

So, it's interesting to ponder: if you were building a team, how would you structure the work space or work activities to enable collaboration when it was necessary, but enable concentration in the common case?

I suppose it's doubly difficult because you also want your team to feel empowered — capable of seeking out collaborative help when they need it in order to get things done and make progress, but also empowered to cordon themselves off and have at it.

I can pretty easily find research that involves hospital workers in the early 1990s, but I have to imagine that a team of brilliant systems programmers is a different beast altogether.