Archive for the ‘Languages’ Category

A prototypal binding trap

Sunday, September 5th, 2010

It always pains me to explain these little identifier resolution traps:

#!/usr/bin/env python3
 
class Egg:
 
    _next_id = 1
 
    def __init__(self):
        self.id = self._next_id
        self._next_id += 1
        assert Egg._next_id is self._next_id
 
 
if __name__ == '__main__':
    f = Egg()

Fails the assertion. It’s decomposing the assignment-update into its constituent tmp = self._next_id + 1; self._next_id = tmp components, but a programmer could reasonably expect a Lookup/Update hash map ADT operation to occur instead — get the slot by lookup, mutate the value found, and store back in an atomic sense, clobbering the class member with the updated value — but that’s not how it works.

This goes with the prototypal territory:

function Egg() {
    this.id = this.next_id;
    this.next_id += 1;
    assertEq(this.next_id, Egg.prototype.next_id);
}
 
Egg.prototype.next_id = 1;
var e = new Egg(); // Error: Assertion failed: got 2, expected 1

Fortunately, note that this behavior definitely has the semantics you want when you add inheritance to the mix. Is the self.viscosity that this class implementation is referring to a class member or an instance member in the base class?

#!/usr/bin/env python3
 
import sauce
 
class AwesomeSauce(sauce.Sauce):
 
    def __init__(self):
        super().__init__()
        self.viscosity += 12 # Deca-viscoses per milliliter.

The answer is that you don’t care — it’s being rebound on the AwesomeSauce instance no matter what.

Moral of the story is to be mindful when using update operations on class members — if you want to rebind a class member within an instance method, you’ve got to use the class name instead of self.

Coding style as a feature of language design

Thursday, July 29th, 2010

roc recently posted a thought-provoking entry titled, "Coding Style as a Failure of Language Design", in which he states:

Languages already make rules about syntax that are somewhat arbitrary. Projects imposing additional syntax restrictions indicate that the language did not constrain the syntax enough; if the language syntax was sufficiently constrained, projects would not feel the need to do it. Syntax would be uniform within and across projects, and developers would not need to learn multiple variants of the same language.

I totally agree with roc’s point that there is overhead in learning-and-conforming-to local style guidelines. I also agree that this overhead is unnecessary and that language implementers should find ways to eliminate it; however, I think that imposing additional arbitrary constraints on the syntax is heading in the wrong direction.

Your language’s execution engine [*] already has a method of normalizing crazy styles: it forms an abstract syntax tree. Before the abstract syntax tree (AST) is mutated [†] it is in perfect correspondence with the original source text, modulo the infinite number of possible formatting preferences. This is the necessary set of constraints on the syntax that can actually result in your program being executed as it is written. [‡]

So, why don’t we just lug that thing around instead of the source text itself?

The dream

The feature that languages should offer is a mux/demux service: mux an infinite number of formatting preferences into an AST (via a traditional parser); demux the AST into source text via an AST-decompiler, parameterized by an arbitrarily large set of formatting options. Language implementations could ship with a pair of standalone binaries. Seriously, the reference language implementation should understand its own formatting parameters at least as well as Eclipse does. [§]

Once you have the demux tool, you run it on your AST files as a post-checkout hook in your revision control system for instant style personalization. If the engine accepts the AST directly as input, you would only need to demux the files you planned to work on — if the engine accepted an AST directly as input in lieu of source text, this could even be an optimization.

Different execution engines are likely to use different ASTs, but there should be little problem with composability: checked-in AST goes through standalone demux with an arbitrary set of preferences, then through the alternate compiler’s mux. So long as the engines have the same language grammar for the source text, everybody’s happy, and you don’t have to waste time writing silly AST-to-AST-prime transforms.

In this model, linters are just composable AST observers/transforms that have no ordering dependencies. You could even offer a service for simple grammatical extensions without going so far as language level support. Want a block-end delimiter in the Python code you look at? [¶] Why not, just use a transform to rip it out before it leaves the front-end of the execution engine.

Reality

Of course, the set of languages we know and love has some overlap with the set of languages that totally suck to parse, whether due to preprocessors or context sensitivity or the desire to parse poems, but I would bet good money that there are solutions for such languages. In any case, the symmetric difference between those two sets could get with it, and new languages would be kind to follow suit. It would certainly be an interesting post-FF4 experiment for SpiderMonkey, as we’ve got a plan on file to clean up the parser interfaces for an intriguing ECMAScript strawman proposal anywho.

Footnotes

[*] Interpreter, compiler, translator, whatever.
[†] To do constant folding or what have you.
[‡] Oh yeah, and comments. We would have to keep those around too. They’re easy enough to throw away during the first pass over the AST.
[§] Even more ideal, you’d move all of that formatting and autocompletion code out of IDEs into a language service API.
[¶] Presumably because you despise all that is good and righteous in the world? ;-)

Notes from the JS pit: lofty goals, humble beginnings

Saturday, May 1st, 2010

I’ve been working at Mozilla for about two months on the JavaScript (JS) engine team, the members of which sit in an area affectionately known as the "JS pit".

Mozillians appear to try to blog on a regular basis, so I’ll be starting a series of entries prefixed "notes from the JS pit" to explain what I’ve been working on and/or thinking about.

Notably, I feel fortunate to work for a company that encourages this kind of openness.

Goals

I always feel stupid writing down goals — they seem so self-evident; however, it helps to put the work I’m doing into perspective and gives me something that I can to refer back to.

I’m also a big believer in the effectiveness of public accountability, so publishing those goals seems prudent — my notes from the JS pit are poised to help me stay motivated more than anything else.

My goals are to:

  • Do my part to meet and exceed performance targets in the JS engine. It’s an exciting time with lots of healthy competition to motivate this.
  • Immerse myself in language design concepts and seek out important implementation details, even if those details don’t pertain to something that I’m working on directly. This also entails some independent research and small projects to gain better understanding of foreign language concepts.
  • Deliberately take time to think constructively about alternative or innovative ways to accomplish our language goals. Taking things at face value as "the way we’ve always done it" is the way that innovative things get overlooked — this is relatively easy to do with a pair of fresh eyes, but similarly difficult because you’re the team noob.

Working with compiler engineers from diverse language backgrounds, it’s prime time for sucking knowledge out of people’s heads, comparing and contrasting it, and listening to them argue with each other. Heck, just look at the concepts behind JS: an imperative, Scheme-inspired, prototypal, C-and-Java-syntax conforming language that’s irrevocably tied to a practical platform, the web. It’s bound to be a fun ride.

From start to present

I started off implementing the simpler opcodes for JaegerMonkey (JM) and getting an understanding the JM code base. Not too long into it, I was told that looking into quick-and-easy parser optimizations was a priority — somebody had reported that a significant fraction of the GMail load bar time could be attributed to JS parsing. [*]

Now, JavaScript isn’t the easiest language in the world to parse; for example, automatic semicolon insertion creates some non-traditional obstacles for generated shift/reduce parsers [†] — it effectively makes an error correction algorithm part of the normal parse procedure. The details are for another entry, but suffice it to say that our recursive descent parser code gets complex, especially due to our E4X support and some of the static analyses we perform for optimizations before bytecode emission.

In pursuing JS parser optimization I assembled a suite of parsing benchmarks from sites on the web with "large" JS payloads — I call this suite parsemark. After getting some speedups from simple inlining, I attempted a somewhat fundamental change to the parser to reduce the number of branch mispredictions, in converting it to always have a token "pre-lexed" as opposed to the prior "lex-on-demand" model. Roughly, this required adding a "there’s always a lexed token" invariant to the lexer and hoisting lexer calls/modesets from substitution rules into their referring nonterminals in the parser. The details for this are also entry fodder. Sadly, it demonstrated negligible performance gains for the increase in complexity. Sure taught me a lot about our parser, though.

The biggest performance win was obtained through a basic fix to our parser arena-allocation chunk sizing. sayrer noticed that a surprising amount of time was being spent in kernel space, so we tracked the issue down. It was frustrating to work for a few weeks on a fundamental change and then realize that multiplying a constant by four can get you a 20% parsing speedup, but I’ve certainly learned to look a lot more closely at the vmem subsystem when things are slow. I have some speedup statistics and a comparison to V8 (with all its lazy parsing and parse-caching bits ripped out), but I don’t have much faith that my environment hasn’t changed in the course of all the historical data measurements — writing a script to verify speedup over changesets seems like a viable option for future notes.

In the miscellany department, I’ve been trying to do a good amount of work fixing broken windows via cleanup patches. I’m finding it difficult to strike a balance here, since there’s a lot of modularity-breaking interdependencies in the code base — what appear to be simple cleanups tend to unravel into large patches that get stale easily. However, cleanup does force you to read through the code you’re modifying, which is always good when you’re learning a new code base.

Looking back on it, it doesn’t seem like a lot of work; of course, my hope is that the time I spend up-front getting accustomed to the codebase will let me make progress on my goals more rapidly.

Stay tuned for more JS pittage — unpredictable time, but predictable channel.

Footnotes

[*] To date, I haven’t looked into this myself. Ideally, I should have verified it before starting on the parser work, but I was eager to start working on things rather than investigate the reasons behind them.
[†] Though I’ve seen that Webkit’s JavaScriptCore uses Bison — I’m going to have to check out that implementation at some point.

Code ☃ Unicode

Thursday, April 1st, 2010

Let’s come to terms: angle brackets and forward slashes are overloaded. Between relational operators, templates, bitwise shift operators, XML tags, (HTML/squiggly brace language) comments, division, regular expressions, and path separators, what don’t they do?

I think it’s clear to everyone that XML is the best and most human readable markup format ever conceived (for all data serialization and database backing store applications without exception), so it’s time for all that crufty old junk from yesteryear to learn its place. Widely adopted web standards (such as Binary XML and E4X) and well specified information exchange protocols (such as SOAP) speak for themselves through the synergy they’ve utilized in enterprise compute environments.

The results of a confidential survey I conducted conclusively demonstrate beyond any possibility of refutation that you type more angle brackets in an average markup document than you will type angle-bracket relational operators for the next ten years.

In conclusion, your life expectancy decreases as you continue to use the less-than operator and forward slash instead of accepting XML into your heart as a first-class syntax. I understand that some may not enjoy life or the pursuit of happiness and that they will continue to use deprecated syntaxes. To each their own.

I have contributed a JavaScript parser patch to rectify the situation: the ☃ operator is a heart-warming replacement for the (now XML-exclusive) pointy-on-the-left angle bracket and the commonly seen tilde diaeresis ⍨ replaces slash for delimiting regular expressions. I am confident this patch will achieve swift adoption, as it decreases the context sensitivity of the JavaScript parser, which is a clear and direct benefit for browser end users.

The (intolerably whitespace-sensitive) Python programming language nearly came to a similar conclusion to use unicode more pervasively, while simultaneously making it a real programming language by way of the use of types, but did not have enough garments to see it through.

Another interesting benefit: because JavaScript files may be UTF-16 encoded, this increases the utilization of bytes in the source text by filling the upper octets with non-zero values. This, in the aggregate, will increase the meaningful bandwidth utilization of the Internet as a whole.

Of course, I’d also recommend that C++ solve its nested template delimiter issue with ☃ and ☼ to close instead of increasing the context-sensitivity of the parser. [*] It clearly follows the logical flow of start/end delimiting.

As soon as Emoji are accepted as proper unicode code points, I will revise my recommendation to suggest using the standard poo emoticon for a template start delimiter, because increased giggling is demonstrated to reduce the likelihood of head-and-wall involved injuries during C++ compilation, second only to regular use of head protection while programming.

Footnotes

[*] Which provides a direct detriment to the end user — optimizing compilers spend most of their time in the parser.

Two postfix operations redux: sequence points

Sunday, January 10th, 2010

Get ready for some serious language lawyering.

I was going back and converting my old entries to reStructuredText when I found an entry in which I was wrong! (Shocking, I know.)

C

Stupid old me didn’t know about sequence points back in 2007: the effects of the ++ operator in the C expression i++ * i++ are in an indeterminate state of side-effect completion until one of the language-defined sequence points is encountered (i.e. a semicolon or function invocation).

From the C99 standard 6.5.4.2 item 2 regarding the postfix increment and decrement operators:

The result of the postfix ++ operator is the value of the operand. After the result is obtained, the value of the operand is incremented. The side effect of updating the stored value of the operand shall occur between the previous and the next sequence point.

Therefore, the compiler is totally at liberty to interpret that expression as:

mov lhs_result, i     ; Copy the values of the postincrement evaluation.
mov rhs_result, i     ; (Which is the original value of i.)
mul result, lhs_result, rhs_result
add i, lhs_result, 1
add i, rhs_result, 1  ; Second increment clobbers with the same value!

This results in the same result as the GCC compilation in the referenced entry: i is 12 and the result is 121.

As I mentioned before, the reason this can occur is that nothing in the syntax forces the first postincrement to be evaluated before the second one. To give an analogy to concurrency constructs: you have a kind of compile-time "race condition" in your syntax between the two postincrements that could be solved with a sequence point "barrier". [*]

In this assembly, those adds can float anywhere they like after their corresponding mov instruction and can operate directly on i instead of the temporary if they’d prefer. Here’s an possible sequence that results in a value of 132 and i as 13.

mov lhs_result, i ; Gets the original 11.
inc i             ; Increment in-place after the start value is copied.
mov rhs_result, i ; Gets the new value 12.
inc i             ; Increment occurs in-place again, making 13.
mul result, lhs_result, rhs_result

Even if you know what you’re doing, mixing two postfix operations, or any side effect, using the less obvious sequence points (like function invocation) is dangerous and easy to get wrong. Clearly it is not a best practice. [†]

Java

The postincrement operation appears to have sequence-point-like semantics in the Java language through experimentation, and it does! From the Java language specification (page 416):

The Java programming language also guarantees that every operand of an operator (except the conditional operators &&, ||, and ? :) appears to be fully evaluated before any part of the operation itself is performed.

Which combines with the definition of the postfix increment expression (page 485):

A postfix expression followed by a ++ operator is a postfix increment expression.

As well as left-to-right expression evaluation (page 415):

The left-hand operand of a binary operator appears to be fully evaluated before any part of the right-hand operand is evaluated.

To a definitive conclusion that i++ * i++ will always result in 132 == 11 * 12 and i == 13 when i == 11 to start.

Python

Python has no increment operators specifically so you don’t have to deal with this kind of nonsense.

>>> count = 0
>>> count++
  File "<stdin>", line 1
    count++
          ^
SyntaxError: invalid syntax

Annoyingly for newbies, though, it looks like ++count is a valid expression that happens to look like preincrement.

>>> count = 0
>>> ++count
0
>>> --count
0

They’re actually two unary positive and negative operators, respectively. Just one of the hazards of a context free grammar, I suppose.

Footnotes

[*] I threw this in because the ordeal reminds me of the classic bank account concurrency problem. If it’s more confusing than descriptive, please ignore it. :-)
[†]

Since function invocation defines sequence points, I thought this code sequence guaranteed those results:

#include <stdio.h>
 
int identity(int value) { return value; }
 
int main() {
        int i = 11;
        printf("%d\n", identity(i++) * identity(i++));
        printf("%d\n", i);
        return 0;
}

As Dan points out, the order of evaluation is totally unspecified — the left hand and right hand subexpression can potentially be evaluated concurrently.