Coding style as a feature of language design

roc recently posted a thought-provoking entry titled, "Coding Style as a Failure of Language Design", in which he states:

Languages already make rules about syntax that are somewhat arbitrary. Projects imposing additional syntax restrictions indicate that the language did not constrain the syntax enough; if the language syntax was sufficiently constrained, projects would not feel the need to do it. Syntax would be uniform within and across projects, and developers would not need to learn multiple variants of the same language.

I totally agree with roc’s point that there is overhead in learning-and-conforming-to local style guidelines. I also agree that this overhead is unnecessary and that language implementers should find ways to eliminate it; however, I think that imposing additional arbitrary constraints on the syntax is heading in the wrong direction.

Your language’s execution engine [*] already has a method of normalizing crazy styles: it forms an abstract syntax tree. Before the abstract syntax tree (AST) is mutated [†] it is in perfect correspondence with the original source text, modulo the infinite number of possible formatting preferences. This is the necessary set of constraints on the syntax that can actually result in your program being executed as it is written. [‡]

So, why don’t we just lug that thing around instead of the source text itself?

The dream

The feature that languages should offer is a mux/demux service: mux an infinite number of formatting preferences into an AST (via a traditional parser); demux the AST into source text via an AST-decompiler, parameterized by an arbitrarily large set of formatting options. Language implementations could ship with a pair of standalone binaries. Seriously, the reference language implementation should understand its own formatting parameters at least as well as Eclipse does. [§]

Once you have the demux tool, you run it on your AST files as a post-checkout hook in your revision control system for instant style personalization. If the engine accepts the AST directly as input, you would only need to demux the files you planned to work on — if the engine accepted an AST directly as input in lieu of source text, this could even be an optimization.

Different execution engines are likely to use different ASTs, but there should be little problem with composability: checked-in AST goes through standalone demux with an arbitrary set of preferences, then through the alternate compiler’s mux. So long as the engines have the same language grammar for the source text, everybody’s happy, and you don’t have to waste time writing silly AST-to-AST-prime transforms.

In this model, linters are just composable AST observers/transforms that have no ordering dependencies. You could even offer a service for simple grammatical extensions without going so far as language level support. Want a block-end delimiter in the Python code you look at? [¶] Why not, just use a transform to rip it out before it leaves the front-end of the execution engine.

Reality

Of course, the set of languages we know and love has some overlap with the set of languages that totally suck to parse, whether due to preprocessors or context sensitivity or the desire to parse poems, but I would bet good money that there are solutions for such languages. In any case, the symmetric difference between those two sets could get with it, and new languages would be kind to follow suit. It would certainly be an interesting post-FF4 experiment for SpiderMonkey, as we’ve got a plan on file to clean up the parser interfaces for an intriguing ECMAScript strawman proposal anywho.

Footnotes

[*] Interpreter, compiler, translator, whatever.
[†] To do constant folding or what have you.
[‡] Oh yeah, and comments. We would have to keep those around too. They’re easy enough to throw away during the first pass over the AST.
[§] Even more ideal, you’d move all of that formatting and autocompletion code out of IDEs into a language service API.
[¶] Presumably because you despise all that is good and righteous in the world? ;-)

Tags: , , ,

11 Responses to “Coding style as a feature of language design”

  1. Tweets that mention Honest to a Segfault» Blog Archive » Coding style as a feature of language design -- Topsy.com Says:

    [...] This post was mentioned on Twitter by Planet Mozilla and Planet Repeater, Chris Leary. Chris Leary said: Posted: Coding style as a feature of language design http://bit.ly/clKn0Z [...]

  2. Sebastian Redl Says:

    It’s a nice idea, but you would have to reinvent a good part of the current programming infrastructure. Textual diffs for RCS, for example, become pretty meaningless.

  3. cdleary Says:

    @Sebastian Redl: You can still use textual diffs (so long as you weren’t using grammatical extensions), but I’m sure people would prefer some form of AST diff so they could format to their liking. Line numbers become less significant, but it seems like you could somehow substitute AST node numbers.

  4. Simon Says:

    Good formatting isn’t something that can always be reverse-engineered from an AST. Eclipse is a good example – it’s Java “reformat” tool usually does a wonderful job, and can be configured to almost any style. But there are cases where the formatting of the code is significant to the human understanding of the code, and a tool that lacks the latter can’t reconstruct the former. Things like:

    String sql = new StringBuilder()
    .append("select * from table")
    .append("where foo ='bar')
    .append("and baz in ").append(subquery)
    .toString();

    With hand-formatting, it’s a lot more readable than a machine-formatted version that doesn’t know where newlines are desirable, and where they’re not.

  5. voracity Says:

    re: strawman. Oooh, that’s almost as powerful as the HTML/CSS DOM (or lisp, for that matter). (I dream of the day of being able to do live updates to JS via an OM — I don’t know if it would be useful, but it would be fun finding out.)

  6. cdleary Says:

    @Simon: Excellent point — sometimes programmers use an aberrant style to reflect semantic differences, like in your “returns this” builder chaining example. This actually seems like an argument against the baser premise that was originally given by roc, that coding styles should be further constrained to begin with. I’m not convinced that the benefits of the hand-formatting would outweigh the convenience of auto-formatting, but that’s a nice piece of evidence. (I don’t think you’d want to start putting formatting annotations in your AST, so I would assume it’s one or the other.)

  7. Robert O'Callahan Says:

    It’s certainly true that if it’s trivial to convert source code between different styles, style choices are less of an issue.

    However you probably still want to standardize particular styles for storage. It seems kinda hard to get useful annotate/blame/diffs from existing version control systems if you store ASTs. And one thing we learned from the failure of “structured editing” in the 70s and 80s (e.g. the CMU Gandalf project) is that it’s important to work well with existing infrastructure like that.

    You probably also want to standardize particular styles for exchanging patches in Bugzilla and elsewhere. I don’t want to have to integrate code reformatting tools into Bugzilla+pastebin+IRC clients, or into my Web browser, to avoid reading someone’s crazy idea of tasty style.

    And if you’re going to standardize those things, I think only the truly stubborn would insist on working with a nonstandard style at other times.

  8. cdleary Says:

    @Robert O’Callahan: My claim is even stronger than “easy conversion makes things easy” — I’m saying source constructs are inherently format-less, so ideally we would store and share them that way in our distributed revision control systems.

    I’m not familiar with the failure of structured editing (will look into it), but I would assume, given that line-based longest-common-substring diffs don’t work on binary files, we should be able to hook existing version control tools to invoke a different diff command on these AST file-types. I’m curious as to whether this qualifies as “working well with existing infrastructure”?

    I’m not sure I agree with this idea: “And if you’re going to standardize those things, I think only the truly stubborn would insist on working with a nonstandard style at other times.” If it were sufficiently easy to say, “give me a diff/source segment in the normative style,” people would still be happy to use their own style in the general case. I’ve run into many people who are really frustrated working in some standardized style or another.

  9. Simon Says:

    @Chris – to some degree, the auto formatting could be made smarter if it can recognise certain ‘idioms’ in the code, such as the chained “return this” methods of a builder (like StringBuilder, or Hibernate’s Criteria queries). Supported, perhaps, by some form of metadata – not explicit formatting instructions, but something to tell it “this class is a builder”, so it can apply “builder” formatting rules?

    It’s not perfect though, since I can’t see how *any* automated process could recognise that the fourth “append” call in my example above ought to be on the same line as the previous one, instead of on a new line like most builders. Though I suppose StringBuilder is an unusual case, a very generic builder, without a lot of semantics to the method names.

  10. Gerv Says:

    One thing you’ve missed is comments. Some styles have comments like this:

    /**********************************************************/
    /* This is a comment in a nice box */
    /**********************************************************/

    If, say, your preference was for 75 character lines instead of 80, this wouldn’t work well.

    Gerv

  11. cdleary Says:

    @Gerv: Ah, nice catch — you might have to continue to enforce style guidelines for comments. Human languages are a _lot_ harder to reformat/reflow. :-)

    It seems like there are a few common heuristics for comment styles you might be able to factor out in a language-neutral and most-of-what-you-want kind of way. For example, on one of my projects we used \param instead of @param doxygen-style comments; also, it’s probably not hard to convert simple things like “stars on every line” to “stars on just the first and last lines”.

Leave a Reply