'Efficiency' tag

« Older entries
Newer entries »

Remove the self-selection bias from Q&A sessions

I've been in quite a few painful Q&A sessions. I think we can do better.

When somebody volunteers to ask a question, their question is not necessarily of interest to anybody in the audience other than themselves. Despite that possibility, publicly answering the question necessarily takes up the time of every person in the audience. Remember the kid in class who asked 90% of the questions — questions that nobody else ever cared about?

There's a simple way to fix this.

Step 1: At the end of your presentation, ask for a show of hands for those people interested in having a Q&A session. If very few people raise their hands, you should be worried about the quality your presentation. However, in the less gloomy scenario where a good number of people raise their hands, you have a representative sample for the audience population that might be interested in the answer to any given question. (The other people should probably leave, but suggesting that would be rude.)

Step 2: After each person volunteers to ask a question, give the audience a quick poll by saying, "Raise your hand if you're interested in the answer to this question." If very few hands go up, reassure the person that their question is a good one [*] and tell them you'd love to stick around and chat afterwards.

That's it. Each several-minute irrelevant Q&A iteration is now reduced to a few seconds and you have additional feedback as to the topics the audience is interested in — the added audience involvement can't hurt either.

Caveats

Real-time feedback software that the audience interacts with during your presentation can definitely do better. This is just a low-tech proposal to raise the bar a bit.

As my friend pointed out in discussing this, there's no simple way to determine the critical mass for answering a question — it's possible that any given question will only be interesting to a fraction of the audience. It's also possible that you, the presenter, know that the answer to a question is particularly interesting and should be answered without soliciting audience feedback. I believe in your intuition!

Footnotes

[*]

Even if it's not — otherwise, question-asking potentially becomes public humiliation, which is very undesirable.

PICing on JavaScript for fun and profit

Inline caching is a critical ingredient in the delicious pie that is dynamic language performance optimization. What follows is a gentle-albeit-quirky introduction to what polymorphic inline caches (PICs) are and why they're useful to JavaScript Just-In-Time compilers like JaegerMonkey.

But first, the ceremonial giving of the props: the initial barrage of PIC research and implementation in JaegerMonkey was performed by Dave Mandelin and our current inline cache implementations are largely the work of David Anderson. As always, the performance improvements of Firefox's JavaScript engine can be monitored via the Are We Fast Yet? website.

C is for speed, and that's good enough for me

C is fast.

Boring people (like me) argue about astoundingly interesting boring things like, "Can hand-tuned assembly be generally faster than an equivalent C program on modern processor architectures?" and "Do languages really have speeds?", but you needn't worry — just accept that C is fast, and we've always been at war with Eurasia.

So, as we've established, when you write a program in C, it executes quickly. If you rewrite that program in your favorite dynamic language and want to know if it still executes quickly, then you naturally compare it to the original C program.

C is awesome in that it has very few language features. For any given snippet of C code, there's a fairly direct translation to the corresponding assembly instructions. [*] You can almost think of C as portable assembly code. Notably, there are (almost) zero language features that require support during the program's execution — compiling a C program is generally a non-additive translation to machine code.

Dynamic languages like JavaScript have a massive number of features by comparison. The language, as specified, performs all kinds of safety checks, offers you fancy-n-flexible data record constructs, and even takes out the garbage. These things are wonderful, but generally require runtime support, which is supplied by the language engine. [†] This runtime support comes at a price, but, as you'll soon see, we've got a coupon for 93 percent off on select items! [‡]

You now understand the basic, heart-wrenching plight of the performance-oriented dynamic language compiler engineer: implement all the fancy features of the language, but do it at no observable cost.

Interpreters, virtual machines, and bears

"Virtual machine" sounds way cooler than "interpreter". Other than that, you'll find that the distinction is fairly meaningless in relevant literature.

An interpreter takes your program and executes it. Generally, the term "virtual machine" (AKA "VM") refers to a sub-category of interpreter where the source program is first turned into fake "instructions" called bytecodes. [§]

A bear moving quickly

I call these instructions fake because they do things that a hardware processing units are unlikely to ever do: for example, an ADD bytecode in JavaScript will try to add two arbitrary objects together. [¶] The point that languages implementors make by calling it a "virtual machine" is that there is conceptually a device, whether in hardware or software, that could execute this set of instructions to run the program.

These bytecodes are then executed in sequence. A program instruction counter is kept in the VM as it executes, analogous to a program counter register in microprocessor hardware, and control flow bytecodes (branches) change the typical sequence by indicating the next bytecode instruction to be executed.

Virtual (machine) reality

Languages implemented in "pure" VMs are slower than C. Fundamentally, your VM is a program that executes instructions, whereas compiled C code runs on the bare metal. Executing the VM code is overhead!

To narrow the speed gap between dynamic languages and C, VM implementers are forced to eliminate this overhead. They do so by extending the VM to emit real machine instructions — bytecodes are effectively lowered into machine-codes in a process called Just-In-Time (JIT) compilation. Performance-oriented VMs, like Firefox's SpiderMonkey engine, have the ability to JIT compile their programs.

The term "Just-In-Time" is annoyingly vague — just in time for what, exactly? Dinner? The heat death of the universe? The time it takes me to get to the point already?

In today's JavaScript engines, the lowering from bytecodes to machine instructions occurs as the program executes. With the new JaegerMonkey JIT compiler, the lowering occurs for a single function that the engine sees you are about to execute. This has less overhead than compiling the program as a whole when the web browser receives it. The JaegerMonkey JIT compiler is also known as the method JIT, because it JIT compiles a method at a time.

For most readers, this means a few blobs of x86 or x86-64 assembly are generated as you load a web page. The JavaScript engine in your web browser probably spewed a few nice chunks of assembly as you loaded this blog entry.

Aside: TraceMonkey

In SpiderMonkey we have some special sauce: a second JIT, called TraceMonkey, that kicks in under special circumstances: when the engine detects that you're running loopy code (for example, a for loop with a lot of iterations), it records a stream of bytecodes that corresponds to a trip around the loop. This stream is called a trace and it's interesting because a) it can record bytecodes across function calls and b) the trace optimizer works harder than the method JIT to make the resulting machine code fast.

There's lots more to be said about TraceMonkey, but the inline caching optimization that we're about to discuss is only implemented in JaegerMonkey nowadays, so I'll cut that discussion short.

The need for inline caching

In C, accessing a member of a structure is a single "load" machine instruction:

struct Nose {
    int howManyNostrils;
    bool isPointy;
};

bool isNosePointy(struct Nose *nose) {
    return nose->isPointy;
}

The way that the members of struct Nose are laid out in memory is known to the C compiler because it can see the struct definition — getting the attribute nose->isPointy translates directly into a load from the address addressof(nose) + offsetof(Nose, isPointy).

Note: Just to normalize all the terminology, let's call the data contained within a structure the properties (instead of members) and the way that you name them the identifiers. For example, isPointy is an identifier and the boolean data contained within nose->isPointy is the property. The act of looking up a property through an identifier is a property access.

On the other hand, objects in JavaScript are flexible — you can add and delete arbitrary properties from objects at runtime. There is also no language-level support for specifying the types that an identifier can take on. As a result, there's no simple way to know what memory address to load from in an arbitrary JavaScript property access.

Consider the following snippet:

function isNosePointy(nose) {
    return nose.isPointy;
}

To get at the isPointy property, the JavaScript VM emits a single bytecode, called GETPROP, which says "pull out the property with the identifier isPointy". [#] Conceptually, this operation performs a hash-map lookup (using the identifier as a key), which takes around 45 cycles in my microbenchmark. [♠]

Uncached property access data

The process of "looking up a property at runtime because you don't know the exact type of the object" falls into a general category of runtime support called dynamic dispatch. Unsurprisingly, there is execution time overhead associated with dynamic dispatch, because the lookup must be performed at runtime.

To avoid performing a hash-map lookup on every property access, dynamic language interpreters sometimes employ a small cache for (all) property accesses. You index into this cache with the runtime-type of the object and desired identifier. [♥] Resolving a property access against this cache under ideal circumstances takes about 8.5 cycles.

Cached property access data

WTF is inline caching already!?

So we've established that, with good locality, JS property accesses are at least 8.5x slower than C struct property accesses. We've bridged the gap quite a bit from 45x slower. But how do we bridge the gap even bridgier?

Bridge fail!

The answer is, surprisingly, self-modifying code: code that modifies code-that-currently-exists-in-memory. When we JIT compile a property access bytecode, we emit machine-code that looks like this:

type            <- load addressof(object) + offsetof(JSObject, type)
shapeIsKnown    <- type equals IMPOSSIBLE_TYPE
None            <- goto slowLookupCode if shapeIsKnown is False
property        <- load addressof(object) + IMPOSSIBLE_SLOT

Now, if you ask Joe Programmer what he thinks of that code snippet, he would correctly deduce, "The slow lookup code will always be executed!" However, we've got the self-modifying code trick up our sleeves. Imagine that the type matched, so we didn't have to go to the slow lookup code — what's our new property access time?

One type load, one comparison, an untaken branch, and a property value load. Assuming good locality/predictability and that the object's type happened to already be in the register (because you tend to use it a lot), that's 0+1+1+1 == 3 cycles! Much better.

But how do we get the types to match? Joe Programmer is still looking pretty smug over there.

The trick is to have the slowLookupCode actually modify this snippet of machine code! After slowLookupCode resolves the property in the traditional ways mentioned in previous sections, it fills in a reasonable value for IMPOSSIBLE_TYPE and IMPOSSIBLE_SLOT like they were blank fields in a form. This way, the next time you run this machine code, there's a reasonable chance you won't need to go to slowLookupCode — the types might compare equal, in which case you can perform a simple load instruction to get the property that you're looking for!

This technique of modifying the JIT-compiled code to reflect a probable value is called inline caching: inline, as in "in the emitted code"; caching, as in "cache a probable value in there". This the basic idea behind inline caches, AKA ICs.

Also, because we emit this snippet for every property-retrieving bytecode we don't rely on global property access patterns like the global property cache does. We mechanical mariners are less at the mercy of the gods of locality.

Code generation

Where does "P" come from?

Er, right, we're still missing a letter. The "P" in "PIC" stands for polymorphic, which is a fancy sounding word that means "more than one type".

The inline cache demonstrated above can only remember information for a single type — any other type will result is a shapeIsKnown of False and you'll end up going to the slowLookupCode.

Surveys have shown that the degree of polymorphism (number of different types that actually pass through a snippet during program execution) in real-world code tends to be low, in JavaScript [♦] as well as related languages. However, polymorphism happens, and when it does, we like to be fast at it, too.

So, if our inline cache only supports a single type, what can we do to handle polymorphism? The answer may still be surprising: self-modify the machine code some more!

Before we talk about handling the polymorphic case, let's recap the PIC lifecycle.

The PIC lifecycle

The evolution of the PIC is managed through slowLookupCode, which keeps track of the state of the inline cache in addition to performing a traditional lookup. Once the slow lookup is performed and the PIC evolves, the slowLookupCode jumps back (to the instruction after the slot load) to do the next thing in the method.

When a PIC is born, it has that useless-looking structure you saw in the previous section — it's like a form waiting to be filled out. The industry terminology for this state is pre-monomorphic, meaning that it hasn't even seen one (mono) type pass through it yet.

The first time that inline cache is executed and we reach slowLookupCode we, shockingly, just ignore it. We do this because there is actually a hidden overhead associated with modifying machine code in-place — we want to make sure that you don't incur any of that overhead unless there's an indication you might be running that code a bunch of times. [♣]

The second time we reach the slowLookupCode, the inline cache is modified and the PIC reaches the state called monomorphic. Let's say we saw a type named ElephantTrunk — the PIC can now recognize ElephantTrunk objects and perform the fast slot lookup.

When the PIC is monomorphic and another type, named GiraffeSnout, flows through, we have a problem. There are no more places to put cache entries — we've filled out the whole form. This is where we get tricky: we create a new piece of code memory that contains the new filled-out form, and we modify the original form's jump to go to the new piece of code memory instead of slowLookupCode.

Recognize the pattern? We're making a chain of cache entries: if it's not an ElephantTrunk, jump to the GiraffeSnout test. If the GiraffeSnout fails, then jump to the slowLookupCode. An inline cache that can hit on more than one type is said to be in the polymorphic state.

PIC lifecycle

There's one last stage that PICs can reach, which is the coolest sounding of all: megamorphic. Once we detect that there are a lot of types flowing through a property access site, slowLookupCode stops creating cache entries. The assumption is that you might be passing an insane number of types through this code, in which case additional caching would only only slow things down. For a prime example of megamorphism, the 280slides code has an invocation site with 1,437 effective types! [**]

Conclusion

There's a lot more to discuss, but this introduction is rambling enough as-is — if people express interest we can further discuss topics like:

  • Why in the name of Knuth is it okay to perform a linear search of the types in the inline cache? Can't we do better?

  • What are all the types of inline caches that you currently implement? What bytecodes do they correspond to? Who are the handsome fellows who implemented each of those inline caches and are they currently seeing anybody?

  • Why are you talking about types in a language like JavaScript, where typeof o is almost always "object"?!

  • What happens if the property lives on a prototype of the object, instead of the object itself? ...then it can't be a simple slot load! I feel hurt and betrayed... what other caveats haven't been discussed!?

  • Why might you choose to use monomorphic inline caches (MICs) instead of using PICs everywhere?

  • Does the TraceMonkey execution time suffer because hits in the inline cache prevent fills of the (global) property cache?

  • What kick-arse optimizations related to PICs have been discussed in the literature that have yet to be implemented in JaegerMonkey?

  • What the crap have I been talking about this whole time?

Suffice it to say that JavaScript gets a nice speed boost by enabling PICs: x86 JaegerMonkey with PICs enabled is 25% faster on SunSpider than with them disabled on my machine. [††] If something makes a dynamic language fast, then it is awesome. Therefore, inline caches are awesome. (Modus ponens says so.)

Footnotes

[*]

This is as opposed to, say, C++, where in any given snippet of code the == operator could be overloaded.

[†]

"Engine" is a sexy term, but it's just a library of support code that you use when language constructs don't easily fall into the translate-it-directly-to-machine-code model used by C.

[‡]

Coupon only applies to idealized property access latencies. Competitor coupons gladly accepted. Additional terms and restrictions may apply. See store for details.

[§]

Alternative interpreter designs tend to walk over something that looks more like the source text — either an abstract syntax tree or the program tokens themselves. These designs are less common in modern dynamic languages.

[¶]

There have historically been implementations that do things like this; notably, the Lisp machines and Jazelle DBX. The JavaScript semantics for ADD are particularly hairy compared to these hosted languages, because getting the value-for-adding out of an object can potentially invoke arbitrary functions, causing re-entrance into JavaScript interpretation.

[#]

In the bytecode stream the value isPointy is encoded as an immediate.

[♠]

Note that there is actually further overhead in turning the looked-up property into an appropriate JavaScript value. For example, there are additional checks to see whether the looked-up value represents a "getter" function that should be invoked.

[♥]

This is, in itself, a small hash-map lookup, but the hash function is quite fast. At the moment it's four dependent ALU operations: right shift, xor, add, and.

[♦]

Gregor Richards published a paper in PLDI 2010 that analyzed a set of popular web-based JS applications. The results demonstrated that more than eighty percent of all call sites were monomorphic (had the same function body). I'm speculating that this correlates well to the property accesses we're discussing, though that wasn't explicitly established by the research — in JS, property access PIC are easier to discuss than function invocation PICs. In related languages, like Self, there is no distinction between method invocation and property access.

[♣]

"Hidden overhead my foot! Where does it come from?" Today's processors get a little scared when you write to the parts of memory that contain code. Modern processor architecture assumes that the memory you're executing code out of will not be written to frequently, so they don't optimize for it. [‡‡]

[**]

The annoying part is that the instruction prefetcher may have buffered up the modified instructions, so you have to check if the modified cache line is in there. Older cache coherency protocols I've read about flush lines past unified caches if they detect a hit in both the instruction and data caches — maybe it's better nowadays.

[††]

I'm citing Gregor Richards yet again.

[‡‡]

MICs give a nice percentage boost as well, but they're harder to disable at the moment, or I'd have numbers for that too.

House rental, showers, and auctions

I'm currently in the prologue of the house-rental process. Our sights are set on a four-bedroom house with an interesting property: the room resources are of differing value — not in terms of square-footage, but other, less easily scalable factors.

  • One bedroom has an attached bathroom and a slightly larger closet (the master bedroom).

  • One bedroom is on its own floor with a half-bathroom. [*]

  • The other two share a full bathroom, whose shower is also a resource shared with the bedroom mentioned above.

So how do we dole out the resources of unequal value in a manner that everybody considers fair?

(Talk about first-world problems...)

Proposed solutions

Three systems for deciding the room allocations have been proposed, and one has been rejected:

  • Randomly assign choosing priority: this was dismissed because the person with the last priority inevitably feels "stuck with" a room.

  • Entirely random: we take four playing cards from a deck, associate each card with a room, shuffle the cards until everybody is satisfied, then deal each person a card. That's the room that they get. Everybody is equally "stuck with" a room, and not due to the actions of any other person.

  • Bidding system: the auction would move from perceived-most-valuable-room to perceived-least-valuable-room. Bidding starts with the master bedroom at the evenly-divided rental rate. Once a winner/price is determined for that room, the process is repeated for the room with the half-bathroom, with a newly-divided rental rate. Once a point is reached where nobody is willing to place a bid on a room, the rest are assigned at random. Each person, through their own decision to bid or not to bid, ends up with the room of their choosing.

I'm personally a fan of the simplicity and total lack of competition in the random system, but if the bidding system is more likely to lead to the optimal outcome, it must be considered! (Of course, care must be taken to address possible social ramifications.)

During casual discussion we noted that the bidding system would be strategically interesting. If a person were not really interested in getting a high-value room but didn't care all that much if they did, it is clearly in their favor to drive bids on the early rooms as high as possible — this would entail a decrease in the payment rate for a later room.

I perused the Wikipedia entry on auction theory and realize that we were assuming an open/ascending bid auction, but a first-price/sealed-bid auction might be less socially stressful. [†] My main fear is that open bidding would get carried away (it's exciting, after all) and participants would come to regret their bids.

In any case, it's an interesting problem I thought I would share. Suggestions are welcome; otherwise, I'll write a follow-up entry as to how it turns out.

Footnotes

[*]

This means it has no shower. Stupid terminology, but I suppose it has a nicer ring than "toilet room".

[†]

As an aside, the game theory described on that page also seems to discuss provable equilibrium of a single auction, whereas our auctions have dependent equilibrium values.

Tool teams should work like sleeper cells

I've had some unique experiences interacting-with and participating-in tool development at previous companies that I've worked for, with the quality of those experiences in the broad spectrum from train-wreck to near-satisfactory. From that mental scarring has emerged the weighty goo of an idea, which may be interesting food for thought. [*]

How it starts

At some point in a company's growth, management notices that there is a lot of sub-par, [†] redundant, and distributed tool development going on. Employees have been manufacturing quick-and-dirty tools in order to perform their jobs more efficiently.

Management then ponders the benefit of centralizing that tool development. It seems like an easy sell:

  • Help ensure ongoing productivity gains — mission-critical tools that stop working or contain heinous bugs are a hidden liability

  • Foster tool developer expertise on a specialized team

  • Eliminate needless repetition by creating shared code bases and consolidating infrastructure resources

Good management will also consider the negative repercussions of turning distributed and independent resources into a shared and centrally managed resource:

  • Everybody wants a slice of the tool pie, because tools make our life better, and there always seems to be some stupid crap to automate.

  • Everybody wants a piece of the compute resources, because more compute makes parallelizable analyses go faster.

  • If you build it, they will come.

How I've seen it work (warning: depressing, hyperbolic)

  1. A group at the company makes a strong enough case to the centralized-tool-management machinery — a request for tool development is granted.

  2. A series of inevitably painful meetings are scheduled where the customer dictates their requirements, after which the tool team either rejects them or misunderstands/mis-prioritizes them because: a) that's not how it works — they have to actively gather the requirements, and b) they don't have enough time to do all the silly little things that the customer wants.

    Because people are fighting each other to get what they want, everybody forgets that the customers haven't really described the problem domain in any relevant detail.

  3. The tool team developers are happy to go code in peace, without going back for more painful meetings. They create a tool according to their understanding of the requirements during the first iteration.

  4. The customer has no idea how the tool team came up with a product that was nothing like their expectation. They say something overly dramatic like, "it's all wrong," pissing off the tool team, and lose faith in the ability of the tool team to deliver the product they want.

  5. The customer goes back to doing it manually or continue to develop their own tools, expecting that the tool team will fail.

  6. The tool team fails because the customer lost interest in telling them what they actually needed and giving good feedback. It wasn't the tool that anybody was looking for because the process doomed it from the start.

I say that this scenario is depressing because tool teams exist to make life better for everybody — they enjoy writing software that makes your life easier. Working with a tool team should not be painful. You should want to jump for joy when you start working with them and take them out to beers when you're finished working with them, because they're just that good. I think that, by taking a less traditional approach, you will be able to achieve much better results...

How it should work

  1. A group at the company makes a strong enough case to the centralized-tool-management machinery — a request for tool development is granted.

  2. A small handful of tool team operatives [‡], probably around two or three people, split off from the rest of the tool team and are placed in the company hierarchy under the team of the customers. They sit the customers' cube farm, go to their meetings to listen (but no laptops!), etc., just like a typical team member would.

  3. The customer team brings the operatives up to speed on the automatable task that must be performed each day through immersion. Depending on the frequency, breadth, and duration of the manual processes, the operatives must perform this manual process somewhere on the scale from weeks to months, until they develop a full understanding of the variety of manual processes that must be performed. [§] All operatives should be 100% assigned to the manual tasks for this duration, temporarily offloading members of customer team after their ramp-up.

  4. Bam! With an unquestionably solid understanding of the problem domain, the tool team sleeper cells activate. 80% of the manual task load is transitioned off of the operatives so that they can begin development work. Agile-style iterations of 1-2 weeks should be used.

  5. After each iteration there must be a usable product (by definition of an iteration). As a result of this, a percentage of the manual task load is shifted back onto the operatives each iteration, augmenting the original 20%. If the tool is actually developing properly, the operatives will be able to cope with the increased load over time.

  6. As the feature set begins to stabilize or the manual task load approaches zero (because it has all been automated), the product is released to the customers for feedback and a limited amount of future-proofing is considered for final iterations.

  7. Most customer feedback is ignored, but a small and reasonable subset is acted on. If the operatives were able to make do with the full task load plus development, it's probably a lot better than it used to be, and the customer is just getting greedy.

  8. The customer takes the operatives out for beers, since the tool team saved them a crapload of time and accounted for all the issues in the problem domain.

  9. A single operative hangs back with the customer for a few more iterations to eyeball maintenance concerns and maybe do a little more future-proofing while the rest head back to the tool team. The one who hangs back gets some kind of special reward for being a team player.

Conclusion

In the sleeper cell approach, the operatives have a clear understanding of what's important through first hand knowledge and experience and, consequently, know the ways in which the software has to be flexible. It emulates the way that organic tool development is found in the wild, as described in the introductory paragraph, but puts the task of creating the actual tool in the hands of experienced tool developers (our operatives!).

I think it's also noteworthy that this approach adheres to a reasonable principle: to write a good program to automate a task, you have to know/understand the variety of ways in which you might perform that task by hand, across all the likely variables.

The operatives are forced to live with the fruits of their labor; i.e. a defect like slow load times will be more painful for them, because they have to work with their tool regularly and take on larger workloads on an ongoing basis, before developers can ever get their hands on it.

Notice that there's still the benefit through centralization of tool developers: central contact point for tool needs, cultivating expertise in developers, knowledge of shared code base, understanding of infrastructure and contact points for infrastructural resource needs; however, you avoid the weird customer disconnect that comes with time slicing a traditional tool team.

Tools developers may also find that they enjoy the team that they're working in so much that they request to stay on that team! How awesome of a pitch is that to new hires? "Do you have a strong background in software development? Work closely with established software experts, make connections to people who will love you when you're done awesome-ing their lives, and take a whirlwind tour of the company within one year."

Footnotes

[*]

Yes, I'm suggesting you digest my mind-goo.

[†]

For some definition of par.

[‡]

I'm calling them operatives now, because their roles are different from tool developers, as you'll see.

[§]

It is beneficial if a small seed of hatred for the manual task begins to develop, though care should be taken not to allow operatives to be consumed by said hatred.

Efficiency of list comprehensions

I'm psyched about the awesome comments on my previous entry, Python by example: list comprehensions. Originally this entry was just a response to those comments, but people who stumbled across this entry on the interwebz found the response format too confusing, so I've restructured it for posterity.

Efficiency of the more common usage

Let's look at the efficiency of list comprehensions in the more common usage, where the comprehension's list result is actually relevant (or, in compiler-speak, live-out).

Using the following program, you can see the time spent in each implementation and the corresponding bytecode sequence:

import dis
import inspect
import timeit


programs = dict(
    loop="""
result = []
for i in range(20):
    result.append(i * 2)
""",
   loop_faster="""
result = []
add = result.append
for i in range(20):
    add(i * 2)
""",
    comprehension='result = [i * 2 for i in range(20)]',
)


for name, text in programs.iteritems():
    print name, timeit.Timer(stmt=text).timeit()
    code = compile(text, '<string>', 'exec')
    dis.disassemble(code)
loop 11.1495118141
  2           0 BUILD_LIST               0
              3 STORE_NAME               0 (result)

  3           6 SETUP_LOOP              37 (to 46)
              9 LOAD_NAME                1 (range)
             12 LOAD_CONST               0 (20)
             15 CALL_FUNCTION            1
             18 GET_ITER
        >>   19 FOR_ITER                23 (to 45)
             22 STORE_NAME               2 (i)

  4          25 LOAD_NAME                0 (result)
             28 LOAD_ATTR                3 (append)
             31 LOAD_NAME                2 (i)
             34 LOAD_CONST               1 (2)
             37 BINARY_MULTIPLY
             38 CALL_FUNCTION            1
             41 POP_TOP
             42 JUMP_ABSOLUTE           19
        >>   45 POP_BLOCK
        >>   46 LOAD_CONST               2 (None)
             49 RETURN_VALUE
loop_faster 8.36096310616
  2           0 BUILD_LIST               0
              3 STORE_NAME               0 (result)

  3           6 LOAD_NAME                0 (result)
              9 LOAD_ATTR                1 (append)
             12 STORE_NAME               2 (add)

  4          15 SETUP_LOOP              34 (to 52)
             18 LOAD_NAME                3 (range)
             21 LOAD_CONST               0 (20)
             24 CALL_FUNCTION            1
             27 GET_ITER
        >>   28 FOR_ITER                20 (to 51)
             31 STORE_NAME               4 (i)

  5          34 LOAD_NAME                2 (add)
             37 LOAD_NAME                4 (i)
             40 LOAD_CONST               1 (2)
             43 BINARY_MULTIPLY
             44 CALL_FUNCTION            1
             47 POP_TOP
             48 JUMP_ABSOLUTE           28
        >>   51 POP_BLOCK
        >>   52 LOAD_CONST               2 (None)
             55 RETURN_VALUE
comprehension 7.08145213127
  1           0 BUILD_LIST               0
              3 DUP_TOP
              4 STORE_NAME               0 (_[1])
              7 LOAD_NAME                1 (range)
             10 LOAD_CONST               0 (20)
             13 CALL_FUNCTION            1
             16 GET_ITER
        >>   17 FOR_ITER                17 (to 37)
             20 STORE_NAME               2 (i)
             23 LOAD_NAME                0 (_[1])
             26 LOAD_NAME                2 (i)
             29 LOAD_CONST               1 (2)
             32 BINARY_MULTIPLY
             33 LIST_APPEND
             34 JUMP_ABSOLUTE           17
        >>   37 DELETE_NAME              0 (_[1])
             40 STORE_NAME               3 (result)
             43 LOAD_CONST               2 (None)
             46 RETURN_VALUE

List comprehensions perform better here because you don’t need to load the append attribute off of the list (loop program, bytecode 28) and call it as a function (loop program, bytecode 38). Instead, in a comprehension, a specialized LIST_APPEND bytecode is generated for a fast append onto the result list (comprehension program, bytecode 33).

In the loop_faster program, you avoid the overhead of the append attribute lookup by hoisting it out of the loop and placing the result in a fastlocal (bytecode 9-12), so it loops more quickly; however, the comprehension uses a specialized LIST_APPEND bytecode instead of incurring the overhead of a function call, so it still trumps.

Using list comprehensions for side effects

I want to address a point that was brought up in the previous entry as to the efficiency of for loops versus list comprehensions when used purely for side effects, but I'll discuss the subjective bit first, since that's the least sciency part.

Readability

Simple test – if you did need the result would the comprehension be easily understood? If the answer is yes then removing the assignment on the left hand side doesn’t magically make it less readable…

Michael Foord

First of all, thanks to Michael for his excellent and thought provoking comment!

My response is that removing the use of the result does indeed make it less readable, precisely because you're using a result-producing control flow construct where the result is not needed. I suppose I'm positing that it's inherently confusing to do that with your syntax: there's a looping form that doesn't produce a result, so that should be used instead. It's expressing your semantic intention via syntax.

For advanced Pythonistas it's easy for figure out what's going on at a glance, but comprehension-as-loop definitely has a "there's more than one way to do it" smell about it, which also makes it less amenable to people learning the language.

With a viable comprehension-as-loop option, every time a user goes to write a loop that doesn't require a result they now ask themselves, "Can I fit this into the list comprehension form?" Those mental branches are, to me, what "one way to do it" is designed to avoid. When I read Perl code, I take "mental exceptions" all the time because the author didn't use the construct that I would have used in the same situation. Minimizing that is a good thing, so I maintain that "no result needed" should automatically imply a loop construct.

Efficiency

Consider two functions, comprehension and loop:

def loop():
    accum = []
    for i in range(20):
        accum.append(i)
    return accum


def comprehension():
    accum = []
    [accum.append(i) for i in range(20)]
    return accum

N.B. This example is comparing the efficiency of a list comprehension where the result of the comprehension is ignored to a for loop that produces no result, as is discussed in the referenced entry, Python by example: list comprehensions.

Michael Foord comments:

Your alternative for the single line, easily readable, list comprehension is four lines that are less efficient because the loop happens in the interpreter rather than in C.

However, the disassembly, obtained via dis.dis(func) looks like the following for the loop:

2           0 BUILD_LIST               0
            3 STORE_FAST               0 (accum)

3           6 SETUP_LOOP              33 (to 42)
            9 LOAD_GLOBAL              0 (range)
           12 LOAD_CONST               1 (20)
           15 CALL_FUNCTION            1
           18 GET_ITER
      >>   19 FOR_ITER                19 (to 41)
           22 STORE_FAST               1 (i)

4          25 LOAD_FAST                0 (accum)
           28 LOAD_ATTR                1 (append)
           31 LOAD_FAST                1 (i)
           34 CALL_FUNCTION            1
           37 POP_TOP
           38 JUMP_ABSOLUTE           19
      >>   41 POP_BLOCK

5     >>   42 LOAD_FAST                0 (accum)
           45 RETURN_VALUE

And it looks like the following for the comprehension:

2           0 BUILD_LIST               0
            3 STORE_FAST               0 (accum)

3           6 BUILD_LIST               0
            9 DUP_TOP
           10 STORE_FAST               1 (_[1])
           13 LOAD_GLOBAL              0 (range)
           16 LOAD_CONST               1 (20)
           19 CALL_FUNCTION            1
           22 GET_ITER
      >>   23 FOR_ITER                22 (to 48)
           26 STORE_FAST               2 (i)
           29 LOAD_FAST                1 (_[1])
           32 LOAD_FAST                0 (accum)
           35 LOAD_ATTR                1 (append)
           38 LOAD_FAST                2 (i)
           41 CALL_FUNCTION            1
           44 LIST_APPEND
           45 JUMP_ABSOLUTE           23
      >>   48 DELETE_FAST              1 (_[1])
           51 POP_TOP

4          52 LOAD_FAST                0 (accum)
           55 RETURN_VALUE

By looking at the bytecode instructions, we see that the list comprehension is, at a language level, actually just "syntactic sugar" for the for loop, as mentioned by nes — they both lower down into the same control flow construct at a virtual machine level, at least in CPython.

The primary difference between the two disassemblies is that a superfluous list comprehension result is stored into fastlocal 1, which is loaded (bytecode 29) and appended to (bytecode 44) each iteration, creating some additional overhead — it's simply deleted in bytecode 48. Unless the POP_BLOCK operation (bytecode 41) of the loop disassembly is very expensive (I haven't looked into its implementation), the comprehension disassembly is guaranteed to be less efficient.

Because of this, I believe that Michael was mistaken in referring to an overhead that results from use of a for loop versus a list comprehension for CPython. It would be interesting to perform a survey of the list comprehension optimization techniques used in various Python implementations, but optimization seems difficult outside of something like a special Cython construct, because LOAD_GLOBAL range could potentially be changed from the builtin range function. Various issues of this kind are discussed in the (very interesting) paper The effect of unrolling and inlining for Python bytecode optimizations.