October 9, 2010

Why programming languages?

I am totally taken aback by the lack of hyperbolic romanticism in the foreword of the programming language book that I just got. There are horribly boring-sounding hooks along the lines of "computers are ubiquitous in the 21st century" and "connecting the theoretical foundations of computer science to modern platform architectures". Hopefully people don't judge a book by its cover or its foreword.

Pragmatic rhetoric is uninspiring — you could just as easily write a foreword for a book on ripping up floorboards and talk about how "floorboards are ubiquitous in the 21st century" and "connects the theoretical foundations of wood science to modern house architectures".

If I were to write a foreword that mentioned the reasons you should be interested in programming languages, it would go roughly like this, to which I hope there is no analogy for ripping up floorboards:


Rejoice, programming languages are the irrigation ditches of your mind-goo!

You've got brilliant ideas brewing inside your head, trying to claw their way out of your little mind and escape into the world. Unfortunately, the meat shell that your ideas live in is quite limited — you can barely eat and talk at the same time, and even then, you're not supposed to be talking with your mouth full. It's bad manners.

Luckily for you, computers provide another venue to express and execute your abstract thoughts. The way you express yourself to these unquestioning harbingers of awesome is through programming languages, whose programs cause actions to be taken in the computing device's virtual world.

Much like ice cream, programming languages come in many flavors (and some have those great cookie dough chunks that you find when you first bite into them). The flavor of a programming language shapes the way that people express and reason about their abstract thoughts — a result of the language's unique design and implementation. Because there's no "best" way to channel your thoughts into a computing device, our work on programming languages is never done!

What [whoever] has written here is [probably] a wonderful explanation of the way we currently bridge the gap between the mind and the computing device for the flavors of programming language we've invented-and-used thus far. Learning from historical successes and pitfalls is key to really understanding existing programming languages and evaluating the design decisions that you'll be making for the programming languages of tomorrow.

Christopher D. Leary
Guy With a Blog
The Internet

PICing on JavaScript for fun and profit

Inline caching is a critical ingredient in the delicious pie that is dynamic language performance optimization. What follows is a gentle-albeit-quirky introduction to what polymorphic inline caches (PICs) are and why they're useful to JavaScript Just-In-Time compilers like JaegerMonkey.

But first, the ceremonial giving of the props: the initial barrage of PIC research and implementation in JaegerMonkey was performed by Dave Mandelin and our current inline cache implementations are largely the work of David Anderson. As always, the performance improvements of Firefox's JavaScript engine can be monitored via the Are We Fast Yet? website.

C is for speed, and that's good enough for me

C is fast.

Boring people (like me) argue about astoundingly interesting boring things like, "Can hand-tuned assembly be generally faster than an equivalent C program on modern processor architectures?" and "Do languages really have speeds?", but you needn't worry — just accept that C is fast, and we've always been at war with Eurasia.

So, as we've established, when you write a program in C, it executes quickly. If you rewrite that program in your favorite dynamic language and want to know if it still executes quickly, then you naturally compare it to the original C program.

C is awesome in that it has very few language features. For any given snippet of C code, there's a fairly direct translation to the corresponding assembly instructions. [*] You can almost think of C as portable assembly code. Notably, there are (almost) zero language features that require support during the program's execution — compiling a C program is generally a non-additive translation to machine code.

Dynamic languages like JavaScript have a massive number of features by comparison. The language, as specified, performs all kinds of safety checks, offers you fancy-n-flexible data record constructs, and even takes out the garbage. These things are wonderful, but generally require runtime support, which is supplied by the language engine. [†] This runtime support comes at a price, but, as you'll soon see, we've got a coupon for 93 percent off on select items! [‡]

You now understand the basic, heart-wrenching plight of the performance-oriented dynamic language compiler engineer: implement all the fancy features of the language, but do it at no observable cost.

Interpreters, virtual machines, and bears

"Virtual machine" sounds way cooler than "interpreter". Other than that, you'll find that the distinction is fairly meaningless in relevant literature.

An interpreter takes your program and executes it. Generally, the term "virtual machine" (AKA "VM") refers to a sub-category of interpreter where the source program is first turned into fake "instructions" called bytecodes. [§]

A bear moving quickly

I call these instructions fake because they do things that a hardware processing units are unlikely to ever do: for example, an ADD bytecode in JavaScript will try to add two arbitrary objects together. [¶] The point that languages implementors make by calling it a "virtual machine" is that there is conceptually a device, whether in hardware or software, that could execute this set of instructions to run the program.

These bytecodes are then executed in sequence. A program instruction counter is kept in the VM as it executes, analogous to a program counter register in microprocessor hardware, and control flow bytecodes (branches) change the typical sequence by indicating the next bytecode instruction to be executed.

Virtual (machine) reality

Languages implemented in "pure" VMs are slower than C. Fundamentally, your VM is a program that executes instructions, whereas compiled C code runs on the bare metal. Executing the VM code is overhead!

To narrow the speed gap between dynamic languages and C, VM implementers are forced to eliminate this overhead. They do so by extending the VM to emit real machine instructions — bytecodes are effectively lowered into machine-codes in a process called Just-In-Time (JIT) compilation. Performance-oriented VMs, like Firefox's SpiderMonkey engine, have the ability to JIT compile their programs.

The term "Just-In-Time" is annoyingly vague — just in time for what, exactly? Dinner? The heat death of the universe? The time it takes me to get to the point already?

In today's JavaScript engines, the lowering from bytecodes to machine instructions occurs as the program executes. With the new JaegerMonkey JIT compiler, the lowering occurs for a single function that the engine sees you are about to execute. This has less overhead than compiling the program as a whole when the web browser receives it. The JaegerMonkey JIT compiler is also known as the method JIT, because it JIT compiles a method at a time.

For most readers, this means a few blobs of x86 or x86-64 assembly are generated as you load a web page. The JavaScript engine in your web browser probably spewed a few nice chunks of assembly as you loaded this blog entry.

Aside: TraceMonkey

In SpiderMonkey we have some special sauce: a second JIT, called TraceMonkey, that kicks in under special circumstances: when the engine detects that you're running loopy code (for example, a for loop with a lot of iterations), it records a stream of bytecodes that corresponds to a trip around the loop. This stream is called a trace and it's interesting because a) it can record bytecodes across function calls and b) the trace optimizer works harder than the method JIT to make the resulting machine code fast.

There's lots more to be said about TraceMonkey, but the inline caching optimization that we're about to discuss is only implemented in JaegerMonkey nowadays, so I'll cut that discussion short.

The need for inline caching

In C, accessing a member of a structure is a single "load" machine instruction:

struct Nose {
    int howManyNostrils;
    bool isPointy;

bool isNosePointy(struct Nose *nose) {
    return nose->isPointy;

The way that the members of struct Nose are laid out in memory is known to the C compiler because it can see the struct definition — getting the attribute nose->isPointy translates directly into a load from the address addressof(nose) + offsetof(Nose, isPointy).

Note: Just to normalize all the terminology, let's call the data contained within a structure the properties (instead of members) and the way that you name them the identifiers. For example, isPointy is an identifier and the boolean data contained within nose->isPointy is the property. The act of looking up a property through an identifier is a property access.

On the other hand, objects in JavaScript are flexible — you can add and delete arbitrary properties from objects at runtime. There is also no language-level support for specifying the types that an identifier can take on. As a result, there's no simple way to know what memory address to load from in an arbitrary JavaScript property access.

Consider the following snippet:

function isNosePointy(nose) {
    return nose.isPointy;

To get at the isPointy property, the JavaScript VM emits a single bytecode, called GETPROP, which says "pull out the property with the identifier isPointy". [#] Conceptually, this operation performs a hash-map lookup (using the identifier as a key), which takes around 45 cycles in my microbenchmark. [♠]

Uncached property access data

The process of "looking up a property at runtime because you don't know the exact type of the object" falls into a general category of runtime support called dynamic dispatch. Unsurprisingly, there is execution time overhead associated with dynamic dispatch, because the lookup must be performed at runtime.

To avoid performing a hash-map lookup on every property access, dynamic language interpreters sometimes employ a small cache for (all) property accesses. You index into this cache with the runtime-type of the object and desired identifier. [♥] Resolving a property access against this cache under ideal circumstances takes about 8.5 cycles.

Cached property access data

WTF is inline caching already!?

So we've established that, with good locality, JS property accesses are at least 8.5x slower than C struct property accesses. We've bridged the gap quite a bit from 45x slower. But how do we bridge the gap even bridgier?

Bridge fail!

The answer is, surprisingly, self-modifying code: code that modifies code-that-currently-exists-in-memory. When we JIT compile a property access bytecode, we emit machine-code that looks like this:

type            <- load addressof(object) + offsetof(JSObject, type)
shapeIsKnown    <- type equals IMPOSSIBLE_TYPE
None            <- goto slowLookupCode if shapeIsKnown is False
property        <- load addressof(object) + IMPOSSIBLE_SLOT

Now, if you ask Joe Programmer what he thinks of that code snippet, he would correctly deduce, "The slow lookup code will always be executed!" However, we've got the self-modifying code trick up our sleeves. Imagine that the type matched, so we didn't have to go to the slow lookup code — what's our new property access time?

One type load, one comparison, an untaken branch, and a property value load. Assuming good locality/predictability and that the object's type happened to already be in the register (because you tend to use it a lot), that's 0+1+1+1 == 3 cycles! Much better.

But how do we get the types to match? Joe Programmer is still looking pretty smug over there.

The trick is to have the slowLookupCode actually modify this snippet of machine code! After slowLookupCode resolves the property in the traditional ways mentioned in previous sections, it fills in a reasonable value for IMPOSSIBLE_TYPE and IMPOSSIBLE_SLOT like they were blank fields in a form. This way, the next time you run this machine code, there's a reasonable chance you won't need to go to slowLookupCode — the types might compare equal, in which case you can perform a simple load instruction to get the property that you're looking for!

This technique of modifying the JIT-compiled code to reflect a probable value is called inline caching: inline, as in "in the emitted code"; caching, as in "cache a probable value in there". This the basic idea behind inline caches, AKA ICs.

Also, because we emit this snippet for every property-retrieving bytecode we don't rely on global property access patterns like the global property cache does. We mechanical mariners are less at the mercy of the gods of locality.

Code generation

Where does "P" come from?

Er, right, we're still missing a letter. The "P" in "PIC" stands for polymorphic, which is a fancy sounding word that means "more than one type".

The inline cache demonstrated above can only remember information for a single type — any other type will result is a shapeIsKnown of False and you'll end up going to the slowLookupCode.

Surveys have shown that the degree of polymorphism (number of different types that actually pass through a snippet during program execution) in real-world code tends to be low, in JavaScript [♦] as well as related languages. However, polymorphism happens, and when it does, we like to be fast at it, too.

So, if our inline cache only supports a single type, what can we do to handle polymorphism? The answer may still be surprising: self-modify the machine code some more!

Before we talk about handling the polymorphic case, let's recap the PIC lifecycle.

The PIC lifecycle

The evolution of the PIC is managed through slowLookupCode, which keeps track of the state of the inline cache in addition to performing a traditional lookup. Once the slow lookup is performed and the PIC evolves, the slowLookupCode jumps back (to the instruction after the slot load) to do the next thing in the method.

When a PIC is born, it has that useless-looking structure you saw in the previous section — it's like a form waiting to be filled out. The industry terminology for this state is pre-monomorphic, meaning that it hasn't even seen one (mono) type pass through it yet.

The first time that inline cache is executed and we reach slowLookupCode we, shockingly, just ignore it. We do this because there is actually a hidden overhead associated with modifying machine code in-place — we want to make sure that you don't incur any of that overhead unless there's an indication you might be running that code a bunch of times. [♣]

The second time we reach the slowLookupCode, the inline cache is modified and the PIC reaches the state called monomorphic. Let's say we saw a type named ElephantTrunk — the PIC can now recognize ElephantTrunk objects and perform the fast slot lookup.

When the PIC is monomorphic and another type, named GiraffeSnout, flows through, we have a problem. There are no more places to put cache entries — we've filled out the whole form. This is where we get tricky: we create a new piece of code memory that contains the new filled-out form, and we modify the original form's jump to go to the new piece of code memory instead of slowLookupCode.

Recognize the pattern? We're making a chain of cache entries: if it's not an ElephantTrunk, jump to the GiraffeSnout test. If the GiraffeSnout fails, then jump to the slowLookupCode. An inline cache that can hit on more than one type is said to be in the polymorphic state.

PIC lifecycle

There's one last stage that PICs can reach, which is the coolest sounding of all: megamorphic. Once we detect that there are a lot of types flowing through a property access site, slowLookupCode stops creating cache entries. The assumption is that you might be passing an insane number of types through this code, in which case additional caching would only only slow things down. For a prime example of megamorphism, the 280slides code has an invocation site with 1,437 effective types! [**]


There's a lot more to discuss, but this introduction is rambling enough as-is — if people express interest we can further discuss topics like:

Suffice it to say that JavaScript gets a nice speed boost by enabling PICs: x86 JaegerMonkey with PICs enabled is 25% faster on SunSpider than with them disabled on my machine. [††] If something makes a dynamic language fast, then it is awesome. Therefore, inline caches are awesome. (Modus ponens says so.)



This is as opposed to, say, C++, where in any given snippet of code the == operator could be overloaded.


"Engine" is a sexy term, but it's just a library of support code that you use when language constructs don't easily fall into the translate-it-directly-to-machine-code model used by C.


Coupon only applies to idealized property access latencies. Competitor coupons gladly accepted. Additional terms and restrictions may apply. See store for details.


Alternative interpreter designs tend to walk over something that looks more like the source text — either an abstract syntax tree or the program tokens themselves. These designs are less common in modern dynamic languages.


There have historically been implementations that do things like this; notably, the Lisp machines and Jazelle DBX. The JavaScript semantics for ADD are particularly hairy compared to these hosted languages, because getting the value-for-adding out of an object can potentially invoke arbitrary functions, causing re-entrance into JavaScript interpretation.


In the bytecode stream the value isPointy is encoded as an immediate.


Note that there is actually further overhead in turning the looked-up property into an appropriate JavaScript value. For example, there are additional checks to see whether the looked-up value represents a "getter" function that should be invoked.


This is, in itself, a small hash-map lookup, but the hash function is quite fast. At the moment it's four dependent ALU operations: right shift, xor, add, and.


Gregor Richards published a paper in PLDI 2010 that analyzed a set of popular web-based JS applications. The results demonstrated that more than eighty percent of all call sites were monomorphic (had the same function body). I'm speculating that this correlates well to the property accesses we're discussing, though that wasn't explicitly established by the research — in JS, property access PIC are easier to discuss than function invocation PICs. In related languages, like Self, there is no distinction between method invocation and property access.


"Hidden overhead my foot! Where does it come from?" Today's processors get a little scared when you write to the parts of memory that contain code. Modern processor architecture assumes that the memory you're executing code out of will not be written to frequently, so they don't optimize for it. [‡‡]


The annoying part is that the instruction prefetcher may have buffered up the modified instructions, so you have to check if the modified cache line is in there. Older cache coherency protocols I've read about flush lines past unified caches if they detect a hit in both the instruction and data caches — maybe it's better nowadays.


I'm citing Gregor Richards yet again.


MICs give a nice percentage boost as well, but they're harder to disable at the moment, or I'd have numbers for that too.

Learning Python by example: list comprehensions

My friend, who is starting to learn Python 2.x, asked me what this snippet did:

def collapse(seq):
    # Preserve order.
    uniq = []
    [uniq.append(item) for item in seq if not uniq.count(item)]
    return uniq

This is not a snippet that should be emulated (i.e. it's bad); however, it makes me happy: there are so many things that can be informatively corrected!

What is a list comprehension?

A list comprehension is a special brackety syntax to perform a transform operation with an optional filter clause that always produces a new sequence (list) object as a result. To break it down visually, you perform:

new_range = [i * i          for i in range(5)   if i % 2 == 0]

Which corresponds to:

*result*  = [*transform*    *iteration*         *filter*     ]

The filter piece answers the question, "should this item be transformed?" If the answer is yes, then the transform piece is evaluated and becomes an element in the result. The iteration [*] order is preserved in the result.

Go ahead and figure out what you expect new_range to be in the prior example. You can double check me in the Python shell, but I think it comes out to be:

>>> new_range = [i * i for i in range(5) if i % 2 == 0]
>>> print new_range
[0, 4, 16]

If it still isn't clicking, we can try to make the example less noisy by getting rid of the transform and filter — can you tell what this will produce?

>>> new_range = [i for i in range(5)]

So what's wrong with that first snippet?

As we observed in the previous section, a list comprehension always produces a result list, where the elements of the result list are the transformed elements of the iteration. That means, if there's no filter piece, there are exactly as many result elements as there were iteration elements.

Weird thing number one about the snippet — the list comprehension result is unused. It's created, mind you — list comprehension always create a value, even if you don't care what it is — but it just goes off to oblivion. (In technical terms, it becomes garbage.) When you don't need the result, just use a for loop! This is better:

def colapse(seq):
    """Preserve order."""
    uniq = []
    for item in seq:
        if not uniq.count(item):
    return uniq

It's two more lines, but it's less weird looking and wasteful. "Better for everybody who reads and runs your code," means you should do it.

Moral of the story: a list comprehension isn't just, "shorthand for a loop." It's shorthand for a transform from an input sequence to an output sequence with an optional filter. If it gets too complex or weird looking, just make a loop. It's not that hard and readers of your code will thank you.

Weird thing number two: the transform, list.append(item), produces None as its output value, because the return value from list.append is always None. Therefore, the result, even though it isn't kept anywhere, is a list of None values of the same length as seq (notice that there's no filter clause).

Weird thing number three: list.count(item) iterates over every element in the list looking for things that == to item. If you think through the case where you call collapse on an entirely unique sequence, you can tell that the collapse algorithm is O(n2). In fact, it's even worse than it may seem at first glance, because count will keep going all the way to the end of uniq, even if it finds item in the first index of uniq. What the original author really wanted was item not in uniq, which bails out early if it finds item in uniq.

Also worth mentioning for the computer-sciency folk playing along at home: if all elements of the sequence are comparable, you can bring that down to O(n * log n) by using a "shadow" sorted sequence and bisecting to test for membership. If the sequence is hashable you can bring it down to O(n), perhaps by using the set datatype if you are in Python >= 2.3. Note that the common cases of strings, numbers, and tuples (any built-in immutable datatype, for that matter) are hashable.

From Python history

It's interesting to note that Python Enhancement Proposal (PEP) #270 considered putting a uniq function into the language distribution, but withdrew it with the following statement:

Removing duplicate elements from a list is a common task, but there are only two reasons I can see for making it a built-in. The first is if it could be done much faster, which isn't the case. The second is if it makes it significantly easier to write code. The introduction of sets.py eliminates this situation since creating a sequence without duplicates is just a matter of choosing a different data structure: a set instead of a list.

Remember that sets can only contain hashable elements (same policy as dictionary keys) and are therefore not suitable for all uniq-ifying tasks, as mentioned in the last paragraph of the previous section.



"Iteration" is just a fancy word for "step through the sequence, element by element, and give that element a name." In our case we're giving the name i.

Inedible vectors of spam: learning non-reified generics by example

I've been playing with the new Java features, only having done minor projects in it since 1.4, and there have been a lot of nice improvements! One thing that made me do a double take, however, was a run-in with non-reified types in Java generics. Luckily, one of my Java-head friends was online and beat me with a stick of enlightenment until I understood what was going on.

In the Java generic system, the collections are represented by two separate, yet equally important concepts — the compile-time generic parameters and the run-time casts who check the collection members. These are their stories.

Dun, dun!

An example: worth a thousand lame intros

The following code is a distilled representation of the situation I encountered:

import java.util.Arrays;
import java.util.List;

public class BadCast {

    static interface Edible {}

    static class Spam implements Edible {}

    List<Spam> canSomeSpam() {
        return Arrays.asList(new Spam(), new Spam(), new Spam());

     * @note Return type *must* be List<Editable> (because we intend to
     *       implement an interface that requires it).
    List<Edible> castSomeSpam() {
        return (List<Edible>) canSomeSpam();


It produced the following error in my IDE:

Cannot cast from List<BadCast.Spam> to List<BadCast.Edible>

At which point I scratched my head and thought, "If all Spams are Edible, [*] why won't it let me cast List<Spam> to List<Edible>? This seems silly."

Potential for error

A slightly expanded example points out where that simple view goes wrong: [†]

import java.util.Arrays;
import java.util.List;
import java.util.Vector;

public class GenericFun implements Runnable {

    static interface Edible {}

    static class Spam implements Edible {

        void decompose() {}

    List<Spam> canSomeSpam() {
        return Arrays.asList(new Spam(), new Spam(), new Spam());

     * Loves to stick his apples into things.
    static class JohnnyAppleseed {

        static class Apple implements Edible {}

        JohnnyAppleseed(List<Edible> edibles) {
            edibles.add(new Apple());


    public void run() {
        List<Spam> spams = new Vector<Spam>(canSomeSpam());
        List<Edible> edibles = (List<Edible>) spams;
        new JohnnyAppleseed(edibles); // He puts his apple in our spams!
        for (Spam s : spams) {
            s.decompose(); // What does this do when it gets to the apple!?

We make a (mutable) collection of spams, but this time, unlike in the previous example, we keep a reference to that collection. Then, when we give it to JohnnyAppleseed, he sticks a damn Apple in there, invalidating the supposed type of spams! (If you still don't see it, note that the object referenced by spams is aliased to edibles.) Then, when we invoke the decompose method on the Apple that is confused with a Spam, what could possibly happen?!

The red pill: there is no runtime-generic-type-parameterization!

Though the above code won't compile, this kind of thing actually is possible, and it's where the implementation of generics starts to leak through the abstraction. To quote Neal Gafter:

Many people are unsatisfied with the restrictions caused by the way generics are implemented in Java. Specifically, they are unhappy that generic type parameters are not reified: they are not available at runtime. Generics are implemented using erasure, in which generic type parameters are simply removed at runtime. That doesn't render generics useless, because you get typechecking at compile-time based on the generic type parameters, and also because the compiler inserts casts in the code (so that you don't have to) based on the type parameters.


The implementation of generics using erasure also causes Java to have unchecked operations, which are operations that would normally check something at runtime but can't do so because not enough information is available. For example, a cast to the type List<String> is an unchecked cast, because the generated code checks that the object is a List but doesn't check whether it is the right kind of list.

At runtime, List<Edible> is no different from List. At compile-time, however, a List<Edible> cannot be cast to from List<Spam>, because it knows what evil things you could then do (like sticking Apples in there).

But if you did stick an Apple in there (like I told you that you can actually do, with evidence to follow shortly), you wouldn't know anything was wrong until you tried to use it like a Spam. This is a clear violation of the "error out early" policy that allows you to localize your debugging. [‡]

In what way does the program error out when you try to use the masquerading Apple like a Spam? Well, when you write:

for (Spam s : spams) {
    s.decompose(); // What does this do when it gets to the apple!?

The code the compiler actually generates is:

for (Object s : spams) {

At which point it's clear what will happen to the Apple instance — a ClassCastException, because it's not a Spam!

Exception in thread "main" java.lang.ClassCastException: GenericFun$JohnnyAppleseed$Apple cannot be cast to GenericFun$Spam
        at GenericFun.run(GenericFun.java:36)


Okay, so in the first example we didn't keep a reference to the List around, making it acceptable (but bad style) to perform an unchecked cast:

Since, under the hood, the generic type parameters are erased, there's no runtime difference between List<Edible> and plain ol' List. If we just cast to List, it will give us a warning:

Type safety: The expression of type List needs unchecked conversion to conform to List<BadCast.Edible>

The real solution, though, is to just make an unnecessary "defensive copy" when you cross this function boundary; i.e.

List<Edible> castSomeSpam() {
    return new Vector<Edible>(canSomeSpam());



Obviously a point of contention among ham connoisseurs.


This doesn't compile, because we're imagining that the cast were possible. Compilers don't respond well when you ask them to imagine things:

$ javac 'Imagine you could cast List<Spam> to List<Edible>!'
javac: invalid flag: Imagine you could cast List<Spam> to List<Edible>!
Usage: javac <options> <source files>
use -help for a list of possible options

Note that if you must do something like this, you can use a Collections.checkedList to get the early detection. Still, the client is going to be pissed that they tried to put their delicious Ham in there and got an unexpected ClassCastException — probably best to use Collections.unmodifiableList if the reference ownership isn't fully transferred.

Learning Python by example: RC4

One of my friends at work was fakeplaining [*] that he had been on the Python programming mailing list at work for some time, yet still did not know Python. Being hopelessly suggestible in the face of obvious sarcasm, I decided to sacrifice a few hours of sleep to the god of blog. [†]

Note that this entry is aimed at people who already know how to program and have been looking for a tidbit to try in Python.

There are a lot of side notes I've left out for simplicity of explanation; however, I also attempted to make the experience interesting by introducing one of Python's more advanced features, called "generator functions," into the mix. Hopefully it strikes a balance. Please comment if you are utterly confused by generators — I may add an alternate section that allows the reader to avoid them altogether.

You kata wanna...

A number of leaders in the programming community are hot on this trend called "code katas." I'm actually a big fan of this trend, mainly because I've been writing code for no reason, hating on it, throwing it away, and subsequently rewriting it for my entire life, but I now get to call it something cool- and ninja-sounding. Doing such things in my spare time is no longer considered "inexcusable nerdiness;" rather, it's my small endeavor to bring professionalism to the field of software engineering. *cough*

One reason that I really enjoy this new trend is that programmers are posting their own morsel-sized programming problems left and right, giving ample opportunities to explore new languages (and dusty corners of ones you know well) without accidentally becoming BDFL of a seminal framework or utility. [‡]

RC4 Pseudocode

Case in point, I'll use the recent kata from Programming Praxis for this Python exercise, as they provide straightforward pseudocode. Here's the encryption algorithm named RC4, as quoted from Programming Praxis:

The key array K is initialized like this:

for i from 0 to 255
    K[i] := i

j := 0

for i from 0 to 255
    j := (j + K[i] + key[i mod klen]) mod 256
    swap K[i], K[j]

Once the key array is initialized, the pseudo-random byte generator uses a similar calculation to build the stream of random bytes:

i := 0
j := 0

    i := (i + 1) mod 256
    j := (j + K[i]) mod 256
    swap K[i], K[j]
    output K[ (K[i] + K[j]) mod 256 ]
    goto start

The first step in writing our RC4 program is to translate this pseudocode to Python, while the second step is to add a command line front-end for some off-the-cuff implementation experience.

If you'd like to look at the final program ahead of time, grab a copy of my reference implementation.

Porting the initialization

For initialization, we use a provided key to calculate a 256 entry integer sequence. Open a new file called rc4.py and write the following function:

def initialize(key):
    """Produce a 256-entry list based on `key` (a sequence of numbers) as
    the first step in RC4.
    Note: indices in key greater than 255 will be ignored.
    k = range(256)
    j = 0
    for i in range(256):
        j = (j + k[i] + key[i % len(key)]) % 256
        k[i], k[j] = k[j], k[i]
    return k

The simplicity of the translation demonstrates why Python is sometimes called "executable pseudocode". Breaking it down line by line:

  1. defines a function named initialize that takes a single argument, key.

  2. A documentation string ("docstring" for short). In Python, documentation is associated with a function even at runtime, in contrast to traditional JavaDoc or POD. [§] If the first statement in a function is a string literal, it is used as the docstring for that function. [¶]

  1. The built-in range function returns a list of values. [#] "Built-in" is the terminology used for items that are "available all the time without explicitly importing anything."

    This function also has a two-argument form, range(start, stop); however, in the single argument form, start has a default of 0, so the function invocation returns a list of all the integers in the mathematical interval [0, 256), for a total of 256 values.

  1. There is only one for loop syntax: for [identifier] in [iterable]. Lists are iterable because they contain a sequence of objects.

  2. Finite collections also support the built-in function len([sizable]). The way that numerical arithmetic works and sequence indexing via seq[idx] should be familiar.

  3. Python has an elegant swap capability — what's important to note is that the entire right hand side is evaluated, then assigned to the left hand side.

  4. Python functions optionally return a value. If no return statement is encountered, None is returned, which indicates the absence of a value (docs).

Generators: functions that pause

Python has a convenient feature, called "generator functions," that allows you to create a stream of values using function-definition syntax. [♠] You can think of generator functions as special functions that can pause and resume, returning a value each time it pauses.

The canonical example illustrates the concept well — use the interactive Python shell to explore how generator functions work, by running the python command without arguments. Make sure the version is python2.3 or above. Once you're in the interactive shell, type the following:

>>> def gen_counter():
...     i = 0
...     while True:
...         yield i
...         i += 1

Note the use of a yield statement, which tells Python that it is a generator function. Calling a generator function creates an iterable generator object, which can then produce a potentially infinite series of values:

>>> counter = gen_counter()
>>> print counter
<generator object gen_counter at 0xb7e3fbe4>
>>> counter.next()
>>> counter.next()

Note that because the stream of values is potentially infinite and lazily evaluated, there's no concept of length: it's not representative of a container so much as a sequence:

>>> len(counter)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type 'generator' has no len()

Also important to note is that none of the local values in the generator instances are shared; i.e. instantiating a second generator has no effect on the first:

>>> one_counter = gen_counter()
>>> another_counter = gen_counter()
>>> one_counter.next()
>>> another_counter.next()

Since generators are iterable, you can use them in a for loop just like containers. (But watch out for infinite generators and no break condition in your for loop!)

If you're still confused, the official tutorial's section on generators may help to clarify things. For an in-depth look at generators and why they're awesome, David M. Beazley's PyCon presentation on generators is excellent.

Applying generators to RC4

This dove-tails nicely with the second part of the algorithm, which requires a stream of values to XOR against. The generator is nearly a direct translation from the pseudocode, which you may also add to rc4.py:

def gen_random_bytes(k):
    """Yield a pseudo-random stream of bytes based on 256-byte array `k`."""
    i = 0
    j = 0
    while True:
        i = (i + 1) % 256
        j = (j + k[i]) % 256
        k[i], k[j] = k[j], k[i]
        yield k[(k[i] + k[j]) % 256]

Each time .next() is called on the generator instance, the function executes until the first yield statement is encountered, produces that value, and saves the function state for later.

Yes, we could create a big list of pseudo-random values the length of the text, but creating them all at the same time adds O(len(text)) memory overhead, whereas the generator is constant memory overhead (and computationally efficient).

Tying it together

Now we just need a function that does the XORing, which teaches us about strings and characters.

def run_rc4(k, text):
    cipher_chars = []
    random_byte_gen = gen_random_bytes(k)
    for char in text:
        byte = ord(char)
        cipher_byte = byte ^ random_byte_gen.next()
    return ''.join(cipher_chars)

Line by line:

  1. An empty list cipher character accumulator is created.

  2. The generator object is instantiated by calling the generator function.

  3. As you can see from the for loop, Python strings are iterable as sequences of characters. Characters in Python are just strings of length one, so you can think of a string iterator as stepping over all of its one-character substrings in order.

  4. To convert a textual character into its character-code numerical value, the built-in ord function is used (docs).

  5. The meat of the algorithm: XOR the textual character with the next pseudo-random byte from the byte stream.

  6. After obtaining the cipher-byte through the XOR, we want to convert back to a textual (character) representation, which we do via the built-in chr function (docs). We then place that character into a sequence of cipher characters. [♥]

  7. To join together characters to form a string, we use the str.join([iterable]) method (docs). [♦] Note that, on some platforms, this is much more efficient than using += (for string concatenation) over and over again. It's a best practice to use this sequence-joining idiom to avoid possible concatenation overhead. [♣]

Front-end fun

If you thought that the pseudo-code translation looked like a piece of cake, you may feel up to a challenge: write a command line interface that:

  1. Asks for an encryption key.

  2. Turns the key to a sequence of integer values and initializes with it.

  3. Continually asks for user-provided text to translate and spits out the corresponding cipher text.

What you need to know

If you need help

I wrote a reference implementation and posted it to github — feel free to check it out if you get stuck.

Here's a sample usage of my implementation:

RC4 Utility
The output values are valid Python strings. They may contain escape characters
of the form \xhh to avoid confusing your terminal emulator. Only the first 256
characters of the encryption key are used.

Enter an encryption key: an encryption key!

Enter plain or cipher text: Victory is mine!
Your RC4 text is: '\xecL\xce(\x16\x8e3\xf02!\xcd\xc6\x9a\xc0j\x98'

Enter plain or cipher text: '\xecL\xce(\x16\x8e3\xf02!\xcd\xc6\x9a\xc0j\x98'
Unescaping ciphertext...
Your RC4 text is: 'Victory is mine!'

Once you find that your cipher is reversible, you've probably got it right!

Again, please comment if anything is unclear.



Also known as "fitching." Often performed by those in brillig, slithy toves.


Could God make a blog entry so long and boring that God would proclaim "TL:DR?"


If I had a nickel for every time this happened to me, I would have no nickels.


This allows you to reflect on things and extract their documentation, which comes in handy when you're running in an interactive Python session or spitting out module-level documentation in a command line argument parser.


This same rule applies to classes and modules, which are beyond the scope of this entry.


Python lists are mutable sequences, implemented as vector ADTs under the hood.


The same task can be accomplished with a custom iterator class, but generators are much more concise and more readable — note that the generator that we end up with reads just like the pseudocode!


Note that the language having a built-in join(iterable) method on its string datatype eliminates the need for every iterable type to implement some form of iterable.join(str).


There's a way to use generators here as well, but the list of characters makes things simpler to understand for the moment. If you're feeling confident, convert this function to be a generator function at the end of the exercise and make it work with the rest of the program.


It's bad practice to assume you'll always be running on CPython — there are also JVM and .NET (CLR) interpreters. Remember, thou shalt not claimeth that, "all the world's a VAX!"