December 26, 2010

A generator protocol trap

You should be cautious that the functions you call from generators don't accidentally raise StopIteration exceptions.

Background

Generators are generally a little tricky in a behind-the-scenes sort of way. Performing a return from a generator is different from a function's return — a generator's return statement actually raises a StopIteration instance.

>>> def foo():
...     yield 1
...     return
>>> i = foo()
>>> next(i)
1
>>> try:
...     next(i)
... except StopIteration as e:
...     print(id(e), e)
23688744

Snag

There's a snag that I run into at times: when you're writing a generator and that generator calls into other functions, be aware that those callees may accidentally raise StopIteration exceptions themselves.

def insidious(whoops=True):
    if whoops:
        raise StopIteration
    else:
        return 2

def generator():
    yield 1
    yield insidious()
    yield 3

if __name__ == '__main__':
    print([i for i in generator()])

This program happily prints out [1].

If you substitute StopIteration with ValueError, you get a traceback, as you'd probably expect. The leakage of a StopIteration exception, however, propagates up to the code that moves the generator along, [*] which sees an uncaught StopIteration exception and terminates the loop.

JavaScript

The same trap exists in the SpiderMonkey (Firefox) JavaScript dialect:

function insidious() {
    throw StopIteration;
}

function generator() {
    yield 1;
    yield insidious();
    yield 3;
}

print(uneval([i for (i in generator())]));

Which prints [1].

VM guts

As a side note for those interested in the guts of virtual machines, the primordial StopIteration class is known to the virtual machine, regardless of user hackery with the builtins:

>>> def simple_generator():
...     yield 1
...     raise StopIteration
...     yield 3
>>> print([i for i in simple_generator()])
[1]
>>> class FakeStopIteration(Exception): pass
>>> __builtins__.StopIteration = FakeStopIteration
>>> print([i for i in simple_generator()])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in <listcomp>
  File "<stdin>", line 3, in simple_generator
__main__.FakeStopIteration

Design

You may look at this issue and think:

The fundamental problem is that uncaught exceptions are raised over any number of function invocations. Generators should have been designed such that you have to stop the generator from the generator invocation.

In an alternate design, a special value (like a StopIteration singleton) might be yield'd from the generator to indicate that it's done.

One issue with that alternative approach is that you're introducing a special-value-check into the hot path within the virtual machine — i.e. you'd be slowing down the common process of yielding iteration items. Using an exception only requires the VM to extend the normal exception handling machinery a bit and adds no additional overhead to the hot path. I think the measurable significance of this overhead is questionable.

Another issue is that it hurts the lambda abstraction — namely, the ability to factor your generator function into smaller helper functions that also have the ability to terminate the generator. In the absence of a language-understood exception, the programmer has to invent a new way for the helper functions to communicate to the generator that they'd like the generator to terminate.

Footnotes

[*]

The FOR_ITER bytecode in CPython VM for-loops.

Learning Python by example: RC4

One of my friends at work was fakeplaining [*] that he had been on the Python programming mailing list at work for some time, yet still did not know Python. Being hopelessly suggestible in the face of obvious sarcasm, I decided to sacrifice a few hours of sleep to the god of blog. [†]

Note that this entry is aimed at people who already know how to program and have been looking for a tidbit to try in Python.

There are a lot of side notes I've left out for simplicity of explanation; however, I also attempted to make the experience interesting by introducing one of Python's more advanced features, called "generator functions," into the mix. Hopefully it strikes a balance. Please comment if you are utterly confused by generators — I may add an alternate section that allows the reader to avoid them altogether.

You kata wanna...

A number of leaders in the programming community are hot on this trend called "code katas." I'm actually a big fan of this trend, mainly because I've been writing code for no reason, hating on it, throwing it away, and subsequently rewriting it for my entire life, but I now get to call it something cool- and ninja-sounding. Doing such things in my spare time is no longer considered "inexcusable nerdiness;" rather, it's my small endeavor to bring professionalism to the field of software engineering. *cough*

One reason that I really enjoy this new trend is that programmers are posting their own morsel-sized programming problems left and right, giving ample opportunities to explore new languages (and dusty corners of ones you know well) without accidentally becoming BDFL of a seminal framework or utility. [‡]

RC4 Pseudocode

Case in point, I'll use the recent kata from Programming Praxis for this Python exercise, as they provide straightforward pseudocode. Here's the encryption algorithm named RC4, as quoted from Programming Praxis:

The key array K is initialized like this:

for i from 0 to 255
    K[i] := i

j := 0

for i from 0 to 255
    j := (j + K[i] + key[i mod klen]) mod 256
    swap K[i], K[j]

Once the key array is initialized, the pseudo-random byte generator uses a similar calculation to build the stream of random bytes:

i := 0
j := 0

start:
    i := (i + 1) mod 256
    j := (j + K[i]) mod 256
    swap K[i], K[j]
    output K[ (K[i] + K[j]) mod 256 ]
    goto start

The first step in writing our RC4 program is to translate this pseudocode to Python, while the second step is to add a command line front-end for some off-the-cuff implementation experience.

If you'd like to look at the final program ahead of time, grab a copy of my reference implementation.

Porting the initialization

For initialization, we use a provided key to calculate a 256 entry integer sequence. Open a new file called rc4.py and write the following function:

def initialize(key):
    """Produce a 256-entry list based on `key` (a sequence of numbers) as
    the first step in RC4.
    Note: indices in key greater than 255 will be ignored.
    """
    k = range(256)
    j = 0
    for i in range(256):
        j = (j + k[i] + key[i % len(key)]) % 256
        k[i], k[j] = k[j], k[i]
    return k

The simplicity of the translation demonstrates why Python is sometimes called "executable pseudocode". Breaking it down line by line:

  1. defines a function named initialize that takes a single argument, key.

  2. A documentation string ("docstring" for short). In Python, documentation is associated with a function even at runtime, in contrast to traditional JavaDoc or POD. [§] If the first statement in a function is a string literal, it is used as the docstring for that function. [¶]

  1. The built-in range function returns a list of values. [#] "Built-in" is the terminology used for items that are "available all the time without explicitly importing anything."

    This function also has a two-argument form, range(start, stop); however, in the single argument form, start has a default of 0, so the function invocation returns a list of all the integers in the mathematical interval [0, 256), for a total of 256 values.

  1. There is only one for loop syntax: for [identifier] in [iterable]. Lists are iterable because they contain a sequence of objects.

  2. Finite collections also support the built-in function len([sizable]). The way that numerical arithmetic works and sequence indexing via seq[idx] should be familiar.

  3. Python has an elegant swap capability — what's important to note is that the entire right hand side is evaluated, then assigned to the left hand side.

  4. Python functions optionally return a value. If no return statement is encountered, None is returned, which indicates the absence of a value (docs).

Generators: functions that pause

Python has a convenient feature, called "generator functions," that allows you to create a stream of values using function-definition syntax. [♠] You can think of generator functions as special functions that can pause and resume, returning a value each time it pauses.

The canonical example illustrates the concept well — use the interactive Python shell to explore how generator functions work, by running the python command without arguments. Make sure the version is python2.3 or above. Once you're in the interactive shell, type the following:

>>> def gen_counter():
...     i = 0
...     while True:
...         yield i
...         i += 1
...
>>>

Note the use of a yield statement, which tells Python that it is a generator function. Calling a generator function creates an iterable generator object, which can then produce a potentially infinite series of values:

>>> counter = gen_counter()
>>> print counter
<generator object gen_counter at 0xb7e3fbe4>
>>> counter.next()
0
>>> counter.next()
1

Note that because the stream of values is potentially infinite and lazily evaluated, there's no concept of length: it's not representative of a container so much as a sequence:

>>> len(counter)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type 'generator' has no len()

Also important to note is that none of the local values in the generator instances are shared; i.e. instantiating a second generator has no effect on the first:

>>> one_counter = gen_counter()
>>> another_counter = gen_counter()
>>> one_counter.next()
0
>>> another_counter.next()
0

Since generators are iterable, you can use them in a for loop just like containers. (But watch out for infinite generators and no break condition in your for loop!)

If you're still confused, the official tutorial's section on generators may help to clarify things. For an in-depth look at generators and why they're awesome, David M. Beazley's PyCon presentation on generators is excellent.

Applying generators to RC4

This dove-tails nicely with the second part of the algorithm, which requires a stream of values to XOR against. The generator is nearly a direct translation from the pseudocode, which you may also add to rc4.py:

def gen_random_bytes(k):
    """Yield a pseudo-random stream of bytes based on 256-byte array `k`."""
    i = 0
    j = 0
    while True:
        i = (i + 1) % 256
        j = (j + k[i]) % 256
        k[i], k[j] = k[j], k[i]
        yield k[(k[i] + k[j]) % 256]

Each time .next() is called on the generator instance, the function executes until the first yield statement is encountered, produces that value, and saves the function state for later.

Yes, we could create a big list of pseudo-random values the length of the text, but creating them all at the same time adds O(len(text)) memory overhead, whereas the generator is constant memory overhead (and computationally efficient).

Tying it together

Now we just need a function that does the XORing, which teaches us about strings and characters.

def run_rc4(k, text):
    cipher_chars = []
    random_byte_gen = gen_random_bytes(k)
    for char in text:
        byte = ord(char)
        cipher_byte = byte ^ random_byte_gen.next()
        cipher_chars.append(chr(cipher_byte))
    return ''.join(cipher_chars)

Line by line:

  1. An empty list cipher character accumulator is created.

  2. The generator object is instantiated by calling the generator function.

  3. As you can see from the for loop, Python strings are iterable as sequences of characters. Characters in Python are just strings of length one, so you can think of a string iterator as stepping over all of its one-character substrings in order.

  4. To convert a textual character into its character-code numerical value, the built-in ord function is used (docs).

  5. The meat of the algorithm: XOR the textual character with the next pseudo-random byte from the byte stream.

  6. After obtaining the cipher-byte through the XOR, we want to convert back to a textual (character) representation, which we do via the built-in chr function (docs). We then place that character into a sequence of cipher characters. [♥]

  7. To join together characters to form a string, we use the str.join([iterable]) method (docs). [♦] Note that, on some platforms, this is much more efficient than using += (for string concatenation) over and over again. It's a best practice to use this sequence-joining idiom to avoid possible concatenation overhead. [♣]

Front-end fun

If you thought that the pseudo-code translation looked like a piece of cake, you may feel up to a challenge: write a command line interface that:

  1. Asks for an encryption key.

  2. Turns the key to a sequence of integer values and initializes with it.

  3. Continually asks for user-provided text to translate and spits out the corresponding cipher text.

What you need to know

If you need help

I wrote a reference implementation and posted it to github — feel free to check it out if you get stuck.

Here's a sample usage of my implementation:

===========
RC4 Utility
===========
The output values are valid Python strings. They may contain escape characters
of the form \xhh to avoid confusing your terminal emulator. Only the first 256
characters of the encryption key are used.

Enter an encryption key: an encryption key!

Enter plain or cipher text: Victory is mine!
Your RC4 text is: '\xecL\xce(\x16\x8e3\xf02!\xcd\xc6\x9a\xc0j\x98'

Enter plain or cipher text: '\xecL\xce(\x16\x8e3\xf02!\xcd\xc6\x9a\xc0j\x98'
Unescaping ciphertext...
Your RC4 text is: 'Victory is mine!'

Once you find that your cipher is reversible, you've probably got it right!

Again, please comment if anything is unclear.

Footnotes

[*]

Also known as "fitching." Often performed by those in brillig, slithy toves.

[†]

Could God make a blog entry so long and boring that God would proclaim "TL:DR?"

[‡]

If I had a nickel for every time this happened to me, I would have no nickels.

[§]

This allows you to reflect on things and extract their documentation, which comes in handy when you're running in an interactive Python session or spitting out module-level documentation in a command line argument parser.

[¶]

This same rule applies to classes and modules, which are beyond the scope of this entry.

[#]

Python lists are mutable sequences, implemented as vector ADTs under the hood.

[♠]

The same task can be accomplished with a custom iterator class, but generators are much more concise and more readable — note that the generator that we end up with reads just like the pseudocode!

[♥]

Note that the language having a built-in join(iterable) method on its string datatype eliminates the need for every iterable type to implement some form of iterable.join(str).

[♦]

There's a way to use generators here as well, but the list of characters makes things simpler to understand for the moment. If you're feeling confident, convert this function to be a generator function at the end of the exercise and make it work with the rest of the program.

[♣]

It's bad practice to assume you'll always be running on CPython — there are also JVM and .NET (CLR) interpreters. Remember, thou shalt not claimeth that, "all the world's a VAX!"

Generators and resource aquisition/release

One of the neatest things about language lawyers is that they have a keen eye for features of a language that may conflict with each other to produce fail. I, on the other hand, find it fun to stumble around in various languages and analyze interesting cases as I encounter them.

Generators in Python were a subset of a more general concept of coroutines. Generators are an elegant and concise way to write reasonably sized state machines. For that reason, you'll seem them heavily associated with iterators (which are more sytaxerific [*] to write in a language without generators, like Java).

I used to envision generators as little stack frames that were detached from the call stack and placed somewhere in outer space, eating moon cheese and playing with the Django pony, where they lived happily ever after. Surprisingly, that concept didn't match up with reality too well.

PEP 342: Coroutines via Enhanced Generators and PEP 325: Resource-Release Support for Generators are the language lawyer smack-down of my naive view. We used to be unable to perform proper resource acquisition within generators; notably, you couldn't yield from the try suite of a try/finally block, because the only way to guarantee resource release in the finally block was to step the generator until a StopIteration exception:

Restriction: A yield statement is not allowed in the try clause of a try/finally construct. The difficulty is that there's no guarantee the generator will ever be resumed, hence no guarantee that the finally block will ever get executed; that's too much a violation of finally's purpose to bear.

  • PEP 255 — Specification: Yield

from threading import Lock

lock = Lock()

def gen():
    try:
        lock.acquire()
        yield 'Acquired!'
    finally:
        lock.release()

if __name__ == '__main__':
    g = gen()
    print g.next()

We see the addition of this capability in Python 2.5:

$ python2.4 poc.py
  File "poc.py", line 8
    yield 'Acquired!'
SyntaxError: 'yield' not allowed in a 'try' block with a 'finally' clause

$ python2.5 poc.py
Acquired!

Before Python 2.5 there was no way to tell the generator to die and give up its resources. As PEP 342 describes, Python 2.5 turns generators into simple coroutines, which we can force to release its resources [†] when necessary via the close method:

>>> import poc
>>> g = poc.gen()
>>> h = poc.gen()
>>> g.next()
'Acquired!'
>>> g.close() # Force it to release the resource, or we deadlock.
>>> h.next()
'Acquired!'

SimPy

If you're wondering how I came across this combination in day-to-day Python programming, it was largely due to SimPy. I was writing a PCI bus simulation [‡] for fun, to help get a grasp of the SimPy constructs and how they might affect normal object oriented design. [§] I wanted to "acquire" a bus grant, so I analyzed the applicability of with for this resource acquisition.

I went to Stack Overflow and submitted a "feeler" question to see if there was some conventional Python wisdom I was lacking: Is it safe to yield from within a "with" block in Python (and why)?. The concept seemed relatively new to those in the discussion; however, the responses are still insightful.

The Lesson

This experience has demonstrated to me there are two modes of thinking when it comes to Python generators: short-lived and long-lived.

Typical, pre-Python 2.5 generator usage, where generators are really used like generators, lets you glaze over the difference between a regular function and a generator. Really, all that you want to do with this kind of construct is get some values to be used right now. You're not doing anything super-fancy in the generator — it's just nicer syntax to have all of your local variables automatically saved in the generator function than doing it manually in an independent object.

Fancy, SimPy co-routine usage, where generators are managed as coroutines by a central dispatcher, makes a generator take on some more serious object-like semantics. Shared-resource acquisition across coroutine yields should scare you, at least as much as objects that acquire shared resources without releasing them right away. [¶] Perhaps more, seeing as how you're lulled into a state of confidence by understanding short-lived Python generator behaviors.

Footnotes

[*]

This word was invented to make me seem less biased against Java. Oh, also, even more props to Barbara Liskov, (Turing Award winner) for the impetus of generator-based iterators in the CLU language.

[†]

We can do other things with the new capabilities, like feed values back into the generators:

def gen():
    feedback = (yield 'First')
    yield feedback

if __name__ == '__main__':
    g = gen()
    assert g.next() == 'First'
    assert g.send('Test') == 'Test'
[‡]

The original PCI bus is approximately the "Hello, Word!" of platform architecture, so far as I can tell.

[§]

I still haven't gotten solid good grasp of the design methodology changes. If you want one generator to block until the success/failure of another subroutine, then you have to sleep and trigger wake events with the possibility of interrupts. Can you tell I've never used a language with continuations before? ;-)

[¶]

Deadlocking on mutually exclusive resources is easy with a cooperatively multitasking dispatcher: one entity (coroutine) is holding the resource and yields, dispatcher picks another one that wants that same resource, performs a non-blocking acquisition, and then you have circular wait with no preemption == deadlock.

Python's generators sure are handy

While rewriting some older code today, I ran across a good example of the clarity inherent in Python's generator expressions. Some time ago, I had written this weirdo construct:

for regex in date_regexes:
    match = regex.search(line)
    if match:
        break
else:
    return
# ... do stuff with the match

The syntax highlighting makes the problem fairly obvious: there's way too much syntax!

First of all, I used the semi-obscure "for-else" construct. For those of you who don't read the Python BNF grammar for fun (as in: the for statement), the definition may be useful:

So long as the for loop isn't (prematurely) terminated by a break statement, the code in the else suite gets evaluated. To restate (in the contrapositive): the code in the else suite doesn't get evaluated if the for loop is terminated with a break statement. From this definition we can deduce that if a match was found, I did not want to return early.

That's way too much stuff to think about. Generators come to the rescue!

def first(iterable):
    """:return: The first item in the iterable that evaluates
    as True.
    """
    for item in iterable:
        if item:
            return item
    return None

match = first(regex.search(line) for regex in regexes)
if not match:
    return
# ... do stuff with the match

At a glance, this is much shorter and more comprehensible. We pass a generator expression to the first function, which performs a kind of short-circuit evaluation — as soon as a match is found, we stop running regexes (which can be expensive). This is a pretty rockin' solution, so far as I can tell.

Prior to generator expressions, to do something similar to this we'd have to use a list comprehension, like so:

match = first([regex.search(line) for regex in regexes])
if not match:
    return
# ... do stuff with the match

We dislike this because the list comprehension will run all of the regexes, even if one already found a match. What we really want is the short circuit evaluation provided by generator expressions and the any builtin, as shown above. Huzzah!

Edit

Originally I thought that the any built-in returned the first object which evaluated to a boolean True, but it actually returns the boolean True if any of the objects evaluate to True. I've edited to reflect my mistake.