September 20, 2009

Learning Python by example: RC4

One of my friends at work was fakeplaining [*] that he had been on the Python programming mailing list at work for some time, yet still did not know Python. Being hopelessly suggestible in the face of obvious sarcasm, I decided to sacrifice a few hours of sleep to the god of blog. [†]

Note that this entry is aimed at people who already know how to program and have been looking for a tidbit to try in Python.

There are a lot of side notes I've left out for simplicity of explanation; however, I also attempted to make the experience interesting by introducing one of Python's more advanced features, called "generator functions," into the mix. Hopefully it strikes a balance. Please comment if you are utterly confused by generators — I may add an alternate section that allows the reader to avoid them altogether.

You kata wanna...

A number of leaders in the programming community are hot on this trend called "code katas." I'm actually a big fan of this trend, mainly because I've been writing code for no reason, hating on it, throwing it away, and subsequently rewriting it for my entire life, but I now get to call it something cool- and ninja-sounding. Doing such things in my spare time is no longer considered "inexcusable nerdiness;" rather, it's my small endeavor to bring professionalism to the field of software engineering. *cough*

One reason that I really enjoy this new trend is that programmers are posting their own morsel-sized programming problems left and right, giving ample opportunities to explore new languages (and dusty corners of ones you know well) without accidentally becoming BDFL of a seminal framework or utility. [‡]

RC4 Pseudocode

Case in point, I'll use the recent kata from Programming Praxis for this Python exercise, as they provide straightforward pseudocode. Here's the encryption algorithm named RC4, as quoted from Programming Praxis:

The key array K is initialized like this:

for i from 0 to 255
    K[i] := i

j := 0

for i from 0 to 255
    j := (j + K[i] + key[i mod klen]) mod 256
    swap K[i], K[j]

Once the key array is initialized, the pseudo-random byte generator uses a similar calculation to build the stream of random bytes:

i := 0
j := 0

start:
    i := (i + 1) mod 256
    j := (j + K[i]) mod 256
    swap K[i], K[j]
    output K[ (K[i] + K[j]) mod 256 ]
    goto start

The first step in writing our RC4 program is to translate this pseudocode to Python, while the second step is to add a command line front-end for some off-the-cuff implementation experience.

If you'd like to look at the final program ahead of time, grab a copy of my reference implementation.

Porting the initialization

For initialization, we use a provided key to calculate a 256 entry integer sequence. Open a new file called rc4.py and write the following function:

def initialize(key):
    """Produce a 256-entry list based on `key` (a sequence of numbers) as
    the first step in RC4.
    Note: indices in key greater than 255 will be ignored.
    """
    k = range(256)
    j = 0
    for i in range(256):
        j = (j + k[i] + key[i % len(key)]) % 256
        k[i], k[j] = k[j], k[i]
    return k

The simplicity of the translation demonstrates why Python is sometimes called "executable pseudocode". Breaking it down line by line:

  1. defines a function named initialize that takes a single argument, key.

  2. A documentation string ("docstring" for short). In Python, documentation is associated with a function even at runtime, in contrast to traditional JavaDoc or POD. [§] If the first statement in a function is a string literal, it is used as the docstring for that function. [¶]

  1. The built-in range function returns a list of values. [#] "Built-in" is the terminology used for items that are "available all the time without explicitly importing anything."

    This function also has a two-argument form, range(start, stop); however, in the single argument form, start has a default of 0, so the function invocation returns a list of all the integers in the mathematical interval [0, 256), for a total of 256 values.

  1. There is only one for loop syntax: for [identifier] in [iterable]. Lists are iterable because they contain a sequence of objects.

  2. Finite collections also support the built-in function len([sizable]). The way that numerical arithmetic works and sequence indexing via seq[idx] should be familiar.

  3. Python has an elegant swap capability — what's important to note is that the entire right hand side is evaluated, then assigned to the left hand side.

  4. Python functions optionally return a value. If no return statement is encountered, None is returned, which indicates the absence of a value (docs).

Generators: functions that pause

Python has a convenient feature, called "generator functions," that allows you to create a stream of values using function-definition syntax. [♠] You can think of generator functions as special functions that can pause and resume, returning a value each time it pauses.

The canonical example illustrates the concept well — use the interactive Python shell to explore how generator functions work, by running the python command without arguments. Make sure the version is python2.3 or above. Once you're in the interactive shell, type the following:

>>> def gen_counter():
...     i = 0
...     while True:
...         yield i
...         i += 1
...
>>>

Note the use of a yield statement, which tells Python that it is a generator function. Calling a generator function creates an iterable generator object, which can then produce a potentially infinite series of values:

>>> counter = gen_counter()
>>> print counter
<generator object gen_counter at 0xb7e3fbe4>
>>> counter.next()
0
>>> counter.next()
1

Note that because the stream of values is potentially infinite and lazily evaluated, there's no concept of length: it's not representative of a container so much as a sequence:

>>> len(counter)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type 'generator' has no len()

Also important to note is that none of the local values in the generator instances are shared; i.e. instantiating a second generator has no effect on the first:

>>> one_counter = gen_counter()
>>> another_counter = gen_counter()
>>> one_counter.next()
0
>>> another_counter.next()
0

Since generators are iterable, you can use them in a for loop just like containers. (But watch out for infinite generators and no break condition in your for loop!)

If you're still confused, the official tutorial's section on generators may help to clarify things. For an in-depth look at generators and why they're awesome, David M. Beazley's PyCon presentation on generators is excellent.

Applying generators to RC4

This dove-tails nicely with the second part of the algorithm, which requires a stream of values to XOR against. The generator is nearly a direct translation from the pseudocode, which you may also add to rc4.py:

def gen_random_bytes(k):
    """Yield a pseudo-random stream of bytes based on 256-byte array `k`."""
    i = 0
    j = 0
    while True:
        i = (i + 1) % 256
        j = (j + k[i]) % 256
        k[i], k[j] = k[j], k[i]
        yield k[(k[i] + k[j]) % 256]

Each time .next() is called on the generator instance, the function executes until the first yield statement is encountered, produces that value, and saves the function state for later.

Yes, we could create a big list of pseudo-random values the length of the text, but creating them all at the same time adds O(len(text)) memory overhead, whereas the generator is constant memory overhead (and computationally efficient).

Tying it together

Now we just need a function that does the XORing, which teaches us about strings and characters.

def run_rc4(k, text):
    cipher_chars = []
    random_byte_gen = gen_random_bytes(k)
    for char in text:
        byte = ord(char)
        cipher_byte = byte ^ random_byte_gen.next()
        cipher_chars.append(chr(cipher_byte))
    return ''.join(cipher_chars)

Line by line:

  1. An empty list cipher character accumulator is created.

  2. The generator object is instantiated by calling the generator function.

  3. As you can see from the for loop, Python strings are iterable as sequences of characters. Characters in Python are just strings of length one, so you can think of a string iterator as stepping over all of its one-character substrings in order.

  4. To convert a textual character into its character-code numerical value, the built-in ord function is used (docs).

  5. The meat of the algorithm: XOR the textual character with the next pseudo-random byte from the byte stream.

  6. After obtaining the cipher-byte through the XOR, we want to convert back to a textual (character) representation, which we do via the built-in chr function (docs). We then place that character into a sequence of cipher characters. [♥]

  7. To join together characters to form a string, we use the str.join([iterable]) method (docs). [♦] Note that, on some platforms, this is much more efficient than using += (for string concatenation) over and over again. It's a best practice to use this sequence-joining idiom to avoid possible concatenation overhead. [♣]

Front-end fun

If you thought that the pseudo-code translation looked like a piece of cake, you may feel up to a challenge: write a command line interface that:

  1. Asks for an encryption key.

  2. Turns the key to a sequence of integer values and initializes with it.

  3. Continually asks for user-provided text to translate and spits out the corresponding cipher text.

What you need to know

If you need help

I wrote a reference implementation and posted it to github — feel free to check it out if you get stuck.

Here's a sample usage of my implementation:

===========
RC4 Utility
===========
The output values are valid Python strings. They may contain escape characters
of the form \xhh to avoid confusing your terminal emulator. Only the first 256
characters of the encryption key are used.

Enter an encryption key: an encryption key!

Enter plain or cipher text: Victory is mine!
Your RC4 text is: '\xecL\xce(\x16\x8e3\xf02!\xcd\xc6\x9a\xc0j\x98'

Enter plain or cipher text: '\xecL\xce(\x16\x8e3\xf02!\xcd\xc6\x9a\xc0j\x98'
Unescaping ciphertext...
Your RC4 text is: 'Victory is mine!'

Once you find that your cipher is reversible, you've probably got it right!

Again, please comment if anything is unclear.

Footnotes

[*]

Also known as "fitching." Often performed by those in brillig, slithy toves.

[†]

Could God make a blog entry so long and boring that God would proclaim "TL:DR?"

[‡]

If I had a nickel for every time this happened to me, I would have no nickels.

[§]

This allows you to reflect on things and extract their documentation, which comes in handy when you're running in an interactive Python session or spitting out module-level documentation in a command line argument parser.

[¶]

This same rule applies to classes and modules, which are beyond the scope of this entry.

[#]

Python lists are mutable sequences, implemented as vector ADTs under the hood.

[♠]

The same task can be accomplished with a custom iterator class, but generators are much more concise and more readable — note that the generator that we end up with reads just like the pseudocode!

[♥]

Note that the language having a built-in join(iterable) method on its string datatype eliminates the need for every iterable type to implement some form of iterable.join(str).

[♦]

There's a way to use generators here as well, but the list of characters makes things simpler to understand for the moment. If you're feeling confident, convert this function to be a generator function at the end of the exercise and make it work with the rest of the program.

[♣]

It's bad practice to assume you'll always be running on CPython — there are also JVM and .NET (CLR) interpreters. Remember, thou shalt not claimeth that, "all the world's a VAX!"