Posts Tagged ‘Introductory’

Learning Python by example: list comprehensions

Thursday, April 29th, 2010

My friend, who is starting to learn Python 2.x, asked me what this snippet did:

def collapse(seq):
    # Preserve order.
    uniq = []
    [uniq.append(item) for item in seq if not uniq.count(item)]
    return uniq

This is not a snippet that should be emulated (i.e. it’s bad); however, it makes me happy: there are so many things that can be informatively corrected!

What is a list comprehension?

A list comprehension is a special brackety syntax to perform a transform operation with an optional filter clause that always produces a new sequence (list) object as a result. To break it down visually, you perform:

new_range = [i * i          for i in range(5)   if i % 2 == 0]

Which corresponds to:

*result*  = [*transform*    *iteration*         *filter*     ]

The filter piece answers the question, "should this item be transformed?" If the answer is yes, then the transform piece is evaluated and becomes an element in the result. The iteration [*] order is preserved in the result.

Go ahead and figure out what you expect new_range to be in the prior example. You can double check me in the Python shell, but I think it comes out to be:

>>> new_range = [i * i for i in range(5) if i % 2 == 0]
>>> print new_range
[0, 4, 16]

If it still isn’t clicking, we can try to make the example less noisy by getting rid of the transform and filter — can you tell what this will produce?

>>> new_range = [i for i in range(5)]

So what’s wrong with that first snippet?

As we observed in the previous section, a list comprehension always produces a result list, where the elements of the result list are the transformed elements of the iteration. That means, if there’s no filter piece, there are exactly as many result elements as there were iteration elements.

Weird thing number one about the snippet — the list comprehension result is unused. It’s created, mind you — list comprehension always create a value, even if you don’t care what it is — but it just goes off to oblivion. (In technical terms, it becomes garbage.) When you don’t need the result, just use a for loop! This is better:

def colapse(seq):
    """Preserve order."""
    uniq = []
    for item in seq:
        if not uniq.count(item):
            uniq.append(item)
    return uniq

It’s two more lines, but it’s less weird looking and wasteful. "Better for everybody who reads and runs your code," means you should do it.

Moral of the story: a list comprehension isn’t just, "shorthand for a loop." It’s shorthand for a transform from an input sequence to an output sequence with an optional filter. If it gets too complex or weird looking, just make a loop. It’s not that hard and readers of your code will thank you.

Weird thing number two: the transform, list.append(item), produces None as its output value, because the return value from list.append is always None. Therefore, the result, even though it isn’t kept anywhere, is a list of None values of the same length as seq (notice that there’s no filter clause).

Weird thing number three: list.count(item) iterates over every element in the list looking for things that == to item. If you think through the case where you call collapse on an entirely unique sequence, you can tell that the collapse algorithm is O(n2). In fact, it’s even worse than it may seem at first glance, because count will keep going all the way to the end of uniq, even if it finds item in the first index of uniq. What the original author really wanted was item not in uniq, which bails out early if it finds item in uniq.

Also worth mentioning for the computer-sciency folk playing along at home: if all elements of the sequence are comparable, you can bring that down to O(n * log n) by using a "shadow" sorted sequence and bisecting to test for membership. If the sequence is hashable you can bring it down to O(n), perhaps by using the set datatype if you are in Python >= 2.3. Note that the common cases of strings, numbers, and tuples (any built-in immutable datatype, for that matter) are hashable.

From Python history

It’s interesting to note that Python Enhancement Proposal (PEP) #270 considered putting a uniq function into the language distribution, but withdrew it with the following statement:

Removing duplicate elements from a list is a common task, but there are only two reasons I can see for making it a built-in. The first is if it could be done much faster, which isn’t the case. The second is if it makes it significantly easier to write code. The introduction of sets.py eliminates this situation since creating a sequence without duplicates is just a matter of choosing a different data structure: a set instead of a list.

Remember that sets can only contain hashable elements (same policy as dictionary keys) and are therefore not suitable for all uniq-ifying tasks, as mentioned in the last paragraph of the previous section.

Footnotes

[*] "Iteration" is just a fancy word for "step through the sequence, element by element, and give that element a name." In our case we’re giving the name i.

Inedible vectors of spam: learning non-reified generics by example

Friday, December 4th, 2009

I’ve been playing with the new Java features, only having done minor projects in it since 1.4, and there have been a lot of nice improvements! One thing that made me do a double take, however, was a run-in with non-reified types in Java generics. Luckily, one of my Java-head friends was online and beat me with a stick of enlightenment until I understood what was going on.

In the Java generic system, the collections are represented by two separate, yet equally important concepts — the compile-time generic parameters and the run-time casts who check the collection members. These are their stories.

Dun, dun!

An example: worth a thousand lame intros

The following code is a distilled representation of the situation I encountered:

import java.util.Arrays;
import java.util.List;
 
public class BadCast {
 
    static interface Edible {}
 
    static class Spam implements Edible {}
 
    List<Spam> canSomeSpam() {
        return Arrays.asList(new Spam(), new Spam(), new Spam());
    }
 
    /**
     * @note Return type *must* be List<Editable> (because we intend to
     *       implement an interface that requires it).
     */
    List<Edible> castSomeSpam() {
        return (List<Edible>) canSomeSpam();
    }
 
}

It produced the following error in my IDE:

Cannot cast from List<BadCast.Spam> to List<BadCast.Edible>

At which point I scratched my head and thought, "If all Spams are Edible, [*] why won’t it let me cast List<Spam> to List<Edible>? This seems silly."

Potential for error

A slightly expanded example points out where that simple view goes wrong: [†]

import java.util.Arrays;
import java.util.List;
import java.util.Vector;
 
public class GenericFun implements Runnable {
 
    static interface Edible {}
 
    static class Spam implements Edible {
 
        void decompose() {}
    }
 
    List<Spam> canSomeSpam() {
        return Arrays.asList(new Spam(), new Spam(), new Spam());
    }
 
    /**
     * Loves to stick his apples into things.
     */
    static class JohnnyAppleseed {
 
        static class Apple implements Edible {}
 
        JohnnyAppleseed(List<Edible> edibles) {
            edibles.add(new Apple());
        }
 
    }
 
    @Override
    public void run() {
        List<Spam> spams = new Vector<Spam>(canSomeSpam());
        List<Edible> edibles = (List<Edible>) spams;
        new JohnnyAppleseed(edibles); // He puts his apple in our spams!
        for (Spam s : spams) {
            s.decompose(); // What does this do when it gets to the apple!?
        }
    }
}

We make a (mutable) collection of spams, but this time, unlike in the previous example, we keep a reference to that collection. Then, when we give it to JohnnyAppleseed, he sticks a damn Apple in there, invalidating the supposed type of spams! (If you still don’t see it, note that the object referenced by spams is aliased to edibles.) Then, when we invoke the decompose method on the Apple that is confused with a Spam, what could possibly happen?!

The red pill: there is no runtime-generic-type-parameterization!

Though the above code won’t compile, this kind of thing actually is possible, and it’s where the implementation of generics starts to leak through the abstraction. To quote Neal Gafter:

Many people are unsatisfied with the restrictions caused by the way generics are implemented in Java. Specifically, they are unhappy that generic type parameters are not reified: they are not available at runtime. Generics are implemented using erasure, in which generic type parameters are simply removed at runtime. That doesn’t render generics useless, because you get typechecking at compile-time based on the generic type parameters, and also because the compiler inserts casts in the code (so that you don’t have to) based on the type parameters.

The implementation of generics using erasure also causes Java to have unchecked operations, which are operations that would normally check something at runtime but can’t do so because not enough information is available. For example, a cast to the type List<String> is an unchecked cast, because the generated code checks that the object is a List but doesn’t check whether it is the right kind of list.

At runtime, List<Edible> is no different from List. At compile-time, however, a List<Edible> cannot be cast to from List<Spam>, because it knows what evil things you could then do (like sticking Apples in there).

But if you did stick an Apple in there (like I told you that you can actually do, with evidence to follow shortly), you wouldn’t know anything was wrong until you tried to use it like a Spam. This is a clear violation of the "error out early" policy that allows you to localize your debugging. [‡]

In what way does the program error out when you try to use the masquerading Apple like a Spam? Well, when you write:

for (Spam s : spams) {
    s.decompose(); // What does this do when it gets to the apple!?
}

The code the compiler actually generates is:

for (Object s : spams) {
    ((Spam)s).decompose();
}

At which point it’s clear what will happen to the Apple instance — a ClassCastException, because it’s not a Spam!

Exception in thread "main" java.lang.ClassCastException: GenericFun$JohnnyAppleseed$Apple cannot be cast to GenericFun$Spam
        at GenericFun.run(GenericFun.java:36)

Backpedaling

Okay, so in the first example we didn’t keep a reference to the List around, making it acceptable (but bad style) to perform an unchecked cast:

Since, under the hood, the generic type parameters are erased, there’s no runtime difference between List<Edible> and plain ol’ List. If we just cast to List, it will give us a warning:

Type safety: The expression of type List needs unchecked conversion to conform to List<BadCast.Edible>

The real solution, though, is to just make an unnecessary "defensive copy" when you cross this function boundary; i.e.

List<Edible> castSomeSpam() {
    return new Vector<Edible>(canSomeSpam());
}

Footnotes

[*] Obviously a point of contention among ham connoisseurs.
[†]

This doesn’t compile, because we’re imagining that the cast were possible. Compilers don’t respond well when you ask them to imagine things:

$ javac 'Imagine you could cast List<Spam> to List<Edible>!'
javac: invalid flag: Imagine you could cast List<Spam> to List<Edible>!
Usage: javac <options> <source files>
use -help for a list of possible options
[‡] Note that if you must do something like this, you can use a Collections.checkedList to get the early detection. Still, the client is going to be pissed that they tried to put their delicious Ham in there and got an unexpected ClassCastException — probably best to use Collections.unmodifiableList if the reference ownership isn’t fully transferred.

Learning Python by example: RC4

Sunday, September 20th, 2009

One of my friends at work was fakeplaining [*] that he had been on the Python programming mailing list at work for some time, yet still did not know Python. Being hopelessly suggestible in the face of obvious sarcasm, I decided to sacrifice a few hours of sleep to the god of blog. [†]

Note that this entry is aimed at people who already know how to program and have been looking for a tidbit to try in Python.

There are a lot of side notes I’ve left out for simplicity of explanation; however, I also attempted to make the experience interesting by introducing one of Python’s more advanced features, called "generator functions," into the mix. Hopefully it strikes a balance. Please comment if you are utterly confused by generators — I may add an alternate section that allows the reader to avoid them altogether.

You kata wanna…

A number of leaders in the programming community are hot on this trend called "code katas." I’m actually a big fan of this trend, mainly because I’ve been writing code for no reason, hating on it, throwing it away, and subsequently rewriting it for my entire life, but I now get to call it something cool- and ninja-sounding. Doing such things in my spare time is no longer considered "inexcusable nerdiness;" rather, it’s my small endeavor to bring professionalism to the field of software engineering. *cough*

One reason that I really enjoy this new trend is that programmers are posting their own morsel-sized programming problems left and right, giving ample opportunities to explore new languages (and dusty corners of ones you know well) without accidentally becoming BDFL of a seminal framework or utility. [‡]

RC4 Pseudocode

Case in point, I’ll use the recent kata from Programming Praxis for this Python exercise, as they provide straightforward pseudocode. Here’s the encryption algorithm named RC4, as quoted from Programming Praxis:

The key array K is initialized like this:

for i from 0 to 255
    K[i] := i
 
j := 0
 
for i from 0 to 255
    j := (j + K[i] + key[i mod klen]) mod 256
    swap K[i], K[j]

Once the key array is initialized, the pseudo-random byte generator uses a similar calculation to build the stream of random bytes:

i := 0
j := 0
 
start:
    i := (i + 1) mod 256
    j := (j + K[i]) mod 256
    swap K[i], K[j]
    output K[ (K[i] + K[j]) mod 256 ]
    goto start

The first step in writing our RC4 program is to translate this pseudocode to Python, while the second step is to add a command line front-end for some off-the-cuff implementation experience.

If you’d like to look at the final program ahead of time, grab a copy of my reference implementation.

Porting the initialization

For initialization, we use a provided key to calculate a 256 entry integer sequence. Open a new file called rc4.py and write the following function:

def initialize(key):
    """Produce a 256-entry list based on `key` (a sequence of numbers) as
    the first step in RC4.
    Note: indices in key greater than 255 will be ignored.
    """
    k = range(256)
    j = 0
    for i in range(256):
        j = (j + k[i] + key[i % len(key)]) % 256
        k[i], k[j] = k[j], k[i]
    return k

The simplicity of the translation demonstrates why Python is sometimes called "executable pseudocode". Breaking it down line by line:

  1. defines a function named initialize that takes a single argument, key.
  2. A documentation string ("docstring" for short). In Python, documentation is associated with a function even at runtime, in contrast to traditional JavaDoc or POD. [§] If the first statement in a function is a string literal, it is used as the docstring for that function. [¶]
  1. The built-in range function returns a list of values. [#] "Built-in" is the terminology used for items that are "available all the time without explicitly importing anything."

    This function also has a two-argument form, range(start, stop); however, in the single argument form, start has a default of 0, so the function invocation returns a list of all the integers in the mathematical interval [0, 256), for a total of 256 values.

  1. There is only one for loop syntax: for [identifier] in [iterable]. Lists are iterable because they contain a sequence of objects.
  2. Finite collections also support the built-in function len([sizable]). The way that numerical arithmetic works and sequence indexing via seq[idx] should be familiar.
  3. Python has an elegant swap capability — what’s important to note is that the entire right hand side is evaluated, then assigned to the left hand side.
  4. Python functions optionally return a value. If no return statement is encountered, None is returned, which indicates the absence of a value (docs).

Generators: functions that pause

Python has a convenient feature, called "generator functions," that allows you to create a stream of values using function-definition syntax. [♠] You can think of generator functions as special functions that can pause and resume, returning a value each time it pauses.

The canonical example illustrates the concept well — use the interactive Python shell to explore how generator functions work, by running the python command without arguments. Make sure the version is python2.3 or above. Once you’re in the interactive shell, type the following:

>>> def gen_counter():
...     i = 0
...     while True:
...         yield i
...         i += 1
...
>>>

Note the use of a yield statement, which tells Python that it is a generator function. Calling a generator function creates an iterable generator object, which can then produce a potentially infinite series of values:

>>> counter = gen_counter()
>>> print counter
<generator object gen_counter at 0xb7e3fbe4>
>>> counter.next()
0
>>> counter.next()
1

Note that because the stream of values is potentially infinite and lazily evaluated, there’s no concept of length: it’s not representative of a container so much as a sequence:

>>> len(counter)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type 'generator' has no len()

Also important to note is that none of the local values in the generator instances are shared; i.e. instantiating a second generator has no effect on the first:

>>> one_counter = gen_counter()
>>> another_counter = gen_counter()
>>> one_counter.next()
0
>>> another_counter.next()
0

Since generators are iterable, you can use them in a for loop just like containers. (But watch out for infinite generators and no break condition in your for loop!)

If you’re still confused, the official tutorial‘s section on generators may help to clarify things. For an in-depth look at generators and why they’re awesome, David M. Beazley‘s PyCon presentation on generators is excellent.

Applying generators to RC4

This dove-tails nicely with the second part of the algorithm, which requires a stream of values to XOR against. The generator is nearly a direct translation from the pseudocode, which you may also add to rc4.py:

def gen_random_bytes(k):
    """Yield a pseudo-random stream of bytes based on 256-byte array `k`."""
    i = 0
    j = 0
    while True:
        i = (i + 1) % 256
        j = (j + k[i]) % 256
        k[i], k[j] = k[j], k[i]
        yield k[(k[i] + k[j]) % 256]

Each time .next() is called on the generator instance, the function executes until the first yield statement is encountered, produces that value, and saves the function state for later.

Yes, we could create a big list of pseudo-random values the length of the text, but creating them all at the same time adds O(len(text)) memory overhead, whereas the generator is constant memory overhead (and computationally efficient).

Tying it together

Now we just need a function that does the XORing, which teaches us about strings and characters.

def run_rc4(k, text):
    cipher_chars = []
    random_byte_gen = gen_random_bytes(k)
    for char in text:
        byte = ord(char)
        cipher_byte = byte ^ random_byte_gen.next()
        cipher_chars.append(chr(cipher_byte))
    return ''.join(cipher_chars)

Line by line:

  1. An empty list cipher character accumulator is created.
  2. The generator object is instantiated by calling the generator function.
  3. As you can see from the for loop, Python strings are iterable as sequences of characters. Characters in Python are just strings of length one, so you can think of a string iterator as stepping over all of its one-character substrings in order.
  1. To convert a textual character into its character-code numerical value, the built-in ord function is used (docs).
  2. The meat of the algorithm: XOR the textual character with the next pseudo-random byte from the byte stream.
  3. After obtaining the cipher-byte through the XOR, we want to convert back to a textual (character) representation, which we do via the built-in chr function (docs). We then place that character into a sequence of cipher characters. [♥]
  4. To join together characters to form a string, we use the str.join([iterable]) method (docs). [♦] Note that, on some platforms, this is much more efficient than using += (for string concatenation) over and over again. It’s a best practice to use this sequence-joining idiom to avoid possible concatenation overhead. [♣]

Front-end fun

If you thought that the pseudo-code translation looked like a piece of cake, you may feel up to a challenge: write a command line interface that:

  1. Asks for an encryption key.
  2. Turns the key to a sequence of integer values and initializes with it.
  3. Continually asks for user-provided text to translate and spits out the corresponding cipher text.

What you need to know

  • In Python 2.x print is a statement that is followed by comma-separated values, where each comma turns into a space. The print statement puts a newline at the end of its output by default:

    >>> print 'a', 'b', 'c', 'd'
    a b c d

    To suppress the newline (for example, in a loop), leave on a trailing comma:

    >>> for char in ['a', 'b', 'c', 'd']:
    ...     print char,
    ...
    a b c d

    If there’s something that you can’t do using the above, I’ll refer you to the Python tutorial’s section on fancier output formatting.

  • The built-in function called raw_input (docs) displays a message and then requests user input as a command line, returning the user input as a string. For example:

    name = raw_input('What is your name? ')
    print 'Your name is', name
  • The built-in function called repr (docs) returns the Python representation of a string (or any object) — this is useful for escaping strings with funky, non- printable characters in it, as our cipher algorithm is likely to do. For example, you’ll probably want to use something along the lines of:

    cipher_text = run_rc4(k, text)
    print 'Cipher text:', repr(cipher_text)
  • The character-escaping performed by repr can be reversed using the string decode method: str.decode('string_escape'). For example:

    >>> text_with_escapes = '\\x61\\x62\\x63'
    >>> print text_with_escapes
    \x61\x62\x63
    >>> print text_with_escapes.decode('string_escape')
    abc

    So if you want to allow the user to enter ciphertext at the command prompt, you can read it in and decode it from the escaped format.

  • If you just put the CLI front-end code to execute at the bottom of the rc4.py file, it will work; however, it’s a best practice to test to make sure that rc4.py is the file that’s being executed. To do this, wrap the command line interface code in a test like the following:

    if __name__ == '__main__': # This is the file being executed.
        main()

If you need help

I wrote a reference implementation and posted it to github — feel free to check it out if you get stuck.

Here’s a sample usage of my implementation:

===========
RC4 Utility
===========
The output values are valid Python strings. They may contain escape characters
of the form \xhh to avoid confusing your terminal emulator. Only the first 256
characters of the encryption key are used.
 
Enter an encryption key: an encryption key!
 
Enter plain or cipher text: Victory is mine!
Your RC4 text is: '\xecL\xce(\x16\x8e3\xf02!\xcd\xc6\x9a\xc0j\x98'
 
Enter plain or cipher text: '\xecL\xce(\x16\x8e3\xf02!\xcd\xc6\x9a\xc0j\x98'
Unescaping ciphertext...
Your RC4 text is: 'Victory is mine!'

Once you find that your cipher is reversible, you’ve probably got it right!

Again, please comment if anything is unclear.

Footnotes

[*] Also known as "fitching." Often performed by those in brillig, slithy toves.
[†] Could God make a blog entry so long and boring that God would proclaim "TL:DR?"
[‡] If I had a nickel for every time this happened to me, I would have no nickels.
[§] This allows you to reflect on things and extract their documentation, which comes in handy when you’re running in an interactive Python session or spitting out module-level documentation in a command line argument parser.
[¶] This same rule applies to classes and modules, which are beyond the scope of this entry.
[#] Python lists are mutable sequences, implemented as vector ADTs under the hood.
[♠] The same task can be accomplished with a custom iterator class, but generators are much more concise and more readable — note that the generator that we end up with reads just like the pseudocode!
[♥] Note that the language having a built-in join(iterable) method on its string datatype eliminates the need for every iterable type to implement some form of iterable.join(str).
[♦] There’s a way to use generators here as well, but the list of characters makes things simpler to understand for the moment. If you’re feeling confident, convert this function to be a generator function at the end of the exercise and make it work with the rest of the program.
[♣] It’s bad practice to assume you’ll always be running on CPython — there are also JVM and .NET (CLR) interpreters. Remember, thou shalt not claimeth that, "all the world’s a VAX!"