April 29, 2010

Learning Python by example: list comprehensions

My friend, who is starting to learn Python 2.x, asked me what this snippet did:

def collapse(seq):
    # Preserve order.
    uniq = []
    [uniq.append(item) for item in seq if not uniq.count(item)]
    return uniq

This is not a snippet that should be emulated (i.e. it's bad); however, it makes me happy: there are so many things that can be informatively corrected!

What is a list comprehension?

A list comprehension is a special brackety syntax to perform a transform operation with an optional filter clause that always produces a new sequence (list) object as a result. To break it down visually, you perform:

new_range = [i * i          for i in range(5)   if i % 2 == 0]

Which corresponds to:

*result*  = [*transform*    *iteration*         *filter*     ]

The filter piece answers the question, "should this item be transformed?" If the answer is yes, then the transform piece is evaluated and becomes an element in the result. The iteration [*] order is preserved in the result.

Go ahead and figure out what you expect new_range to be in the prior example. You can double check me in the Python shell, but I think it comes out to be:

>>> new_range = [i * i for i in range(5) if i % 2 == 0]
>>> print new_range
[0, 4, 16]

If it still isn't clicking, we can try to make the example less noisy by getting rid of the transform and filter — can you tell what this will produce?

>>> new_range = [i for i in range(5)]

So what's wrong with that first snippet?

As we observed in the previous section, a list comprehension always produces a result list, where the elements of the result list are the transformed elements of the iteration. That means, if there's no filter piece, there are exactly as many result elements as there were iteration elements.

Weird thing number one about the snippet — the list comprehension result is unused. It's created, mind you — list comprehension always create a value, even if you don't care what it is — but it just goes off to oblivion. (In technical terms, it becomes garbage.) When you don't need the result, just use a for loop! This is better:

def colapse(seq):
    """Preserve order."""
    uniq = []
    for item in seq:
        if not uniq.count(item):
            uniq.append(item)
    return uniq

It's two more lines, but it's less weird looking and wasteful. "Better for everybody who reads and runs your code," means you should do it.

Moral of the story: a list comprehension isn't just, "shorthand for a loop." It's shorthand for a transform from an input sequence to an output sequence with an optional filter. If it gets too complex or weird looking, just make a loop. It's not that hard and readers of your code will thank you.

Weird thing number two: the transform, list.append(item), produces None as its output value, because the return value from list.append is always None. Therefore, the result, even though it isn't kept anywhere, is a list of None values of the same length as seq (notice that there's no filter clause).

Weird thing number three: list.count(item) iterates over every element in the list looking for things that == to item. If you think through the case where you call collapse on an entirely unique sequence, you can tell that the collapse algorithm is O(n2). In fact, it's even worse than it may seem at first glance, because count will keep going all the way to the end of uniq, even if it finds item in the first index of uniq. What the original author really wanted was item not in uniq, which bails out early if it finds item in uniq.

Also worth mentioning for the computer-sciency folk playing along at home: if all elements of the sequence are comparable, you can bring that down to O(n * log n) by using a "shadow" sorted sequence and bisecting to test for membership. If the sequence is hashable you can bring it down to O(n), perhaps by using the set datatype if you are in Python >= 2.3. Note that the common cases of strings, numbers, and tuples (any built-in immutable datatype, for that matter) are hashable.

From Python history

It's interesting to note that Python Enhancement Proposal (PEP) #270 considered putting a uniq function into the language distribution, but withdrew it with the following statement:

Removing duplicate elements from a list is a common task, but there are only two reasons I can see for making it a built-in. The first is if it could be done much faster, which isn't the case. The second is if it makes it significantly easier to write code. The introduction of sets.py eliminates this situation since creating a sequence without duplicates is just a matter of choosing a different data structure: a set instead of a list.

Remember that sets can only contain hashable elements (same policy as dictionary keys) and are therefore not suitable for all uniq-ifying tasks, as mentioned in the last paragraph of the previous section.

Footnotes

[*]

"Iteration" is just a fancy word for "step through the sequence, element by element, and give that element a name." In our case we're giving the name i.