September 20, 2009

Learning Python by example: RC4

One of my friends at work was fakeplaining [*] that he had been on the Python programming mailing list at work for some time, yet still did not know Python. Being hopelessly suggestible in the face of obvious sarcasm, I decided to sacrifice a few hours of sleep to the god of blog. [†]

Note that this entry is aimed at people who already know how to program and have been looking for a tidbit to try in Python.

There are a lot of side notes I've left out for simplicity of explanation; however, I also attempted to make the experience interesting by introducing one of Python's more advanced features, called "generator functions," into the mix. Hopefully it strikes a balance. Please comment if you are utterly confused by generators — I may add an alternate section that allows the reader to avoid them altogether.

You kata wanna...

A number of leaders in the programming community are hot on this trend called "code katas." I'm actually a big fan of this trend, mainly because I've been writing code for no reason, hating on it, throwing it away, and subsequently rewriting it for my entire life, but I now get to call it something cool- and ninja-sounding. Doing such things in my spare time is no longer considered "inexcusable nerdiness;" rather, it's my small endeavor to bring professionalism to the field of software engineering. *cough*

One reason that I really enjoy this new trend is that programmers are posting their own morsel-sized programming problems left and right, giving ample opportunities to explore new languages (and dusty corners of ones you know well) without accidentally becoming BDFL of a seminal framework or utility. [‡]

RC4 Pseudocode

Case in point, I'll use the recent kata from Programming Praxis for this Python exercise, as they provide straightforward pseudocode. Here's the encryption algorithm named RC4, as quoted from Programming Praxis:

The key array K is initialized like this:

for i from 0 to 255
    K[i] := i

j := 0

for i from 0 to 255
    j := (j + K[i] + key[i mod klen]) mod 256
    swap K[i], K[j]

Once the key array is initialized, the pseudo-random byte generator uses a similar calculation to build the stream of random bytes:

i := 0
j := 0

start:
    i := (i + 1) mod 256
    j := (j + K[i]) mod 256
    swap K[i], K[j]
    output K[ (K[i] + K[j]) mod 256 ]
    goto start

The first step in writing our RC4 program is to translate this pseudocode to Python, while the second step is to add a command line front-end for some off-the-cuff implementation experience.

If you'd like to look at the final program ahead of time, grab a copy of my reference implementation.

Porting the initialization

For initialization, we use a provided key to calculate a 256 entry integer sequence. Open a new file called rc4.py and write the following function:

def initialize(key):
    """Produce a 256-entry list based on `key` (a sequence of numbers) as
    the first step in RC4.
    Note: indices in key greater than 255 will be ignored.
    """
    k = range(256)
    j = 0
    for i in range(256):
        j = (j + k[i] + key[i % len(key)]) % 256
        k[i], k[j] = k[j], k[i]
    return k

The simplicity of the translation demonstrates why Python is sometimes called "executable pseudocode". Breaking it down line by line:

  1. defines a function named initialize that takes a single argument, key.

  2. A documentation string ("docstring" for short). In Python, documentation is associated with a function even at runtime, in contrast to traditional JavaDoc or POD. [§] If the first statement in a function is a string literal, it is used as the docstring for that function. [¶]

  1. The built-in range function returns a list of values. [#] "Built-in" is the terminology used for items that are "available all the time without explicitly importing anything."

    This function also has a two-argument form, range(start, stop); however, in the single argument form, start has a default of 0, so the function invocation returns a list of all the integers in the mathematical interval [0, 256), for a total of 256 values.

  1. There is only one for loop syntax: for [identifier] in [iterable]. Lists are iterable because they contain a sequence of objects.

  2. Finite collections also support the built-in function len([sizable]). The way that numerical arithmetic works and sequence indexing via seq[idx] should be familiar.

  3. Python has an elegant swap capability — what's important to note is that the entire right hand side is evaluated, then assigned to the left hand side.

  4. Python functions optionally return a value. If no return statement is encountered, None is returned, which indicates the absence of a value (docs).

Generators: functions that pause

Python has a convenient feature, called "generator functions," that allows you to create a stream of values using function-definition syntax. [♠] You can think of generator functions as special functions that can pause and resume, returning a value each time it pauses.

The canonical example illustrates the concept well — use the interactive Python shell to explore how generator functions work, by running the python command without arguments. Make sure the version is python2.3 or above. Once you're in the interactive shell, type the following:

>>> def gen_counter():
...     i = 0
...     while True:
...         yield i
...         i += 1
...
>>>

Note the use of a yield statement, which tells Python that it is a generator function. Calling a generator function creates an iterable generator object, which can then produce a potentially infinite series of values:

>>> counter = gen_counter()
>>> print counter
<generator object gen_counter at 0xb7e3fbe4>
>>> counter.next()
0
>>> counter.next()
1

Note that because the stream of values is potentially infinite and lazily evaluated, there's no concept of length: it's not representative of a container so much as a sequence:

>>> len(counter)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type 'generator' has no len()

Also important to note is that none of the local values in the generator instances are shared; i.e. instantiating a second generator has no effect on the first:

>>> one_counter = gen_counter()
>>> another_counter = gen_counter()
>>> one_counter.next()
0
>>> another_counter.next()
0

Since generators are iterable, you can use them in a for loop just like containers. (But watch out for infinite generators and no break condition in your for loop!)

If you're still confused, the official tutorial's section on generators may help to clarify things. For an in-depth look at generators and why they're awesome, David M. Beazley's PyCon presentation on generators is excellent.

Applying generators to RC4

This dove-tails nicely with the second part of the algorithm, which requires a stream of values to XOR against. The generator is nearly a direct translation from the pseudocode, which you may also add to rc4.py:

def gen_random_bytes(k):
    """Yield a pseudo-random stream of bytes based on 256-byte array `k`."""
    i = 0
    j = 0
    while True:
        i = (i + 1) % 256
        j = (j + k[i]) % 256
        k[i], k[j] = k[j], k[i]
        yield k[(k[i] + k[j]) % 256]

Each time .next() is called on the generator instance, the function executes until the first yield statement is encountered, produces that value, and saves the function state for later.

Yes, we could create a big list of pseudo-random values the length of the text, but creating them all at the same time adds O(len(text)) memory overhead, whereas the generator is constant memory overhead (and computationally efficient).

Tying it together

Now we just need a function that does the XORing, which teaches us about strings and characters.

def run_rc4(k, text):
    cipher_chars = []
    random_byte_gen = gen_random_bytes(k)
    for char in text:
        byte = ord(char)
        cipher_byte = byte ^ random_byte_gen.next()
        cipher_chars.append(chr(cipher_byte))
    return ''.join(cipher_chars)

Line by line:

  1. An empty list cipher character accumulator is created.

  2. The generator object is instantiated by calling the generator function.

  3. As you can see from the for loop, Python strings are iterable as sequences of characters. Characters in Python are just strings of length one, so you can think of a string iterator as stepping over all of its one-character substrings in order.

  4. To convert a textual character into its character-code numerical value, the built-in ord function is used (docs).

  5. The meat of the algorithm: XOR the textual character with the next pseudo-random byte from the byte stream.

  6. After obtaining the cipher-byte through the XOR, we want to convert back to a textual (character) representation, which we do via the built-in chr function (docs). We then place that character into a sequence of cipher characters. [♥]

  7. To join together characters to form a string, we use the str.join([iterable]) method (docs). [♦] Note that, on some platforms, this is much more efficient than using += (for string concatenation) over and over again. It's a best practice to use this sequence-joining idiom to avoid possible concatenation overhead. [♣]

Front-end fun

If you thought that the pseudo-code translation looked like a piece of cake, you may feel up to a challenge: write a command line interface that:

  1. Asks for an encryption key.

  2. Turns the key to a sequence of integer values and initializes with it.

  3. Continually asks for user-provided text to translate and spits out the corresponding cipher text.

What you need to know

If you need help

I wrote a reference implementation and posted it to github — feel free to check it out if you get stuck.

Here's a sample usage of my implementation:

===========
RC4 Utility
===========
The output values are valid Python strings. They may contain escape characters
of the form \xhh to avoid confusing your terminal emulator. Only the first 256
characters of the encryption key are used.

Enter an encryption key: an encryption key!

Enter plain or cipher text: Victory is mine!
Your RC4 text is: '\xecL\xce(\x16\x8e3\xf02!\xcd\xc6\x9a\xc0j\x98'

Enter plain or cipher text: '\xecL\xce(\x16\x8e3\xf02!\xcd\xc6\x9a\xc0j\x98'
Unescaping ciphertext...
Your RC4 text is: 'Victory is mine!'

Once you find that your cipher is reversible, you've probably got it right!

Again, please comment if anything is unclear.

Footnotes

[*]

Also known as "fitching." Often performed by those in brillig, slithy toves.

[†]

Could God make a blog entry so long and boring that God would proclaim "TL:DR?"

[‡]

If I had a nickel for every time this happened to me, I would have no nickels.

[§]

This allows you to reflect on things and extract their documentation, which comes in handy when you're running in an interactive Python session or spitting out module-level documentation in a command line argument parser.

[¶]

This same rule applies to classes and modules, which are beyond the scope of this entry.

[#]

Python lists are mutable sequences, implemented as vector ADTs under the hood.

[♠]

The same task can be accomplished with a custom iterator class, but generators are much more concise and more readable — note that the generator that we end up with reads just like the pseudocode!

[♥]

Note that the language having a built-in join(iterable) method on its string datatype eliminates the need for every iterable type to implement some form of iterable.join(str).

[♦]

There's a way to use generators here as well, but the list of characters makes things simpler to understand for the moment. If you're feeling confident, convert this function to be a generator function at the end of the exercise and make it work with the rest of the program.

[♣]

It's bad practice to assume you'll always be running on CPython — there are also JVM and .NET (CLR) interpreters. Remember, thou shalt not claimeth that, "all the world's a VAX!"

Deferred object models, plus caveat

I've been doing a bunch of work in JavaScript lately using Dojo's asynchronous I/O object implementation, dojo.Deferred, based off Twisted Python's deferred model. Deferreds represent a piece of asynchronous I/O (i.e. data retrieval from a web service) and a chain of callbacks that should be performed when it completes.

Deferreds are an excellent way to keep your head from exploding when dealing with callback-oriented APIs, especially when you need to coordinate the point at which several asynchronous operations occur.

Object Models

Let's say that you so happened to be eliminating web service dependencies with a language-specific abstraction barrier — using deferreds to represent your lazy data retrieval is a very nice abstraction. [*]

/**
 * Constructor.
 */
function RemoteAnimal(uri, kwargs) {
    var self = this;
    self._uri = uri;
    if (!kwargs) {
        return;
    }
    self._children = kwargs.children;
}

/**
 * Simple (synchronous) getter.
 */
RemoteAnimal.prototype.getURI = function() {
    // Simple, synchronous getter.
    var self = this;
    return self._uri;
};

/**
 * Asynchronous fetcher.
 */
RemoteAnimal.prototype.fetchChildren = function() {
    if (self._children !== undefined) {
        // Already fetched
        return self._children;
    }
    var deferred = dojo.Deferred();
    asyncFind('animal', // type
        'Parent = ' + self.getURI(), // query
        function callback(results) {
            var objs = dojo.map(results, self.constructor);
            self._children = objs;
            deferred.callback(self._children);
        },
        function errback(errors) {
            console && console.dir(errors);
            throw new Exception('Problem finding children');
        });
    return deferred;
};

The above uses an obvious naming convention to differentiate between synchronous and asynchronous getters — it's important to know when a method will be returning a deferred as opposed to the desired value itself!

There is, however, a sneaky issue with RemoteAnimal.prototype.fetchChildren -- can you tell what it is? You have to think pretty hard about what kinds of execution asynchronous I/O allows from the perspective of a caller.

Caveat

The key is to realize that fetchChildren could be called a whole lot of times in a row on a really slow connection (or if the caller is calling it in a for loop, which is more likely to be a problem). This will create n outstanding asyncFinds, triggering n asynchronous XMLHttpRequests, where we really only needed one. To fix this, you need to patch your code so that only one request could be outstanding at any given time, like so:

RemoteAnimal.prototype.fetchChildren = function() {
    // Property where the first deferred gets stashed.
    var flagProperty = '_childrenDeferred';
    // Property where runner-up deferreds get stashed.
    var othersProperty = '_childrenDeferredOthers';

    if (self._children !== undefined) {
        // Already fetched
        return self._children;
    }
    var deferred = dojo.Deferred();

    if (self[flagProperty]) {
        // Fetch is outstanding... latch on
        if (!(othersProperty in self)) {
            self[othersProperty] = [];
        }
        self[othersProperty].push(deferred);
        return deferred;
    }

    asyncFind('animal', // type
        'Parent = ' + self.getURI(), // query
        function callback(results) {
            var objs = dojo.map(results, self.constructor);
            self._children = objs;
            deferred.callback(self._children);
            delete self[flagProperty];
            if (!self[othersProperty]) {
                // No runner-ups.
                return;
            }
            // Trigger the runner-up deferreds in the order that they came.
            dojo.forEach(self[othersProperty], function(otherDeferred) {
                otherDeferred.callback(self._children);
            });
            delete self[othersProperty];
        },
        function errback(errors) {
            console && console.dir(errors);
            throw new Exception('Problem finding children');
        });
    self[flagProperty] = deferred; // Flag as outstanding
    return deferred;
}

There's a non-trivial amount of added complexity there:

The annoying part is that you can't just tack the runner-up deferred triggers onto the original as callbacks. (That would make the construction so much easier!) Between the first and any later invocation of fetchChildren;, there may be callbacks added to the original deferred that would arbitrarily change the outcome. As a result, we have to call the other deferreds back directly with the found children from the original fetch.

It's possible to factor out this runner-up-aggregating behavior in a highly reusable way — this code was meant to demonstrate the caveat and how to fix it. Encapsulating the runner-up triggering behavior is left as an exercise to the reader. :-)

Non-Obvious Operations with Deferreds

For educational purposes I've enumerated a few non-obvious operations that you can perform using deferreds. [†]

Barrier (block until all parallelized operations complete)
var deferreds = dojo.map(animals, function(animal) {
    return animal.fetchChildren();
});
var megaDeferred = DeferredList.gatherResults(deferreds);
megaDeferred.addCallback(function process(listOfChildren) {
    // "Oh yeah, there's definitely more than one children in there.."
    // Extra points if you can name the quote!
    assertEquals(listOfChildren.length, animals.length,
        'Must have equal number of animals and lists of children');
    // ...
});
return megaDeferred;

The great thing about this is that you can take embarrassingly parallel I/O operations and perform them all at once without the headaches that accompany threading.

Recursion (keep issuing I/O until a base condition is met) [‡]
RemoteAnimal.getSubtree = function(animals) {
    var all = new Array(animals);

    function flattenAndFilter(listOfLists) {
        var accumulator = [];
        dojo.forEach(listOfLists, function(list) {
            if (!list) { // Filter nulls
                return;
            }
            accumulator.push.apply(accumulator, list);
        });
        return accumulator;
    }

    function iterate(generation) {
        if (generation.length === 0) {
            return all;
        }
        var deferreds = dojo.map(generation, function(animal) {
            return animal.getChildren();
        });
        var megaDeferred = Deferred.gatherResults(deferreds);
        megaDeferred.addCallback(flattenAndFilter);
        megaDeferred.addCallback(iterate);
        return megaDeferred;
    }

    // Initialize
    var deferreds = dojo.map(animals, function(animal) {
        return animal.getChildren();
    });
    var megaDeferred = DeferredList.gatherResults(deferreds);
    megaDeferred.addCallback(flattenAndFilter);
    megaDeferred.addCallback(iterate);
    return megaDeferred;
}

The key to understanding this example is that returning a deferred from a callback chains that deferred in. In other words, if you return a deferred from a callback, all that deferred's callbacks will be executed before you come back to the original. That way, we can perform recursion without worrying whether or not somebody has added more callbacks onto the initial megaDeferred returned from the getSubtree function — the chained callbacks execute right away, before any callbacks added by the caller.

Footnotes

[*]

It's important to note that you may also want to implement vector operations for efficiency reasons, as in RemoteAnimal.getSubtree above.

[†]

Note that dojo.DeferredList.gatherResults is actually mapped as dojo.DeferredList.prototype.gatherResults in my version of Dojo (1.3.1) -- not sure what's up with that. It's certainly a factory function, so I've mapped it onto the dojo.DeferredList constructor as well.

[‡]

Some potential refactoring is omitted for clarity.

Python 3.1, you shouldn't have!

I highly appreciate the presents that the Python 3.1 team (unwittingly) got me for my birthday this year. This morning I wrote the following snippet to determine the day-frequency of birthday occurrences: [*]

#!/usr/bin/env python3.1

import collections
import datetime
from operator import itemgetter


birthday = {part: int(input('Enter birth {}: '.format(part)))
    for part in 'year month day'.split()}
now = datetime.datetime.now()
days = []
for year in range(birthday['year'], now.year + 1):
    iter_birthday = dict(birthday)
    iter_birthday['year'] = year
    date = datetime.date(**iter_birthday)
    days.append(date.strftime('%A'))

counter = collections.Counter(days)
fmt = '{:15} {:>3}'
for day, frequency in sorted(counter.items(), key=itemgetter(1),
        reverse=True):
    print(fmt.format(day, frequency))
print(fmt.format('Total', sum(counter.values())))

Check out the Python 3.1 goodness: automatically numbered format strings and collections.Counter. It's also a relief that I no longer have to type iter, x, or raw — Python is even prettier without the 2.x cruft.

Turns out that my original day of birth is still in the lead!

Updates

  1. Added dict comprehensions per astute comments. :-)

  2. Replaced lambda argument to sorted with the more readable operator.itemgetter, which I didn't realize exists! Thanks @xtian.

Footnotes

[*]

Note that the script assumes your birthday has already occurred this year.

Registry pattern trumps import magic

The other night I saw an interesting tweet in the #Python Twitter channel -- Patrick was looking to harness the dynamism of a language like Python in a way that many Pythonistas would consider magical. [*] Coming from languages with more rigid execution models, it's understandably easy to confuse dynamic and magical. [†]

What is magic?

To quote the jargon file, magic is:

Characteristic of something that works although no one really understands why (this is especially called black magic).

Taken in the context of programming, magic refers to code that works without a straightforward way of determining why it works.

Today's more flexible languages provide the programmer with a significant amount of power at runtime, making the barrier to "accidental magic" much lower. As a programmer who works with dynamic languages, there's an important responsibility to keep in mind: err on the side of caution with the Principle of Least Surprise.

[T]o design usable interfaces, it's best when possible not to design an entire new interface model. Novelty is a barrier to entry; it puts a learning burden on the user, so minimize it.

This principle indicates that using well known design patterns and language idioms is a "best practice" in library design. When you follow that guideline, people will already have an understanding of the interface that you're providing; therefore, they will have one less thing to worry about in leveraging your library to write their code.

Discovery Mechanism Proposals

Patrick is solving a common category of problem: he wants to allow clients to flexibly extend his parsing library's capabilities. For example, if his module knows how to parse xml and yaml files out of the box, programmers using his library should be able to add their own rst and html parser capabilities with ease.

Patrick's proposal is this:

If you were to do this, you would use the various utilities in the imp module to load the modules dynamically, then determine the appropriate classes via the inspect module. [‡]

My counter-proposal is this, which is also known as the Registry Pattern, a form of runtime configuration and behavior extension:

Parser library:

class UnknownMimetypeException(Exception): pass
class ParseError(Exception): pass

class IParser:
    """
    Reference interface for parser classes;
    inheritance is not necessary.
    """

    parseable_mimetypes = set()

    def __init__(self, file):
        self.file = file
        self.doctree = None

    def parse(self):
        """
        Parse :ivar:`file` and place the parsed document
        tree into :ivar:`doctree`.
        """
        raise NotImplementedError


class ParserFacade:
    """
    Assumes that there can only be one parser per mimetype.
    :ivar mimetype_to_parser_cls: Storage for parser registry.
    """

    def __init__(self):
        self.mimetype_to_parser_cls = {}

    def register_parser(self, cls):
        for mimetype in cls.parseable_mimetypes:
            self.mimetype_to_parser_cls[mimetype] = cls

        return cls # For use as a decorator.

    def parse(self, file, mimetype):
        """
        Determine the appropriate parser for the mimetype,
        create a parser to parse the file, and perform
        the parsing.

        :return: The parser object.
        """
        try:
            parser_cls = self.mimetype_to_parser_cls[mimetype]
        except KeyError:
            raise UnknownMimetypeException(mimetype)

        parser = parser_cls(file)
        parser.parse() # May raise ParseError
        return parser


default_facade = ParserFacade()
register_parser = default_facade.register_parser
parse = default_facade.parse

Client code:

from parser_lib import register_parser

@register_parser
class SpamParser:
    """
    Parses ``.spam`` files.
    Conforms to implicit parser interface of `parser_lib`.
    """

    parseable_mimetypes = {'text/spam'}

    def __init__(self, file):
        self.file = file
        self.doctree = None

    def parse(self):
        raise NotImplementedError

After the client code executes, the SpamParser will then be available for parsing text/spam mimetype files via parser_lib.parse.

Here are some of my considerations in determining which of these is the least magical:

Magical Allure

The problem with magic is that it is freaking cool and it drives all the ladies crazy. [¶] As a result, the right hemisphere of your developer-brain yearns for your library clients to read instructions like:

Drag and drop your Python code into my directory — I'll take care of it from there.

That's right, that's all there is to it.

Oh, I know what you're thinking — yes, I'm available — check out parser_lib.PHONE_NUMBER and give me a call sometime.

But, as you envision phone calls from sexy Pythonistas, the left hemisphere of your brain is screaming at the top of its lungs! [#]

Magic leaves the audience wondering how the trick is done, and the analytical side of the programmer mind hates that. It implies that there's a non-trivial abstraction somewhere that does reasonably complex things, but it's unclear where it can be found or how to leverage it differently.

Coders need control and understanding of their code and, by extension, as much control and understanding over third party code as is reasonably possible. Because of this, concise, loosely coupled, and extensible abstractions are always preferred to the imposition of elaborate usage design ideas on clients of your code. It's best to assume that people will want to leverage the functionality your code provides, but that you can't foresee the use cases.

To Reiterate: Dynamic does not Imply Magical

Revisiting my opening point: anecdotal evidence suggests that some members of the static typing camp see we programming-dynamism dynamos as anarchic lovers of programming chaos. Shoot-from-the-hip cowboys, strolling into lawless towns of code, type checking blowing by the vacant sheriff's station as tumbleweeds in the wind. (Enough imagery for you?) With this outlook, it's easy to see why you would start doing all sorts of fancy things when you cross into dynamism town — little do you know, we don't take kindly to that 'round these parts.

In other, more intelligble words, this is a serious misconception — dynamism isn't a free pass to disregard the Principle of Least Surprise — dynamism proponents still want order in the programming universe. Perhaps we value our sanity even more! The key insight is that programming dynamism does allow you additional flexibility when it's required or practical to use. More rigid execution models require you to use workarounds, laboriously at times, for a similar degree of flexibility.

As demonstrated by Marius' comment in my last entry, Python coders have a healthy respect for the power of late binding, arbitrary code execution on module import, and seamless platform integration. Accompanying this is a healthy wariness of black magic.

Caveat

It's possible that Patrick was developing a closed-system application (e.g. the Eclipse IDE) and not a library like I was assuming.

In the application case, extensions are typically discovered (though not necessarily activated) by enumerating a directory. When the user activates such an extension, the modules found within it are loaded into the application. This is the commonly found plugin model — it's typically more difficult to wrap the application interface and do configurations at load time, so the application developer must provide an extension hook.

However, the registration pattern should still be preferred to reflection in this case! When the extension is activated and the extension modules load, the registration decorator will be executed along with all the other top-level code in the extension modules.

The extension has the capability to inform the application of the extension's functionality instead having the application query the plugin for its capabilities. This is a form of loosely coupled cooperative configuration that eases the burden on the application and eliminates the requirement to foresee needs of the extensions. [♠]

Footnotes

[*]

Note that you can't call it dynamic programming, as that would alias a well known term from the branch of computer science concerned with algorithms. Programming language dynamism it is!

[†]

Much like a dehydrated wanderer in the desert mistakes a shapely pile of sand for an oasis!

[‡]

As of the date of this publishing, Patrick's implementation seems to have gone a bit astray with text processing of Python source files. Prefer dynamic module loading and inspection to text processing source code! Enumerating the reasons this is preferred is beyond the scope of this article.

[§]

In Python < 3.0 you can perform class decoration without the decorator syntax. Decorator syntax is just syntactic sugar for "invoke this method and rebind the identifier in this scope", like so:

class SomeClass(object):
    pass
SomeClass = my_class_decorator(SomeClass) # Decorate the class.
[¶]

Perhaps men as well, but I've never seen any TV evidence to justify that conclusion.

[#]

Yes, in this analogy brains have lungs. If you've read this far you're probably not a biologist anyway.

[♠]

Of course, the plugin model always has security implications. Unless you go out of your way to make a sandboxed Python environment for plugins, you need to trust the plugins that you activate — they have the ability to execute arbitrary code.

Monstrous polymorphism and a Python post-import hook decorator

I queue up a few thousand things to do before I get on an airplane: synchronize two-thousand Google Reader entries, load up a bunch of websites I've been meaning to read, and make sure for-fun projects are pulled from their most updated branches.

Then, once I get up in the air, I realize that I don't really want to do 90% of those things crammed into a seat with no elbow room. I end up doing one or two. Along with reading Stevey's Drunken Blog Rant: When Polymorphism Fails, this entry is all the productivity I can claim. The full code repository for this entry is online if you'd like to follow along.

Polymorphism Recap

The word "polymorphic" comes from Greek roots meaning "many shaped." (Or they lied to me in school — one of those.) From a worldly perspective I can see this meaning two things:

As it turns out, both of these concepts apply to the Object-Oriented programming, but the canonical meaning is the latter. [*] As Yegge says:

If you have a bunch of similar objects [...], and they're all supposed to respond differently to some situation, then you add a virtual method to them and implement it differently for each object.

(If you don't know what a virtual method is, the Wikipedia page has an alternate explanation.)

Yegge's Example

Yegge demonstrates that strictly adhering to the principles of polymorphism does not always produce the best design:

Let's say you've got a big installed base of monsters. [...] Now let's say one of your users wants to come in and write a little OpinionatedElf monster. [...] Let's say the OpinionatedElf's sole purpose in life is to proclaim whether it likes other monsters or not. It sits on your shoulder, and whenever you run into, say, an Orc, it screams bloodthirstily: "I hate Orcs!!! Aaaaaargh!!!" (This, incidentally, is how I feel about C++.)

The polymorphic approach to this problem is simple: go through every one of your 150 monsters and add a doesMrOpinionatedElfHateYou() method.

This is a great counterexample — it induces an instant recognition of absurdity.

He then touches on the fact that dynamic languages allow you to do neat things consistent with polymorphism due to the flexibility of the object structure (which is typically just a hash map from identifiers to arbitrary object values):

I guess if you could somehow enumerate all the classes in the system, and check if they derive from Monster, then you could do this whole thing in a few lines of code. In Ruby, I bet you can... but only for the already-loaded classes. It doesn't work for classes still sitting on disk! You could solve that, but then there's the network...

This is clearly impractical, but I figured there was some exploratory value to implementing this challenge in Python. This entry is a small walk-through for code to detect interface conformity by inspection, enumerate the classes in the environment, manipulate classes in place, and add an import hook to manipulate classes loaded from future modules.

The Antagonist

Double entendre intended. :-)

class OpinionatedElf(object):

    is_liked_by_class_name = {
        'OpinionatedElf': True,
        'Orc': False,
        'Troll': False}

    def __init__(self, name):
        self.name = name

    def be_scary(self):
        print("I'm small, ugly, and don't like the cut of your jib!")

    def proclaim_penchance(self, other):
        if not IMonster.is_conforming(other):
            print("I can't even tell what that is!")
            return
        is_liked = other.is_liked_by_elf()
        class_name = other.__class__.__name__
        if is_liked is None:
            print("I'm not sure how I feel about %s" % class_name)
            return
        emotion = 'love' if is_liked else 'hate'
        print('I %s %s!!! Aaaaaargh!!!' % (emotion, other.__class__.__name__))

Determining which Classes are Monsters

First of all, Python doesn't require (nor does it encourage) a rigid type hierarchy. Python's all about the interfaces, which are often implicit. Step one is to create a way to recognize classes that implement the monster interface:

    required_methods = ['be_scary']

    def be_scary(self):
        raise NotImplementedError

    @classmethod
    def is_conforming(cls, object):
        result = all(callable(getattr(object, attr_name, None))
            for attr_name in cls.required_methods)
        logging.debug('%s conforms? %s', object, result)
        return result

assert IMonster.is_conforming(IMonster)

This is a simple little class — there are better third party libraries to use if you want real interface functionality (i.e. more generic conformity testing and Design By Contract).

Enumerating the Classes in the Environment

All of the modules that have been loaded into the Python environment are placed into sys.modules. By inspecting each of these modules, we can manipulate the classes contained inside if they conform to our monster interface.

for name, module in sys.modules.iteritems():
    extend_monsters(module)

The extend_monsters function is a bit nuanced because immutable modules also live in sys.modules. We skip those, along with abstract base classes, which have trouble with inspect.getmembers():

def extend_monsters(module, extension_tag='_opinionated_extended'):
    """Extend monsters in the module's top-level namespace to
    tell if they are liked by the :class:`OpinionatedElf`.
    and tag it with the :param:`extension_tag` as a flag name.
    Do not attempt to extend already-flagged modules.
    Do not clobber existing methods with the extension method name.

    Warning: swallows exceptional cases where :param:`module`
        is builtin, frozen, or None.
    """
    name = module.__name__ if module else None
    logging.info('Instrumenting module %s', name)
    if not module or imp.is_builtin(name) or imp.is_frozen(name) \
            or getattr(module, extension_tag, False):
        logging.info('Skipping module: %s', name)
        return
    module._opinionated_instrumented = True
    classes = inspect.getmembers(module, inspect.isclass)
    for name, cls in classes:
        logging.debug('%s: %s', name, cls)
        try:
            conforming = IMonster.is_conforming(cls)
        except AttributeError, e:
            if '__abstractmethods__' in str(e): # Abstract class.
                continue
            raise
        if not conforming:
            continue
        class_name = cls.__name__
        logging.debug('Instrumenting class %s', class_name)
        attr_name = 'is_liked_by_elf'
        if hasattr(cls, attr_name): # Don't clobber existing methods.
            logging.warn('Method already exists: %s', cls)
            continue
        logging.info('Setting %s on %s', attr_name, class_name)
        setattr(cls, attr_name,
            lambda self: OpinionatedElf.is_liked_by_class_name.get(
                self.__class__.__name__, None))

If we were going to be thorough, we would recurse on the members of the class to see if the class scope was enclosing any more IMonster classes, but you're never really going to find them all: if a module defines a monster class in a function-local scope, there's no good way to get the local class statement and modify it through inspection.

In any case, we're at the point where we can modify all monsters in the top-level namespace of already-loaded modules. What about modules that we have yet to load?

Post-import Hook

There is no standard post-import hook (that I know of) in Python. PEP 369 looks promising, but I couldn't find any record of additional work being done on it. The current import hooks, described in PEP 302, are all pre-import hooks. As such, you have to decorate the __import__ builtin, wrapping the original with your intended post-input functionality, like so: [†]

def import_decorator(old_import, post_processor):
    """
    :param old_import: The import function to decorate, most likely
        ``__builtin__.__import__``.
    :param post_processor: Function of the form
        `post_processor(module) -> module`.
    :return: A new import function, most likely to be assigned to
        ``__builtin__.__import__``.
    """
    assert all(callable(fun) for fun in (old_import, post_processor))
    def new_import(``*args``, ``**kwargs``):
        module = old_import(\*args, \*\*kwargs)
        return post_processor(module)
    return new_import

After which we can replace the old __import__ with its decorated counterpart:

__builtin__.__import__ = import_decorator(__builtin__.__import__,
    extend_monsters)

The Network

Yegge brings up the issue of dynamically generated classes by mentioning network communications, calling to mind examples such as Java's RMI and CORBA. This is a scary place to go, even just conceptualizing. If metaclasses are used, I don't see any difficulty in decorating __new__ with the same kind of inspection we employed above; however, code generation presents potentially insurmountable problems.

Decorating the eval family of functions to modify new classes created seems possible, but it would be challenging and requires additional research on my part. exec is a keyword/statement, which I would think is a hopeless cause.

Footnotes

[*]

In accordance with the former, an object can implement many interfaces.

[†]

This function is actually a generic decorate-with-post-processing closure, but I added the references to import for more explicit documentation.