June 27, 2008

twill is mini-language done right

In my college career I found a book whose genre I can best describe as "computer science philosophy." The Pragmatic Programmer taught me several key philosophical ideals behind real-world code design, which I found to be quite useful in my every-day programming life. [*]

One of the aforementioned "Pragmatic Programming" practices that I had the most difficulty wrapping my head around dealt with the creation of mini-languages close to the problem domain (AKA domain specific languages). To me, this seemed like an unnecessary amount of additional work, when one could just specify an interface that was a close approximation of a mini-language in a semantically rich language. After looking at twill, I realize that it isn't too much work, specifically because you can create a close approximation in a semantically rich language.

Twill has a commands module that exports functions with simple names; for example, "go", "find", "back", etc. This is the more simple approximation of a mini-language that I had anticipated as a time savings. However, Twill also has a shell module which deals with the online execution of these functions and their arguments within a shell-like environment. This creates, as interpreted by the shell, a mini-language!

With Twill's elegant and maintainable (extensible) coding, the command loop is approximately 250 lines. It's not too hard to create a simple file reader to interpret commands line by line from a script and feed them into the command loop one by one, at which point you have a scriptable mini-language!

There's nothing extraordinarily amazing about any single component of this process, but the fact that a mini-language can be created with so little effort kind of stunned me — one of those "lightbulb going off in your head" situations. :)

Letting my mind wander a bit, I doubt that it's always possible to wrap an existing API with simple, domain-specific, script-like commands. If creating a domain-specific mini-language is a tangible possibility, it seems like a design decision you want to know early on.



My CS degree program had a tendency to emphasize provably correct programs, which I don't find great need for in my personal day-to-day work beyond some simple design-by-contract guidelines. This is the topic for another entry, however... perhaps several entries!

How not to do software version numbers

There's a really nice, production-ready package called Pygments available from the Cheese Shop. [*] It's a brilliant package, but the developers made a silly mistake. [†]

The release version numbers progress from 0.9 to 0.10.

Sure, this looks reasonable; especially given that in Python developers typically write their version numbers like the following:

>>> import pygments
>>> if pygments.__version__ < (0, 5):
...     raise ImportError('Pygments too old.')
Traceback (most recent call last):
ImportError: Pygments too old.

This would be fine progressing from the (0, 9) tuple to the (0, 10) tuple. However, if your clients interpret your version number as a float (which isn't fully unreasonable, given that it really is represented as a float in writing) and you progress from version 0.9 to 0.10, your code base just fell back 8 minor releases, as in the following:

>>> import pygments
>>> print pygments.__version__

Additionally, you annoy people like me who get errors like this:

2008-06-17 20:43:58,653 Trac[loader] ERROR: Skipping "TracPygments 0.3dev" ("Pygments&gt;=0.5" not found)

When their setup says this:

cdleary$ pygmentize -V
Pygments version 0.10, (c) 2006-2007 by Georg Brandl .

Anybody who looks at numerical data in a sorted directory is familiar with the results of this phenomenon:

cdleary@gamma:~/example$ ls -l
total 0
-rw-r----- 1 cdleary cdleary 0 2008-06-17 22:00 10
-rw-r----- 1 cdleary cdleary 0 2008-06-17 22:00 11
-rw-r----- 1 cdleary cdleary 0 2008-06-17 22:00 7
-rw-r----- 1 cdleary cdleary 0 2008-06-17 22:00 8
-rw-r----- 1 cdleary cdleary 0 2008-06-17 22:00 9

Which is resolved with a simple prefixing method, as follows:

cdleary@gamma:~/example$ ls -l
total 0
-rw-r----- 1 cdleary cdleary 0 2008-06-17 22:00 07
-rw-r----- 1 cdleary cdleary 0 2008-06-17 22:00 08
-rw-r----- 1 cdleary cdleary 0 2008-06-17 22:00 09
-rw-r----- 1 cdleary cdleary 0 2008-06-17 22:00 10
-rw-r----- 1 cdleary cdleary 0 2008-06-17 22:00 11

Now I have the unfortunate choice of either hacking up my installation to allow the release number, or downgrading.



The Cheese Shop is now called PyPi, but I liked the Monty Python Cheese Shop sketch enough to honor its memory.


Update: I've lightly edited this post to be more tactful — I was slightly peeved at the time of its writing. :( See Problems with Python version parsing for more insight.

World-readable plaintext passwords and toddler murder

What do world readable plaintext passwords and toddler murder have in common? They're both easy.

Oh, right... not to mention they're both bad! I, for one, have accepted our not-evil corporate overlords and have been using GMail since my full-time-student unbecoming. As a result, I was looking at the GMail notifiers available in the Ubuntu repository.

One, called cgmail, was written in Python and had a fairly beautiful codebase. cgmail tied nicely into gconf and had everything going for it. I totally would be using it if it didn't crash ten times during five minutes of configuration. [*]

Another, called gmail-notify, worked perfectly. The source looked like it was written by a Java programmer (you know, making a "main" method for classes and such) who didn't believe in refactoring or PEP8, which made me a little sad. What made me really sad was finding that it stored my password in plaintext in a word readable file, and I had never gotten any warning.

This is a bug on the part of two parties: the MOTU who maintains this package (I'll be submitting a bug report) and the creator of the program. The Gentoo wiki has a page on the ability to install via portage, from which I quote:

elog "Warning: if you check the 'save username and password' option"
elog "your password will be stored in plaintext in ~/.notifier.conf"
elog "with world-readable permissions. If this concerns you, do not"
elog "check the 'save username and password' option."

Ideally this would read: "There is no 'save username and password' option." Just to recap some things:

  1. Do design your program to allow for plugins that tie into keyring managers,

  2. Don't knowingly put some of my most sensitive data where any user on the system can read it, and

  3. Don't, for God's sake, let me install a program that does this without telling me!

I don't mind that Pidgin stores my password in plaintext because it's an Instant Messaging client and it's as careful as possible to use file permissions as protection. gmail-notify used my default umask, which is clearly not good enough, to protect perhaps the most personal data that I have.

You know who you should really feel sorry for, though? Linux-using toddlers.



I'll probably end up contributing to this project.

Python's generators sure are handy

While rewriting some older code today, I ran across a good example of the clarity inherent in Python's generator expressions. Some time ago, I had written this weirdo construct:

for regex in date_regexes:
    match = regex.search(line)
    if match:
# ... do stuff with the match

The syntax highlighting makes the problem fairly obvious: there's way too much syntax!

First of all, I used the semi-obscure "for-else" construct. For those of you who don't read the Python BNF grammar for fun (as in: the for statement), the definition may be useful:

So long as the for loop isn't (prematurely) terminated by a break statement, the code in the else suite gets evaluated. To restate (in the contrapositive): the code in the else suite doesn't get evaluated if the for loop is terminated with a break statement. From this definition we can deduce that if a match was found, I did not want to return early.

That's way too much stuff to think about. Generators come to the rescue!

def first(iterable):
    """:return: The first item in the iterable that evaluates
    as True.
    for item in iterable:
        if item:
            return item
    return None

match = first(regex.search(line) for regex in regexes)
if not match:
# ... do stuff with the match

At a glance, this is much shorter and more comprehensible. We pass a generator expression to the first function, which performs a kind of short-circuit evaluation — as soon as a match is found, we stop running regexes (which can be expensive). This is a pretty rockin' solution, so far as I can tell.

Prior to generator expressions, to do something similar to this we'd have to use a list comprehension, like so:

match = first([regex.search(line) for regex in regexes])
if not match:
# ... do stuff with the match

We dislike this because the list comprehension will run all of the regexes, even if one already found a match. What we really want is the short circuit evaluation provided by generator expressions and the any builtin, as shown above. Huzzah!


Originally I thought that the any built-in returned the first object which evaluated to a boolean True, but it actually returns the boolean True if any of the objects evaluate to True. I've edited to reflect my mistake.

procfs and preload

Two of the cool utilities that I've checked out lately have centered around /proc. /proc is a virtual filesystem mountpoint — the filesystem entities are generated on the fly by the kernel. The filesystem entities provide information about the kernel state and, consequently, the currently running processes. [*]

The utilities are preload and powertop. Both are written in C, though I think that either of them could be written more clearly in Python.


Preload's premise is fascinating. Each shared library that a running process is using via MMIO can be queried via /proc/[pid]/maps, which contains entries of the form:

[vm_start_addr]-[vm_end_addr] [perms] [file_offset] [device_major_id]:[device_minor_id] [inode_num] [file_path]

Preload uses a Markov chain to decide which shared library pages to "pre-load" into the page cache by reading and analyzing these maps over time. Preload's primary goal was to reduce login times by pre-emptively warming up a cold page cache, which it was successful in doing. The catch is that running preload was shown to decrease performance once the cache was warmed up, indicating that it may have just gotten in the way of the native Linux page cache prefetch algorithm. [†]

There are a few other things in /proc that preload uses, like /proc/meminfo, but querying the maps is the meat and potatoes. I was thinking of porting it to Python so that I could understand the structure of the program better, but the fact that the daemon caused a performance decrease over a warm cache turned me off the idea.




A cool side note — all files in /proc have a file size of 0 except kcore and self.


The page_cache_readahead() function in the Linux kernel.