August 30, 2008

Using Python identifiers to helpfully indicate protocols

Background

The Stroop Effect indicates that misleading identifiers will be more prone to improper use and will be more subtle when introducing bugs. Because of this phenomenon, I try to make my identifiers' intended usage as clear as possible without over-specifying and/or repeating myself. Additionally, I prefer programming languages which allow for latent typing, which has interesting results: I end up encoding protocol indicators into identifiers. [*]

An Example

If you're (still) reading this, you're most likely a Python programmer. When you find that there exists an identifier chunks, what "kind of thing" do you most expect chunks to be bound to? Since this is a very hand-wavy question, I'll provide some options to clarify:

  1. A sequence (iterable) of chunk-like objects.

  2. A callable that returns chunks.

  3. A mapping with chunk-like values (presumably not chunk-like keys).

  4. A number (which represents a count of chunk-like objects somewhere in the problem domain).

If you've got a number picked out then you can know that I'm the bachelor behind door number one. Since I would identify a lone chunk by the identifier chunk, the identifier says to me, "I'm identifying some (a collection of) chunks." By iterating, I'm asking to hold the chunks one at a time. (Yuck. :)

Callables and Action Words

If you chose door number two and think that it's a callable, then your bachelor is this Django project API, which I am using in this particular (chunky) example. This practice not at all uncommon, however, and another good example of this present within the Python standard library is the BaseHTTPServer with its version_string and date_time_string methods. I might be missing something major; however, I'm going to claim that callables should be identified with action word (verb) prefixes.

To me, it seems well worth it to put these prefixes onto identifiers that are intended to be callables to make their use more readily apparent. To my associative capacities, action words and callables go together like punch and pie. Since it helps clarify usage while writing code, it seems bound to help clarify potential errors while reading code, as in the following contrast:

for chunk in uploaded_file.get_chunks:
    """Looks wrong and feels wrong writing it... action word but no
    invocation.
    """
for chunk in uploaded_file.chunks:
    """Looks fine and feels okay writing it, but uploaded_file.chunks is
    really a method.
    """

Mappings and Bylines

If you chose number three and think that it's a mapping, I'm surprised. There's nothing about the identifier to indicate that there is a key-value relation. Additionally, attempting to iterate over chunks, if it is a mapping with non-chunk keys, will end up iterating over the (non-chunk) keys, like so:

>>> chunks = {1: 'Spam', 2: 'Eggs'}
>>> i = iter(chunks)
>>> repr(i)
'<dictionary-keyiterator object at 0xb7da4ea0>'
>>> i.next()
1

chunks being a mapping makes the code incompatible with the people who interpret the identifier as an iterable of chunks (number two), since the iterator method (__iter__) for a mapping iterates over the keys rather than the chunks. This is the kind of mistake that I dislike the most: a potentially silent one! [†]

To solve this potential ambiguity in my code I use "bylines", as in the following:

>>> chunk_by_health_value = {1: 'Spam', 2: 'Eggs'}
>>> health_values = iter(chunk_by_health_value)

Seeing the fact that the identifier has a _by_healthiness postfix tells me that I'm dealing with a mapping rather than a simple sequence, and the code tends to read in a more straightforward manner: if it has a _by_* postfix, that's what the default iterator will iterate over. In a similar fashion, if you had a mapping of healthiness to sequences of chunks, I would name the identifier chunks_by_healthiness. [‡]

Identifying Numbers

If you chose number four, I see where you're coming from but don't think the same way. Every identifier whose purpose is to reference a numerical count I either prefix with num_ or postfix with _count. This leaves identifiers like chunks free for sequences that I can call len() on, and indicates that chunk_count has a number-like interface.

Compare/Contrast with Hungarian notation

Though my day-to-day usage I find that this approach doesn't really suffer from the Wikipedia-listed criticisms with Hungarian notation.

Unless I sorely misunderstood the distinction, you could classify this system as a broadly applicable Apps Hungarian, since protocols are really all about semantic information (being file-like indicates the purpose of being used like a file!). Really, this guideline just developed from a desire to use identifiers that conform to the some general notions that we have of language and what language describes; I don't tend to think of chunks as something that I can invoke. (Invoke the chunks!)

For objects that span multiple protocols or aren't "inherently" tied to any given protocol, I just use the singular.

Potential Inconsistencies

Strings can be seen as an inconsistency in this schema. Strings really fall into an ordered sequence protocol, but identifiers are in the singular; i.e. "message". One could argue that strings are really more like sequences of symbols in Python and that the identifiers would be more consistent if we used something like: "message_letters". These sort of identifiers just seem impractical, so I'm going to reply that you're really identifying a message object adapted to the string protocol, so it's still a message. Feel free to tear this argument apart. ;)

Footnotes

[*]

A protocol in Python is roughly a "well understood" interface. For example, the Python file protocol starts with an open() ability that returns a file-like handle: this handle has a read() method, a write() method, and usually a seek() method. Anything that acts in a file-like way by using these same "well understood" methods can usually be used in place of a file, so the term protocol is used due to the lack of a de jure interface specification. For a really cool PEP on protocols and adapters, read Alex Martelli and Clark Evans' PEP 246. (Note: This PEP wasn't accepted because function annotations are coming along in Python 3, but it's still a really cool idea. :)

[†]

Perl has lots of silently conforming behavior that drives me nuts.

[‡]

I still haven't figured out a methodological way to scale this appropriately in extreme cases; i.e. a map whose values are also maps becomes something like chunk_by_healthiness_by_importance, which gets ugly real fast.