Archive for the ‘Best Practices’ Category

Registry pattern trumps import magic

Monday, June 1st, 2009

The other night I saw an interesting tweet in the #Python Twitter channelPatrick was looking to harness the dynamism of a language like Python in a way that many Pythonistas would consider magical. [*] Coming from languages with more rigid execution models, it’s understandably easy to confuse dynamic and magical. [†]

What is magic?

To quote the jargon file, magic is:

Characteristic of something that works although no one really understands why (this is especially called black magic).

Taken in the context of programming, magic refers to code that works without a straightforward way of determining why it works.

Today’s more flexible languages provide the programmer with a significant amount of power at runtime, making the barrier to "accidental magic" much lower. As a programmer who works with dynamic languages, there’s an important responsibility to keep in mind: err on the side of caution with the Principle of Least Surprise.

[T]o design usable interfaces, it’s best when possible not to design an entire new interface model. Novelty is a barrier to entry; it puts a learning burden on the user, so minimize it.

This principle indicates that using well known design patterns and language idioms is a "best practice" in library design. When you follow that guideline, people will already have an understanding of the interface that you’re providing; therefore, they will have one less thing to worry about in leveraging your library to write their code.

Discovery Mechanism Proposals

Patrick is solving a common category of problem: he wants to allow clients to flexibly extend his parsing library’s capabilities. For example, if his module knows how to parse xml and yaml files out of the box, programmers using his library should be able to add their own rst and html parser capabilities with ease.

Patrick’s proposal is this:

  • Have the programmer place all extension modules that might contain parser classes in a known directory.
  • In a factory class constructor, take a directory listing of the known directory.
  • Import every module present in that listing.
  • Inspect each module imported this way for class members.
  • For each class found, add it to an accumulator if it inherits from a Parser abstract base class provided by the module.

If you were to do this, you would use the various utilities in the imp module to load the modules dynamically, then determine the appropriate classes via the inspect module. [‡]

My counter-proposal is this, which is also known as the Registry Pattern, a form of runtime configuration and behavior extension:

  • Have the programmer import a decorator from our module.
  • Let them decorate any class [§] that conforms to the implicit Parser interface.

Parser library:

class UnknownMimetypeException(Exception): pass
class ParseError(Exception): pass
 
class IParser:
 
    """Reference interface for parser classes -- inheritance is not
    necessary."""
 
    parseable_mimetypes = set()
 
    def __init__(self, file):
        self.file = file
        self.doctree = None
 
    def parse(self):
        """Parse :ivar:`file` and place the parsed document tree into
        :ivar:`doctree`.
        """
        raise NotImplementedError
 
 
class ParserFacade:
 
    """Assumes that there can only be one parser per mimetype.
    :ivar mimetype_to_parser_cls: Storage for parser registry.
    """
 
    def __init__(self):
        self.mimetype_to_parser_cls = {}
 
    def register_parser(self, cls):
        for mimetype in cls.parseable_mimetypes:
            self.mimetype_to_parser_cls[mimetype] = cls
 
    def parse(self, file, mimetype):
        """Determine the appropriate parser for the mimetype, create a
        parser to parse the file, and perform the parsing.
 
        :return: The parser object.
        """
        try:
            parser_cls = self.mimetype_to_parser_cls[mimetype]
        except KeyError:
            raise UnknownMimetypeException(mimetype)
 
        parser = parser_cls(file)
        parser.parse() # May raise ParseError
        return parser
 
 
default_facade = ParserFacade()
register_parser = default_facade.register_parser
parse = default_facade.parse

Client code:

from parser_lib import register_parser
 
@register_parser
class SpamParser:
 
    """Parses ``.spam`` files.
    Conforms to implicit parser interface of `parser_lib`.
    """
 
    parseable_mimetypes = {'text/spam'}
 
    def __init__(self, file):
        self.file = file
        self.doctree = None
 
    def parse(self):
        raise NotImplementedError

After the client code executes, the SpamParser will then be available for parsing text/spam mimetype files via parser_lib.parse.

Here are some of my considerations in determining which of these is the least magical:

  • Which interface is the easiest to explain?
  • Which implementation will be the easiest to explain?
  • Which is more fragile? (Which is most likely to break when "special case uses" crop up?)
  • Which is easier to test?

Magical Allure

The problem with magic is that it is freaking cool and it drives all the ladies crazy. [¶] As a result, the right hemisphere of your developer-brain yearns for your library clients to read instructions like:

Drag and drop your Python code into my directory — I’ll take care of it from there.

That’s right, that’s all there is to it.

Oh, I know what you’re thinking — yes, I’m available — check out parser_lib.PHONE_NUMBER and give me a call sometime.

But, as you envision phone calls from sexy Pythonistas, the left hemisphere of your brain is screaming at the top of its lungs! [#]

Magic leaves the audience wondering how the trick is done, and the analytical side of the programmer mind hates that. It implies that there’s a non-trivial abstraction somewhere that does reasonably complex things, but it’s unclear where it can be found or how to leverage it differently.

Coders need control and understanding of their code and, by extension, as much control and understanding over third party code as is reasonably possible. Because of this, concise, loosely coupled, and extensible abstractions are always preferred to the imposition of elaborate usage design ideas on clients of your code. It’s best to assume that people will want to leverage the functionality your code provides, but that you can’t foresee the use cases.

To Reiterate: Dynamic does not Imply Magical

Revisiting my opening point: anecdotal evidence suggests that some members of the static typing camp see we programming-dynamism dynamos as anarchic lovers of programming chaos. Shoot-from-the-hip cowboys, strolling into lawless towns of code, type checking blowing by the vacant sheriff’s station as tumbleweeds in the wind. (Enough imagery for you?) With this outlook, it’s easy to see why you would start doing all sorts of fancy things when you cross into dynamism town — little do you know, we don’t take kindly to that ’round these parts.

In other, more intelligble words, this is a serious misconception — dynamism isn’t a free pass to disregard the Principle of Least Surprise — dynamism proponents still want order in the programming universe. Perhaps we value our sanity even more! The key insight is that programming dynamism does allow you additional flexibility when it’s required or practical to use. More rigid execution models require you to use workarounds, laboriously at times, for a similar degree of flexibility.

As demonstrated by Marius’ comment in my last entry, Python coders have a healthy respect for the power of late binding, arbitrary code execution on module import, and seamless platform integration. Accompanying this is a healthy wariness of black magic.

Caveat

It’s possible that Patrick was developing a closed-system application (e.g. the Eclipse IDE) and not a library like I was assuming.

In the application case, extensions are typically discovered (though not necessarily activated) by enumerating a directory. When the user activates such an extension, the modules found within it are loaded into the application. This is the commonly found plugin model — it’s typically more difficult to wrap the application interface and do configurations at load time, so the application developer must provide an extension hook.

However, the registration pattern should still be preferred to reflection in this case! When the extension is activated and the extension modules load, the registration decorator will be executed along with all the other top-level code in the extension modules.

The extension has the capability to inform the application of the extension’s functionality instead having the application query the plugin for its capabilities. This is a form of loosely coupled cooperative configuration that eases the burden on the application and eliminates the requirement to foresee needs of the extensions. [♠]

Footnotes

[*] Note that you can’t call it dynamic programming, as that would alias a well known term from the branch of computer science concerned with algorithms. Programming language dynamism it is!
[†] Much like a dehydrated wanderer in the desert mistakes a shapely pile of sand for an oasis!
[‡] As of the date of this publishing, Patrick’s implementation seems to have gone a bit astray with text processing of Python source files. Prefer dynamic module loading and inspection to text processing source code! Enumerating the reasons this is preferred is beyond the scope of this article.
[§]

In Python < 3.0 you can perform class decoration without the decorator syntax. Decorator syntax is just syntactic sugar for "invoke this method and rebind the identifier in this scope", like so:

class SomeClass(object):
    pass
SomeClass = my_class_decorator(SomeClass) # Decorate the class.
[¶] Perhaps men as well, but I’ve never seen any TV evidence to justify that conclusion.
[#] Yes, in this analogy brains have lungs. If you’ve read this far you’re probably not a biologist anyway.
[♠] Of course, the plugin model always has security implications. Unless you go out of your way to make a sandboxed Python environment for plugins, you need to trust the plugins that you activate — they have the ability to execute arbitrary code.

Eliminating web service dependencies with a language-specific abstraction barrier

Thursday, April 9th, 2009

Hyperbolic analogy: Saying, “You shouldn’t need to wrap the web service interface, because it already provides an API,” is like saying, “You shouldn’t need different programming languages, because they’re all Turing complete.”

Web services tend to deliver raw data payloads from a flat interface and thus lack the usability of native language APIs. Inevitably, when you program RPC-like interfaces for no language in particular, you incur incompatibilities for every particular language’s best practices, idioms, and data models. [*] The issue of appropriately representing exceptions and/or error codes in RPC-like services is a notorious example of this.

There are additional specification mechanisms like WSDL [@] that allow us to make the payloads more object-like. Additional structure is indicated through the use of user-defined “complex types,” but this only gets you part of the way to a usable API for any given language. In Python, it’s a lot more sensible to perform an operation like in the following abstraction:

from internal_tracker.service import InternalTracker
bug_serivce = InternalTracker(username=getpass.getuser(),
    password=getpass.getpass())
bug = bug_service.get_bug(123456)
bug.actionable.add('Chris Leary') # may raise ReadOnlyException
comment = Comment(text='Adding self to actionable')
bug.arbs.add_comment(comment)
bug.save() # may raise instanceof ServiceWriteException

Than to use an external web service API solution directly (despite using the excellent Suds library):

# Boilerplate
client = suds.client.Client(wsdl=wsdl_uri)
security = suds.wsse.Security()
security.tokens.append(UsernameToken(getpass.getuser(),
    getpass.getpass()))
client.set_options(wsse=security)
internal_tracker_service = client.service
 
# Usage
service_bug = internal_tracker_service.GetBug(123456)
service_bug.actionable += ', Chris Leary'
# Do we check the response for all WebFault exceptions?
# (Do we check for and handle all the possible transport issues?)
internal_tracker_service.UpdateBug(service_bug)
Comment = internal_tracker_service.factory['Comment']
comment = Comment()
comment.BugId = service_bug.Id
comment.Text = 'Adding self to actionable'
# Again, what should we check?
internal_tracker_service.AddComment(comment)

Why is it good to have the layer of indirection?

Lemma 1: The former example actually reads like Python code. It raises problem-domain-relevant exceptions, uses keyword arguments appropriately, follows language naming conventions, and uses sensible language-specific data types that may be poorly represented in the web service. For example, actionable may be a big comma-delimited string according to the service, whereas it should clearly be modeled as a set of (unique) names, using Python’s set data type. Another example is BigIntegers being poorly represented as strings in order to keep the API language-neutral.

Lemma 2: The layer represents an extremely maintainable abstraction barrier between the client and the backing service. Should a team using the abstraction decide it’s prudent to switch to, say, Bugzilla, I would have no trouble writing a port for the backing service in which all client code would continue to work. Another example is a scenario in which we determine that the transport is unreliable for some reason, so decide all requests should be retried three times instead of one. [!] How many places will I need to make changes? How many client code bases do I potentially need to keep track of?

Why is it risky to use the web service interface?

If the web service API represents the problem domain correctly with constructs that make sense for your language, it’s fine to use directly. (As long as you’re confident you won’t have transport-layer issues.) If you’re near-certain that the backing service will not change, and/or you’re willing to risk all the client code that will depend on that API directly being instantaneously broken, it’s fine. The trouble occurs when one of these is not the case.

Let’s say that the backing service does change to Bugzilla. Chances are that hacking in adapter classes for the new service would be a horrible upgrade experience that entails:

  1. Repeated discovery of leaky abstractions,
  2. Greater propensity to bugs, [^] and
  3. More difficult maintenance going forward.

Client code that is tightly coupled to the service API would force a rewrite in order to avoid these issues.

Pragmatic Programming says to rely on reliable things, which is a rule that any reasonable person will agree with. [&] The abstraction barrier is reliable in its loose coupling (direct modeling of the problem domain), whereas direct use of the web service API could force a reliance on quirky external service facts, perhaps deep into client code.

Is there room for compromise?

This is the point in the discussion where we think something along the lines of, “Well, I can just fix the quirky things with a bunch of shims between my code and the service itself.” At that point, I contend, you’re really just implementing a half-baked version of the language-specific API. It’s better to make the abstractions appropriate for the target language and problem domain the first time around than by incrementally adding shims and hoping client code didn’t use the underlying quirks before you got to them. Heck, if the web service is extremely well suited to your language, you’ll end up proxying most of the time anyway, and the development will be relatively effortless. [$]

What about speed of deployment?

If we have language-specific APIs, won’t there be additional delay waiting for it to update when additional capabilities are added to the backing service?

First of all, if the new capability is not within the problem domain of the library, it should be a separate API. This is the single responsibility principle applied to interfaces — you should be programming to an interface abstraction. Just because a backing service has a hodgepodge of responsibilities doesn’t mean that our language-specific API should as well. In fact, it probably shouldn’t. Let’s assume it is in the problem domain.

If the functionality is sane and ready for use in the target language, it should be really simple for the library owner to extend the language-specific API. In fact, if you’re using the proxy pattern, you may not have to do anything at all. Let’s assume that the functionality is quirky and you’re blocked waiting for the library owner to update with the language-specific shim, because it’s non-trivial.

Now our solution tends to vary based on the language. Languages like Python have what’s known as “gentlemen’s privacy”, based on the notion of a gentlemen’s agreement. Privacy constraints are not enforced at compile-time and/or run-time, so you can just reach through the abstraction barrier if you believe you know what you’re doing. Yes, you’re making an informed decision to violate encapsulation. Cases like this are exactly when it comes in handy.

assert not hasattr(bug_service, 'super_new_method_we_need')
# HACK: Violate abstraction -- we need this new capability right now
# and Billy-Bo, the library owner, is swamped!
suds_client = bug_service._suds_client
result = suds_client.SuperNewMethodWeNeed()
target_result = de_quirkify(result)

As you can see, we end up implementing the method de_quirkify to de-quirk the quirky web service result into a more language-specific data model — it’s bad form to make the code dependent on the web service’s quirky output form. We then submit our code for this method to the library owner and suggest that they use it as a basis for their implementation, so that a) they can get it done faster, and b) we can seamlessly factor the hack out.

For privacy-enforcing languages, you would need to expose a public API for getting at the private service, then tell people not to use it unless they know what they’re doing. As you can tell, you pretty much wind up with gentlemen’s privacy on that interface, anyway.

Footnotes

[*] In EE we call this kind of phenomenon an impedance mismatch, which results in power loss.
[@] And many others, with few successful ones.
[!] Or maybe we want to switch from HTTP transport to SMTP. Yes, this is actually possible. :-)
[^] In duplicating exact web service behaviors — you have to be backwards compatible with whichever behaviors the existing client code relies on.
[&] It’s a syntactic tautology. The caveat is that reasonable people will almost certainly quibble over the classification of what’s reliable.
[$] At least in Python or other languages with a good degree of dynamism, it will.

Generators and resource aquisition/release

Monday, April 6th, 2009

One of the neatest things about language lawyers is that they have a keen eye for features of a language that may conflict with each other to produce fail. I, on the other hand, find it fun to stumble around in various languages and analyze interesting cases as I encounter them.

Generators in Python were a subset of a more general concept of coroutines. Generators are an elegant and concise way to write reasonably sized state machines. For that reason, you’ll seem them heavily associated with iterators (which are more sytaxerific [^] to write in a language without generators, like Java).

I used to envision generators as little stack frames that were detached from the call stack and placed somewhere in outer space, eating moon cheese and playing with the Django pony, where they lived happily ever after. Surprisingly, that concept didn’t match up with reality too well.

PEP 342: Coroutines via Enhanced Generators and PEP 325: Resource-Release Support for Generators are the language lawyer smack-down of my naive view. We used to be unable to perform proper resource acquisition within generators; notably, you couldn’t yield from the try suite of a try/finally block, because the only way to guarantee resource release in the finally block was to step the generator until a StopIteration exception:

Restriction: A yield statement is not allowed in the try clause of a try/finally construct. The difficulty is that there’s no guarantee the generator will ever be resumed, hence no guarantee that the finally block will ever get executed; that’s too much a violation of finally’s purpose to bear.
- PEP 255 — Specification: Yield

from threading import Lock
 
lock = Lock()
 
def gen():
    try:
        lock.acquire()
        yield 'Acquired!'
    finally:
        lock.release()
 
if __name__ == '__main__':
    g = gen()
    print g.next()

We see the addition of this capability in Python 2.5:

$ python2.4 poc.py 
  File "poc.py", line 8
    yield 'Acquired!'
SyntaxError: 'yield' not allowed in a 'try' block with a 'finally' clause
 
$ python2.5 poc.py 
Acquired!

Before Python 2.5 there was no way to tell the generator to die and give up its resources. As PEP 342 describes, Python 2.5 turns generators into simple coroutines, which we can force to release its resources [#] when necessary via the close method:

>>> import poc
>>> g = poc.gen()
>>> h = poc.gen()
>>> g.next()
'Acquired!'
>>> g.close() # Force it to release the resource, or we deadlock.
>>> h.next()
'Acquired!'

SimPy

If you’re wondering how I came across this combination in day-to-day Python programming, it was largely due to SimPy. I was writing a PCI bus simulation [$] for fun, to help get a grasp of the SimPy constructs and how they might affect normal object oriented design. [%] I wanted to “acquire” a bus grant, so I analyzed the applicability of with for this resource acquisition.

I went to Stack Overflow and submitted a “feeler” question to see if there was some conventional Python wisdom I was lacking: Is it safe to yield from within a “with” block in Python (and why)?. The concept seemed relatively new to those in the discussion; however, the responses are still insightful.

The Lesson

This experience has demonstrated to me there are two modes of thinking when it comes to Python generators: short-lived and long-lived.

Typical, pre-Python 2.5 generator usage, where generators are really used like generators, lets you glaze over the difference between a regular function and a generator. Really, all that you want to do with this kind of construct is get some values to be used right now. You’re not doing anything super-fancy in the generator — it’s just nicer syntax to have all of your local variables automatically saved in the generator function than doing it manually in an independent object.

Fancy, SimPy co-routine usage, where generators are managed as coroutines by a central dispatcher, makes a generator take on some more serious object-like semantics. Shared-resource acquisition across coroutine yields should scare you, at least as much as objects that acquire shared resources without releasing them right away. [*] Perhaps more, seeing as how you’re lulled into a state of confidence by understanding short-lived Python generator behaviors.

Footnotes

[^] This word was invented to make me seem less biased against Java. Oh, also, even more props to Barbara Liskov, (Turing Award winner) for the impetus of generator-based iterators in the CLU language.
[#] We can do other things with the new capabilities, like feed values back into the generators:

def gen():
    feedback = (yield 'First')
    yield feedback
 
if __name__ == '__main__':
    g = gen()
    assert g.next() == 'First'
    assert g.send('Test') == 'Test'

[$] The original PCI bus is approximately the “Hello, Word!” of platform architecture, so far as I can tell.
[%] I still haven’t gotten solid good grasp of the design methodology changes. If you want one generator to block until the success/failure of another subroutine, then you have to sleep and trigger wake events with the possibility of interrupts. Can you tell I’ve never used a language with continuations before? ;-)
[*] Deadlocking on mutually exclusive resources is easy with a cooperatively multitasking dispatcher: one entity (coroutine) is holding the resource and yields, dispatcher picks another one that wants that same resource, performs a non-blocking acquisition, and then you have circular wait with no preemption == deadlock.