April 9, 2009

Eliminating web service dependencies with a language-specific abstraction barrier

Hyperbolic analogy: Saying, "You shouldn't need to wrap the web service interface, because it already provides an API," is like saying, "You shouldn't need different programming languages, because they're all Turing complete."

Web services tend to deliver raw data payloads from a flat interface and thus lack the usability of native language APIs. Inevitably, when you program RPC-like interfaces for no language in particular, you incur incompatibilities for every particular language's best practices, idioms, and data models. [*] The issue of appropriately representing exceptions and/or error codes in RPC-like services is a notorious example of this.

There are additional specification mechanisms like WSDL [†] that allow us to make the payloads more object-like. Additional structure is indicated through the use of user-defined "complex types," but this only gets you part of the way to a usable API for any given language. In Python, it's a lot more sensible to perform an operation like in the following abstraction:

from internal_tracker.service import InternalTracker
bug_serivce = InternalTracker(username=getpass.getuser(),
    password=getpass.getpass())
bug = bug_service.get_bug(123456)
bug.actionable.add('Chris Leary') # may raise ReadOnlyException
comment = Comment(text='Adding self to actionable')
bug.arbs.add_comment(comment)
bug.save() # may raise instanceof ServiceWriteException

Than to use an external web service API solution directly (despite using the excellent Suds library):

# Boilerplate
client = suds.client.Client(wsdl=wsdl_uri)
security = suds.wsse.Security()
security.tokens.append(UsernameToken(getpass.getuser(),
    getpass.getpass()))
client.set_options(wsse=security)
internal_tracker_service = client.service

# Usage
service_bug = internal_tracker_service.GetBug(123456)
service_bug.actionable += ', Chris Leary'
# Do we check the response for all WebFault exceptions?
# (Do we check for and handle all the possible transport issues?)
internal_tracker_service.UpdateBug(service_bug)
Comment = internal_tracker_service.factory['Comment']
comment = Comment()
comment.BugId = service_bug.Id
comment.Text = 'Adding self to actionable'
# Again, what should we check?
internal_tracker_service.AddComment(comment)

Why is it good to have the layer of indirection?

Lemma 1: The former example actually reads like Python code. It raises problem-domain-relevant exceptions, uses keyword arguments appropriately, follows language naming conventions, and uses sensible language-specific data types that may be poorly represented in the web service. For example, actionable may be a big comma-delimited string according to the service, whereas it should clearly be modeled as a set of (unique) names, using Python's set data type. Another example is BigIntegers being poorly represented as strings in order to keep the API language-neutral.

Lemma 2: The layer represents an extremely maintainable abstraction barrier between the client and the backing service. Should a team using the abstraction decide it's prudent to switch to, say, Bugzilla, I would have no trouble writing a port for the backing service in which all client code would continue to work. Another example is a scenario in which we determine that the transport is unreliable for some reason, so decide all requests should be retried three times instead of one. [‡] How many places will I need to make changes? How many client code bases do I potentially need to keep track of?

Why is it risky to use the web service interface?

If the web service API represents the problem domain correctly with constructs that make sense for your language, it's fine to use directly. (As long as you're confident you won't have transport-layer issues.) If you're near-certain that the backing service will not change, and/or you're willing to risk all the client code that will depend on that API directly being instantaneously broken, it's fine. The trouble occurs when one of these is not the case.

Let's say that the backing service does change to Bugzilla. Chances are that hacking in adapter classes for the new service would be a horrible upgrade experience that entails:

  1. Repeated discovery of leaky abstractions,

  2. Greater propensity to bugs, [§] and

  3. More difficult maintenance going forward.

Client code that is tightly coupled to the service API would force a rewrite in order to avoid these issues.

Pragmatic Programming says to rely on reliable things, which is a rule that any reasonable person will agree with. [¶] The abstraction barrier is reliable in its loose coupling (direct modeling of the problem domain), whereas direct use of the web service API could force a reliance on quirky external service facts, perhaps deep into client code.

Is there room for compromise?

This is the point in the discussion where we think something along the lines of, "Well, I can just fix the quirky things with a bunch of shims between my code and the service itself." At that point, I contend, you're really just implementing a half-baked version of the language-specific API. It's better to make the abstractions appropriate for the target language and problem domain the first time around than by incrementally adding shims and hoping client code didn't use the underlying quirks before you got to them. Heck, if the web service is extremely well suited to your language, you'll end up proxying most of the time anyway, and the development will be relatively effortless. [#]

What about speed of deployment?

If we have language-specific APIs, won't there be additional delay waiting for it to update when additional capabilities are added to the backing service?

First of all, if the new capability is not within the problem domain of the library, it should be a separate API. This is the single responsibility principle applied to interfaces — you should be programming to an interface abstraction. Just because a backing service has a hodgepodge of responsibilities doesn't mean that our language-specific API should as well. In fact, it probably shouldn't. Let's assume it is in the problem domain.

If the functionality is sane and ready for use in the target language, it should be really simple for the library owner to extend the language-specific API. In fact, if you're using the proxy pattern, you may not have to do anything at all. Let's assume that the functionality is quirky and you're blocked waiting for the library owner to update with the language-specific shim, because it's non-trivial.

Now our solution tends to vary based on the language. Languages like Python have what's known as "gentlemen's privacy", based on the notion of a gentlemen's agreement. Privacy constraints are not enforced at compile-time and/or run-time, so you can just reach through the abstraction barrier if you believe you know what you're doing. Yes, you're making an informed decision to violate encapsulation. Cases like this are exactly when it comes in handy.

assert not hasattr(bug_service, 'super_new_method_we_need')
# HACK: Violate abstraction -- we need this new capability right now
# and Billy-Bo, the library owner, is swamped!
suds_client = bug_service._suds_client
result = suds_client.SuperNewMethodWeNeed()
target_result = de_quirkify(result)

As you can see, we end up implementing the method de_quirkify to de-quirk the quirky web service result into a more language-specific data model — it's bad form to make the code dependent on the web service's quirky output form. We then submit our code for this method to the library owner and suggest that they use it as a basis for their implementation, so that a) they can get it done faster, and b) we can seamlessly factor the hack out.

For privacy-enforcing languages, you would need to expose a public API for getting at the private service, then tell people not to use it unless they know what they're doing. As you can tell, you pretty much wind up with gentlemen's privacy on that interface, anyway.

Footnotes

[*]

In EE we call this kind of phenomenon an impedance mismatch, which results in power loss.

[†]

And many others, with few successful ones.

[‡]

Or maybe we want to switch from HTTP transport to SMTP. Yes, this is actually possible. :-)

[§]

In duplicating exact web service behaviors — you have to be backwards compatible with whichever behaviors the existing client code relies on.

[¶]

It's a syntactic tautology. The caveat is that reasonable people will almost certainly quibble over the classification of what's reliable.

[#]

At least in Python or other languages with a good degree of dynamism, it will.

Generators and resource aquisition/release

One of the neatest things about language lawyers is that they have a keen eye for features of a language that may conflict with each other to produce fail. I, on the other hand, find it fun to stumble around in various languages and analyze interesting cases as I encounter them.

Generators in Python were a subset of a more general concept of coroutines. Generators are an elegant and concise way to write reasonably sized state machines. For that reason, you'll seem them heavily associated with iterators (which are more sytaxerific [*] to write in a language without generators, like Java).

I used to envision generators as little stack frames that were detached from the call stack and placed somewhere in outer space, eating moon cheese and playing with the Django pony, where they lived happily ever after. Surprisingly, that concept didn't match up with reality too well.

PEP 342: Coroutines via Enhanced Generators and PEP 325: Resource-Release Support for Generators are the language lawyer smack-down of my naive view. We used to be unable to perform proper resource acquisition within generators; notably, you couldn't yield from the try suite of a try/finally block, because the only way to guarantee resource release in the finally block was to step the generator until a StopIteration exception:

Restriction: A yield statement is not allowed in the try clause of a try/finally construct. The difficulty is that there's no guarantee the generator will ever be resumed, hence no guarantee that the finally block will ever get executed; that's too much a violation of finally's purpose to bear.

  • PEP 255 — Specification: Yield

from threading import Lock

lock = Lock()

def gen():
    try:
        lock.acquire()
        yield 'Acquired!'
    finally:
        lock.release()

if __name__ == '__main__':
    g = gen()
    print g.next()

We see the addition of this capability in Python 2.5:

$ python2.4 poc.py
  File "poc.py", line 8
    yield 'Acquired!'
SyntaxError: 'yield' not allowed in a 'try' block with a 'finally' clause

$ python2.5 poc.py
Acquired!

Before Python 2.5 there was no way to tell the generator to die and give up its resources. As PEP 342 describes, Python 2.5 turns generators into simple coroutines, which we can force to release its resources [†] when necessary via the close method:

>>> import poc
>>> g = poc.gen()
>>> h = poc.gen()
>>> g.next()
'Acquired!'
>>> g.close() # Force it to release the resource, or we deadlock.
>>> h.next()
'Acquired!'

SimPy

If you're wondering how I came across this combination in day-to-day Python programming, it was largely due to SimPy. I was writing a PCI bus simulation [‡] for fun, to help get a grasp of the SimPy constructs and how they might affect normal object oriented design. [§] I wanted to "acquire" a bus grant, so I analyzed the applicability of with for this resource acquisition.

I went to Stack Overflow and submitted a "feeler" question to see if there was some conventional Python wisdom I was lacking: Is it safe to yield from within a "with" block in Python (and why)?. The concept seemed relatively new to those in the discussion; however, the responses are still insightful.

The Lesson

This experience has demonstrated to me there are two modes of thinking when it comes to Python generators: short-lived and long-lived.

Typical, pre-Python 2.5 generator usage, where generators are really used like generators, lets you glaze over the difference between a regular function and a generator. Really, all that you want to do with this kind of construct is get some values to be used right now. You're not doing anything super-fancy in the generator — it's just nicer syntax to have all of your local variables automatically saved in the generator function than doing it manually in an independent object.

Fancy, SimPy co-routine usage, where generators are managed as coroutines by a central dispatcher, makes a generator take on some more serious object-like semantics. Shared-resource acquisition across coroutine yields should scare you, at least as much as objects that acquire shared resources without releasing them right away. [¶] Perhaps more, seeing as how you're lulled into a state of confidence by understanding short-lived Python generator behaviors.

Footnotes

[*]

This word was invented to make me seem less biased against Java. Oh, also, even more props to Barbara Liskov, (Turing Award winner) for the impetus of generator-based iterators in the CLU language.

[†]

We can do other things with the new capabilities, like feed values back into the generators:

def gen():
    feedback = (yield 'First')
    yield feedback

if __name__ == '__main__':
    g = gen()
    assert g.next() == 'First'
    assert g.send('Test') == 'Test'
[‡]

The original PCI bus is approximately the "Hello, Word!" of platform architecture, so far as I can tell.

[§]

I still haven't gotten solid good grasp of the design methodology changes. If you want one generator to block until the success/failure of another subroutine, then you have to sleep and trigger wake events with the possibility of interrupts. Can you tell I've never used a language with continuations before? ;-)

[¶]

Deadlocking on mutually exclusive resources is easy with a cooperatively multitasking dispatcher: one entity (coroutine) is holding the resource and yields, dispatcher picks another one that wants that same resource, performs a non-blocking acquisition, and then you have circular wait with no preemption == deadlock.

Idiomatic Python refactoring: for-else, "in" (contains) operator

I was perusing the App Engine SDK and I came across this snippet:

if self.choices:
  match = False
  for choice in self.choices:
    if choice == value:
      match = True
  if not match:
    raise BadValueError('Property %s is %r; must be one of %r' %
                        (self.name, value, self.choices))

Since I don't work with many other Python programmers, I always have trouble figuring out what interesting tidbits would be useful to post in, say, a blog entry. I don't have a good understanding of the popular knowledge level, but I figure that I can't go too wrong refactoring code written by Google engineers (who I naively assume are all as cool as Steve Yegge). [*]

The for-else statement

Let's forget about self for now [†] and refactor to use an obscure (but useful) Python feature, the for-else construct. for-else removes the necessity for the boolean-flag-state idiom from the original code, which is often used in lower level languages. [‡]

if choices:
    for choice in choices:
        if choice == value:
            break
    else:
        raise BadValueError

The for-else statement looks a little strange when you first encounter it, but I've come to love it. The else suite is evaluated if you don't break out of the for loop. In this case, if we didn't break out of the for loop, then we never found a value equivalent to choice.

We also gain some efficiency over the original by using the break statement as soon as we find a match: there's no need to keep looking if you've already found a result! This can save you from iterating over all len(choices) items if you find it's a valid choice in the first iteration.

in (contains) operator

Here is an even more readable and Python-like refactoring that uses the in operator: [§]

if choices and value not in choices:
    raise BadValueError

The in operator works on any iterable object and performs the same behavior as the code above: it looks for any item within self.choices such that choice == item. If it finds it early in the list, it won't keep looking. This is similar behavior to our early break statement from the first refactoring.

Just like the original code with the for loop, the in operator raises a TypeError if choices is not iterable. The in operator is effectively a drop-in replacement for the (more verbose) for loop when it comes to membership testing.

Footnotes

[*]

You should read his blog if you don't already.

[†]

For the language lawyers: we're forgetting about the fact that this code was intended to be executed in a bound instance method. ;)

[‡]

For example, C. For more information on programming languages and their "heights", see this Wikipedia entry.

[§]

Yeah, yeah... technically it's the not in operator.

Using Python identifiers to helpfully indicate protocols

Background

The Stroop Effect indicates that misleading identifiers will be more prone to improper use and will be more subtle when introducing bugs. Because of this phenomenon, I try to make my identifiers' intended usage as clear as possible without over-specifying and/or repeating myself. Additionally, I prefer programming languages which allow for latent typing, which has interesting results: I end up encoding protocol indicators into identifiers. [*]

An Example

If you're (still) reading this, you're most likely a Python programmer. When you find that there exists an identifier chunks, what "kind of thing" do you most expect chunks to be bound to? Since this is a very hand-wavy question, I'll provide some options to clarify:

  1. A sequence (iterable) of chunk-like objects.

  2. A callable that returns chunks.

  3. A mapping with chunk-like values (presumably not chunk-like keys).

  4. A number (which represents a count of chunk-like objects somewhere in the problem domain).

If you've got a number picked out then you can know that I'm the bachelor behind door number one. Since I would identify a lone chunk by the identifier chunk, the identifier says to me, "I'm identifying some (a collection of) chunks." By iterating, I'm asking to hold the chunks one at a time. (Yuck. :)

Callables and Action Words

If you chose door number two and think that it's a callable, then your bachelor is this Django project API, which I am using in this particular (chunky) example. This practice not at all uncommon, however, and another good example of this present within the Python standard library is the BaseHTTPServer with its version_string and date_time_string methods. I might be missing something major; however, I'm going to claim that callables should be identified with action word (verb) prefixes.

To me, it seems well worth it to put these prefixes onto identifiers that are intended to be callables to make their use more readily apparent. To my associative capacities, action words and callables go together like punch and pie. Since it helps clarify usage while writing code, it seems bound to help clarify potential errors while reading code, as in the following contrast:

for chunk in uploaded_file.get_chunks:
    """Looks wrong and feels wrong writing it... action word but no
    invocation.
    """
for chunk in uploaded_file.chunks:
    """Looks fine and feels okay writing it, but uploaded_file.chunks is
    really a method.
    """

Mappings and Bylines

If you chose number three and think that it's a mapping, I'm surprised. There's nothing about the identifier to indicate that there is a key-value relation. Additionally, attempting to iterate over chunks, if it is a mapping with non-chunk keys, will end up iterating over the (non-chunk) keys, like so:

>>> chunks = {1: 'Spam', 2: 'Eggs'}
>>> i = iter(chunks)
>>> repr(i)
'<dictionary-keyiterator object at 0xb7da4ea0>'
>>> i.next()
1

chunks being a mapping makes the code incompatible with the people who interpret the identifier as an iterable of chunks (number two), since the iterator method (__iter__) for a mapping iterates over the keys rather than the chunks. This is the kind of mistake that I dislike the most: a potentially silent one! [†]

To solve this potential ambiguity in my code I use "bylines", as in the following:

>>> chunk_by_health_value = {1: 'Spam', 2: 'Eggs'}
>>> health_values = iter(chunk_by_health_value)

Seeing the fact that the identifier has a _by_healthiness postfix tells me that I'm dealing with a mapping rather than a simple sequence, and the code tends to read in a more straightforward manner: if it has a _by_* postfix, that's what the default iterator will iterate over. In a similar fashion, if you had a mapping of healthiness to sequences of chunks, I would name the identifier chunks_by_healthiness. [‡]

Identifying Numbers

If you chose number four, I see where you're coming from but don't think the same way. Every identifier whose purpose is to reference a numerical count I either prefix with num_ or postfix with _count. This leaves identifiers like chunks free for sequences that I can call len() on, and indicates that chunk_count has a number-like interface.

Compare/Contrast with Hungarian notation

Though my day-to-day usage I find that this approach doesn't really suffer from the Wikipedia-listed criticisms with Hungarian notation.

Unless I sorely misunderstood the distinction, you could classify this system as a broadly applicable Apps Hungarian, since protocols are really all about semantic information (being file-like indicates the purpose of being used like a file!). Really, this guideline just developed from a desire to use identifiers that conform to the some general notions that we have of language and what language describes; I don't tend to think of chunks as something that I can invoke. (Invoke the chunks!)

For objects that span multiple protocols or aren't "inherently" tied to any given protocol, I just use the singular.

Potential Inconsistencies

Strings can be seen as an inconsistency in this schema. Strings really fall into an ordered sequence protocol, but identifiers are in the singular; i.e. "message". One could argue that strings are really more like sequences of symbols in Python and that the identifiers would be more consistent if we used something like: "message_letters". These sort of identifiers just seem impractical, so I'm going to reply that you're really identifying a message object adapted to the string protocol, so it's still a message. Feel free to tear this argument apart. ;)

Footnotes

[*]

A protocol in Python is roughly a "well understood" interface. For example, the Python file protocol starts with an open() ability that returns a file-like handle: this handle has a read() method, a write() method, and usually a seek() method. Anything that acts in a file-like way by using these same "well understood" methods can usually be used in place of a file, so the term protocol is used due to the lack of a de jure interface specification. For a really cool PEP on protocols and adapters, read Alex Martelli and Clark Evans' PEP 246. (Note: This PEP wasn't accepted because function annotations are coming along in Python 3, but it's still a really cool idea. :)

[†]

Perl has lots of silently conforming behavior that drives me nuts.

[‡]

I still haven't figured out a methodological way to scale this appropriately in extreme cases; i.e. a map whose values are also maps becomes something like chunk_by_healthiness_by_importance, which gets ugly real fast.

Problems with Python __version__ parsing

As stated by Armin and commenters [*] the change from 0.9 to 0.10 is a convention in open source versioning, and the fault seems to lie more on the version-parsers than the version-suppliers. [†] Armin also notes that the appropriate solution is to use:

from pkg_resources import parse_version

Despite it not being the fault of the version supplier, we've recognized that this can be an issue and can certainly take precautions against letting client code interpret __version__ as a float. Right now there are two ways that I can think of doing this:

  1. Keep __version__ as a tuple. If you keep __version__ in tuple form you don't need to worry about client code forgetting to use the parse_version method.

  2. Use version numbers with more than one decimal. This prohibits the version from being parsed as a float because it's not the correct format — taking the current Linux kernel version as an example:

    >>> __version__ = '2.6.26'
    >>> float(__version__)
    Traceback (most recent call last):
    ...
    ValueError: invalid literal for float(): 2.6.26
    >>> tuple(int(i) for i in __version__.split('.'))
    (2, 6, 26)
    

    This ensures that the client code will think about a more appropriate way to parse the version number than using the float builtin; however, it doesn't prevent people from performing an inappropriate string comparison like the tuple does.

Footnotes

[*]

In … and 0.10 follows 0.9

[†]

Which seems to invalidate the implied conclusion of my title How not to do software version numbers, which I now realize was stupidly named.