Archive for the ‘Python’ Category

Two postfix operations redux: sequence points

Sunday, January 10th, 2010

Get ready for some serious language lawyering.

I was going back and converting my old entries to reStructuredText when I found an entry in which I was wrong! (Shocking, I know.)

C

Stupid old me didn’t know about sequence points back in 2007: the effects of the ++ operator in the C expression i++ * i++ are in an indeterminate state of side-effect completion until one of the language-defined sequence points is encountered (i.e. a semicolon or function invocation).

From the C99 standard 6.5.4.2 item 2 regarding the postfix increment and decrement operators:

The result of the postfix ++ operator is the value of the operand. After the result is obtained, the value of the operand is incremented. The side effect of updating the stored value of the operand shall occur between the previous and the next sequence point.

Therefore, the compiler is totally at liberty to interpret that expression as:

mov lhs_result, i     ; Copy the values of the postincrement evaluation.
mov rhs_result, i     ; (Which is the original value of i.)
mul result, lhs_result, rhs_result
add i, lhs_result, 1
add i, rhs_result, 1  ; Second increment clobbers with the same value!

This results in the same result as the GCC compilation in the referenced entry: i is 12 and the result is 121.

As I mentioned before, the reason this can occur is that nothing in the syntax forces the first postincrement to be evaluated before the second one. To give an analogy to concurrency constructs: you have a kind of compile-time "race condition" in your syntax between the two postincrements that could be solved with a sequence point "barrier". [*]

In this assembly, those adds can float anywhere they like after their corresponding mov instruction and can operate directly on i instead of the temporary if they’d prefer. Here’s an possible sequence that results in a value of 132 and i as 13.

mov lhs_result, i ; Gets the original 11.
inc i             ; Increment in-place after the start value is copied.
mov rhs_result, i ; Gets the new value 12.
inc i             ; Increment occurs in-place again, making 13.
mul result, lhs_result, rhs_result

Even if you know what you’re doing, mixing two postfix operations, or any side effect, using the less obvious sequence points (like function invocation) is dangerous and easy to get wrong. Clearly it is not a best practice. [†]

Java

The postincrement operation appears to have sequence-point-like semantics in the Java language through experimentation, and it does! From the Java language specification (page 416):

The Java programming language also guarantees that every operand of an operator (except the conditional operators &&, ||, and ? :) appears to be fully evaluated before any part of the operation itself is performed.

Which combines with the definition of the postfix increment expression (page 485):

A postfix expression followed by a ++ operator is a postfix increment expression.

As well as left-to-right expression evaluation (page 415):

The left-hand operand of a binary operator appears to be fully evaluated before any part of the right-hand operand is evaluated.

To a definitive conclusion that i++ * i++ will always result in 132 == 11 * 12 and i == 13 when i == 11 to start.

Python

Python has no increment operators specifically so you don’t have to deal with this kind of nonsense.

>>> count = 0
>>> count++
  File "<stdin>", line 1
    count++
          ^
SyntaxError: invalid syntax

Annoyingly for newbies, though, it looks like ++count is a valid expression that happens to look like preincrement.

>>> count = 0
>>> ++count
0
>>> --count
0

They’re actually two unary positive and negative operators, respectively. Just one of the hazards of a context free grammar, I suppose.

Footnotes

[*] I threw this in because the ordeal reminds me of the classic bank account concurrency problem. If it’s more confusing than descriptive, please ignore it. :-)
[†]

Since function invocation defines sequence points, I thought this code sequence guaranteed those results:

#include <stdio.h>
 
int identity(int value) { return value; }
 
int main() {
        int i = 11;
        printf("%d\n", identity(i++) * identity(i++));
        printf("%d\n", i);
        return 0;
}

As Dan points out, the order of evaluation is totally unspecified — the left hand and right hand subexpression can potentially be evaluated concurrently.

Why you should bother!

Friday, October 2nd, 2009

I write this entry in response to Why Should I Bother?. The answer, in short, is that I find Python to be a great language for getting things done, and you shouldn’t let stupid interviewers deter you from learning a language that allows you to get more done.

I think I’m a pretty tough interviewer, so I also describe the things I’d recommend that a Python coder knows before applying to a Java position, based on my own technical-interview tactics.

A spectrum of formality

As my friend pointed out to me long ago, many computer scientists don’t care about effective programming methods, because they prefer theory. Understandably, we computer scientists qua programmers (AKA software engineers) find ourselves rent in twain.

As a computer science degree candidate, you are inevitably enamored with complex formalities, terminology, [*] and a robust knowledge of mathematical models that you’ll use a few times per programming project (if you’re lucky). Pragmatic programming during a course of study often takes a back seat to familiar computer science concepts and conformance to industry desires.

As a programmer, I enjoy Python because I find it minimizes boilerplate, maximizes my time thinking about the problem domain, and permits me to use whichever paradigm works best. I find that I write programs more quickly and spend less time working around language deficiencies. Importantly, the execution model fits in my brain.

Real Computer Scientists (TM) tend to love pure-functional programming languages because they fit into mathematical models nicely — founded on recursion, Curry-Howard isomorphism, and what have you — whereas Python is strongly imperative and, in its dynamism, lacks the same sort of formality.

Languages like Java sit somewhere in the middle. They’re still strongly imperative (there are no higher-order functions in Java), but there are more formalities. As the most notable example, compile-time type checking eliminates the possibility of type errors, which gives some programmers a sense of safety. [†] Such languages still let scientists chew on some computer sciencey problems; for example, where values clash with the type system, like provably eliminating NullPointerExceptions, which is fun, but difficult!

As the cost of increased formality, this class of languages is more syntax-heavy and leans on design patterns to get some of the flexibility dynamic typing gives you up front.

It’s debatable which category of languages is easiest to learn, but Java-like languages have footholds in the industry from historical C++ developer bases, Sun’s successful marketing of Java off of C++, and the more recent successes of the C# .NET platform.

It makes sense that we’re predominantly taught this category of languages in school: as a result, we can play the percentages and apply for most available developer jobs. Given that we have to learn it, you might as well do some throw-away programming in it now and again to keep yourself from forgetting everything; however, I’d recommend, as a programmer, that you save the fun projects for whichever language(s) that you find most intriguing.

I picture ease-and-rapidity of development-and-maintenance on a spectrum from low to high friction — other languages I’ve worked in fall somewhere on that spectrum as higher friction than Python. Though many computer scientists much smarter than I seem to conflate formality and safety, I’m fairly convinced I attain code completion and maintainability goals more readily with the imperative and flexible Python language. Plus, perhaps most importantly, I have fun.

My technical-interview protocol

The protocol I use to interview candidates is pretty simple. [‡]

  • Analyze their background experience for potential weaknesses that would affect their ability to perform in the position.
  • Don’t let the candidate talk about anything until the last n minutes, when they can ask questions or comment on the interview experience.
  • Hone in on the areas of potential weakness to determine how bad they are (if they’re bad at all).
  • Evaluate how easy it is to overcome those weaknesses. Determine where their level in relevant areas falls in the Dreyfus model and how much it matters to the description of the position.

Potential Java interview weaknesses

Interviewing a candidate whose background is primarily Python based for a generic Java developer position (as in Sayamindu’s entry), I would immediately flag the following areas as potential weaknesses:

Primitive data types

A programmer can pretty much get away never knowing how a number works in Python, since you typically overflow to appropriately sized data types automatically.

The candidate needs to know what all the Java primitives are when the names are provided to them, and must be able to describe why you would choose to use one over another. Knowing pass-by-value versus pass-by-reference is a plus. In Python there is a somewhat similar distinction between mutable and immutable types — if they understand the subtleties of identifier binding, learning by-ref versus by-value will be a cinch. If they don’t know either, I’ll be worried.

Object oriented design

The candidate’s Python background must not be entirely procedural, or they won’t fare well in a Java environment (which forces object orientation). Additionally, this would indicate that they probably haven’t done much design work: even if they’re an object-orientation iconoclast, they have to know what they’re rebelling against and why.

They need to know:

  • When polymorphism is appropriate.
  • What should be exposed from an encapsulation perspective.
  • What the benefits of interfaces are (in a statically typed, single-inheritance language).

Basically, if they don’t know the fundamentals of object oriented design, I’ll assume they’ve only ever written "scripts," by which I mean, "Small, unimportant code that glues the I/O of several real applications together." I don’t use the term lightly.

Unit testing

If they’ve been writing real-world Python without a single unit test or doctest, they’ve been Doing it Wrong (TM).

unittest is purposefully modeled on xUnit. They may have to learn the new jUnit 4 decorator syntax when they start work, but they should be able to claim they’ve worked with a jUnit 3 -like API.

Abstract data structures

Python has tuples, lists and dictionaries — all polymorphic containers — and they’ll do everything that your programmer heart desires. [§] Some other languages don’t have such nice abstractions.

It’d be awesome if they knew:

  • The difference between injective and bijective and how those terms are important to hash function design. If they can tell me this, I’ll let them write my high-performance hash functions.

They must know (in ascending importance):

  • The difference between a HashMap and a TreeMap.
  • The difference between a vector and a linked list, or when one should preferred over the other. The names are unimportant — I’d clarify that a vector was a dynamically growing array.
  • The "difference" between a tree and a graph.

Turning attention to you, the reader: if you’re lacking in data structures knowledge, I recommend you read a data structures book and actually implement the data structures. Then, take a few minutes to figure out where you’d actually use them in an application. They stick in your head fairly well once you’ve implemented them once.

Some interviewers will ask stupid questions like how to implement sorting algorithms. Again, just pick up a data structures book and implement them once, and you’ll get the gist. Refresh yourself before the interview, because these are a silly favorite — very few people have to implement sorting functions anymore.

Design patterns

Design patterns serve several purposes:

  • They establish a common language for communicating proposed solutions to commonly found problems.
  • They prevent developers for inventing stupid solutions to a solved class of problems.
  • They contain a suite of workarounds for inflexibilities in statically typed languages.

I would want to assure myself that you had an appropriate knowledge of relevant design patterns. More important than the names: if I describe them to you, will you recognize them and their useful applications?

For example, have you ever used the observer pattern? Adapter pattern? Proxying? Facade? You almost certainly had to use all of those if you’ve done major design work in Python.

Background concepts

These are some things that I would feel extra good about if the candidate knew and could accurately describe how they relate to their Python analogs:

  • The importance of string builders (Python list joining idiom)
  • Basic idea of how I/O streams work (Python files under the hood)
  • Basic knowledge of typecasting (Python has implicit polymorphism)

Practical advice

Some (bad) interviewers just won’t like you because you don’t know their favorite language. If you’re interviewing for a position that’s likely to be Java oriented, find the easiest IDE out there and write an application in it for fun. Try porting a Python application you wrote and see how the concepts translate — that’s often an eye-opener. Or katas!

If you find yourself unawares in an interview with these "language crusaders," there’s nothing you can do but show that you have the capacity to learn their language in the few weeks vacation you have before you start. If it makes you feel better, keep a mental map from languages to number of jerks you’ve encountered — even normalizing by developer-base size the results can be surprising. ;-)

Footnotes

[*] Frequently unncessary terminology, often trending towards hot enterprise jargon, since that’s what nets the most jobs and grant money.
[†] Dynamic typing proponents are quick to point out that this doesn’t prevent flaws in reasoning, which are the more difficult class of errors, and that you’ll end up writing tests for these anyway.
[‡] Clearly candidates could exploit a vulnerability in my interview protocol: leave off things they know I’m likely to test that they know particularly well; however, I generally ask them to stop after I’m satisfied they know something. Plus, the less I know about their other weaknesses the more unsure I am about them, and thus the less likely I am to recommend them.
[§] Though not necessarily in a performant way; i.e. note the existence of collections.deque and bisect. Knowing Python, I’d quiz the candidate to see if they knew of the performant datatypes.

Learning Python by example: RC4

Sunday, September 20th, 2009

One of my friends at work was fakeplaining [*] that he had been on the Python programming mailing list at work for some time, yet still did not know Python. Being hopelessly suggestible in the face of obvious sarcasm, I decided to sacrifice a few hours of sleep to the god of blog. [†]

Note that this entry is aimed at people who already know how to program and have been looking for a tidbit to try in Python.

There are a lot of side notes I’ve left out for simplicity of explanation; however, I also attempted to make the experience interesting by introducing one of Python’s more advanced features, called "generator functions," into the mix. Hopefully it strikes a balance. Please comment if you are utterly confused by generators — I may add an alternate section that allows the reader to avoid them altogether.

You kata wanna…

A number of leaders in the programming community are hot on this trend called "code katas." I’m actually a big fan of this trend, mainly because I’ve been writing code for no reason, hating on it, throwing it away, and subsequently rewriting it for my entire life, but I now get to call it something cool- and ninja-sounding. Doing such things in my spare time is no longer considered "inexcusable nerdiness;" rather, it’s my small endeavor to bring professionalism to the field of software engineering. *cough*

One reason that I really enjoy this new trend is that programmers are posting their own morsel-sized programming problems left and right, giving ample opportunities to explore new languages (and dusty corners of ones you know well) without accidentally becoming BDFL of a seminal framework or utility. [‡]

RC4 Pseudocode

Case in point, I’ll use the recent kata from Programming Praxis for this Python exercise, as they provide straightforward pseudocode. Here’s the encryption algorithm named RC4, as quoted from Programming Praxis:

The key array K is initialized like this:

for i from 0 to 255
    K[i] := i
 
j := 0
 
for i from 0 to 255
    j := (j + K[i] + key[i mod klen]) mod 256
    swap K[i], K[j]

Once the key array is initialized, the pseudo-random byte generator uses a similar calculation to build the stream of random bytes:

i := 0
j := 0
 
start:
    i := (i + 1) mod 256
    j := (j + K[i]) mod 256
    swap K[i], K[j]
    output K[ (K[i] + K[j]) mod 256 ]
    goto start

The first step in writing our RC4 program is to translate this pseudocode to Python, while the second step is to add a command line front-end for some off-the-cuff implementation experience.

If you’d like to look at the final program ahead of time, grab a copy of my reference implementation.

Porting the initialization

For initialization, we use a provided key to calculate a 256 entry integer sequence. Open a new file called rc4.py and write the following function:

def initialize(key):
    """Produce a 256-entry list based on `key` (a sequence of numbers) as
    the first step in RC4.
    Note: indices in key greater than 255 will be ignored.
    """
    k = range(256)
    j = 0
    for i in range(256):
        j = (j + k[i] + key[i % len(key)]) % 256
        k[i], k[j] = k[j], k[i]
    return k

The simplicity of the translation demonstrates why Python is sometimes called "executable pseudocode". Breaking it down line by line:

  1. defines a function named initialize that takes a single argument, key.
  2. A documentation string ("docstring" for short). In Python, documentation is associated with a function even at runtime, in contrast to traditional JavaDoc or POD. [§] If the first statement in a function is a string literal, it is used as the docstring for that function. [¶]
  1. The built-in range function returns a list of values. [#] "Built-in" is the terminology used for items that are "available all the time without explicitly importing anything."

    This function also has a two-argument form, range(start, stop); however, in the single argument form, start has a default of 0, so the function invocation returns a list of all the integers in the mathematical interval [0, 256), for a total of 256 values.

  1. There is only one for loop syntax: for [identifier] in [iterable]. Lists are iterable because they contain a sequence of objects.
  2. Finite collections also support the built-in function len([sizable]). The way that numerical arithmetic works and sequence indexing via seq[idx] should be familiar.
  3. Python has an elegant swap capability — what’s important to note is that the entire right hand side is evaluated, then assigned to the left hand side.
  4. Python functions optionally return a value. If no return statement is encountered, None is returned, which indicates the absence of a value (docs).

Generators: functions that pause

Python has a convenient feature, called "generator functions," that allows you to create a stream of values using function-definition syntax. [♠] You can think of generator functions as special functions that can pause and resume, returning a value each time it pauses.

The canonical example illustrates the concept well — use the interactive Python shell to explore how generator functions work, by running the python command without arguments. Make sure the version is python2.3 or above. Once you’re in the interactive shell, type the following:

>>> def gen_counter():
...     i = 0
...     while True:
...         yield i
...         i += 1
...
>>>

Note the use of a yield statement, which tells Python that it is a generator function. Calling a generator function creates an iterable generator object, which can then produce a potentially infinite series of values:

>>> counter = gen_counter()
>>> print counter
<generator object gen_counter at 0xb7e3fbe4>
>>> counter.next()
0
>>> counter.next()
1

Note that because the stream of values is potentially infinite and lazily evaluated, there’s no concept of length: it’s not representative of a container so much as a sequence:

>>> len(counter)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type 'generator' has no len()

Also important to note is that none of the local values in the generator instances are shared; i.e. instantiating a second generator has no effect on the first:

>>> one_counter = gen_counter()
>>> another_counter = gen_counter()
>>> one_counter.next()
0
>>> another_counter.next()
0

Since generators are iterable, you can use them in a for loop just like containers. (But watch out for infinite generators and no break condition in your for loop!)

If you’re still confused, the official tutorial’s section on generators may help to clarify things. For an in-depth look at generators and why they’re awesome, David M. Beazley’s PyCon presentation on generators is excellent.

Applying generators to RC4

This dove-tails nicely with the second part of the algorithm, which requires a stream of values to XOR against. The generator is nearly a direct translation from the pseudocode, which you may also add to rc4.py:

def gen_random_bytes(k):
    """Yield a pseudo-random stream of bytes based on 256-byte array `k`."""
    i = 0
    j = 0
    while True:
        i = (i + 1) % 256
        j = (j + k[i]) % 256
        k[i], k[j] = k[j], k[i]
        yield k[(k[i] + k[j]) % 256]

Each time .next() is called on the generator instance, the function executes until the first yield statement is encountered, produces that value, and saves the function state for later.

Yes, we could create a big list of pseudo-random values the length of the text, but creating them all at the same time adds O(len(text)) memory overhead, whereas the generator is constant memory overhead (and computationally efficient).

Tying it together

Now we just need a function that does the XORing, which teaches us about strings and characters.

def run_rc4(k, text):
    cipher_chars = []
    random_byte_gen = gen_random_bytes(k)
    for char in text:
        byte = ord(char)
        cipher_byte = byte ^ random_byte_gen.next()
        cipher_chars.append(chr(cipher_byte))
    return ''.join(cipher_chars)

Line by line:

  1. An empty list cipher character accumulator is created.
  2. The generator object is instantiated by calling the generator function.
  3. As you can see from the for loop, Python strings are iterable as sequences of characters. Characters in Python are just strings of length one, so you can think of a string iterator as stepping over all of its one-character substrings in order.
  1. To convert a textual character into its character-code numerical value, the built-in ord function is used (docs).
  2. The meat of the algorithm: XOR the textual character with the next pseudo-random byte from the byte stream.
  3. After obtaining the cipher-byte through the XOR, we want to convert back to a textual (character) representation, which we do via the built-in chr function (docs). We then place that character into a sequence of cipher characters. [♥]
  4. To join together characters to form a string, we use the str.join([iterable]) method (docs). [♦] Note that, on some platforms, this is much more efficient than using += (for string concatenation) over and over again. It’s a best practice to use this sequence-joining idiom to avoid possible concatenation overhead. [♣]

Front-end fun

If you thought that the pseudo-code translation looked like a piece of cake, you may feel up to a challenge: write a command line interface that:

  1. Asks for an encryption key.
  2. Turns the key to a sequence of integer values and initializes with it.
  3. Continually asks for user-provided text to translate and spits out the corresponding cipher text.

What you need to know

  • In Python 2.x print is a statement that is followed by comma-separated values, where each comma turns into a space. The print statement puts a newline at the end of its output by default:

    >>> print 'a', 'b', 'c', 'd'
    a b c d

    To suppress the newline (for example, in a loop), leave on a trailing comma:

    >>> for char in ['a', 'b', 'c', 'd']:
    ...     print char,
    ...
    a b c d

    If there’s something that you can’t do using the above, I’ll refer you to the Python tutorial’s section on fancier output formatting.

  • The built-in function called raw_input (docs) displays a message and then requests user input as a command line, returning the user input as a string. For example:

    name = raw_input('What is your name? ')
    print 'Your name is', name
  • The built-in function called repr (docs) returns the Python representation of a string (or any object) — this is useful for escaping strings with funky, non- printable characters in it, as our cipher algorithm is likely to do. For example, you’ll probably want to use something along the lines of:

    cipher_text = run_rc4(k, text)
    print 'Cipher text:', repr(cipher_text)
  • The character-escaping performed by repr can be reversed using the string decode method: str.decode('string_escape'). For example:

    >>> text_with_escapes = '\\x61\\x62\\x63'
    >>> print text_with_escapes
    \x61\x62\x63
    >>> print text_with_escapes.decode('string_escape')
    abc

    So if you want to allow the user to enter ciphertext at the command prompt, you can read it in and decode it from the escaped format.

  • If you just put the CLI front-end code to execute at the bottom of the rc4.py file, it will work; however, it’s a best practice to test to make sure that rc4.py is the file that’s being executed. To do this, wrap the command line interface code in a test like the following:

    if __name__ == '__main__': # This is the file being executed.
        main()

If you need help

I wrote a reference implementation and posted it to github — feel free to check it out if you get stuck.

Here’s a sample usage of my implementation:

===========
RC4 Utility
===========
The output values are valid Python strings. They may contain escape characters
of the form \xhh to avoid confusing your terminal emulator. Only the first 256
characters of the encryption key are used.
 
Enter an encryption key: an encryption key!
 
Enter plain or cipher text: Victory is mine!
Your RC4 text is: '\xecL\xce(\x16\x8e3\xf02!\xcd\xc6\x9a\xc0j\x98'
 
Enter plain or cipher text: '\xecL\xce(\x16\x8e3\xf02!\xcd\xc6\x9a\xc0j\x98'
Unescaping ciphertext...
Your RC4 text is: 'Victory is mine!'

Once you find that your cipher is reversible, you’ve probably got it right!

Again, please comment if anything is unclear.

Footnotes

[*] Also known as "fitching." Often performed by those in brillig, slithy toves.
[†] Could God make a blog entry so long and boring that God would proclaim "TL:DR?"
[‡] If I had a nickel for every time this happened to me, I would have no nickels.
[§] This allows you to reflect on things and extract their documentation, which comes in handy when you’re running in an interactive Python session or spitting out module-level documentation in a command line argument parser.
[¶] This same rule applies to classes and modules, which are beyond the scope of this entry.
[#] Python lists are mutable sequences, implemented as vector ADTs under the hood.
[♠] The same task can be accomplished with a custom iterator class, but generators are much more concise and more readable — note that the generator that we end up with reads just like the pseudocode!
[♥] Note that the language having a built-in join(iterable) method on its string datatype eliminates the need for every iterable type to implement some form of iterable.join(str).
[♦] There’s a way to use generators here as well, but the list of characters makes things simpler to understand for the moment. If you’re feeling confident, convert this function to be a generator function at the end of the exercise and make it work with the rest of the program.
[♣] It’s bad practice to assume you’ll always be running on CPython — there are also JVM and .NET (CLR) interpreters. Remember, thou shalt not claimeth that, "all the world’s a VAX!"