December 6, 2011

Reviews redux

"Whoa, Billy reviewed a one-meg patch to the hairiest part of the codebase in just two hours!" [*]

It's pretty easy to identify what's wrong with that sentence. The speed of a review is not an achievement. Billy could have literally just checked the, "yes, I reviewed it" button without looking at the patch.

... but an empty review looks pretty bad, especially as the size of the patch grows. So maybe Billy padded it out by identifying two hours worth of style nits and asking for a few comments here and there. In any case, the code quality is no more assured after the review than before it.

Conventional wisdom is that it's economically prudent to do good code reviews: finding defects early incurs the lowest cost, review has a 'peer pressure' based motivation towards quality improvement, and review creates knowledge redundancy that mitigates the bus effect. In research literature on code review, effectiveness is typically measured as "defects found per KLoC". [†] However, this dimension ignores the element of "time per review": I'm going to argue that time to give a good review varies in the complexity and size of the modifications.

Now, one can argue that, if Billy does anything more than ignorantly checking the little "I've reviewed this" box, he has the potential to add value. After all, code isn't going to be defect-free when it comes out of review, so we're just talking about a difference in degree. If we assume that truly obscure or systematic bugs won't jump out from a diff, what additional value is Billy really providing by taking a long time?

This is where it gets tricky. I think the reason that folks can have trouble deciding how long reviews should take is that we don't know what a review really entails. When I request that somebody review my patch, what will they try to suss out? What kind of code quality (in terms of functional correctness and safety) is actually being assured at the component level, across all reviewed code?

If you can't say that your reviews ensure some generally understood level of code quality (i.e. certain issues have definitively been considered), it's hard to say that you're using reviews as an effective tool.

Aside: even with clear expectations for the code review process, each party has to exercise some discipline and avoid the temptation to lean on the other party. For mental framing purposes, it's a defect-finding game in which you're adversaries: the developer wants to post a patch with as few defects as possible and the reviewer wants to find as many defects as they possibly can within a reasonable window of time.

A few best practices

From the research I've read on code review, these are two simple things that are supposed to increase defect-finding effectiveness:

Scan, then dig.

Do a preliminary pass to get the gist of how it's structured and what it's doing. Note down anything that looks fishy at a glance. Once you finish your scan, then do another pass that digs into all the corner cases you can think of and inspects each line thoroughly.

Keep checklists.

One checklist for self-reviews and one checklist for reviews of everybody else's stuff. I've seen it recommended that you scan through the code once for every checklist item to do a truly thorough review.

The self-review checklist is important because you tend to repeat the same mistakes until you've learned them cold. When you make a defect and it gets caught, figure out where it fits into your list and make a mental and/or physical note of the example, or add it as a new category.

Having a communal checklist can also be helpful for identifying group pain points. "Everybody screws up GC-rooting JSString-derived chars sometimes," is easily codified in a communal checklist document that the whole team can reference. In addition, this document helps newcomers avoid potential pitfalls and points out areas of the code that could generally be more usable / less error prone.

Here's another nice summary of more effective practices.

I'm personally of the opinion that, if you find something that you think is defective, you try to write a test to demonstrate it. The beneficial outcomes of this are:

I think in an ideal situation there are also linter tools in place to avoid style nits altogether: aside from nits masquerading as legitimate review comments, automatically enforced stylistic consistency is nice.

Footnotes

[*]

Just in case you were wondering, Billy is not an actual person. I think I started using Billy as my hypothetical example person's name after I saw this fairly amusing video.

[†]

In the literature almost every substantive comment is grouped under the term "defect". This includes things like design decisions and suggested factorings. In the same sense that finding a behavioral error early has benefit, finding these issues early helps improve the overall quality of the product going forward.