December 18, 2012

Quick tips for getting into systems programming

In reply

Andrew (@ndrwdn) asked a great followup question to the last entry on systems programming at my alma mater:

@cdleary Just read your blog post. Are there any resources you would recommend for a Java guy interested in doing systems programming?

What follows are a few quick-and-general pointers on "I want to start doing lower level stuff, but need a motivating direction for a starter project." They're somewhat un-tested because I haven't mentored any apps-to-systems transitions, but, as somebody who plays on both sides of that fence, I think they all sound pretty fun.

A word of warning: systems programming may feel crude at first compared to the managed languages and application-level design you're used to. However, even among experts, the prevalence of footguns motivates simple designs and APIs, which can be a beautiful thing. As a heuristic, when starting out, just code it the simple, ungeneralized way. If you're doing something interesting, hard problems are likely to present themselves anyhow!

Microcontrollers rock

Check out sites like hackaday.com to see the incredible feats that people accomplish through microcontrollers and hobby time. When starting out, it's great to get the tactile feedback of lighting up a bright blue LED or successfully sending that first UDP packet to your desktop at four in the morning.

Microcontroller-based development is also nice because you can build up your understanding of C code, if you're feeling rusty, from basic usage — say, keeping everything you need to store as a global variable or array — to fancier techniques as you improve and gain experience with what works well.

Although I haven't played with them specifically, I understand that Arduino boards are all the rage these days — there are great tutorials and support communities out on the web that love to help newbies get started with microcontrollers. AVR freaks was around even when I was programming on my STK500. I would recommend reading some forums to figure out which board looks right for you and your intended projects.

At school, people really took to Bruce Land's microcontroller class, because you can't help but feel the fiero as you work towards more and more ambitious project goals. Since that class is still being taught, look to the exercises and projects (link above) as good examples of what's possible with bright students and four credits worth of time. [*]

Start fixing bugs on low-level open source projects

Many open source projects love to see willing new contributors. Especially check out projects a) that are known for having good/friendly mentoring and b) that you think are cool (which will help you stay motivated).

I know one amazing person I worked with at Mozilla got into the project by taking his time to figure out how to properly patch some open bugs. If you take that route, either compare your patch to what the project member has already posted, or request that somebody give you feedback on your patch. This is another good way to pick up mentor-like connections.

Check out open courseware for conceptual background

I personally love the rapid evolution of open courseware we're seeing. If you're feeling confident, pick a random low-level thing you've heard-of-but-never-quite-understood, type it into a search engine, and do a deep dive on a lecture or series. If you want a more structured approach, a simple search for systems programming open courseware has quite educational looking results.

General specifics: OSes and reversing

@cdleary Some general but also OS implementation and perhaps malware analysis/RE.

OSes

If you're really into OSes, I think you should just dive in and try writing a little kernel on top of your hardware of choice in qemu (a hardware emulator). Quick searches turn up some seemingly excellent tutorials on writing simple OS kernels on qemu, and writing simple OSes for microcontrollers is often a student project topic in courses like the one I mention above. [†]

With some confidence, patience, maybe a programming guide, and recall of some low-level background from school, I think this should be doable. Some research will be required on effective methods of debugging, though — that's always the trick with bare metal coding.

Or, for something less audacious sounding: build your own Linux kernel with some modifications to figure out what's going on. There are plenty of guides on how to do this for your Linux distribution of choice, and you can learn a great deal just by fiddling around with code paths and using printk. Try doing something on the system (in userspace) that's simple to isolate in the kernel source using grep — like mmapping /dev/mem or accessing an entry in /proc — to figure out how it works, and leave no stone unturned.

I recommend taking copious notes, because I find that's the best way to trace out any complex system. Taking notes makes it easy to refer back to previous realizations and backtrack at will.

Read everything that interests you on Linux Kernel Newbies, and subscribe to kernel changelog summaries. Attempt to understand things that interest you in the source tree's /Documentation. Write a really simple Linux Kernel Module. Then, refer to freely available texts for help in making it do progressively more interesting things. Another favorite read of mine was Understanding the Linux Kernel, if you have a hobby budget or a local library that carries it.

Reversing

This I know less about — pretty much everybody I know that has done significant reversing is an IDA wizard, and I, at this point, am not. They are also typically Win32 experts, which I am not. Understanding obfuscated assembly is probably a lot easier with powerful and scriptable tools of that sort, which ideally also have a good understanding of the OS. [‡]

However, one of the things that struck me when I was doing background research for attack mitigation patches was how great the security community was at sharing information through papers, blog entries, and proof of concept code. Also, I found that there are a good number of videos online where security researchers share their insights and methods in the exploit analysis process. Video searches may turn up useful conference proceedings, or it may be more effective to work from the other direction: find conferences that deal with your topic of interest, and see which of those offer video recordings.

During my research on security-related things, a blog entry by Chris Rohlf caused Practical Malware Analysis to end up on my wishlist as an introductory text. Seems to have good reviews all around. Something else to check out on a trip to the library or online forums, perhaps.

Footnotes

[*]

At the end of the page somebody notes: "This page is transmitted using 100% recycled electrons." ;-)

[†]

Also, don't pass up a chance to browse through the qemu source. Want to know how to emulate a bunch of different hardware efficiently? Use the source, Luke! (Hint: it's a JIT. :-)

[‡]

One other neat thing we occassionally used for debugging at Mozilla was a VMWare-based time-traveling virtual machine instance. It sounded like they were deprecating it a few years back, so I'm not sure the status of it, but if it's still around it would literally allow you to play programs backwards!

What if MoCo were hit by a bus?

I went to my first Hybrid Factory event with some of my teammates this past evening, which was kind of like a mini-workshop. The event was titled, "How to effectively work as a Tech Lead," given by Derek Parham.

One of the main topics was delegation: the audience interaction portion of the event asked us to consider how we would approach delegating all of our current tasks to other people on our teams.

During this exercise sstangl brought up something profound: rather than other Mozilla Corporation-employed teammates, how much of our current task load could we delegate to the community members outside of Mozilla Corporation (AKA MoCo)? These are the people who voluntarily devote their time and effort towards advancing the Mozilla project.

Of course, the talk also covered the classic "bus test," which asks, "If [person X] were hit by a bus, how would we continue to function?" It wasn't a big leap to my asking, "If all of MoCo were hit by a bus, how well situated is the community to carry our outstanding tasks and projects?"

Like all fun hypotheticals, it's far fetched and a bit apocalyptic, but it forces you to think about your team's work and coordination methods in a very different light.

I suppose a related, follow-up question is: if the Mozilla organization is designed to empower a worldwide community, but we couldn't survive a MoCo bus scenario, then are we managing the project in a sustainable way?

Maybe people who oversee the project as a whole (and those who are more familiar with the philosophy behind our governance) have a definitive answer. In any case, it's interesting food for thought.

Contributing to SpiderMonkey

My latest experiment is "slide casting" — here's Contributing to SpiderMonkey (a slidecast that's less than four minutes long):

Links

Transcript

Contributing to SpiderMonkey

This is a short presentation about contributing to Mozilla's SpiderMonkey JavaScript engine.

Business

As a guy who writes code, there are a few basic things I ask before I jump into any project:

  • Will I learn?

  • Do the people rock?

  • Will it ship?

  • Does it matter?

I guarantee that you'll learn from working on SpiderMonkey. It's an important language implementation created by some of the most brilliant and approachable coders I've ever had the privilege of working with.

We ship like clockwork. When a patch gets submitted to trunk, it's in the Firefox release 18 weeks later.

Hack: this technology could fall into the right hands [image]

And when it comes to finding meaning in your work, Mozilla makes life easy.

In my opinion, the Mozilla mission is technological utopianism at its finest. If you believe in using your technological superpowers to help people, Mozilla is for you.

Wrench

If you know how to write and debug C++ code, you have the skills to hack SpiderMonkey. We don't go crazy with the C++, so C coders should also feel pretty confident. The only tools required are the ones necessary to build Firefox.

SpiderMonkey is a language implementation, but don't let that get to you. Once you get your hands dirty (say, by fixing a few minor bugs) you'll realize that language VMs are no different from the other systems-level programs that you know and love.

See

The Mozilla project coordinates effort through Bugzilla. Every bit of work that we intend to do on the engine is tracked as a bug under the "JavaScript Engine" component at bugzilla.mozilla.org.

The JS team tries to tag good first bugs. If you see a good first bug that interests you, feel free to go in and make a comment stating your interest.

If you'd like to ease into development a little more, you can check out the latest ECMAScript specification and use that to create tests for our test suite. This is a great way to ensure SpiderMonkey engine quality and cross-browser compatibility.

Do

In typical open-source style, once you've found something that interests you, hack away!

And feel free to sample from the buffet: every bug that you work on teaches you about a different aspect of the engine.

You may also stumble onto a part of the engine that you'd like to specialize in — we loving having domain experts hacking on our code as well!

Code

Once you've made a working improvement to the engine, make sure you get your work in! Add your changes as an attachment to an existing bug, or create a new bug in the JavaScript Engine component.

When you improve the engine, you can get your name added to about:credits, in a product that ships to something like half a billion users, which I think is pretty cool.

Lots of great details and walkthroughs are available on the "New to SpiderMonkey" wiki page.

Barrel (#coding, #jsapi)

Friendly people hang around in these IRC channels at irc.mozilla.org. #coding is for general questions, whereas #jsapi is for JS engine internals discussion. You can easily install ChatZilla as a Firefox add-on to get on IRC.

If you've had bad experiences with IRC in the past, fear not!@ I know, from personal experience, that the IRC moderators in these channels have a zero-tolerance policy for disrespectful behavior. We love (and are) our community.

On my back

I haven't provided any kind of engine internals overview, but I think this may be just enough information to get you intrepid hackers going.

I may find time to do more screencasts in the future, but don't wait on me. (I'm new to this whole medium and prefer to write code. ;-) In the meantime, there's a screencast intro on hacking the SpiderMonkey shell available on my blog.

Around

The beauty of software, especially open source, is that you can mess around without taking any risks and by satisfying very few dependencies (i.e. a computer and the ability to install open source software).

Like the slogan says, with you hacking on Mozilla, the technology may have feallen into the right hands.

So, I hope that you'll consider hacking with us!

Note

Please excuse my use of the colloqualism, "as a guy who writes code." On a second listen I realize it may be poorly placed, because I'm using my own criteria as an example of an arbitrary person who might be considering contributing to the Mozilla project — no gender implication for contributors was at all intended!

More fortunately, this note is a great opportunity for me to plug WoMoz, Mozilla's project to promote women's involvement in open source and encourage contributions. You can find members on #womoz on irc.mozilla.org.

Thoughts on the idea of "slidecasts"

Just to get this established up front: I'm super rusty at presenting any kind of material. Also, I've never tried to record a presentation on the same computer that I was reading notes off of. (Case in point: you can hear the clicking of a keypress when I change slides.)

Despite all this hedging, I'm not sure about slidecasts as a medium. I sometimes fumble when I ad-lib, so I effectively had to write out a whole script for these slides. That's why it sounds like I'm reading off of a piece of paper.

Screencasts (as opposed to slidecasts) are different because you're walking through a sequence of on-screen editing actions that are inherently difficult to put into words. It's also a lot of fun to see how somebody else uses their development environment.

Slidecasts harness the poignant messaging of slides, but lose the body language of recorded audience presentations, which is clearly an important component. Turning the slidecast script into words would have been simple, and potentially more accessible to people who don't have the time to watch video content at all.

...or maybe it's humanizing? I'm not sure. Perhaps I have to add more soaring rhetoric and fancy slides to make spoken word worthwhile.

Clearly, more experimentation is needed!

JS regexps implemented in JS

If you heart JS and want to learn more about how regular expressions work, I've got yet another fun project for you to work on.

Back when I was working more heavily on the regular expression engine, some ECMAScript spec correctness questions came up. As it turns out, the regular expression part of the specification is written in scary-sounding-but-really-not-that-hard-to-understand continuation passing style (and it really seems to make the most sense that way).

I tried to work through the first bug on paper, but I forgot to carry the one or something, and I got the wrong result. dmandelin quickly whipped up a program modeled on the spec to resolve that one example conclusively. Seeing how easy that was, I followed his lead and started working on a little library to resolve these questions in ways that save more trees.

I haven't worked on it much lately (according to this neat GitHub activity doohickey), but I put cdlre out on github today, and I'd be happy to review pull requests. The regular expression specification for matching (ECMAScript revision 5 section 15.10) is easy to translate into running code, and I left a lot of features unimplemented, just for you!

I'll just copy-pasta the rest from the project README:

Potential applications:

  • Regression testing the specification against host implementations.

  • Use in understanding why regular expressions succeed/fail to match.

  • Use in a metacircular interpreter (like Narcissus).

  • Use as a staging ground for regular expression optimizations and/or a regular expression compiler. (Such a compiler could target eval as a backend or a JIT code execution foreign function.)

Goals

  • Be capable of visualizing (or at least dumping out) the ECMAScript standard steps taken in matching a regular expression.

  • Be capable of enabling/disabling the de-facto quirks from various browsers which are not yet part of the standard.

  • Be capable of running a thorough regression suite against the host regular expression engine (presumably with a set of permitted quirk options).

  • Keep the JS code a direct translation from the spec where possible and practical.

I'm sure that hooking in a comprehensible visualization would be a helpful tool for web developers who want to harness the Indiana Jones-like power of regular expressions.

Go-go gadget community?

Thoughts on Stack Overflow

This is a short article detailing my thoughts on the recently released programming Q&A site, Stack Overflow.

Background

Historically, I've had three resources for programming questions:

Over the course of 80 days, I've found Stack Overflow to be a better resource than all three of the above, even when combined.

The way I see it, Stack Overflow (hereby referred to as SO) is going strong for two fundamental reasons:

  1. SO baited the right community with the appropriate timing

  2. SO uses tags

Community

If you think about it, there's nothing about SO that ties it to programming questions, aside from the constitution. (On SO, the constitution is the site FAQ). All in all, SO is a Q&A framework. If it's just a Q&A framework, how did SO manage to stay on topic and under control from its inception? They took the right members in with the right timing.

The beta test population was roughly given by the following:

Podcast listeners

(Jeff's readers UNION Joel's readers) - the relatively uninterested - attrition

Beta testers

(Podcast listeners INTERSECT people that cared enough about the programming Q&A site to find an obscure signup form) - more attrition

Joel Spolsky and Jeff Atwood are both well-known in the blogosphere among readers interested in improving their programming skills and doing software the Right Way. Beginning with their reader base (already amenable to their cause) there were two significant levels of filtration, as reflected in the above pseudo-formulae, that ensured that SO started off with a group of people who a) cared and b) had a significant body of knowledge with respect to programming and good software practices. This is just the kind of constituency that you want to impart some positive momentum on a fledgling Q&A site.

The private beta provided an adequate growth period so that, at release, there were enough core members with a solid conception of the constitution that they helped to create. Additionally, the core was able to uphold the tenets of the constitution with power from the reputation that they built. (If there were a third reason for the site's success, it would be how empowered the high-rep members are to uphold the constitution.)

If SO had been released to the public in a Hollywood Launch, without the beta momentum they had, I believe it would have failed. The framework is not programming-specific — the community is.

Tagging

The site happens to be particularly well designed for programming questions in its tag-centric model. SO is a big pipe for programming questions with an unlimited number of virtual channels, each of which is denoted by a tag. With recently added capabilities to ignore or flag particular virtual channels, you (subtractively) take only the content that you want from the big pipe and prioritize the results. Exactly how nice this capability is will come to light when comparing SO to the other programming question outlets.

The tag model is also particularly well suited to the structure of knowledge in the programming domain, where the interests of individual constituents have a strong tendency to straddle several subdomains. Anecdotally, this is especially true for those who really care about their craft: the best programmers tend to have a great deal of depth to their knowledge, which inevitably ends up overlapping with other areas of interest. For example, most of today's great programmers use version control systems, convey information effectively through documentation, and recognize/employ design patterns. Many great programmers also understand than one programming paradigm and program in more than one language. When you mix a number of these programmers together, you get some really strong sauce. The tag system allows these members to cut out the noise and exchange information in their subdomains of expertise.

Community again: noobs

Don't take my subliminal messaging the wrong way: the noobs help. As they say, everybody starts out as a noob. It's clear that noobs pave the way for many others to follow by asking their noobish questions — that's rarely disputed. The really interesting thing is that noobs can provide a more brute force approach to answering questions correctly.

So long as the noobs are semi-informed, they're probably on SO because they're trying to learn about a topic of interest. Active learning processes are accompanied by reading and revisiting things that the more seasoned veterans haven't cared to think about in a long time. Noobs, with references fresh in their mind, can offer up suggestions or quotations (which they may or may not fully understand) while the rest of the members determine whether or not their information is helpful via votes and comments. Even if the noob's proposed answer is somehow incorrect, other members will learn exactly why. If other members thought the noob's answer was feasible as well, they'll be informed and corrected by seeing the dialog. This isn't something you get on an experts-only-answer site: interpolation of the truth through the correction of proposed answers.

There is, of course, the potential for Noobs of Mass Destruction (NMDs?) a la the Eternal September. If noobs outweigh the properly knowledgeable constituency so heavily that misconceptions are voted up far more rapidly than proper solutions, the site will suffer from a misinformation-shock. This misinformation may be corrected over time, but aside from Accepted Answers it's difficult to jump a correct answer to the top over highly up-voted incorrect answers. You need a critical mass of users that know what they're talking about to tip the scales with their votes and their arguments.

Lucky for us members, this didn't happen at public release. Even more lucky for the world of programmers, the success of the site and lack of an Eternal September-like phenomenon on SO will lead to more informed programmers from here forward, further reducing the chance for SO's quality to deteriorate. Really, it was just the initial gamble of going public and, as I mentioned before, SO got the timing right.

Community scaling through tagging

One of my favorite parts of all this is that tags allow the community scale beautifully. If SO gains a thousand new C# programmers as members, does that hurt, say, the Python programmers? No: because of tags, more members can only mean a better site. "Stack Overflow is biased towards C#" is not a self fulfilling prophesy. I'll explain why:

For argument's sake, let's say these are C# robots who only understand ways to use C# syntax to do what you want (i.e. "You can use regions to ease generated code injection. Beep."). If I'm a Python programmer who doesn't care about C#, I'm ignoring the tag anyway and don't get inundated with noise from the robots.

Inevitably, our hypothetical is incorrect and the C# programmers will all have knowledge which crosses into other subdomains. In the (slightly more realistic) case that the members are human beings who know C# along with some generic principles of programming and software design, they can only assist me in my cross-domain problems.

More C# programmers can only help the Python programmers. For all X and Y, more people interested in X can only help people interested in Y, so long as everybody tags everything appropriately. Except Lisp.

Comparison to other outlets

How does SO stack up against the alternatives? The primary differentiation comes in a few identifiable areas:

Structured

Folks on IRC, Usenet, or your buddy list have no real incentive to help you beyond the goodness of their hearts. I'm a starry-eyed idealist and I'm happy that this has worked historically, but it's readily apparent that people love playing for points. SO is one of those purely healthy forms of competition where everybody seems to win; from what I've seen, RTFM and "Google is Your Friend" trolls are consistently down-voted! The reputation system also appears to increase the responsiveness of the site — everybody is looking for the quick "Accepted Answer" grab if they can get it. I had figured that people would try to game the system, but it seems like most people with reputation have been sane, and the people with little reputation have their teeth pulled appropriately. Kudos to the karma system.

What differentiates SO from a big bulletin board is the three-tier threads. You have question (Original Poster), answer (many answers to one question), and comments (many comments to one answer). @replies allow for infinite "virtual" threading, but there's a clear indication of how the conversation is supposed to take place through the structure of the site. My experience with this format has led me to believe that it's ideal for removing noise from the answer tier (via short comments), without letting the meta-conversation get too crazy.

Threading on Usenet allows you to explore related topics of conversation with less friction, but it can be a big problem when you just want to know the answer that the Original Poster (OP) finally accepted. You often see such sub-conversations on Usenet get turned into new threads, while SO asks that you form the new thread pre-emptively as a new question. I have no problem with SO's approach, given the benefits of the three tiered conversation and the more precise indexing capabilities that result from structured threads.

Visibility of questions and answers is a big problem on IRC: there's a distinct fire-and-be-forgotten phenomenon in most channels, proportional to their noise level. Additionally, there's usually a few super gurus in each channel that can only handle one or two problems at a time, leading to,

[impatient-43/4] Can anybody answer my question^^!?!?

messages ad nauseum.

Asynchronous

Usenet does better than IRC in terms of question visibility because it's an asynchronous medium. IRC's synchronous format makes help a lot more interactive, but at great cost. In addition to the fire-and-be-forgotten phenomenon, you inevitably juggle O(n) synchronous channels simultaneously, where n is the number of topics you're interested in.

Also, remember that chat is exactly that: you're going to get unwanted noise. Other people's Q&As, off topic conversation, and sometimes spammers all interfere with your ability to communicate a problem and get an answer in real time. If you've ever tried reading an IRC log to determine the answer to your question, you probably understand this principle — once you mix anonymized handles in with a many-to-many conversation, you give up quickly.

The asynchronous model fits into everybody's day more nicely and scales much better. I haven't yet seen a question on SO where I said to myself, "This Q&A could have benefited greatly from an increased level of synchronous interaction." (Yeah, that's really how I talk to myself. Wanna fight about it?)

Centralized

As I mentioned, the big pipe is a beautiful thing. Some nice corollaries are:

One could argue that IRC's Freenode is similar in the virtual channel respect, but logging is certainly not centralized, and listening to many virtual channels simultaneously quickly converges to impossible. Unlike SO's multi-tag view, asking a question in one IRC channel is unlikely to get the attention of people who reside in other channels.

Newsgroups are all-over-the-place decentralized. It's definitely a web 1.0 technology. There's a bunch of services that consolidate information for newsgroups of interest (Google Groups, gmane), but due to the information being replicated all over the web, the page rank for a given Q&A will tend to be weaker as it's divided across the resources and components of the thread. Newsgroups don't tend to play together as nicely as SO tags — it's easy to see how a question like, "What's monkeypatching?" could be asked on comp.lang.python, comp.lang.ruby, and so on, without ever being referred to each other.

On SO, if you tag things properly, information naturally crosses virtual channels and is well indexed for search.

Persistent

IRC channels tend to get inundated with the same questions over and over, so they make an FAQ to persist a subset of the information that's routinely provided in the channel. Taken to its rational extreme, you could persist all the Q&A information in such a manner, in which case you'd have SO.

Some IRC channels get logged, but I rarely care where the logs are — there's little hope of you finding the answer from the log (as previously discussed). It's also unlikely that the page rank of any given log will be significant. In my IRC experience, you keep your own chat logs if you really care to find the conversations later on. In any case, this is much less elegant than SO's centralized and indexed persistence capabilities.

As I mentioned before, newsgroups have persistence, but it's not well centralized or indexed. Persistence is a moot point if you can't find what you're looking for.

Critical Thinking

Since I'm out of a job as a karma system and NMD doomsayer, I've got to talk about the potential for secondary Armageddon-like effects.

SO doesn't have a significant enough differentiation from refactormycode. Its mission is well differentiated, but it seems like the permitted content on SO is a superset of what can be found on refactormycode. I would consider this kind of Q&A noisy, but it certainly follows the same general format. It's possible the authors are cool with SO engulfing a lot of refactormycode material, but in that case I hope we get some better large code block support. If SO doesn't want it, it should be in the constitution.

I'm concerned about question staleness. Over time we'll see how venerable the Q&As are, but my immediate concern is the plot of views over time: is the drop off in number of views over time for a given question so significant that the return rate cannot overcome initial misconceptions? If misconceptions are introduced later, will users still be watching the thread? There's no "watch this thread" capability in SO for push notification, so to some extent the system expects you to check back at regular intervals to monitor activity on threads. This may be an unrealistic assumption. To be fair, the constitution explicitly states you may re-ask a question if you acknowledge that the other exists, which may prevent this from being such a big deal.

I'm curious as to how the number of non-programming, technical questions has trended over time. Potential problems in this area are alleviated by the constitution and the fact that sufficiently reputable members can close threads, but it's easy to see how there will be an inevitable flow of system administrative questions due to how knowledgeable the constituency is. If the site didn't have such good safeguards, it would easily swallow a whole lot of other Q&A domains that are indirectly programming related.