Thinking about software documentation as the output of a lossy compression algorithm


How many times have you heard or read the following comments?

  1. “The docs are always out of date”
  2. “I don’t bother reading the docs, I just read the source code”
  3. “If you write self-documenting code, you don’t need to write docs”

If you work in software, I bet the answer is: a lot. They range in truth value from #1, which is a tautology, to #3, which is a fantasy. I can relate to #2, since I had to do it just yesterday when using a semi-undocumented Ruby library.

I think all of these points of view (and more) are explained when you think about software documentation as the output of a lossy compression algorithm. Many (most? all?) of the things that you love and/or hate about the documentation for a particular piece of software are explained by this.

Quoth wiki:

Well-designed lossy compression technology often reduces file sizes significantly before degradation is noticed by the end-user. Even when noticeable by the user, further data reduction may be desirable (e.g., for real-time communication, to reduce transmission times, or to reduce storage needs).

As such, the features of software documentation are similar to the features of other products of lossy compression algorithms.

Take mp3s for example. The goal of an mp3 is not to be the highest-fidelity replication of the audio experience it represents. The goal of an mp3 is to provide a “good enough” audio experience given the necessary tradeoffs that had to be made because of constraints such as:

  • Time: How expensive in CPU time is it to run the decompression algorithm? Can the CPU on the device handle it?
  • Space: How much disk space will the file take on the device?

Similarly, we might say that the goal of software documentation is not to be the highest-fidelity replication of the “understanding” experience it (theoretically) represents. The goal of a piece of documentation is to provide a “good enough” learning experience given the necessary tradeoffs that had to be made because of constraints such as:

  • Time: How expensive in person time is it to “run the decompression algorithm”, i.e., learn how the system works well enough to write something coherent? And then to actually write it? How many technical writers does the organization employ per engineer? (In many companies it’s ~1 per 40-50 engineers) How many concurrent projects is that writer working on, across how many teams?
  • Space: How much information does the user need to use the system? How little information can you get away with providing before users give up in disgust?

Remember that fewer, more effective materials cost more to produce. This is similar to the way better compression algorithms may cost more than worse ones along various axes you care about (dollar cost for proprietary algorithms, CPU, memory, etc.)

It takes longer to write more concise documentation, draw clear system diagrams, etc., since those are signs that you actually understand the system better, and have thus compressed the relevant information about it into fewer bytes.

And oh by the way, in practice in an “agile” (lol) environment you don’t have enough time to write the “best” docs for any given feature X. Just like the programmer who wrote feature X would likely admit that she didn’t have enough time to write the “best” implementation according to her standards.

Quoth Pascal:

I would have written a shorter letter, but I did not have the time.

So the next time you are frustrated by the docs for some piece of software (if any docs exist at all), instead of some platitude about docs sucking, think “oh, lossy compression”.

(Image courtesy Jonathan Sureau under Creative Commons license.)

A Portable Scheme Module System



In this post I’d like to introduce load-module, a portable Scheme module system.

Why did I write a module system?

  • Simplicity: A single-file module system in about 200 lines of code
  • Understandability: The implementation avoids wizardry and should be accessible to anyone who knows the language
  • Portability: One system that can be used across multiple implementations

The way it works is this:

  1. You have a file (say, utils.scm) with Scheme code in it that implements stuff that you want to live in the same module.
  2. You create another file (utils.mod, but that extension is easy to change) which lists the procedures and syntax you want to export.
  3. The load-module procedure reads utils.scm, rewriting unexported procedure names such that only the procedures you want exported show up at the top-level. Everything else gets rewritten during load-time as an ignorable “gensym” of the form %--gensym-utils-random-integer-8190504171, where “utils” is the module name, and “random-integer” is the procedure internal to your module.

The module file format is very simple:

(define-module utils
  (exports random-integer atom? take))

The module system exports one procedure: load-module. Run it like so to get the procedures from the aforementioned hypothetical utils package into your environment:

> (load "load-module.scm")
> (load-module 'utils)
> (random-integer 199)
> (atom? 199)

If you care, there’s more information about over at the project README.

(Image courtesy Geoff Collins under Creative Commons license.)

Announcing cpan.el

The Cat's Eye Nebula, one of the first planetary nebulae discovered, also has one of the most complex forms known to this kind of nebula. Eleven rings, or shells, of gas make up the Cat's Eye. Credit: NASA, ESA, HEIC, and The Hubble Heritage Team (STScI/AURA) Acknowledgment: R. Corradi (Isaac Newton Group of Telescopes, Spain) and Z. Tsvetanov (NASA) The Hubble Space Telescope is a project of international cooperation between NASA and the European Space Agency. NASA's Goddard Space Flight Center manages the telescope. The Space Telescope Science Institute conducts Hubble science operations. Goddard is responsible for HST project management, including mission and science operations, servicing missions, and all associated development activities.

The CPAN shell is just another shell, so why not drive it from Emacs?

If you write Perl code in Emacs, you may have wondered why we don’t have a simple mode for driving the CPAN shell (at least I couldn’t find one!).

Well, I finally stopped wondering. It wasn’t that hard to rip out the sh-specific parts of shell.el and make a new mode for the CPAN shell.

Here’s the code:

It’s easy to load up and drive from Emacs:

(add-to-list 'load-path (expand-file-name "/path/to/cpan-el/"))
(setq cpan-file-name "cpan")

(require 'cpan)

To run it, type M-x cpan.

There aren’t too any bells and whistles yet (completion, etc.), but you it’s pretty small so feel free to hack away.

(Image courtesy NASA Goddard Photo and Video under Creative Commons License.)

Scripting JIRA


JIRA is everybody’s favorite enterprise issue management system. It’s the system many of us just love to hate.

Unlike some vocal people on the interwebs, I don’t “hate” JIRA, but I like to keep it at arm’s length. My job is not to be a JIRA jockey (although I do know me some JQL, about which more below) — my job is to get shit done.

And getting shit done quickly, for me at least, is usually a function of being able to control my tools from the command line, including via scripts.

That’s why I have written several scripts for interacting with JIRA from the command line (they’re explained below). You can get them from Github.

Note that they use some hard-coded values that match the JIRA server where I work, which you’ll have to change. However they do at least use your .netrc for passwords, etc., so you should be able to change them for your own use with a quick sed one-liner.

The functionality isn’t there to fully replace your web browser, especially if you work at a company that insists on enterprise-level JIRA jockeying, with crazy themes and labels and stuff, but here’s what’s included as of this writing — for each command, I’ve marked whether it is “plumbing” or “porcelain”:

  • jira-get-issue (plumbing): View an issue’s JSON, which you can pipe through tools such as jq to build other more generic tools you can control from your text editor, scripts, etc.
  • jira-create-issue (porcelain): This one is “end-user-ready”, in the sense that you call it and it pops up your text editor of choice so you can write the issue description, and when you close your editor the issue is created for you!
  • jira-search-issues (plumbing): Uses JQL to search your JIRA, and returns a bunch of issues as JSON for you to fiddle with however you prefer.
  • jira-add-comment (porcelain): Like jira-create-issue, this one pops open your text editor to add a comment to an issue.
  • jira-get-issue-status (plumbing): Gets back JSON describing an issue’s status. Used to check the status to transition an issue forwards or backwards using jira-set-issue-status (below).
  • jira-set-issue-status (porcelain-ish): Depending on an issue’s status, bump it forwards or backwards to an adjacent status. Help output describes the statuses and their ordering. (I’m not so sure about this one; more design work is needed, it’s still not really “porcelain”.)

As an example of using one of the plumbing tools to make a more user-friendly CLI tool, I have one I call jli that just gives me a list of my currently open issues:

#!/usr/bin/env sh

jira-search-issues "assignee = rloveland AND status not in \ 
(Closed, Resolved)" | jq '.key + " " + .fields.summary' | sed -e 's/\"//g';

Since it’s an arbitrary JQL query returning JSON, it’s easy to imagine how you might extend this to give separate lists per-project, ordered by status, etc. You can make your own little terminal- or text-file-based dashboard, send yourself a daily digest email using a larger script, and so forth.

There are some others, but I only have them on my work computer. For example, I have one that checks JIRA for new tickets by diffing the list of my assigned tickets and my local TODO list, which is in a text file.

Obviously these tools could use further development to be more modular and general-purpose, and they don’t come anywhere near covering the whole JIRA API, but they’re pretty small and easy to modify for your own use.

Most importantly, I use them every day at my job, so I know they work and are useful.

Hopefully they are useful to you too!

(Image courtesy Terry Robinson via Flickr under a Creative Commons license.)

Scripting Language Idioms: The “Seen” Hash

The “seen” hash technique is an idiom that lets you use a hash (or dictionary if you prefer) as a set data type. It’s good for generating a de-duplicated list of things, where each thing appears only once. If your language of choice has a real set data type, you may want to use that instead.

To illustrate I’ll offer a real-world use case.

The other day at work I needed to grab a bunch of information about git commits from a batch of automated emails. For reasons that don’t matter right now, our team (Docs) gets automated emails about git commits on our API (It’s not really what we asked for, but it’s what we could get somebody to build).

As a result, we get a bunch of emails formatted like this (personal details changed to protect the guilty):

--------- Project: Foo Details: something something garbage noise etc.

J. Random Luser ABC-123 did something to some other thing
Alyssa P. Hacker ABC-124 fixed mistake in ABC-123
Ben Bitdiddle ABC-125 yet another thing was done
J. Random Luser oh yeah this thing too
Ben Bitdiddle Merge ABC-123 into Foo/master

Luckily I don’t need to read all of these darn things. I have a filter set up on my mail client that saves them all in my ‘Archive’ folder, where I can safely ignore them.

When we’re getting ready to do our API release notes, I go into my ‘Archive’ folder and search for all of the emails with the subject “Project: Foo” that arrived between our last set of release notes and today. I end up with (say) about 100 files formatted like the above.

The format is: Name, JIRA ticket ID, description. Except that sometimes there is no JIRA ticket ID. And sometimes there are duplicate ticket IDs, since the emails contain messages about merge commits.

As a tech writer, I don’t need to look at the contents of every commit. I need to generate a (de-duplicated) list of JIRA ticket IDs, so I can go and review those tickets to see if there is user-facing docs work that needs to happen for those commits. (Sometimes I still need to look at the commits anyway because a ticket has a description like “change the frobnitz”, but hey.)

So I save all of these email files into a directory, and I write some code that loops over each file, generating a set of JIRA ticket IDs, which I then print out. Here’s the code what done it (it’s written in Perl but could as easily be Ruby or Python or whatevs):


use strict;
use warnings;
use feature     qw/ say   /;
use File::Slurp qw/ slurp /;

my @files = glob('*.eml');
my $jira_pat = '([A-Z]+-[0-9]+)';
my %seen;

for my $f (@files) {
  my @lines = slurp($f);

  for my $line (@lines) {
    next unless $line =~ /$jira_pat/; # Skip unless it has a JIRA ticket ID
    my $id = $1;                      # If it did match, save the capture
    $seen{$id}++;                     # Stick the ID in the hash (as a key)

say for sort keys %seen;        # Print out all the keys (which are de-duped)

The reason this trick works is that a hash table can’t have duplicate keys. Therefore the ‘$seen{$id}++’ bit means: “Stick the ID in the hash, and increment its value”. Based on the example email above, you end up with a hash table that looks like this:

  ABC-123 => 2,
  ABC-124 => 1,
  ABC-125 => 1,

Then we print the keys using the line say for sort keys %seen, which just means “print the hash keys in sorted order”.

Perl’s Autovivification FTW

Interestingly, part of the reason this idiom is cleaner in Perl than in, say, Ruby, is that Perl does something called “autovivification” of hash keys. Basically, it means that stuff gets created as soon as you mention it. That’s why you can call the ‘$seen{$id}++’ all in one line. (If you want more information about autovivification, there’s a good article on the Wikipedia.)

By contrast, in Ruby you have to first explicitly create the key’s value, and then increment it. As you can see below, if you try to bump the value of a key that doesn’t exist yet, you get an error (unless you use the technique from the Wikipedia article).

irb(main):015:0> RUBY_VERSION
=> "2.2.4"
irb(main):010:0> tix
=> {"ABC-123"=>2, "ABC-124"=>1}
irb(main):011:0> tix['ABC-125'] += 1
NoMethodError: undefined method `+' for nil:NilClass
    from (irb):11
    from c:/Ruby22/bin/irb:11:in `<main>'
irb(main):012:0> tix['ABC-125'] = 1
=> 1
irb(main):013:0> tix
=> {"ABC-123"=>2, "ABC-124"=>1, "ABC-125"=>1}
irb(main):014:0> tix['ABC-125'] += 1
=> 2

Further Reading

How Dangerous is the Samsung Galaxy Note 7? It’s Safer than Driving

Now that the Galaxy Note 7 has been officially discontinued, I’m not sure it’s worth worrying about the failure rate of this device. But there’s something that really bothered me about the coverage of the device’s various recalls and eventual discontinuation (is that a word?), which was that almost nobody seemed to be running the numbers on the actual failure rates.

If you do the arithmetic on the device failure rates, you end up looking at the situation rather differently. This is not to say that the device being discontinued was the wrong decision — all it takes is one person being horribly burned to create a panic and do serious damage to the company, not to mention that person!

Rather, I think it’s interesting to do the arithmetic as a way of exploring how humans think about risk. It may not surprise you to hear that I think we are really bad at this. And oftentimes it’s because we don’t run the numbers.

With that said, let’s look at some numbers.

According to this article on the Galaxy Note 7 recall, there were about 2.5 million devices sold in the initial batch, and, at least in early September, there had been 35 handsets discovered with the issue. A later report said that over 70 devices had overheated.

The best final count I could find is the one from the Consumer Product Safety Commission. According to the CPSC, there have been 92 reports of the batteries overheating.

How much danger was I really in? (I just returned my Galaxy Note 7 yesterday, which I LOVED, which is part of why I’m writing this.)

  • 2.5 million phones
  • 92 incidents of overheating

Turning to my trusty calculator, that looks like a 1 in 26,000 chance of the device overheating:

CL-USER> (/ 2.4e6 92)

Expressed as a percentage, there is approximately a 0.004% chance that your device would have been one of the ones to overheat:

CL-USER> (format t "~6$" (* (/ 92 2.4e6) 100))

However, let’s make a more conservative assumption that 1000 devices (over 10x as many) would eventually overheat. That’s still 0.04%, far less than one tenth of one percent. Of course, that number “less than one tenth of one percent” was quoted by Samsung themselves during the initial recall:

CL-USER> (format t "~6$" (* (/ 1000 2.4e6) 100))

Eventually, the bad PR due to overheating devices grew to be too much, and Samsung discontinued the model.

One lesson of this incident seems to be that you can make a product that is nearly perfect, with a 0.0038% failure rate, but if the failure mode is bad enough (it probably is), and if the media exposure is widespread enough to create a public outcry (it definitely was), you’re fucked. With lines like this one appearing in the Verge, it’s not hard to understand why Samsung realized they had to just kill it:

It’s easy to imagine how terrifying it would be to have a phone begin smoking like this on a plane or on your bedside table. No thanks.

The mobile device hardware industry is brutal. You can’t have a failure that occurs even in 0.0038% of devices. And even if you maintain that near-perfect safety record you still have to compete on price, features, and time-to-market. I really don’t envy those folks.

But what about risk assessment?

What I find most interesting, as I mentioned above, is what this reflects about how humans assess risk. For example, in 2012, 92 people died in car accidents every day, and nobody has ever considered doing a recall of all automobiles sold in the United States for being fundamentally unsafe!

According to the chart linked above, in 2012 there were 10.691 auto accident deaths per 100,000 people, which means that you had a 0.001% chance of literally dying in a horrible car crash:

CL-USER> (format t "~6$" (* (/ 10.691 10e5) 100))

Compare this to the 0.004% chance of your phone overheating that we calculated above (based on the 92 incidents figure). Given the rather imprecise way we’ve been slinging these numbers around, let’s just assume there’s a lot of error there, and that the figures are roughly equal.

That’s how we arrive at our conclusion:

You have as much chance of your Samsung Galaxy Note 7 overheating as you did of dying in a car crash in 2012.

(Note: this article and its headline do not constitute a claim that the Note 7 is “safe”, or that you should not return it as recommended, etc. This article is not advice on how to live your life, it’s just an exploration of how humans think about risk.)

How to Sync your Opera 12 bookmarks, history, and mail using Dropbox


In this tutorial I’m going to show you how to sync your Opera 12 bookmarks, history, and mail using Dropbox. It involves doing some annoying things to your files (symlinking, etc.) so that Dropbox does the hard work of keeping your bookmarks, mail, history, etc., updated across your various computers. On the bright side, you should only have to do it once.

This post assumes you are a Dropbox user, and you have Opera 12 installed on multiple computers. It doesn’t cover mobile devices, since there was never a mobile version of Opera’s desktop browser.

These instructions work for Linux and the Mac. I would like to do this on my Windows laptop too, but I haven’t been able to get it to work yet. I can’t seem to get Opera to use a link to a folder as if it’s a real folder. If someone knows how to do this, feel free to leave a comment.

One more thing:


OK, now that I’ve shouted a disclaimer at you, let’s do this.

A note on using Dropbox

You may have some concerns about Dropbox’s file synchronization messing up your stuff. For what it’s worth, I’ve been doing this for several months now, and have found that for the most part it just works. In my experience Dropbox basically just handles file updates with no drama.

If it does encounter drama (e.g., finds that it can’t handle a conflict between two different versions of a file), it creates a new version of the file called something like “autosave\ (glamdring2's\ conflicted\ copy\ 2016-02-27).win“, where “glamdring2” is the name of the computer with the conflicted file.

This means you can manually fix things if you have to, but you basically never have to as long as you take care to shut down Opera on computer A and wait until Dropbox is fully synced to start Opera on computer B. This normally takes just a few seconds.

Remember, you should not be doing this with your only copy of your Opera folder. You need to keep regular backups of your Opera folder, and syncing is not a backup.

Step 1. Find your Opera folder

Because Opera is a pretty sane app, it stores all of its junk in one folder (with the exception of the Mac version, about which more below).

Depending on your platform, you’ll find the folder in different places. For more information about where your version of Opera stores its files, see Files used by Opera.

On Linux, it’s easy. Everything is in


On Mac, there is a slight wrinkle. Opera 12 stores your Opera prefs and bookmarks in one place, and your mail in another. This means we need to get them all together in one place so we can sync them with Dropbox.

According to the link above, here’s where your bookmarks, history, etc. are stored if you’re a Mac user:


And it stores your mail in:

~/Library/Application Support/Opera

This is because (according to the docs), user data of “significant size” should go into this other random folder. Um, okay.

In any case, in order to get your Mac install of Opera 12 to work with the actual mail you’ll be syncing in your Dropbox folder, you need to do some goofy stuff with symbolic links, namely:

$ cd ~/Library/Application Support/Opera/
$ mv mail ~/Library/Opera
$ ln -s ~/Library/Opera/mail .

Now everything should all be in ~/Library/Opera (Mac) or ~/.opera (Linux).

Step 2. Copy your Opera folder into Dropbox

Now that we have all of our Opera things in one place, let’s copy the single Opera folder into Dropbox and link it back to wherever your Opera app is expecting the folder to be. On Linux you would do something like this in your terminal – note that Opera should not be running when you do this:

$ mv ~/.opera ~/.opera.bak # Minimal backup copy, you should really do more
$ cp -R ~/.opera.bak ~/Dropbox/opera_config

For Mac it’s pretty similar:

$ mv ~/Library/Support/Opera ~/Library/Support/Opera.bak # Minimal backup copy
$ cp -R ~/Library/Opera.bak ~/Dropbox/opera_config

In English, this:

  • Makes a backup of your Opera folder so if something gets messed up, you can put things back the way they were
  • Copies the backup into your Dropbox folder, where it can be synced across all of your machines

Step 3. Link the folder from Dropbox back to where the app wants it

Now we need to make it look to the Opera 12 app like everything is totally normal. We’ll do that by creating a link from the folder sitting in Dropbox back to where the app expects it to be.

Again, I’m really happy to hear from people about how to do this on Windows. Thus far all of my attempts have failed, and the way Windows handles what it calls “links” seems basically Crazytown. But it’s probably just that I don’t understand them well enough.

Anyway, here’s how you do the linking on Linux:

$ ln -s ~/Dropbox/opera_config ~/.opera

And on Mac:

$ ln -s ~/Dropbox/opera_config ~/Library/Opera

Step 4. Profit?

At this point you should be syncing your mail, bookmarks, history, and sessions across computers. When you shut down Opera 12 on your work machine (a Mac laptop), you can open up Opera 12 on your home workstation running Linux and be in the same session, looking at the same tabs and emails. Everything should just magically work.

If it doesn’t, please let me know in the comments and I’ll try to help.