Thinking about software documentation as the output of a lossy compression algorithm


How many times have you heard or read the following comments?

  1. “The docs are always out of date”
  2. “I don’t bother reading the docs, I just read the source code”
  3. “If you write self-documenting code, you don’t need to write docs”

If you work in software, I bet the answer is: a lot. They range in truth value from #1, which is a tautology, to #3, which is a fantasy. I can relate to #2, since I had to do it just yesterday when using a semi-undocumented Ruby library.

I think all of these points of view (and more) are explained when you think about software documentation as the output of a lossy compression algorithm. Many (most? all?) of the things that you love and/or hate about the documentation for a particular piece of software are explained by this.

Quoth wiki:

Well-designed lossy compression technology often reduces file sizes significantly before degradation is noticed by the end-user. Even when noticeable by the user, further data reduction may be desirable (e.g., for real-time communication, to reduce transmission times, or to reduce storage needs).

As such, the features of software documentation are similar to the features of other products of lossy compression algorithms.

Take mp3s for example. The goal of an mp3 is not to be the highest-fidelity replication of the audio experience it represents. The goal of an mp3 is to provide a “good enough” audio experience given the necessary tradeoffs that had to be made because of constraints such as:

  • Time: How expensive in CPU time is it to run the decompression algorithm? Can the CPU on the device handle it?
  • Space: How much disk space will the file take on the device?

Similarly, we might say that the goal of software documentation is not to be the highest-fidelity replication of the “understanding” experience it (theoretically) represents. The goal of a piece of documentation is to provide a “good enough” learning experience given the necessary tradeoffs that had to be made because of constraints such as:

  • Time: How expensive in person time is it to “run the decompression algorithm”, i.e., learn how the system works well enough to write something coherent? And then to actually write it? How many technical writers does the organization employ per engineer? (In many companies it’s ~1 per 40-50 engineers) How many concurrent projects is that writer working on, across how many teams?
  • Space: How much information does the user need to use the system? How little information can you get away with providing before users give up in disgust?

Remember that fewer, more effective materials cost more to produce. This is similar to the way better compression algorithms may cost more than worse ones along various axes you care about (dollar cost for proprietary algorithms, CPU, memory, etc.)

It takes longer to write more concise documentation, draw clear system diagrams, etc., since those are signs that you actually understand the system better, and have thus compressed the relevant information about it into fewer bytes.

And oh by the way, in practice in an “agile” (lol) environment you don’t have enough time to write the “best” docs for any given feature X. Just like the programmer who wrote feature X would likely admit that she didn’t have enough time to write the “best” implementation according to her standards.

Quoth Pascal:

I would have written a shorter letter, but I did not have the time.

So the next time you are frustrated by the docs for some piece of software (if any docs exist at all), instead of some platitude about docs sucking, think “oh, lossy compression”.

(Image courtesy Jonathan Sureau under Creative Commons license.)

More Recommendations for Technical Writers


In the same spirit of the last post in this series, I have more recommendations for technical writers working in software-land. As before, I make no claims to living up to these recommendations myself. It’s like, aspirational, man.

Without further ado:

Learn Markdown

Yes, Markdown. It’s become a de facto standard plain text format for all kinds of web writing and documentation. It’s easy to use, it’s everywhere, and there are many tools that you can use to work with it.

See also: What is Markdown?

If you want to know what could happen to you if you start using it for things, see Ryan Tomayko’s Why You Should Not Use Markdown.

Learn your web browser’s development tools

From time to time you will need to peek under the covers of a web app you are documenting and see what’s actually going on. Especially nowadays with single page apps, browser-side local storage, and all that fun stuff.

Even if you never need to know that stuff to write your UI docs, you can use the dev tools to rewrite text in user interfaces for making nicer screenshots.

Strunk & White! Yes, Really!

Read it. Then try like hell to live it. I’ll confine the rest of this document to “technical” issues, since I can’t improve on the advice given in that book.

(Again, this is an area where I need to work harder on following my own advice. My long sentences and passive constructions are killing me.)

Think about learning to read code

Everything in our world runs on code (or will soon). Learning how to read code at least up to a basic level can help you figure out what the hell’s going on, sometimes. Yes, you can talk to an engineer, but your conversation will be more productive if you at least have a starting point.

The reason is simple: It’s usually easier to go to someone with a wrong idea and have them correct you with the right ideas than it is to get that same someone to teach you everything from scratch.

This doesn’t mean you have to become a programmer yourself (although you should try to learn how to write scripts to automate boring computer-y stuff, as I noted previously). But like most foreign languages, it helps to be able to read a little to get directions from the locals. And in this case the locals (programmers) speak code.

Seek out pathological edge cases

In any really large system or “platform” designed and built by multiple people on different teams, not all of whom are talking to each other three times a day about every single design decision they make, there will be a number of edge cases. In other words, there will be parts of the system that don’t play well together, or are at best, um, unintuitive in the way they behave.

It’s your job to find these and document them as well as you can. Sometimes this will be low-priority because the issues will be fixed “soon” (for some value of soon). Sometimes it will be necessary, though. You will have to use your own best judgment, along with input from your friends in Engineering and the Product organization, about when to do this.

You probably won’t get them all (not even close), but if you actively seek them out as you go, you will develop a mindset that will help you learn the system a little better.

Also: no one will assume this is part of your job or give you actual time to work on this; you just have to do it.

Don’t listen to anybody

Perhaps that should be rephrased as: “Don’t listen to anybody … just yet.” Until you’ve done your own research and testing and had your own look at the thing you’re documenting (whatever that thing is), you can’t really write about it for someone else. When Product folks, Engineering, etc. tell you the sky is blue, you should still stick your head out the nearest window.

It’s not that they don’t know what they’re talking about (they usually know more than you), it’s that the features they’re telling you about may exist soon at your layer of the onion, but right now they’re sitting on a git branch in somebody’s laptop, or as a line in a technical spec on the internal wiki, or in a weird corner of the internal-only sandbox environment that you will need to hunt down in order to actually use the thing.

If you are documenting a web API, only the things the APIs actually do are real. Everything else is bullshit. That’s part of the meaning of “API”.

Nobody will explicitly budget time for this, either. But you have to do it.

Do your own research, but prepare to be Wrong

It’s always faster to bug someone with a quick question, and sometimes that’s necessary. That said, I’ve been embarrassed by asking quick questions that turned out to have stupidly easy answers that I could have found out for myself.

It’s better to be known as someone who tries to solve their own problems before reaching out for help. Doing some light functional testing as you document, trying out a few database queries, or even reading through code (if that even makes sense and if you are able), will teach you things about the system that could prove useful to you in the future. It will at least give you something to talk about when you do sit down with someone else.

Having said all that, you will still be Wrong a lot (yes, that’s capitalized and bold). It’s one thing to use a thing, and another to build it. The engineers will always know a lot more than you about the systems. If you can get them to point out your mistakes, you’re doing your job! You’re learning things! So don’t get discouraged.

But didn’t I just tell you not to listen to anybody? Yes, but it’s never that simple – these are complex systems.

You are probably an unpaid part-time QA person, embrace it

In the course of writing documentation, you will naturally test things out to ensure that what you’re writing isn’t totally useless (even so, that happens sometimes). You’ll probably spend a lot of time doing this, and taking notes on what you find. These notes will then be integrated into the documentation you write, or into bug reports of some kind to your engineering colleagues (which will probably be rejected because you are Wrong – see above).

Again: no one will assume this is part of your job or give you actual time to work on it; you just have to do it.

Never Forget that The Train Never Stops

Finally – and most importantly – don’t let the fact that everything is changing all of the time get you down. The user interface will change drastically; new APIs will be added, and old ones deprecated and removed; people you’ve gotten to know and like will come and go. If you stay at a job like this for a couple of years, you will be rewriting your Year Two revision of the rewrite you did when you started.

Through it all, “the platform” can never stop running and changing: upgrades, changes, and so on all have to happen while the train is running down the tracks at eighty miles per hour. It doesn’t stop for the engineering teams working on it, and it certainly isn’t going to stop for you.

Your best weapon against change is automating as much drudge work as possible (again with the scripting!) so that you can focus on:

  1. Learning your company’s systems as well as you can technically given your limited time and resources
  2. Knowing what’s getting built next, and why
  3. Knowing what the business cares about (this is usually what drives #2)

Finally, never forget that you are doing important work. Someday, when the people who designed and built the current system have moved on, people will wonder “Why does API service ‘X’ behave this way when I set the twiddle_frobs field on API ‘Y’ to true, but only when API Z’s ignore_frob_twiddle_settings field is not null?”

If you’ve done a good job (and gotten really lucky), there will be some documentation on that.

For a great perspective on why documentation is so important to the health of a company from a smart man who’s been around the block once or twice, see Tim Daly’s talk Literate Programming in the Large.

(Image courtesy typedow under CC-BY-NC-SA.)

Recommendations for Technical Writers

This document describes my recommendations for technical writers working in the software industry (at least the web-focused corner of it). Be forewarned: it’s opinionated. I don’t even live up to it, so you can think of it as an extended note to self.

Learn the UNIX command line

The entire modern internet computing infrastructure is built on open source, UNIX-like operating systems (mostly Linux). If you’re not familiar with them, you will have a more difficult time learning about how large-scale software systems running on networked computers work, otherwise known as: virtually all software systems of importance 1.

To get there, you’ll need to know how to interact with the command line interface (also known as a “terminal”). This is a very plain-looking text-only interface where you will type commands formatted in a strict language understood by the computer. You will learn (if you haven’t already) that the computer always does exactly what you’ve asked it to, but not necessarily what you wanted it to do.

There are a number of “shells” you can use on a UNIX-like OS. I recommend Bash because it’s ubiquitous.

Recommended Reading:

Learn REST API principles

Here’s a light-speed introduction to REST APIs. A large software system has lots of “objects”, which are generally just pieces of stored data about something. For example, where I work, there are objects that represent an advertising campaign or a user in the system.

It doesn’t really matter what data the objects represent, the point of an API is that it creates a single point of contact for external entities (usually programs) that want to interact with the data in the system. It ensures that only certain operations can be performed, and only by authorized parties.

Here’s where the REST part comes in: for any object represented in the system, you can (in theory) perform 4 basic operations on it:

  • Create: Make a new object of that type
  • Read: Look at what information the object contains
  • Update: Change the information the object contains
  • Delete: Delete the object from the system altogether

Unfortunately, I know about RESTful APIs only from practice, not study. There’s a copy of O’Reilly’s RESTful Web Services at work that I’ve been meaning to read to fill in the gaps. See? This really is a note to self.

Recommended Reading:

Learn to use a programmer’s text editor

You’re a professional who gets paid to work with text all day. Programmers are also professionals who get paid to work with text all day. You should be following their example by using a powerful plain text editor; it will allow you to automate away some of the tedious parts of editing.

Remember, word processors are written by programmers for others to use. Text editors are written by programmers for themselves to use. It’s not even a contest.

I recommend vim or Emacs. Pick one, it doesn’t matter.

Recommended Reading:

Learn a programming language with strong UNIX integration

As noted above, you are going to be documenting software running on networked computers doing … networky stuff. For example, you might be making API calls to a “cloud” of some kind 2. A lot of it will be tedious and potentially time-consuming, which means that you should automate it using a programming language. Once you are pretty familiar with a range of terminal commands, it won’t be too difficult to pick up a language with strong UNIX integration. Tastes vary; I am most familiar with Perl. I do not recommend wasting much time writing shell scripts, as there are too many weird edge cases for beginners to trip over.

Recommended Reading:

  • Learning Perl by Randal L. Schwartz, brian d foy, and Tom Phoenix
  • Modern Perl by chromatic


1 Where “importance” is defined as: someone will pay you good money to document it. Personally, I’d love to get a paying job rewriting half of the Emacs manual. But I digress.

2 A Freudian slip by the tech industry perhaps? After all, “the cloud” brings to mind an ephemeral, floating entity that is mostly vaporous, and exists at the whim of forces you don’t really understand, and certainly can’t control. In addition, it’s highly likely to be blown away at any time.