Thinking about software documentation as the output of a lossy compression algorithm

stick

How many times have you heard or read the following comments?

  1. “The docs are always out of date”
  2. “I don’t bother reading the docs, I just read the source code”
  3. “If you write self-documenting code, you don’t need to write docs”

If you work in software, I bet the answer is: a lot. They range in truth value from #1, which is a tautology, to #3, which is a fantasy. I can relate to #2, since I had to do it just yesterday when using a semi-undocumented Ruby library.

I think all of these points of view (and more) are explained when you think about software documentation as the output of a lossy compression algorithm. Many (most? all?) of the things that you love and/or hate about the documentation for a particular piece of software are explained by this.

Quoth wiki:

Well-designed lossy compression technology often reduces file sizes significantly before degradation is noticed by the end-user. Even when noticeable by the user, further data reduction may be desirable (e.g., for real-time communication, to reduce transmission times, or to reduce storage needs).

As such, the features of software documentation are similar to the features of other products of lossy compression algorithms.

Take mp3s for example. The goal of an mp3 is not to be the highest-fidelity replication of the audio experience it represents. The goal of an mp3 is to provide a “good enough” audio experience given the necessary tradeoffs that had to be made because of constraints such as:

  • Time: How expensive in CPU time is it to run the decompression algorithm? Can the CPU on the device handle it?
  • Space: How much disk space will the file take on the device?

Similarly, we might say that the goal of software documentation is not to be the highest-fidelity replication of the “understanding” experience it (theoretically) represents. The goal of a piece of documentation is to provide a “good enough” learning experience given the necessary tradeoffs that had to be made because of constraints such as:

  • Time: How expensive in person time is it to “run the decompression algorithm”, i.e., learn how the system works well enough to write something coherent? And then to actually write it? How many technical writers does the organization employ per engineer? (In many companies it’s ~1 per 40-50 engineers) How many concurrent projects is that writer working on, across how many teams?
  • Space: How much information does the user need to use the system? How little information can you get away with providing before users give up in disgust?

Remember that fewer, more effective materials cost more to produce. This is similar to the way better compression algorithms may cost more than worse ones along various axes you care about (dollar cost for proprietary algorithms, CPU, memory, etc.)

It takes longer to write more concise documentation, draw clear system diagrams, etc., since those are signs that you actually understand the system better, and have thus compressed the relevant information about it into fewer bytes.

And oh by the way, in practice in an “agile” (lol) environment you don’t have enough time to write the “best” docs for any given feature X. Just like the programmer who wrote feature X would likely admit that she didn’t have enough time to write the “best” implementation according to her standards.

Quoth Pascal:

I would have written a shorter letter, but I did not have the time.

So the next time you are frustrated by the docs for some piece of software (if any docs exist at all), instead of some platitude about docs sucking, think “oh, lossy compression”.

(Image courtesy Jonathan Sureau under Creative Commons license.)

Oh my, am I really considering XML?

Since writing Why Markdown is not my favorite text markup language, I’ve been thinking more about document formats.

More and more I begin to see the impetus for the design of XML, despite its sometimes ugly implementation. With XML you avoid much of the ambiguity of parsing plain-text-based formats and just write the document AST directly. Whether this is a good or bad thing seems to depend on the tools you have available to you, but I think I’m starting to see the light.

At $WORK, for example, I’ve been writing directly in the “XML-ish” Confluence storage format since it was introduced in Confluence 4. Combined with the right editing environment (such as that provided by Emacs’ nxml-mode), it’s easy to navigate XML “structurally” in such a way that you no longer really see the tags.

It’s sort of like being Neo in The Matrix except that, instead of making cool shit happen in an immersive virtual world you’re, um, writing XML.

However, not all is roses in XML-land. In an ideal world, you could maintain a set of XML documents and reliably transform them into other valid formats using a simple set of tools that are easy to learn and use. In reality, many of the extant XML tools such as XSLT exhibit a design aesthetic that is deeply unappealing to most programmers. The semantics of XSLT are interesting, but the syntax appears to be a result of the mistakes that are often made when programmers decide to create their own DSLs. Olin Shivers has a good discussion of the often-broken “little language” phenomenon in his scsh paper.

Speaking of Scheme, it’s possible that something reasonable can be built with SXML. I’ve also had good results using Perl and Mojo::DOM to build Graphviz diagrams of the links among Confluence wiki pages as part of a hacked-together “link checker” (Users of Confluence in an industrial setting will know that the built-in link-checking in Confluence only “sort of” works, which is indistinguishable in practice from not actually working — hence the need to build my own thing).

I’ve also been playing around with MIT Scheme’s built-in XML Parser, and so far I’m preferring it to the Perl or SXML way of doing things.

Include Code Samples in Markdown Files

angelfish.jpg

Introduction

One useful feature that Markdown omits is any way to properly maintain formatted code samples in the text. Instead you have to indent your code samples “by hand”. This is easy to mess up, especially if you have a team of people all editing the same files.

Code indentation and formatting is an important issue if you are writing tech docs intended for engineers. It’s mostly about ease of readability. Badly formatted code is jarring to the eye of your reader and makes the rest of your documentation seem instantly suspect.

In this post I’ll share a technique (a script, really) that I’ve developed for “including” longer code samples into your Markdown documents from external files 1.

Motivation

To understand the motivation for this technique, let’s look at some made-up code samples. If you already understand why one might want to do this, feel free to skip down to the code.

First up is the simple case: a code snippet that’s just a few lines long.

# Spaceship docking mechanism.

my $foo = Foo->new;
$foo->rotate(90);
$foo->engage_maglocks;

That wasn’t too bad. However, you may need a longer code sample like the one shown below which uses a lot of indentation. You really don’t want to be manually indenting this inside a Markdown file.

;; Ye Olde Merge Sort.

(define (merge pred l r)
  (letrec ((merge-aux
            (lambda (pred left right result)
              (cond ((and (null? left) (null? right))
                     (reverse result))
                    ((and (not (null? left)) (not (null? right)))
                     (if (pred (car left) (car right))
                         (merge-aux pred
                                    (cdr left)
                                    right
                                    (cons (car left) result))
                         (merge-aux pred
                                    left
                                    (cdr right)
                                    (cons (car right) result))))
                    ((not (null? left))
                     (merge-aux pred (cdr left) right (cons (car left) result)))
                    ((not (null? right))
                     (merge-aux pred left (cdr right) (cons (car right) result)))
                    (else #f)))))
    (merge-aux pred l r '())))

(define (merge-sort xs pred)
  (let loop ((xs xs)
             (result '()))
    (cond ((and (null? xs) (null? (cdr result))) (car result))
          ((null? xs) (loop result xs))
          ((null? (cdr xs))
           (loop (cdr xs) (cons (car xs) result)))
          (else
           (loop (cddr xs)
                 (cons (merge < (first xs) (second xs)) result))))))

Code to solve the problem

An easier way to do this is to “include” the code samples from somewhere else. Then you can maintain the code samples in separate files where you will edit them with the right support for syntax highlighting and indentation from your favorite code $EDITOR.

The particular inline syntax I’ve settled on for this is as follows:

{include code_dir/YourFile.java}

Where code_dir is a directory containing all of your code samples, and YourFile.java is some random Java source file in that directory. (It doesn’t have to be Java, it could be any language.)

The include syntax is not that important. What’s important is that we can easily maintain our code and text separately. We can edit Markdown in a Markdown-aware editor, and code in a code-aware editor.

Then we can build a “final” version of the Markdown file which includes the properly formatted code samples. One way to do it is with this shell redirection (see below for the source of the expand_markdown_includes script):

$ expand_markdown_includes < your-markdown-file.md.in > your-markdown-file.md

This assumes you use the convention that your not-quite-Markdown files (the ones with the {include *} syntax described here) use the extension .md.in.

Another nice thing about this method is that you can automate the “include and build” step using a workflow like the one described in Best. Markdown. Writing. Setup. Ever.

Finally, here is the source of the expand_markdown_includes script. The script itself is not that important. It could be improved in any number of ways. Furthermore, because it’s so trivial, you can rewrite it in your favorite language.

#!/usr/bin/env perl

use strict;
use warnings;
use File::Basename;
use File::Slurp qw< slurp >;

my $input_file = shift;
my $input_pathname_directory = dirname( $input_file );

my @input_lines = slurp( $input_file );

my $include_pat = "{include ([/._a-z]+)}";

for my $line ( @input_lines ) {
  print $line unless $line =~ m/$include_pat/;

  if ( $line =~ /$include_pat/ ) {
    my $include_pathname = $1;
    my $program_file = build_full_pathname($input_pathname_directory,
                                           $include_pathname);
    my @program_text = slurp( $program_file );
    for my $program_line ( @program_text ) {
      printf( "    %s", $program_line );
    }
  }
}

sub build_full_pathname {
  my ($dir, $file) = @_;
  return $dir . '/' . $file;
}

Footnotes:

1

While I was writing this I decided to do a bit of web searching and I discovered this interesting Stack Overflow thread that mentions a number of different tools that solve this problem. However I rather like mine (of course!) since it doesn’t require any particular Markdown implementation, just the small preprocessing script presented here.

(Image courtesy Claudia Mont under a Creative Commons license.)

The Debugger is a Notebook

penrose-tiling-based-modular-origami.jpg

My style of programming has changed since the ODB. I now write insanely fast, making numerous mistakes. This gives me something to search for with the ODB. It’s fun.

– Bil Lewis, creator of the Omniscient Debugger

I. Programmers and Poets

In this short essay I will explore some similarities between the way (some) programmers work when doing exploratory programming that can be fruitfully compared to the way poets write poems. I will also sprinkle in some attempts to make the case that a good debugger is core to the experience of composing a new program that you don’t understand very well yet, and compare that with the experience of writing a poem. I know a lot about the former because I am not a very good programmer, so many programs that explore computer science concepts are “exploratory” for me; I know a lot about the latter because I am a reasonably good poet who has written many poems (which is to say, that I have edited many poems, which is really more important).

This work is largely inspired by:

  • The experience of working an programs for SICP exercises and getting popped into the MIT Scheme debugger a lot! 1
  • Using the Scheme 48/scsh inspector a bunch while working on geiser-scsh
  • Writing a lot of poems

II. Generating {Program,Poem} Text

Computer program texts are compressed descriptions of computational processes designed to be experienced by computers and humans. Similarly, poems are compressed descriptions of cognitive and emotional processes designed to be experienced by humans.

Both artifacts strive to encapsulate something that was understood by the author(s) at one point in time and convey that understanding to a reader at another point in time (human or machine). In poetry world, there are a number of different ways to work. There are ostensibly some writers who think really hard for a moment and write a fully-formed sentence. Then they think for a few moments more and write down another fully-formed sentence. And so on.

In reality, there are very few poets who work this way. Most people work using an approximation of what Sussman beautifully describes as “problem-solving by debugging almost-right plans”. 2 This is actually how human beings create new things! As my professor told our writing workshop, “I can’t teach you how to write. I can only teach you how to edit your own work”. Few people write well, and fewer edit well. But in the end, writing and editing are actually the same thing. When you first edit a poem, you may correct minor errors in the text. The more important work is “running the program” of the poem in your head, reading it over and over, reciting it aloud, testing whether it achieves the aesthetic purpose you have set for it. You will add a pronoun in one place, and replace an adjective in another. You might remove the last line, or add another stanza entirely. Do this for long enough, and you may find the poem has changed substantially over the course of having been “debugged”. It may also achieve a goal that you didn’t know existed when you began writing it. I suspect there is something very similar at work when people are doing exploratory programming sessions.

III. Debugger as Crutch/Enabler

Debuggers are seen by some as a crutch. I agree that debuggers are a crutch. There’s a reason crutches were invented. Because without them, you would have to crawl along, dragging your broken leg behind you in the dirt. And we all have a “broken leg” of sorts when we’re working on a problem we don’t understand very well.

I’d like to propose a better metaphor for debuggers. The debugger is a notebook where you can sketch out different versions of your program. You may correct minor errors in a variable declaration, or change a parameter to a procedure. You might redefine an existing procedure as a higher-order procedure that replaces two or three more verbose ones. And so on. All inside the debugger!

A sufficiently powerful debugger will give you the freedom to sketch out an idea quickly, watch it break, and play around in the environment where the breakage occurred, reading variable bindings, entering new definitions, etc.

I think of this style of programming as “sketching in your notebook” because you don’t write poems by staring at a blank sheet of paper for two minutes and then carefully writing a fully-formed sentence. You have an initial idea or impulse, and then you express it as fast as you can! You write down whatever parts of it you can manage to catch hold of, since your brain is working and associating much faster than your pen can really capture. What you end up with is a pile of things, some of which you will discard, some of which are worth keeping just as they are, and some of which are worth keeping but non-optimal and will need to be rewritten. If you actually have an idea worth expressing, you are in much greater danger of not capturing something than you are of making a mess. You will always start by making a mess and then cleaning it up 3.

I submit that a sufficently powerful, expressive debugging environment is as necessary to the programmer as a pocket notebook to the poet.

Interesting Reads

These essays explore writing, debugging, and thinking in more depth:

(Image courtesy fdecomite under Creative Commons License.)

Footnotes:

1

For more information about how cool the MIT Scheme debugger is, see Joe Marshall’s informative blog post.

2

This technique is mentioned on his CSAIL page here. For more information, see the link to his paper elsewhere on this page.

3

You may enjoy an interesting essay with this title: Make a Mess, Clean it Up!

Why Libraries and Librarians are Amazing

mother-of-all-libraries-14

Libraries: because you cannot have a conscience without a memory.

For some time now I’ve been meaning to write about why libraries (and librarians!) are important. After all, I’m a child of the library. In particular, I’m a child of this library, where I spent many happy hours growing up. As the son of a blue-collar family that was anything but “bookish”, well, I don’t know where I’d be without libraries.

What follows are a collection of random thoughts (linky-blurbs, really) about the general amazingness of libraries and librarians:

  • Librarians fought, and are still fighting, surveillance-state idiocy like the PATRIOT Act.
  • On a lighter note, the Internet Archive also has this collection of classic MS-DOS games that you can play in your browser. When I saw titles like Street Fighter, Sim City, Donkey Kong, and Castle Wolfenstein in this list, I have to admit I kinda freaked out. The future is amazing. Growing up, I had to memorize my friends’ phone numbers! And now we have magic tiny emulated computers in our browser tabs.

See also:

Never trust a corporation to do a library’s job.

(Image courtesy Cher Amio under Creative Commons license.)

Why Markdown is not my favorite text markup language

origami-galerie-freising-tomoko-fuseThere are many text markup languages that purport to allow you to write in a simple markup format and publish to the web. Markdown has arguably emerged as the “king” of these formats. I quite like it myself when it’s used for writing short documents with relatively simple formatting needs. However, it falls a bit short when you start to do more elaborate work. This is especially the case when you are trying to do any kind of “serious” technical authoring.

I know that “Markdown” has been used to write technical books. Game Programming Patterns is one excellent example; you can read more about the author’s use of Markdown here, and the script he uses to extend Markdown to meet his needs is here. (I recommend reading all of his essays about how he wrote the book, by the way. They’re truly inspiring.). Based on that author’s experience (and some of my own), I know that Markdown can absolutely be used as a base upon which to build ebooks, websites, wikis, and more. However, this is exactly why I used the term “Markdown” in quotes at the beginning of this paragraph. By the time you’ve extended Markdown to cover your more featureful technical authoring use cases, it really isn’t “just” Markdown anymore. This is fine if you just want to get something done quickly that meets your own needs, but it’s not ideal if you want to work with a meaningful system can be standardized and built on.

Below I’ll address just a few of the needs of “industrial” technical writing (the kind that I do, ostensibly) where Markdown falls a little short. Lest this come off as too negative, it’s worth stating for the record that a homegrown combination of Markdown and a few scripts in a git repo with a Makefile is still an absolute paradise compared to almost all of the clunky proprietary tooling that is marketed and sold for the purposes of “mainstream” technical writing. I have turned to such a homebrewed setup myself in times of need. I’ve even written about how awesome writing in Markdown can be. However, this essay is an attempt to capture my thoughts on Markdown’s shortcomings. Like any good internet crank, I reserve the right to pull a Nickieben Bourbaki at a later date.

I. No native table support

If you are doing any kind of large-scale tech docs, you need tables. Although constraints are always good, and a simple list can probably replace 80% of your table usage if you’re disciplined, there are times when you really just need a big honkin’ table. And as much as I’m used to editing raw XML and HTML directly in Emacs using its excellent tooling to completely sidestep the unwanted “upgrade” to the Confluence editor at $WORK, most writers probably don’t want to be authoring tables directly in HTML (which is the “native” Markdown solution).

II. No native table of contents support

Yes, I can write a script myself to do this. I can also use one of the dozens of such scripts written by others. However, I’d rather have something built in, and consider it a weakness of the format.

III. Forcing the user to fall back to inline HTML is not really OK

Like tables, there are a number of other formatting and layout use cases that Markdown can’t handle natively. As with tables, you must resort to just slapping in some raw HTML. Two reasons why this isn’t so amazing are:

  • It’s hard for an editor to support well, since editing “regular” text markup and tag-based markup languages are quite different beasts
  • It punts complexity to thousands of users in in order to preserve implementation simplicity for a small number of implementors

I can sympathize with the reasoning behind this design decision, since I am usually the guy making his own little hacks that meet simple use cases, but again: not really OK for serious work.

IV. Too many different ways to express the same formatting

This has lead to a number of incompatibilities among the different “Markdown” renderers out there. Just a few of the areas where ambiguity exists are: headers, lists, code sections, and links. For an introduction to Markdown’s flexible semantics, see the original syntax docs. Then, for a more elaborate description of the inconsistencies and challenges of rendering Markdown properly, see Why is a spec needed?, written by the CommonMark folks.

V. Too many incompatible flavors

There are too many incompatible flavors of Markdown that each render a document slightly differently. For a good description of the ways different Markdown implementations diverge, see the Babelmark 2 FAQ.

The “incompatible flavors” issue will hopefully be addressed with the advent of the CommonMark Standard, but if you read the spec it doesn’t address points I, II, or III at all. This makes sense from the perspective of the author of a standards document: a spec isn’t very useful unless you can achieve consensus and adoption among all the slightly different implementations out there right now, and Markdown as commonly understaood doesn’t try to support those cases anyway.

VI. No native means of validation

There will of course be a reference implementation and tests for CommonMark, which will ensure that the content is valid Markdown, but for large-scale documentation deployments, you really need the ability to validate that the documentation sets you’re publishing have certain properties. These properties might include, but aren’t limited to:

  • “Do all of the links have valid targets?”
  • “Is every page reachable from some other page?”

Markdown doesn’t care about this. And to be fair it never said it would! You are of course free to use other tools to perform all of the validations you care about on the resulting HTML output. This isn’t necessarily so bad (in fact it’s not as bad as points I and II in my opinion, since those actually affect you while you’re authoring), but it’s an issue to be aware of.

This is one area where XML has some neat tooling and properties. Although I suppose you could do something workable with a strict subset of HTML. You could also use pandoc to generate XML, which you then validate according to your needs.

Conclusion

Markdown solves its original use case well, while punting on many others in classic Worse is Better fashion. To be fair to Markdown, it was never purported to be anything other than a simple set of formatting conventions for web writing. And it’s worth saying once more that, even given its limitations, a homegrown combination of Markdown and a few scripts in a git repo with a Makefile is still an absolute paradise compared to almost all of the clunky proprietary tooling that is marketed and sold for the purposes of “mainstream” technical writing.

Even so, I hope I’ve presented an argument for why Markdown is not ideal for large scale technical documentation work.

(Image courtesy Gerwin Sturm under a Creative Commons license.)

Best. Markdown. Writing. Setup. Ever.

../img/markdown-emacs-compilation.png

When writing in a source format such as Markdown, it’s nice to be able to see your changes show up automatically in the output. One of my favorite ways to work is to have Emacs and Firefox open side by side (as shown above). Whenever I save my Markdown file, I want Emacs to automatically build a new HTML file from it, and I want Firefox to automatically refresh to show the latest changes.

Once you have this set up, all you have to do is write and save, write and save.

As it happens, fellow Redditor goodevilgenius was looking to accomplish just this workflow. I originally posted this answer on Reddit, but I’m reposting it here in the hope that it will help some kindly internet stranger someday.

I have this exact use case. I use compile-on-save mode and the Firefox Auto Reload extension.

So in a Markdown buffer (once you’ve installed compile-on-save mode):

M-x compile-on-save-mode RET
M-x compile RET markdown current-file.md > /tmp/current-file.html
Open current-file.html in Firefox.
Write stuff and save. Emacs will auto-compile the Markdown, and Firefox will instantly auto-reload the HTML file.

With Emacs and Firefox open side-by-side, I find it pretty easy to enter a “flow” state, since all you have to do is write and save the file. Hope that helps!

The Emacs-savvy reader will note that this workflow isn’t confined to Markdown. For example, compile-on-save mode could kick off an XML doc build (or any other computation you like, for that matter).