Thinking about software documentation as the output of a lossy compression algorithm

stick

How many times have you heard or read the following comments?

  1. “The docs are always out of date”
  2. “I don’t bother reading the docs, I just read the source code”
  3. “If you write self-documenting code, you don’t need to write docs”

If you work in software, I bet the answer is: a lot. They range in truth value from #1, which is a tautology, to #3, which is a fantasy. I can relate to #2, since I had to do it just yesterday when using a semi-undocumented Ruby library.

I think all of these points of view (and more) are explained when you think about software documentation as the output of a lossy compression algorithm. Many (most? all?) of the things that you love and/or hate about the documentation for a particular piece of software are explained by this.

Quoth wiki:

Well-designed lossy compression technology often reduces file sizes significantly before degradation is noticed by the end-user. Even when noticeable by the user, further data reduction may be desirable (e.g., for real-time communication, to reduce transmission times, or to reduce storage needs).

As such, the features of software documentation are similar to the features of other products of lossy compression algorithms.

Take mp3s for example. The goal of an mp3 is not to be the highest-fidelity replication of the audio experience it represents. The goal of an mp3 is to provide a “good enough” audio experience given the necessary tradeoffs that had to be made because of constraints such as:

  • Time: How expensive in CPU time is it to run the decompression algorithm? Can the CPU on the device handle it?
  • Space: How much disk space will the file take on the device?

Similarly, we might say that the goal of software documentation is not to be the highest-fidelity replication of the “understanding” experience it (theoretically) represents. The goal of a piece of documentation is to provide a “good enough” learning experience given the necessary tradeoffs that had to be made because of constraints such as:

  • Time: How expensive in person time is it to “run the decompression algorithm”, i.e., learn how the system works well enough to write something coherent? And then to actually write it? How many technical writers does the organization employ per engineer? (In many companies it’s ~1 per 40-50 engineers) How many concurrent projects is that writer working on, across how many teams?
  • Space: How much information does the user need to use the system? How little information can you get away with providing before users give up in disgust?

Remember that fewer, more effective materials cost more to produce. This is similar to the way better compression algorithms may cost more than worse ones along various axes you care about (dollar cost for proprietary algorithms, CPU, memory, etc.)

It takes longer to write more concise documentation, draw clear system diagrams, etc., since those are signs that you actually understand the system better, and have thus compressed the relevant information about it into fewer bytes.

And oh by the way, in practice in an “agile” (lol) environment you don’t have enough time to write the “best” docs for any given feature X. Just like the programmer who wrote feature X would likely admit that she didn’t have enough time to write the “best” implementation according to her standards.

Quoth Pascal:

I would have written a shorter letter, but I did not have the time.

So the next time you are frustrated by the docs for some piece of software (if any docs exist at all), instead of some platitude about docs sucking, think “oh, lossy compression”.

(Image courtesy Jonathan Sureau under Creative Commons license.)

Advertisements

A Mini Python and Shell Tutorial

wooly-mammoth-cp

The following is an email I sent to a couple of coworkers whom I’d been teaching a short Python course for technical writers, using Automate the Boring Stuff with Python. The email was meant to show them a real-life example of how a technical writer can use Python and shell scripting to automate something that is, well, boring. In this case, the task was to clean up a CSV file containing a list of git commits to the AppNexus REST APIs.

Because of the way we received this data, it had duplicate entries, and lots of non-interesting merge commits that were unrelated to a feature (a feature is generally associated with a JIRA ticket). Our task was to review the commits and see if there was anything interesting that should be added to our monthly API release notes.

(The names of my coworkers have been changed, obv.)


To: Jane X. (‘REDACTED@appnexus.com’)

Subject: Filtered API git commits to review (bonus: mini Python & shell tutorial)

From: Rich Loveland (‘REDACTED@appnexus.com’)

CC: Victoria Y. (‘REDACTED@appnexus.com’)

Date: Wed, 18 Nov 2015 16:59:04 -0500

+Victoria for the code fun

Jane, the file of commit logs for you to review is attached (along with some others). But so what, that’s boring! Let’s talk about how it was made.

To make the really boring task of reviewing API git commits less awful, let’s do some programming for fun. First let’s write a short Python script to pull out only those commits that have a JIRA ticket ID in them (since we don’t care about the other ones), and call it ‘filter-commit-messages.py’:

  #!/usr/bin/env python

  import re
  import sys

  jira_pat = "[A-Z]+-[0-9]+"

  for line in sys.stdin.readlines():
      m = re.search(jira_pat, line)
      if m:
          print(line)

This tries to match a regular expression against each line of its input (in this case the compiled API git commit list), and prints the line if the match occurs.

Let’s make it executable from our shell:

$ cd ~/bin
$ ln -s ~/work/code/filter-commit-messages.py filter-commit-messages
$ chmod +x ~/bin/filter-commit-messages
$ export PATH=$HOME/bin:$PATH

Then we can run it on the text file with the git commits like so:

$ filter-commit-messages < api-release-november-2015.csv

(The “<” in the shell means “Read your input from this place”.)

This prints out only the matching lines, but there are a lot of annoying extra lines in the output. We can get rid of those lines while sorting them like so:

$ filter-commit-messages < api-release-november-2015.csv | sort 

(The ”

” in the shell means “Pass your output through to this other command”.)

Now that we are extracting only the important lines, let’s throw them in a file:

$ filter-commit-messages < api-release-november-2015.csv | sort > api-release-november-2015-actual.csv

(The “>” near the end means “Write all of the output to this place”.)

We can see how much less reading we have to do now by running a word count program (‘wc’) on the before and after files:

$ wc -l api-release-november-2015.csv # old
     201 api-release-november-2015.csv
$ wc -l api-release-november-2015-actual.csv # new
     115 api-release-november-2015-actual.csv

(The “-l” means “count the lines”.)

Now, since Jane and I each have to review half of the commits, we can use the ‘split’ shell command to break the file in half. Since we know the file is 115 lines, we need to tell ‘split’ how many lines to put in each half with the ‘-l’ option (see ‘man split’ in your terminal):

$ split -l 58 api-release-november-2015-actual.csv COMMITS-TO-REVIEW

‘split’ takes the last argument, “COMMITS-TO-REVIEW”, and creates two files based on that, “COMMITS-TO-REVIEWaa” and “COMMITS-TO-REVIEWbb”, which we can rename for each reviewer:

$ mv COMMITS-TO-REVIEWaa COMMITS-TO-REVIEW-RICH
$ mv COMMITS-TO-REVIEWbb COMMITS-TO-REVIEW-JANE

A nice thing is that because we sorted the lines of the files, each reviewer gets commits by a sorted subset of the engineers, making it easier to see their related commits next to each other.

p.s. We didn’t actually need a Python program for the first part, we could have just used ‘grep’ and stayed with shell commands. But hey!

p.p.s. With more work, this could all be put together into a single program if we were inclined, but since it doesn’t get used that often it’s probably OK to type a few commands.

(Image courtesy William Hartman under Creative Commons license.)

Oh my, am I really considering XML?

Since writing Why Markdown is not my favorite text markup language, I’ve been thinking more about document formats.

More and more I begin to see the impetus for the design of XML, despite its sometimes ugly implementation. With XML you avoid much of the ambiguity of parsing plain-text-based formats and just write the document AST directly. Whether this is a good or bad thing seems to depend on the tools you have available to you, but I think I’m starting to see the light.

At $WORK, for example, I’ve been writing directly in the “XML-ish” Confluence storage format since it was introduced in Confluence 4. Combined with the right editing environment (such as that provided by Emacs’ nxml-mode), it’s easy to navigate XML “structurally” in such a way that you no longer really see the tags.

It’s sort of like being Neo in The Matrix except that, instead of making cool shit happen in an immersive virtual world you’re, um, writing XML.

However, not all is roses in XML-land. In an ideal world, you could maintain a set of XML documents and reliably transform them into other valid formats using a simple set of tools that are easy to learn and use. In reality, many of the extant XML tools such as XSLT exhibit a design aesthetic that is deeply unappealing to most programmers. The semantics of XSLT are interesting, but the syntax appears to be a result of the mistakes that are often made when programmers decide to create their own DSLs. Olin Shivers has a good discussion of the often-broken “little language” phenomenon in his scsh paper.

Speaking of Scheme, it’s possible that something reasonable can be built with SXML. I’ve also had good results using Perl and Mojo::DOM to build Graphviz diagrams of the links among Confluence wiki pages as part of a hacked-together “link checker” (Users of Confluence in an industrial setting will know that the built-in link-checking in Confluence only “sort of” works, which is indistinguishable in practice from not actually working — hence the need to build my own thing).

I’ve also been playing around with MIT Scheme’s built-in XML Parser, and so far I’m preferring it to the Perl or SXML way of doing things.

Why Markdown is not my favorite text markup language

origami-galerie-freising-tomoko-fuseThere are many text markup languages that purport to allow you to write in a simple markup format and publish to the web. Markdown has arguably emerged as the “king” of these formats. I quite like it myself when it’s used for writing short documents with relatively simple formatting needs. However, it falls a bit short when you start to do more elaborate work. This is especially the case when you are trying to do any kind of “serious” technical authoring.

I know that “Markdown” has been used to write technical books. Game Programming Patterns is one excellent example; you can read more about the author’s use of Markdown here, and the script he uses to extend Markdown to meet his needs is here. (I recommend reading all of his essays about how he wrote the book, by the way. They’re truly inspiring.). Based on that author’s experience (and some of my own), I know that Markdown can absolutely be used as a base upon which to build ebooks, websites, wikis, and more. However, this is exactly why I used the term “Markdown” in quotes at the beginning of this paragraph. By the time you’ve extended Markdown to cover your more featureful technical authoring use cases, it really isn’t “just” Markdown anymore. This is fine if you just want to get something done quickly that meets your own needs, but it’s not ideal if you want to work with a meaningful system can be standardized and built on.

Below I’ll address just a few of the needs of “industrial” technical writing (the kind that I do, ostensibly) where Markdown falls a little short. Lest this come off as too negative, it’s worth stating for the record that a homegrown combination of Markdown and a few scripts in a git repo with a Makefile is still an absolute paradise compared to almost all of the clunky proprietary tooling that is marketed and sold for the purposes of “mainstream” technical writing. I have turned to such a homebrewed setup myself in times of need. I’ve even written about how awesome writing in Markdown can be. However, this essay is an attempt to capture my thoughts on Markdown’s shortcomings. Like any good internet crank, I reserve the right to pull a Nickieben Bourbaki at a later date.

I. No native table support

If you are doing any kind of large-scale tech docs, you need tables. Although constraints are always good, and a simple list can probably replace 80% of your table usage if you’re disciplined, there are times when you really just need a big honkin’ table. And as much as I’m used to editing raw XML and HTML directly in Emacs using its excellent tooling to completely sidestep the unwanted “upgrade” to the Confluence editor at $WORK, most writers probably don’t want to be authoring tables directly in HTML (which is the “native” Markdown solution).

II. No native table of contents support

Yes, I can write a script myself to do this. I can also use one of the dozens of such scripts written by others. However, I’d rather have something built in, and consider it a weakness of the format.

III. Forcing the user to fall back to inline HTML is not really OK

Like tables, there are a number of other formatting and layout use cases that Markdown can’t handle natively. As with tables, you must resort to just slapping in some raw HTML. Two reasons why this isn’t so amazing are:

  • It’s hard for an editor to support well, since editing “regular” text markup and tag-based markup languages are quite different beasts
  • It punts complexity to thousands of users in in order to preserve implementation simplicity for a small number of implementors

I can sympathize with the reasoning behind this design decision, since I am usually the guy making his own little hacks that meet simple use cases, but again: not really OK for serious work.

IV. Too many different ways to express the same formatting

This has lead to a number of incompatibilities among the different “Markdown” renderers out there. Just a few of the areas where ambiguity exists are: headers, lists, code sections, and links. For an introduction to Markdown’s flexible semantics, see the original syntax docs. Then, for a more elaborate description of the inconsistencies and challenges of rendering Markdown properly, see Why is a spec needed?, written by the CommonMark folks.

V. Too many incompatible flavors

There are too many incompatible flavors of Markdown that each render a document slightly differently. For a good description of the ways different Markdown implementations diverge, see the Babelmark 2 FAQ.

The “incompatible flavors” issue will hopefully be addressed with the advent of the CommonMark Standard, but if you read the spec it doesn’t address points I, II, or III at all. This makes sense from the perspective of the author of a standards document: a spec isn’t very useful unless you can achieve consensus and adoption among all the slightly different implementations out there right now, and Markdown as commonly understaood doesn’t try to support those cases anyway.

VI. No native means of validation

There will of course be a reference implementation and tests for CommonMark, which will ensure that the content is valid Markdown, but for large-scale documentation deployments, you really need the ability to validate that the documentation sets you’re publishing have certain properties. These properties might include, but aren’t limited to:

  • “Do all of the links have valid targets?”
  • “Is every page reachable from some other page?”

Markdown doesn’t care about this. And to be fair it never said it would! You are of course free to use other tools to perform all of the validations you care about on the resulting HTML output. This isn’t necessarily so bad (in fact it’s not as bad as points I and II in my opinion, since those actually affect you while you’re authoring), but it’s an issue to be aware of.

This is one area where XML has some neat tooling and properties. Although I suppose you could do something workable with a strict subset of HTML. You could also use pandoc to generate XML, which you then validate according to your needs.

Conclusion

Markdown solves its original use case well, while punting on many others in classic Worse is Better fashion. To be fair to Markdown, it was never purported to be anything other than a simple set of formatting conventions for web writing. And it’s worth saying once more that, even given its limitations, a homegrown combination of Markdown and a few scripts in a git repo with a Makefile is still an absolute paradise compared to almost all of the clunky proprietary tooling that is marketed and sold for the purposes of “mainstream” technical writing.

Even so, I hope I’ve presented an argument for why Markdown is not ideal for large scale technical documentation work.

(Image courtesy Gerwin Sturm under a Creative Commons license.)

More Recommendations for Technical Writers

../img/green-fractal-tree.jpg

In the same spirit of the last post in this series, I have more recommendations for technical writers working in software-land. As before, I make no claims to living up to these recommendations myself. It’s like, aspirational, man.

Without further ado:

Learn Markdown

Yes, Markdown. It’s become a de facto standard plain text format for all kinds of web writing and documentation. It’s easy to use, it’s everywhere, and there are many tools that you can use to work with it.

See also: What is Markdown?

If you want to know what could happen to you if you start using it for things, see Ryan Tomayko’s Why You Should Not Use Markdown.

Learn your web browser’s development tools

From time to time you will need to peek under the covers of a web app you are documenting and see what’s actually going on. Especially nowadays with single page apps, browser-side local storage, and all that fun stuff.

Even if you never need to know that stuff to write your UI docs, you can use the dev tools to rewrite text in user interfaces for making nicer screenshots.

Strunk & White! Yes, Really!

Read it. Then try like hell to live it. I’ll confine the rest of this document to “technical” issues, since I can’t improve on the advice given in that book.

(Again, this is an area where I need to work harder on following my own advice. My long sentences and passive constructions are killing me.)

Think about learning to read code

Everything in our world runs on code (or will soon). Learning how to read code at least up to a basic level can help you figure out what the hell’s going on, sometimes. Yes, you can talk to an engineer, but your conversation will be more productive if you at least have a starting point.

The reason is simple: It’s usually easier to go to someone with a wrong idea and have them correct you with the right ideas than it is to get that same someone to teach you everything from scratch.

This doesn’t mean you have to become a programmer yourself (although you should try to learn how to write scripts to automate boring computer-y stuff, as I noted previously). But like most foreign languages, it helps to be able to read a little to get directions from the locals. And in this case the locals (programmers) speak code.

Seek out pathological edge cases

In any really large system or “platform” designed and built by multiple people on different teams, not all of whom are talking to each other three times a day about every single design decision they make, there will be a number of edge cases. In other words, there will be parts of the system that don’t play well together, or are at best, um, unintuitive in the way they behave.

It’s your job to find these and document them as well as you can. Sometimes this will be low-priority because the issues will be fixed “soon” (for some value of soon). Sometimes it will be necessary, though. You will have to use your own best judgment, along with input from your friends in Engineering and the Product organization, about when to do this.

You probably won’t get them all (not even close), but if you actively seek them out as you go, you will develop a mindset that will help you learn the system a little better.

Also: no one will assume this is part of your job or give you actual time to work on this; you just have to do it.

Don’t listen to anybody

Perhaps that should be rephrased as: “Don’t listen to anybody … just yet.” Until you’ve done your own research and testing and had your own look at the thing you’re documenting (whatever that thing is), you can’t really write about it for someone else. When Product folks, Engineering, etc. tell you the sky is blue, you should still stick your head out the nearest window.

It’s not that they don’t know what they’re talking about (they usually know more than you), it’s that the features they’re telling you about may exist soon at your layer of the onion, but right now they’re sitting on a git branch in somebody’s laptop, or as a line in a technical spec on the internal wiki, or in a weird corner of the internal-only sandbox environment that you will need to hunt down in order to actually use the thing.

If you are documenting a web API, only the things the APIs actually do are real. Everything else is bullshit. That’s part of the meaning of “API”.

Nobody will explicitly budget time for this, either. But you have to do it.

Do your own research, but prepare to be Wrong

It’s always faster to bug someone with a quick question, and sometimes that’s necessary. That said, I’ve been embarrassed by asking quick questions that turned out to have stupidly easy answers that I could have found out for myself.

It’s better to be known as someone who tries to solve their own problems before reaching out for help. Doing some light functional testing as you document, trying out a few database queries, or even reading through code (if that even makes sense and if you are able), will teach you things about the system that could prove useful to you in the future. It will at least give you something to talk about when you do sit down with someone else.

Having said all that, you will still be Wrong a lot (yes, that’s capitalized and bold). It’s one thing to use a thing, and another to build it. The engineers will always know a lot more than you about the systems. If you can get them to point out your mistakes, you’re doing your job! You’re learning things! So don’t get discouraged.

But didn’t I just tell you not to listen to anybody? Yes, but it’s never that simple – these are complex systems.

You are probably an unpaid part-time QA person, embrace it

In the course of writing documentation, you will naturally test things out to ensure that what you’re writing isn’t totally useless (even so, that happens sometimes). You’ll probably spend a lot of time doing this, and taking notes on what you find. These notes will then be integrated into the documentation you write, or into bug reports of some kind to your engineering colleagues (which will probably be rejected because you are Wrong – see above).

Again: no one will assume this is part of your job or give you actual time to work on it; you just have to do it.

Never Forget that The Train Never Stops

Finally – and most importantly – don’t let the fact that everything is changing all of the time get you down. The user interface will change drastically; new APIs will be added, and old ones deprecated and removed; people you’ve gotten to know and like will come and go. If you stay at a job like this for a couple of years, you will be rewriting your Year Two revision of the rewrite you did when you started.

Through it all, “the platform” can never stop running and changing: upgrades, changes, and so on all have to happen while the train is running down the tracks at eighty miles per hour. It doesn’t stop for the engineering teams working on it, and it certainly isn’t going to stop for you.

Your best weapon against change is automating as much drudge work as possible (again with the scripting!) so that you can focus on:

  1. Learning your company’s systems as well as you can technically given your limited time and resources
  2. Knowing what’s getting built next, and why
  3. Knowing what the business cares about (this is usually what drives #2)

Finally, never forget that you are doing important work. Someday, when the people who designed and built the current system have moved on, people will wonder “Why does API service ‘X’ behave this way when I set the twiddle_frobs field on API ‘Y’ to true, but only when API Z’s ignore_frob_twiddle_settings field is not null?”

If you’ve done a good job (and gotten really lucky), there will be some documentation on that.

For a great perspective on why documentation is so important to the health of a company from a smart man who’s been around the block once or twice, see Tim Daly’s talk Literate Programming in the Large.

(Image courtesy typedow under CC-BY-NC-SA.)

Applying Lean Principles to the Documentation Lifecycle

pipes

Earlier, I promised to post my notes from talks I attended at the 2014 STC Summit. This talk, by Alan Houser, was probably the most impactful of the Summit for me. The tl;dr version is simply this: Find out what your customers value, and spend your time doing that.

Below is a lightly edited version of the notes I took during the session. The content of the talk is copyright Mr. Hauser, and any errors are mine.

Big Ideas

  • Build/measure/learn
  • get out of the building
  • minimum viable product
  • pivot

How much of what we do truly provides value to the customer?

What we care about

  • deliverables
  • schedules
  • tools
  • org structure
  • office politics
  • legacy file formats

What customers care about

  • can i find it?
  • does it help me?

The Pivot

Can we, based on data, adjust what we do?

“We’ve always done it this way”.

How Companies Pivot

  • budget cuts
  • re-org
  • reduction in force

What Works?

Do That.

What Doesn’t?

Don’t Do That.

What do you measure?

  • pages?
  • topics?
  • words/topic?
  • word count of doc set
  • average word count of headings?
  • readability score?
  • hours/topic?
  • percentage of reuse?
  • revisions/time
  • customer views/topic
  • number of unique words

Do You Get Out of the Building?

What is Waste?

  • things that don’t provide customer value
  • waste time, money, resources, focus
  • (some orgs try to do too much)
  • let’s document this corner case
  • let’s adjust this formatting
  • let’s deliver a CHM file

Let It Go!

Are you continually asking: How does this provide value?

Do you pivot when your process is not aligned with customer value?

Rocky Balboa did two things in the story:

1. Transformed himself

2. Massively Exceeded Expectations

How to exceed expectations?

1. learn something new

2. try something different

3. talk to customers

4. measure something you haven’t before

(Image courtesy dirtyf under Creative Commons License)

Thoughts on the 2014 STC Summit

This is a collection of random thoughts based on my attendance at the 2014 STC Summit earlier this week. I will try to post my more detailed notes from the various individual talks over the next days and weeks.

Lots of proprietary tools, not so much open source

There are lots of proprietary document creation and management tools, and their vendors seem to be well-represented here. Coming from a hybrid tech-writing/programming background, I have to admit that some of the proprietary solutions looked sort of strange to me. Many appeared to be Windows-only to boot.

It seems like there is a lot of opportunity for open-source software to make inroads here. I wonder what it would take to bring the XML-editing capabilities of open source editors like Emacs and Vim up to date (if they aren’t already) to match the capabilities of proprietary tools like Framemaker, Oxygen editor, Madcap Flare, and the like.

There are a few reasons why open source and tech comm could be a match made in heaven. Especially when you consider that whatever improvements in process or tooling you create in open source environments are yours to keep, free of charge, forever. This is definitely not the case in proprietary environments. When you develop your own automation and tooling against proprietary tools, and the vendor breaks stuff, you’re often out of luck.

DITA and XML

It turns out that XML and the transformation tools that work on it such as XSLT and friends are pretty powerful. I have been aware of the existence of these technologies but haven’t used them much thus far in my career.

I feel like I understood the appeal of the DITA XML spec/style better after attending a great talk given by Caitlin Cronkhite and Ted Kuster from Salesforce. As I understood them to say, DITA is just a way of structuring your XML into topics that other tools can then use to create your documentation set with minimal repetition on the part of the writer. (I will put up my notes from that talk in another post.)

However, I admit I still don’t fully understand the reason for layering the proprietary environments over the top of the structure provided by DITA. I would probably prefer to author directly in XML using Emacs and nXML-mode. Alternatively, I’d use a markup language, such as a Markdown variant, that could be translated to XML with a script much like my own confluence2html, and build the various document sets I needed using Makefiles.

Key Takeaway: Do More Professional Development

The number one lesson from this trip was that I have a lot to learn (this should be evident from the preceding paragraphs). There are so many tools and techniques out there that I am not aware of. I’ve only been at this tech writing gig for a couple of years now, after all.

I look forward to engaging more tech writers working in other industries to learn about how they do what they do. My hope is that this will allow me to develop my own skills by stealing some of their best ideas while sharing some of my own crazy notions as well.