A Mini Python and Shell Tutorial


The following is an email I sent to a couple of coworkers whom I’d been teaching a short Python course for technical writers, using Automate the Boring Stuff with Python. The email was meant to show them a real-life example of how a technical writer can use Python and shell scripting to automate something that is, well, boring. In this case, the task was to clean up a CSV file containing a list of git commits to the AppNexus REST APIs.

Because of the way we received this data, it had duplicate entries, and lots of non-interesting merge commits that were unrelated to a feature (a feature is generally associated with a JIRA ticket). Our task was to review the commits and see if there was anything interesting that should be added to our monthly API release notes.

(The names of my coworkers have been changed, obv.)

To: Jane X. (‘REDACTED@appnexus.com’)

Subject: Filtered API git commits to review (bonus: mini Python & shell tutorial)

From: Rich Loveland (‘REDACTED@appnexus.com’)

CC: Victoria Y. (‘REDACTED@appnexus.com’)

Date: Wed, 18 Nov 2015 16:59:04 -0500

+Victoria for the code fun

Jane, the file of commit logs for you to review is attached (along with some others). But so what, that’s boring! Let’s talk about how it was made.

To make the really boring task of reviewing API git commits less awful, let’s do some programming for fun. First let’s write a short Python script to pull out only those commits that have a JIRA ticket ID in them (since we don’t care about the other ones), and call it ‘filter-commit-messages.py’:

  #!/usr/bin/env python

  import re
  import sys

  jira_pat = "[A-Z]+-[0-9]+"

  for line in sys.stdin.readlines():
      m = re.search(jira_pat, line)
      if m:

This tries to match a regular expression against each line of its input (in this case the compiled API git commit list), and prints the line if the match occurs.

Let’s make it executable from our shell:

$ cd ~/bin
$ ln -s ~/work/code/filter-commit-messages.py filter-commit-messages
$ chmod +x ~/bin/filter-commit-messages
$ export PATH=$HOME/bin:$PATH

Then we can run it on the text file with the git commits like so:

$ filter-commit-messages < api-release-november-2015.csv

(The “<” in the shell means “Read your input from this place”.)

This prints out only the matching lines, but there are a lot of annoying extra lines in the output. We can get rid of those lines while sorting them like so:

$ filter-commit-messages < api-release-november-2015.csv | sort 

(The ”

” in the shell means “Pass your output through to this other command”.)

Now that we are extracting only the important lines, let’s throw them in a file:

$ filter-commit-messages < api-release-november-2015.csv | sort > api-release-november-2015-actual.csv

(The “>” near the end means “Write all of the output to this place”.)

We can see how much less reading we have to do now by running a word count program (‘wc’) on the before and after files:

$ wc -l api-release-november-2015.csv # old
     201 api-release-november-2015.csv
$ wc -l api-release-november-2015-actual.csv # new
     115 api-release-november-2015-actual.csv

(The “-l” means “count the lines”.)

Now, since Jane and I each have to review half of the commits, we can use the ‘split’ shell command to break the file in half. Since we know the file is 115 lines, we need to tell ‘split’ how many lines to put in each half with the ‘-l’ option (see ‘man split’ in your terminal):

$ split -l 58 api-release-november-2015-actual.csv COMMITS-TO-REVIEW

‘split’ takes the last argument, “COMMITS-TO-REVIEW”, and creates two files based on that, “COMMITS-TO-REVIEWaa” and “COMMITS-TO-REVIEWbb”, which we can rename for each reviewer:


A nice thing is that because we sorted the lines of the files, each reviewer gets commits by a sorted subset of the engineers, making it easier to see their related commits next to each other.

p.s. We didn’t actually need a Python program for the first part, we could have just used ‘grep’ and stayed with shell commands. But hey!

p.p.s. With more work, this could all be put together into a single program if we were inclined, but since it doesn’t get used that often it’s probably OK to type a few commands.

(Image courtesy William Hartman under Creative Commons license.)

Oh my, am I really considering XML?

Since writing Why Markdown is not my favorite text markup language, I’ve been thinking more about document formats.

More and more I begin to see the impetus for the design of XML, despite its sometimes ugly implementation. With XML you avoid much of the ambiguity of parsing plain-text-based formats and just write the document AST directly. Whether this is a good or bad thing seems to depend on the tools you have available to you, but I think I’m starting to see the light.

At $WORK, for example, I’ve been writing directly in the “XML-ish” Confluence storage format since it was introduced in Confluence 4. Combined with the right editing environment (such as that provided by Emacs’ nxml-mode), it’s easy to navigate XML “structurally” in such a way that you no longer really see the tags.

It’s sort of like being Neo in The Matrix except that, instead of making cool shit happen in an immersive virtual world you’re, um, writing XML.

However, not all is roses in XML-land. In an ideal world, you could maintain a set of XML documents and reliably transform them into other valid formats using a simple set of tools that are easy to learn and use. In reality, many of the extant XML tools such as XSLT exhibit a design aesthetic that is deeply unappealing to most programmers. The semantics of XSLT are interesting, but the syntax appears to be a result of the mistakes that are often made when programmers decide to create their own DSLs. Olin Shivers has a good discussion of the often-broken “little language” phenomenon in his scsh paper.

Speaking of Scheme, it’s possible that something reasonable can be built with SXML. I’ve also had good results using Perl and Mojo::DOM to build Graphviz diagrams of the links among Confluence wiki pages as part of a hacked-together “link checker” (Users of Confluence in an industrial setting will know that the built-in link-checking in Confluence only “sort of” works, which is indistinguishable in practice from not actually working — hence the need to build my own thing).

I’ve also been playing around with MIT Scheme’s built-in XML Parser, and so far I’m preferring it to the Perl or SXML way of doing things.

Why Markdown is not my favorite text markup language

origami-galerie-freising-tomoko-fuseThere are many text markup languages that purport to allow you to write in a simple markup format and publish to the web. Markdown has arguably emerged as the “king” of these formats. I quite like it myself when it’s used for writing short documents with relatively simple formatting needs. However, it falls a bit short when you start to do more elaborate work. This is especially the case when you are trying to do any kind of “serious” technical authoring.

I know that “Markdown” has been used to write technical books. Game Programming Patterns is one excellent example; you can read more about the author’s use of Markdown here, and the script he uses to extend Markdown to meet his needs is here. (I recommend reading all of his essays about how he wrote the book, by the way. They’re truly inspiring.). Based on that author’s experience (and some of my own), I know that Markdown can absolutely be used as a base upon which to build ebooks, websites, wikis, and more. However, this is exactly why I used the term “Markdown” in quotes at the beginning of this paragraph. By the time you’ve extended Markdown to cover your more featureful technical authoring use cases, it really isn’t “just” Markdown anymore. This is fine if you just want to get something done quickly that meets your own needs, but it’s not ideal if you want to work with a meaningful system can be standardized and built on.

Below I’ll address just a few of the needs of “industrial” technical writing (the kind that I do, ostensibly) where Markdown falls a little short. Lest this come off as too negative, it’s worth stating for the record that a homegrown combination of Markdown and a few scripts in a git repo with a Makefile is still an absolute paradise compared to almost all of the clunky proprietary tooling that is marketed and sold for the purposes of “mainstream” technical writing. I have turned to such a homebrewed setup myself in times of need. I’ve even written about how awesome writing in Markdown can be. However, this essay is an attempt to capture my thoughts on Markdown’s shortcomings. Like any good internet crank, I reserve the right to pull a Nickieben Bourbaki at a later date.

I. No native table support

If you are doing any kind of large-scale tech docs, you need tables. Although constraints are always good, and a simple list can probably replace 80% of your table usage if you’re disciplined, there are times when you really just need a big honkin’ table. And as much as I’m used to editing raw XML and HTML directly in Emacs using its excellent tooling to completely sidestep the unwanted “upgrade” to the Confluence editor at $WORK, most writers probably don’t want to be authoring tables directly in HTML (which is the “native” Markdown solution).

II. No native table of contents support

Yes, I can write a script myself to do this. I can also use one of the dozens of such scripts written by others. However, I’d rather have something built in, and consider it a weakness of the format.

III. Forcing the user to fall back to inline HTML is not really OK

Like tables, there are a number of other formatting and layout use cases that Markdown can’t handle natively. As with tables, you must resort to just slapping in some raw HTML. Two reasons why this isn’t so amazing are:

  • It’s hard for an editor to support well, since editing “regular” text markup and tag-based markup languages are quite different beasts
  • It punts complexity to thousands of users in in order to preserve implementation simplicity for a small number of implementors

I can sympathize with the reasoning behind this design decision, since I am usually the guy making his own little hacks that meet simple use cases, but again: not really OK for serious work.

IV. Too many different ways to express the same formatting

This has lead to a number of incompatibilities among the different “Markdown” renderers out there. Just a few of the areas where ambiguity exists are: headers, lists, code sections, and links. For an introduction to Markdown’s flexible semantics, see the original syntax docs. Then, for a more elaborate description of the inconsistencies and challenges of rendering Markdown properly, see Why is a spec needed?, written by the CommonMark folks.

V. Too many incompatible flavors

There are too many incompatible flavors of Markdown that each render a document slightly differently. For a good description of the ways different Markdown implementations diverge, see the Babelmark 2 FAQ.

The “incompatible flavors” issue will hopefully be addressed with the advent of the CommonMark Standard, but if you read the spec it doesn’t address points I, II, or III at all. This makes sense from the perspective of the author of a standards document: a spec isn’t very useful unless you can achieve consensus and adoption among all the slightly different implementations out there right now, and Markdown as commonly understaood doesn’t try to support those cases anyway.

VI. No native means of validation

There will of course be a reference implementation and tests for CommonMark, which will ensure that the content is valid Markdown, but for large-scale documentation deployments, you really need the ability to validate that the documentation sets you’re publishing have certain properties. These properties might include, but aren’t limited to:

  • “Do all of the links have valid targets?”
  • “Is every page reachable from some other page?”

Markdown doesn’t care about this. And to be fair it never said it would! You are of course free to use other tools to perform all of the validations you care about on the resulting HTML output. This isn’t necessarily so bad (in fact it’s not as bad as points I and II in my opinion, since those actually affect you while you’re authoring), but it’s an issue to be aware of.

This is one area where XML has some neat tooling and properties. Although I suppose you could do something workable with a strict subset of HTML. You could also use pandoc to generate XML, which you then validate according to your needs.


Markdown solves its original use case well, while punting on many others in classic Worse is Better fashion. To be fair to Markdown, it was never purported to be anything other than a simple set of formatting conventions for web writing. And it’s worth saying once more that, even given its limitations, a homegrown combination of Markdown and a few scripts in a git repo with a Makefile is still an absolute paradise compared to almost all of the clunky proprietary tooling that is marketed and sold for the purposes of “mainstream” technical writing.

Even so, I hope I’ve presented an argument for why Markdown is not ideal for large scale technical documentation work.

(Image courtesy Gerwin Sturm under a Creative Commons license.)

More Recommendations for Technical Writers


In the same spirit of the last post in this series, I have more recommendations for technical writers working in software-land. As before, I make no claims to living up to these recommendations myself. It’s like, aspirational, man.

Without further ado:

Learn Markdown

Yes, Markdown. It’s become a de facto standard plain text format for all kinds of web writing and documentation. It’s easy to use, it’s everywhere, and there are many tools that you can use to work with it.

See also: What is Markdown?

If you want to know what could happen to you if you start using it for things, see Ryan Tomayko’s Why You Should Not Use Markdown.

Learn your web browser’s development tools

From time to time you will need to peek under the covers of a web app you are documenting and see what’s actually going on. Especially nowadays with single page apps, browser-side local storage, and all that fun stuff.

Even if you never need to know that stuff to write your UI docs, you can use the dev tools to rewrite text in user interfaces for making nicer screenshots.

Strunk & White! Yes, Really!

Read it. Then try like hell to live it. I’ll confine the rest of this document to “technical” issues, since I can’t improve on the advice given in that book.

(Again, this is an area where I need to work harder on following my own advice. My long sentences and passive constructions are killing me.)

Think about learning to read code

Everything in our world runs on code (or will soon). Learning how to read code at least up to a basic level can help you figure out what the hell’s going on, sometimes. Yes, you can talk to an engineer, but your conversation will be more productive if you at least have a starting point.

The reason is simple: It’s usually easier to go to someone with a wrong idea and have them correct you with the right ideas than it is to get that same someone to teach you everything from scratch.

This doesn’t mean you have to become a programmer yourself (although you should try to learn how to write scripts to automate boring computer-y stuff, as I noted previously). But like most foreign languages, it helps to be able to read a little to get directions from the locals. And in this case the locals (programmers) speak code.

Seek out pathological edge cases

In any really large system or “platform” designed and built by multiple people on different teams, not all of whom are talking to each other three times a day about every single design decision they make, there will be a number of edge cases. In other words, there will be parts of the system that don’t play well together, or are at best, um, unintuitive in the way they behave.

It’s your job to find these and document them as well as you can. Sometimes this will be low-priority because the issues will be fixed “soon” (for some value of soon). Sometimes it will be necessary, though. You will have to use your own best judgment, along with input from your friends in Engineering and the Product organization, about when to do this.

You probably won’t get them all (not even close), but if you actively seek them out as you go, you will develop a mindset that will help you learn the system a little better.

Also: no one will assume this is part of your job or give you actual time to work on this; you just have to do it.

Don’t listen to anybody

Perhaps that should be rephrased as: “Don’t listen to anybody … just yet.” Until you’ve done your own research and testing and had your own look at the thing you’re documenting (whatever that thing is), you can’t really write about it for someone else. When Product folks, Engineering, etc. tell you the sky is blue, you should still stick your head out the nearest window.

It’s not that they don’t know what they’re talking about (they usually know more than you), it’s that the features they’re telling you about may exist soon at your layer of the onion, but right now they’re sitting on a git branch in somebody’s laptop, or as a line in a technical spec on the internal wiki, or in a weird corner of the internal-only sandbox environment that you will need to hunt down in order to actually use the thing.

If you are documenting a web API, only the things the APIs actually do are real. Everything else is bullshit. That’s part of the meaning of “API”.

Nobody will explicitly budget time for this, either. But you have to do it.

Do your own research, but prepare to be Wrong

It’s always faster to bug someone with a quick question, and sometimes that’s necessary. That said, I’ve been embarrassed by asking quick questions that turned out to have stupidly easy answers that I could have found out for myself.

It’s better to be known as someone who tries to solve their own problems before reaching out for help. Doing some light functional testing as you document, trying out a few database queries, or even reading through code (if that even makes sense and if you are able), will teach you things about the system that could prove useful to you in the future. It will at least give you something to talk about when you do sit down with someone else.

Having said all that, you will still be Wrong a lot (yes, that’s capitalized and bold). It’s one thing to use a thing, and another to build it. The engineers will always know a lot more than you about the systems. If you can get them to point out your mistakes, you’re doing your job! You’re learning things! So don’t get discouraged.

But didn’t I just tell you not to listen to anybody? Yes, but it’s never that simple – these are complex systems.

You are probably an unpaid part-time QA person, embrace it

In the course of writing documentation, you will naturally test things out to ensure that what you’re writing isn’t totally useless (even so, that happens sometimes). You’ll probably spend a lot of time doing this, and taking notes on what you find. These notes will then be integrated into the documentation you write, or into bug reports of some kind to your engineering colleagues (which will probably be rejected because you are Wrong – see above).

Again: no one will assume this is part of your job or give you actual time to work on it; you just have to do it.

Never Forget that The Train Never Stops

Finally – and most importantly – don’t let the fact that everything is changing all of the time get you down. The user interface will change drastically; new APIs will be added, and old ones deprecated and removed; people you’ve gotten to know and like will come and go. If you stay at a job like this for a couple of years, you will be rewriting your Year Two revision of the rewrite you did when you started.

Through it all, “the platform” can never stop running and changing: upgrades, changes, and so on all have to happen while the train is running down the tracks at eighty miles per hour. It doesn’t stop for the engineering teams working on it, and it certainly isn’t going to stop for you.

Your best weapon against change is automating as much drudge work as possible (again with the scripting!) so that you can focus on:

  1. Learning your company’s systems as well as you can technically given your limited time and resources
  2. Knowing what’s getting built next, and why
  3. Knowing what the business cares about (this is usually what drives #2)

Finally, never forget that you are doing important work. Someday, when the people who designed and built the current system have moved on, people will wonder “Why does API service ‘X’ behave this way when I set the twiddle_frobs field on API ‘Y’ to true, but only when API Z’s ignore_frob_twiddle_settings field is not null?”

If you’ve done a good job (and gotten really lucky), there will be some documentation on that.

For a great perspective on why documentation is so important to the health of a company from a smart man who’s been around the block once or twice, see Tim Daly’s talk Literate Programming in the Large.

(Image courtesy typedow under CC-BY-NC-SA.)

Applying Lean Principles to the Documentation Lifecycle


Earlier, I promised to post my notes from talks I attended at the 2014 STC Summit. This talk, by Alan Houser, was probably the most impactful of the Summit for me. The tl;dr version is simply this: Find out what your customers value, and spend your time doing that.

Below is a lightly edited version of the notes I took during the session. The content of the talk is copyright Mr. Hauser, and any errors are mine.

Big Ideas

  • Build/measure/learn
  • get out of the building
  • minimum viable product
  • pivot

How much of what we do truly provides value to the customer?

What we care about

  • deliverables
  • schedules
  • tools
  • org structure
  • office politics
  • legacy file formats

What customers care about

  • can i find it?
  • does it help me?

The Pivot

Can we, based on data, adjust what we do?

“We’ve always done it this way”.

How Companies Pivot

  • budget cuts
  • re-org
  • reduction in force

What Works?

Do That.

What Doesn’t?

Don’t Do That.

What do you measure?

  • pages?
  • topics?
  • words/topic?
  • word count of doc set
  • average word count of headings?
  • readability score?
  • hours/topic?
  • percentage of reuse?
  • revisions/time
  • customer views/topic
  • number of unique words

Do You Get Out of the Building?

What is Waste?

  • things that don’t provide customer value
  • waste time, money, resources, focus
  • (some orgs try to do too much)
  • let’s document this corner case
  • let’s adjust this formatting
  • let’s deliver a CHM file

Let It Go!

Are you continually asking: How does this provide value?

Do you pivot when your process is not aligned with customer value?

Rocky Balboa did two things in the story:

1. Transformed himself

2. Massively Exceeded Expectations

How to exceed expectations?

1. learn something new

2. try something different

3. talk to customers

4. measure something you haven’t before

(Image courtesy dirtyf under Creative Commons License)

Recommendations for Technical Writers

This document describes my recommendations for technical writers working in the software industry (at least the web-focused corner of it). Be forewarned: it’s opinionated. I don’t even live up to it, so you can think of it as an extended note to self.

Learn the UNIX command line

The entire modern internet computing infrastructure is built on open source, UNIX-like operating systems (mostly Linux). If you’re not familiar with them, you will have a more difficult time learning about how large-scale software systems running on networked computers work, otherwise known as: virtually all software systems of importance 1.

To get there, you’ll need to know how to interact with the command line interface (also known as a “terminal”). This is a very plain-looking text-only interface where you will type commands formatted in a strict language understood by the computer. You will learn (if you haven’t already) that the computer always does exactly what you’ve asked it to, but not necessarily what you wanted it to do.

There are a number of “shells” you can use on a UNIX-like OS. I recommend Bash because it’s ubiquitous.

Recommended Reading:

Learn REST API principles

Here’s a light-speed introduction to REST APIs. A large software system has lots of “objects”, which are generally just pieces of stored data about something. For example, where I work, there are objects that represent an advertising campaign or a user in the system.

It doesn’t really matter what data the objects represent, the point of an API is that it creates a single point of contact for external entities (usually programs) that want to interact with the data in the system. It ensures that only certain operations can be performed, and only by authorized parties.

Here’s where the REST part comes in: for any object represented in the system, you can (in theory) perform 4 basic operations on it:

  • Create: Make a new object of that type
  • Read: Look at what information the object contains
  • Update: Change the information the object contains
  • Delete: Delete the object from the system altogether

Unfortunately, I know about RESTful APIs only from practice, not study. There’s a copy of O’Reilly’s RESTful Web Services at work that I’ve been meaning to read to fill in the gaps. See? This really is a note to self.

Recommended Reading:

Learn to use a programmer’s text editor

You’re a professional who gets paid to work with text all day. Programmers are also professionals who get paid to work with text all day. You should be following their example by using a powerful plain text editor; it will allow you to automate away some of the tedious parts of editing.

Remember, word processors are written by programmers for others to use. Text editors are written by programmers for themselves to use. It’s not even a contest.

I recommend vim or Emacs. Pick one, it doesn’t matter.

Recommended Reading:

Learn a programming language with strong UNIX integration

As noted above, you are going to be documenting software running on networked computers doing … networky stuff. For example, you might be making API calls to a “cloud” of some kind 2. A lot of it will be tedious and potentially time-consuming, which means that you should automate it using a programming language. Once you are pretty familiar with a range of terminal commands, it won’t be too difficult to pick up a language with strong UNIX integration. Tastes vary; I am most familiar with Perl. I do not recommend wasting much time writing shell scripts, as there are too many weird edge cases for beginners to trip over.

Recommended Reading:

  • Learning Perl by Randal L. Schwartz, brian d foy, and Tom Phoenix
  • Modern Perl by chromatic


1 Where “importance” is defined as: someone will pay you good money to document it. Personally, I’d love to get a paying job rewriting half of the Emacs manual. But I digress.

2 A Freudian slip by the tech industry perhaps? After all, “the cloud” brings to mind an ephemeral, floating entity that is mostly vaporous, and exists at the whim of forces you don’t really understand, and certainly can’t control. In addition, it’s highly likely to be blown away at any time.