A Mini Python and Shell Tutorial

wooly-mammoth-cp

The following is an email I sent to a couple of coworkers whom I’d been teaching a short Python course for technical writers, using Automate the Boring Stuff with Python. The email was meant to show them a real-life example of how a technical writer can use Python and shell scripting to automate something that is, well, boring. In this case, the task was to clean up a CSV file containing a list of git commits to the AppNexus REST APIs.

Because of the way we received this data, it had duplicate entries, and lots of non-interesting merge commits that were unrelated to a feature (a feature is generally associated with a JIRA ticket). Our task was to review the commits and see if there was anything interesting that should be added to our monthly API release notes.

(The names of my coworkers have been changed, obv.)


To: Jane X. (‘REDACTED@appnexus.com’)

Subject: Filtered API git commits to review (bonus: mini Python & shell tutorial)

From: Rich Loveland (‘REDACTED@appnexus.com’)

CC: Victoria Y. (‘REDACTED@appnexus.com’)

Date: Wed, 18 Nov 2015 16:59:04 -0500

+Victoria for the code fun

Jane, the file of commit logs for you to review is attached (along with some others). But so what, that’s boring! Let’s talk about how it was made.

To make the really boring task of reviewing API git commits less awful, let’s do some programming for fun. First let’s write a short Python script to pull out only those commits that have a JIRA ticket ID in them (since we don’t care about the other ones), and call it ‘filter-commit-messages.py’:

  #!/usr/bin/env python

  import re
  import sys

  jira_pat = "[A-Z]+-[0-9]+"

  for line in sys.stdin.readlines():
      m = re.search(jira_pat, line)
      if m:
          print(line)

This tries to match a regular expression against each line of its input (in this case the compiled API git commit list), and prints the line if the match occurs.

Let’s make it executable from our shell:

$ cd ~/bin
$ ln -s ~/work/code/filter-commit-messages.py filter-commit-messages
$ chmod +x ~/bin/filter-commit-messages
$ export PATH=$HOME/bin:$PATH

Then we can run it on the text file with the git commits like so:

$ filter-commit-messages < api-release-november-2015.csv

(The “<” in the shell means “Read your input from this place”.)

This prints out only the matching lines, but there are a lot of annoying extra lines in the output. We can get rid of those lines while sorting them like so:

$ filter-commit-messages < api-release-november-2015.csv | sort 

(The ”

” in the shell means “Pass your output through to this other command”.)

Now that we are extracting only the important lines, let’s throw them in a file:

$ filter-commit-messages < api-release-november-2015.csv | sort > api-release-november-2015-actual.csv

(The “>” near the end means “Write all of the output to this place”.)

We can see how much less reading we have to do now by running a word count program (‘wc’) on the before and after files:

$ wc -l api-release-november-2015.csv # old
     201 api-release-november-2015.csv
$ wc -l api-release-november-2015-actual.csv # new
     115 api-release-november-2015-actual.csv

(The “-l” means “count the lines”.)

Now, since Jane and I each have to review half of the commits, we can use the ‘split’ shell command to break the file in half. Since we know the file is 115 lines, we need to tell ‘split’ how many lines to put in each half with the ‘-l’ option (see ‘man split’ in your terminal):

$ split -l 58 api-release-november-2015-actual.csv COMMITS-TO-REVIEW

‘split’ takes the last argument, “COMMITS-TO-REVIEW”, and creates two files based on that, “COMMITS-TO-REVIEWaa” and “COMMITS-TO-REVIEWbb”, which we can rename for each reviewer:

$ mv COMMITS-TO-REVIEWaa COMMITS-TO-REVIEW-RICH
$ mv COMMITS-TO-REVIEWbb COMMITS-TO-REVIEW-JANE

A nice thing is that because we sorted the lines of the files, each reviewer gets commits by a sorted subset of the engineers, making it easier to see their related commits next to each other.

p.s. We didn’t actually need a Python program for the first part, we could have just used ‘grep’ and stayed with shell commands. But hey!

p.p.s. With more work, this could all be put together into a single program if we were inclined, but since it doesn’t get used that often it’s probably OK to type a few commands.

(Image courtesy William Hartman under Creative Commons license.)

How to Rock CSV files with SQLite

Hey, have you ever had to work with large quantities of CSV data (read: Excel) in a structured way? Has that experience ever made you want to smash something in a fit of rage? Have you ever wondered why you’re wasting your precious time on this earth reading half-arsed Excel tutorials from 2001 to run a simple query across some data set? If you’re anything like me, the answer is yes. Here’s how to banish that problem from your world forever (at least the CSV-related bits).

In this post, I will describe how to import CSV files into SQlite 3 and do some SQL-y things to them (also known as “rocking” them). There are many other tutorials on how to do this sort of thing; this one is mine. It assumes you have SQLite already installed on your computer.

I’m using the following version:

$ sqlite3 --version
3.7.12 2012-04-03 19:43:07 86b8481be7e76cccc92d14ce762d21bfb69504af

The Data Set

Here’s the CSV data set we’ll be working with – the columns are id, advertiser, line_item, campaign, impressions, and clicks. It’s in a file called tiny-db.csv.

1,  A          , ACME Axe Co.                        , Midwest Lumberjack Competitors           ,        1792 ,     21 
2, A          , ACME Axe Co.                        , Northwestern Logging Equipment Intenders ,        4355 ,     34 
3, A          , Bar-None Ice Cream Cones            , New York Summer Heatwave                 ,      78231 ,   1408 
4, B          , Candles R-US                        , Home Decor Enthusiasts                   ,        2843 ,     65 
5, B          , Doggie Accessories                  , Southwestern Evening Weekends            ,        9486 ,    123 
6, C          , Evergreen Synthetic Xmas Trees      , Sun Belt Homeowners                      ,        2238 ,     46 
7, D          , Frog Hollow Athletic Socks          , Back-to-School                           ,        8198 ,    214 
8, D          , Frog Hollow Athletic Socks          , Practical Holiday  Shoppers              ,         103 ,     12 
9, E          , Best Soap & Candle                  , Clean Hands Stay Healthy                ,        3883 ,     41 
10, E          , Best Soap & Candle                  , Clean Hands Make Good Neighbors          ,        1292 ,    183 
11, E          , Best Soap & Candle                  , Bath Salt Relaxation                     ,         902 ,     81 
12, E          , Best Soap & Candle                  , Bath Salt Relaxation (plus Candles!)     ,        5352 ,    212 
13 , F          , Farmer Snowplows                    , Northeast Winter Storms                  ,      12448 ,    256 
14, F          , Farmer Snowplows                    , Upper Midwest Winter Storms              ,      23984 ,    782 
15, F          , Farmer Snowplows                    , Rocky Mountain Winter Storms             ,       8764 ,    128 
16, G          , Gretchen's Organic Chocolates       , East Coast NPR Podcast Listeners         ,      48996 ,    973 
17, H          , Hap's Go-Kart Track and Petting Zoo , City Folks; Country Weekends             ,        1108 ,     87 
18, H          , Hap's Go-Kart Track and Petting Zoo , Go-Kart Enthusiast Forums (Weekend)      ,        1872 ,    116 

Don’t worry too much about what those terms mean, they mostly have to do with online advertising stuff and aren’t that relevant, except to say that:

  • Impressions are those moments when a person views an ad on a website
  • Clicks are recorded when people click on advertisements
  • I totally made them up (this should be obvious from the names, but still…)

Importing the CSV File

Here’s how to get the data into sqlite: first you have to make a table in the database that matches the columns in your CSV file. Then, set the comma as your separator, and import the data. Note that you may run into issues with fields that have commas inside them such as “Hello, World!”; it’s best to clean them up somehow beforehand.

rloveland:code rloveland$ sqlite3 
SQLite version 3.7.12 2012-04-03 19:43:07
Enter ".help" for instructions
Enter SQL statements terminated with a ";"

sqlite> create table tiny (id integer, advertiser text, line_item text, campaign text, impressions integer, clicks integer);

sqlite> .separator ","

sqlite> .import tiny-db.csv tiny

Querying the Data Set

Now we can run SQL statements on the data.

How many rows had more than 9000 impressions?

sqlite> select * from tiny where impressions > 9000;

3, A          , Bar-None Ice Cream Cones            , New York Summer Heatwave                 ,78231,1408
5, B          , Doggie Accessories                  , Southwestern Evening Weekends            ,9486,123
13, F          , Farmer Snowplows                    , Northeast Winter Storms                  ,12448,256
14, F          , Farmer Snowplows                    , Upper Midwest Winter Storms              ,23984,782
16, G          , Gretchen's Organic Chocolates       , East Coast NPR Podcast Listeners         ,48996,973

How many rows had 10000 impressions or more? Can we order them by campaign?

sqlite> select * from tiny where impressions >= 10000 group by campaign;

16, G          , Gretchen's Organic Chocolates       , East Coast NPR Podcast Listeners         ,48996,973
3, A          , Bar-None Ice Cream Cones            , New York Summer Heatwave                 ,78231,1408
13, F          , Farmer Snowplows                    , Northeast Winter Storms                  ,12448,256
14, F          , Farmer Snowplows                    , Upper Midwest Winter Storms              ,23984,782

Using Full Text Search

Finally, let’s set up full text searching. We have to use something called a `virtual’ table that mirrors our existing table.

sqlite> create virtual table tiny_fts3 using fts4 (id, advertiser, line_item, campaign, impressions, clicks);

sqlite> insert into tiny_fts3(id, advertiser, line_item, campaign, impressions, clicks) select id, advertiser, line_item, campaign, impressions, clicks from tiny;

Let’s run a few text searches using the MATCH keyword to make sure it works:

sqlite> select * from tiny_fts3 where line_item match('Acme');

1,  A          , ACME Axe Co.                        , Midwest Lumberjack Competitors           ,1792,21
2, A          , ACME Axe Co.                        , Northwestern Logging Equipment Intenders ,4355,34

sqlite> select * from tiny_fts3 where line_item match('Candle');

9, E          , Best Soap & Candle                  , Clean Hands Stay Healthy                ,3883,41
10, E          , Best Soap & Candle                  , Clean Hands Make Good Neighbors          ,1292,183
11, E          , Best Soap & Candle                  , Bath Salt Relaxation                     ,902,81
12, E          , Best Soap & Candle                  , Bath Salt Relaxation (plus Candles!)     ,5352,212

Conclusion

That’s pretty much the whole show. Now go forth and become the master of all CSVs you survey!