Monthly Archives: February 2014

How to Rock CSV files with SQLite

Hey, have you ever had to work with large quantities of CSV data (read: Excel) in a structured way? Has that experience ever made you want to smash something in a fit of rage? Have you ever wondered why you’re wasting your precious time on this earth reading half-arsed Excel tutorials from 2001 to run a simple query across some data set? If you’re anything like me, the answer is yes. Here’s how to banish that problem from your world forever (at least the CSV-related bits).

In this post, I will describe how to import CSV files into SQlite 3 and do some SQL-y things to them (also known as “rocking” them). There are many other tutorials on how to do this sort of thing; this one is mine. It assumes you have SQLite already installed on your computer.

I’m using the following version:

$ sqlite3 --version
3.7.12 2012-04-03 19:43:07 86b8481be7e76cccc92d14ce762d21bfb69504af

The Data Set

Here’s the CSV data set we’ll be working with – the columns are id, advertiser, line_item, campaign, impressions, and clicks. It’s in a file called tiny-db.csv.

1,  A          , ACME Axe Co.                        , Midwest Lumberjack Competitors           ,        1792 ,     21 
2, A          , ACME Axe Co.                        , Northwestern Logging Equipment Intenders ,        4355 ,     34 
3, A          , Bar-None Ice Cream Cones            , New York Summer Heatwave                 ,      78231 ,   1408 
4, B          , Candles R-US                        , Home Decor Enthusiasts                   ,        2843 ,     65 
5, B          , Doggie Accessories                  , Southwestern Evening Weekends            ,        9486 ,    123 
6, C          , Evergreen Synthetic Xmas Trees      , Sun Belt Homeowners                      ,        2238 ,     46 
7, D          , Frog Hollow Athletic Socks          , Back-to-School                           ,        8198 ,    214 
8, D          , Frog Hollow Athletic Socks          , Practical Holiday  Shoppers              ,         103 ,     12 
9, E          , Best Soap & Candle                  , Clean Hands Stay Healthy                ,        3883 ,     41 
10, E          , Best Soap & Candle                  , Clean Hands Make Good Neighbors          ,        1292 ,    183 
11, E          , Best Soap & Candle                  , Bath Salt Relaxation                     ,         902 ,     81 
12, E          , Best Soap & Candle                  , Bath Salt Relaxation (plus Candles!)     ,        5352 ,    212 
13 , F          , Farmer Snowplows                    , Northeast Winter Storms                  ,      12448 ,    256 
14, F          , Farmer Snowplows                    , Upper Midwest Winter Storms              ,      23984 ,    782 
15, F          , Farmer Snowplows                    , Rocky Mountain Winter Storms             ,       8764 ,    128 
16, G          , Gretchen's Organic Chocolates       , East Coast NPR Podcast Listeners         ,      48996 ,    973 
17, H          , Hap's Go-Kart Track and Petting Zoo , City Folks; Country Weekends             ,        1108 ,     87 
18, H          , Hap's Go-Kart Track and Petting Zoo , Go-Kart Enthusiast Forums (Weekend)      ,        1872 ,    116 

Don’t worry too much about what those terms mean, they mostly have to do with online advertising stuff and aren’t that relevant, except to say that:

  • Impressions are those moments when a person views an ad on a website
  • Clicks are recorded when people click on advertisements
  • I totally made them up (this should be obvious from the names, but still…)

Importing the CSV File

Here’s how to get the data into sqlite: first you have to make a table in the database that matches the columns in your CSV file. Then, set the comma as your separator, and import the data. Note that you may run into issues with fields that have commas inside them such as “Hello, World!”; it’s best to clean them up somehow beforehand.

rloveland:code rloveland$ sqlite3 
SQLite version 3.7.12 2012-04-03 19:43:07
Enter ".help" for instructions
Enter SQL statements terminated with a ";"

sqlite> create table tiny (id integer, advertiser text, line_item text, campaign text, impressions integer, clicks integer);

sqlite> .separator ","

sqlite> .import tiny-db.csv tiny

Querying the Data Set

Now we can run SQL statements on the data.

How many rows had more than 9000 impressions?

sqlite> select * from tiny where impressions > 9000;

3, A          , Bar-None Ice Cream Cones            , New York Summer Heatwave                 ,78231,1408
5, B          , Doggie Accessories                  , Southwestern Evening Weekends            ,9486,123
13, F          , Farmer Snowplows                    , Northeast Winter Storms                  ,12448,256
14, F          , Farmer Snowplows                    , Upper Midwest Winter Storms              ,23984,782
16, G          , Gretchen's Organic Chocolates       , East Coast NPR Podcast Listeners         ,48996,973

How many rows had 10000 impressions or more? Can we order them by campaign?

sqlite> select * from tiny where impressions >= 10000 group by campaign;

16, G          , Gretchen's Organic Chocolates       , East Coast NPR Podcast Listeners         ,48996,973
3, A          , Bar-None Ice Cream Cones            , New York Summer Heatwave                 ,78231,1408
13, F          , Farmer Snowplows                    , Northeast Winter Storms                  ,12448,256
14, F          , Farmer Snowplows                    , Upper Midwest Winter Storms              ,23984,782

Using Full Text Search

Finally, let’s set up full text searching. We have to use something called a `virtual’ table that mirrors our existing table.

sqlite> create virtual table tiny_fts3 using fts4 (id, advertiser, line_item, campaign, impressions, clicks);

sqlite> insert into tiny_fts3(id, advertiser, line_item, campaign, impressions, clicks) select id, advertiser, line_item, campaign, impressions, clicks from tiny;

Let’s run a few text searches using the MATCH keyword to make sure it works:

sqlite> select * from tiny_fts3 where line_item match('Acme');

1,  A          , ACME Axe Co.                        , Midwest Lumberjack Competitors           ,1792,21
2, A          , ACME Axe Co.                        , Northwestern Logging Equipment Intenders ,4355,34

sqlite> select * from tiny_fts3 where line_item match('Candle');

9, E          , Best Soap & Candle                  , Clean Hands Stay Healthy                ,3883,41
10, E          , Best Soap & Candle                  , Clean Hands Make Good Neighbors          ,1292,183
11, E          , Best Soap & Candle                  , Bath Salt Relaxation                     ,902,81
12, E          , Best Soap & Candle                  , Bath Salt Relaxation (plus Candles!)     ,5352,212

Conclusion

That’s pretty much the whole show. Now go forth and become the master of all CSVs you survey!

Simple Network Search in Scheme

../img/winston-horn-network.png

I’ve been having fun translating some of the code in Winston and Horn’s Lisp into Scheme. This book is amazing – clearly written, with lots of motivating examples and applications. As SICP is to language implementation, Lisp is to application development, with chapters covering constraint propagation, forward and backward chaining, simulation, object-oriented programming, and so on. And it does include the obligatory Lisp interpreter in one chapter, if you’re into that sort of thing.

In this installment, based on Chapter 19, we will look at some simple strategies for searching for a path between two nodes on a network (a graph). The network we’ll be using is shown in the diagram above.

Here’s the same network, represented as an alist where each CAR:CDR pair represents a NODE:NEIGHBORS relationship:

'((f e)
  (e b d f)
  (d s a e)
  (c b)
  (b a c e)
  (a s b d)
  (s a d))

The high-level strategy the authors use is to traverse the network, building up a list of partial paths. If a partial path ever reaches the point where it describes a full path between the two network nodes we’re after, we’ve been successful.

As with trees, we can do either a breadth-first or depth-first traversal. Here’s what the intermediate partial paths will look like for a breadth-first traversal that builds a path between nodes S and F:

(s)
(s a)
(s d)
(s a b)
(s a d)
(s d a)
(s d e)
(s a b c)
(s a b e)
(s a d e)
(s d a b)
(s d e b)
'(s d e f)

Based on that output, we can deduce that every time we visit a node, we want to extend our partial paths list with that node. Here’s one option – its only problem is that it will happily build circular paths that keep us from ever finding the node we want:

(define (%buggy-extend path)             ;builds circular paths
  (map (lambda (new-node) 
         (cons new-node path))
       (%get-neighbor (first path))))

(Incidentally, I’ve become fond of the convention whereby internal procedures that aren’t part of a public-facing API are prefixed with the `%’ character. This can be found in some parts of the MIT Scheme sources, and I believe it’s used in Racket as well. I’ve started writing lots of my procedures using this notation to remind me that the code I’m writing is not the real `API’, that the design will need more work, and that the current code is just a first draft. I’m using that convention here.)

Here’s a better version that checks if we’ve already visited the node before adding it to the partial paths list – as a debugging aid it prints out the current path before extending it:

(define (%extend path)
  (display (reverse path))
  (newline)
  (map (lambda (new-node)
         (cons new-node path))
       (filter (lambda (neighbor)
                 (not (member neighbor path)))
               (%get-neighbor (first path)))))

You may have noticed the %GET-NEIGHBOR procedure; it’s just part of some silly data structure bookkeeping code. Please feel free to deride me in the comments for my use of a global variable. What can I say? I’m Scheming like it’s 1988 over here! Here’s the boilerplate:

(define *neighbors* '())

(define (%add-neighbor! k v)
  (let ((new-neighbor (cons k v)))
    (set! *neighbors*
          (cons new-neighbor *neighbors*))))

(define (%get-neighbor k)
  (let ((val (assoc k *neighbors*)))
    (if val
        (cdr val)
        '())))

(%add-neighbor! 's '(a d))
(%add-neighbor! 'a '(s b d))
(%add-neighbor! 'b '(a c e))
(%add-neighbor! 'c '(b))
(%add-neighbor! 'd '(s a e))
(%add-neighbor! 'e '(b d f))
(%add-neighbor! 'f '(e))

Now that we have our data structure and a way to extend our partial path list (non-circularly), we can write the main search procedure, %BREADTH-FIRST. The authors have a lovely way of explaining its operation:

BREADTH-FIRST is said to do a “breadth-first” search because it extends all partial paths out to uniform length before extending any to a greater length.

Here’s the code, translated to use a more Schemely, iterative named LET instead of the linear-recursive definition from the book 1:

(define (%breadth-first start finish network)
  (let ((queue (list (list start))))
    (let loop ((start start)
               (finish finish)
               (network network)
               (queue queue))
      (cond ((null? queue) '())                    ;Queue empty?
            ((equal? finish (first (first queue))) ;Finish found?
             (reverse (first queue)))              ;Return path.
            (else
             (loop start
                   finish               ;Try again.
                   network
                   (append
                    (rest queue)
                    (extend (first queue))))))))) ;New paths in front.

(A better way to write this procedure would be to implement a generic internal search procedure that takes its `breadthiness’ or `depthiness’ as a parameter. We could then wrap it with nicer public-facing search procedures specific names.)

Meanwhile, back at the REPL… we remind ourselves of what *neighbors* actually looks like, and then we search for a path between the nodes S and F.

> *neighbors*
'((f e) (e b d f) (d s a e) (c b) (b a c e) (a s b d) (s a d))
> (%breadth-first 's 'f *neighbors*)
(s)
(s a)
(s d)
(s a b)
(s a d)
(s d a)
(s d e)
(s a b c)
(s a b e)
(s a d e)
(s d a b)
(s d e b)
'(s d e f)

What fun! I can almost imagine using a three-dimensional variant of these searches for a space wargame with wormhole travel! Except, you know, they’d need to be much faster and more skillfully implemented. There’s also the tiny requirement to write the surrounding game…

Footnotes:

1 It shouldn’t need to be said, but: Of course the authors knew better; they were trying to hide that unnecessary complexity from the reader until later.