The Josephus Problem

../img/origami-wreath.jpg

The Josephus Problem is deceptively simple. A number of people stand in a circle. (Let’s call this number N.) They agree to count around the circle, and every /M/th person will be killed. As they go around the circle, in what order do they die? Another way of looking at it is: who is the last person left alive?

For an interesting introduction, I recommend checking out the Wikipedia article.

I recently found this problem described in the 2nd edition of Sedgwick’s Algorithms, and thought it would be fun to implement in Scheme. Because I’m not a real programmer (and I had to translate to Scheme from the Pascal in the book), it took me a couple of hours instead of a couple of minutes. But what an enjoyable couple of hours!

I started, as usual for algorithmic programming tasks, with a simulation. In the book, Sedgwick provides an example: If there are 9 people, and every 5th person in the circle is killed, the order in which they’ll die is: 5, 1, 7, 4, 3, 6, 9, 2, 8.

It’s usually easier to start work on a problem by thinking about a simpler case, so that’s what I did. I chose N = 5, and M = 2. Below I’ll show the results of my “paper” simulation (OK, the paper was actually an Emacs buffer).

First, let me show you how to read my “notation”. It should be pretty easy to figure out. Below is an example of one “state snapshot”. I’ve labelled each line with its meaning. If you are a Scheme programmer, you will surmise that each named item corresponds to a “named-let” loop variable. In addition, the vertical bar character in the XS row represents the state of the "cursor" as it goes around the circle of 5 people.

;; I.                           <-- State Number
;; ------------
;; |1 2 3 4 5                   <--    XS
;; '()                          <--    YS
;; 0                            <-- COUNT

Above, we are in state 1, at the start of the program. The cursor is at the beginning of the list XS, which represents the circle of people. YS is our results list, where we will store the "dead". As we travel around the circle and people are killed, they are added to YS. We could think of it as a "death toll", but really it's just a list of integers. As we go around the circle of people, we keep incrementing COUNT. When COUNT reaches the value of M, that person will be "killed", that is, be added to YS.

When this happens, we reset the COUNT to 0 and keep going around the circle. There's a catch, however. As more dead people are added to our result list YS, we need to ignore the spaces those people used to occupy in our count as we build COUNT back up towards M. In other words, the circle will have 4 people in it after the first person is killed. This means that as the circle gets smaller and smaller, people are killed more frequently.

I'll show you how I've solved that problem in Scheme in a moment; first, the simulation:

;; I.
;; ------------
;; |1 2 3 4 5
;; '()
;; 0

;; II.
;; ------------
;; 1| 2 3 4 5
;; '()
;; 1

;; III.
;; ------------
;; 1 2| 3 4 5
;; '()
;; 1

;; IV.
;; ------------
;; 1 _ 3| 4 5
;; '(2)
;; 1

;; V.
;; ------------
;; 1 _ 3 4| 5
;; '(2)
;; 2

;; V.
;; ------------
;; 1 _ 3 _ 5|
;; '(4 2)
;; 1

;; VI.
;; ------------
;; 1| _ 3 _ 5
;; '(4 2)
;; 2

;; VII.
;; ------------
;; _ _ 3| _ 5
;; '(1 4 2)
;; 1

;; VIII.
;; ------------
;; _ _ 3 _ 5|
;; '(1 4 2)
;; 2

;; IX.
;; ------------
;; _ _ 3| _ _
;; '(5 1 4 2)
;; 1

;; X.
;; ------------
;; _ _ _| _ _
;; '(5 1 4 2)
;; 2

;; XI.
;; ------------
;; _ _ _| _ _
;; '(3 5 1 4 2)
;; 2

Now that you've seen how the algorithm should work on paper, let's write some Scheme! I should state for the record that I did not read the Wikipedia article linked above until after I wrote this code. (This will be pretty obvious when you look at the code.) Here it is:

(define (josephus xs m)
  (let ((orig-xs (list-copy xs))
        (true-m (- m 1)))
    (let ((len (length xs)))
      (let loop ((xs xs)
                 (ys '())
                 (count 0))
        (cond 
         ;; If the dead list grows to the same length as our original
         ;; list, we know we're finished.
         ((= len (length ys))
          (reverse ys))
         ;; If XS is null, we have gone once around the circle.  We
         ;; start around again, leaving the count unchanged.
         ((null? xs)
          (loop orig-xs ys count))
         ;; If the current person we're looking at is already dead
         ;; (a ghost), move on.  And don't bother bumping the
         ;; count.
         ((member (car xs) ys)
          (loop (cdr xs) ys count))
         ;; If we're here, it's also the case that the current person
         ;; we're looking at is *not* in the dead list.  Therefore we
         ;; check if the count is equal to M.  If so, they must die.
         ((= count true-m)
          (loop (cdr xs) (cons (car xs) ys) 0))
         ;; If we get here, the current person we're looking at is
         ;; alive.  We check if the count is *not* equal to M.  If it
         ;; is not, we skip this person and bump the count.
         ((not (= count true-m))
          (loop (cdr xs) ys (+ count 1)))
         ;; We should never get here.
         (else #f))))))

How does it compare to the output described in Sedgwick? It seems to work!

> (josephus '(1 2 3 4 5 6 7 8 9) 5)
'(5 1 7 4 3 6 9 2 8)

There are several inefficiencies we could tackle, though. The first and most obvious is that we should try to calculate the solution mathematically (as shown in some of the Wikipedia examples) instead of building up lists using CONS. Another is that we make a "canonical" copy of the input list for restarting the loop. A third is the use of MEMBER to determine whether the person we're currently visiting is already in the "dead" list.

For example, here's the trace output showing 19 calls to MEMBER for a pretty small input:

(josephus '(1 2 3 4 5) 2)
[Enter (member 1 '())
 Leave member #f]
... 17 more calls to member ...
[Enter (member 3 '(5 1 4 2))
 Leave member #f]
'(2 4 1 5 3)

That said, in this case I was happy just to get something working. If this were going to be used for anything "real", I would probably rewrite it statefully using vectors. Most of the fun was in coming up with this first draft without resorting to any stateful idioms.

(Image courtesy Chris Lott via CC-BY license.)

More Recommendations for Technical Writers

../img/green-fractal-tree.jpg

In the same spirit of the last post in this series, I have more recommendations for technical writers working in software-land. As before, I make no claims to living up to these recommendations myself. It’s like, aspirational, man.

Without further ado:

Learn Markdown

Yes, Markdown. It’s become a de facto standard plain text format for all kinds of web writing and documentation. It’s easy to use, it’s everywhere, and there are many tools that you can use to work with it.

See also: What is Markdown?

If you want to know what could happen to you if you start using it for things, see Ryan Tomayko’s Why You Should Not Use Markdown.

Learn your web browser’s development tools

From time to time you will need to peek under the covers of a web app you are documenting and see what’s actually going on. Especially nowadays with single page apps, browser-side local storage, and all that fun stuff.

Even if you never need to know that stuff to write your UI docs, you can use the dev tools to rewrite text in user interfaces for making nicer screenshots.

Strunk & White! Yes, Really!

Read it. Then try like hell to live it. I’ll confine the rest of this document to “technical” issues, since I can’t improve on the advice given in that book.

(Again, this is an area where I need to work harder on following my own advice. My long sentences and passive constructions are killing me.)

Think about learning to read code

Everything in our world runs on code (or will soon). Learning how to read code at least up to a basic level can help you figure out what the hell’s going on, sometimes. Yes, you can talk to an engineer, but your conversation will be more productive if you at least have a starting point.

The reason is simple: It’s usually easier to go to someone with a wrong idea and have them correct you with the right ideas than it is to get that same someone to teach you everything from scratch.

This doesn’t mean you have to become a programmer yourself (although you should try to learn how to write scripts to automate boring computer-y stuff, as I noted previously). But like most foreign languages, it helps to be able to read a little to get directions from the locals. And in this case the locals (programmers) speak code.

Seek out pathological edge cases

In any really large system or “platform” designed and built by multiple people on different teams, not all of whom are talking to each other three times a day about every single design decision they make, there will be a number of edge cases. In other words, there will be parts of the system that don’t play well together, or are at best, um, unintuitive in the way they behave.

It’s your job to find these and document them as well as you can. Sometimes this will be low-priority because the issues will be fixed “soon” (for some value of soon). Sometimes it will be necessary, though. You will have to use your own best judgment, along with input from your friends in Engineering and the Product organization, about when to do this.

You probably won’t get them all (not even close), but if you actively seek them out as you go, you will develop a mindset that will help you learn the system a little better.

Also: no one will assume this is part of your job or give you actual time to work on this; you just have to do it.

Don’t listen to anybody

Perhaps that should be rephrased as: “Don’t listen to anybody … just yet.” Until you’ve done your own research and testing and had your own look at the thing you’re documenting (whatever that thing is), you can’t really write about it for someone else. When Product folks, Engineering, etc. tell you the sky is blue, you should still stick your head out the nearest window.

It’s not that they don’t know what they’re talking about (they usually know more than you), it’s that the features they’re telling you about may exist soon at your layer of the onion, but right now they’re sitting on a git branch in somebody’s laptop, or as a line in a technical spec on the internal wiki, or in a weird corner of the internal-only sandbox environment that you will need to hunt down in order to actually use the thing.

If you are documenting a web API, only the things the APIs actually do are real. Everything else is bullshit. That’s part of the meaning of “API”.

Nobody will explicitly budget time for this, either. But you have to do it.

Do your own research, but prepare to be Wrong

It’s always faster to bug someone with a quick question, and sometimes that’s necessary. That said, I’ve been embarrassed by asking quick questions that turned out to have stupidly easy answers that I could have found out for myself.

It’s better to be known as someone who tries to solve their own problems before reaching out for help. Doing some light functional testing as you document, trying out a few database queries, or even reading through code (if that even makes sense and if you are able), will teach you things about the system that could prove useful to you in the future. It will at least give you something to talk about when you do sit down with someone else.

Having said all that, you will still be Wrong a lot (yes, that’s capitalized and bold). It’s one thing to use a thing, and another to build it. The engineers will always know a lot more than you about the systems. If you can get them to point out your mistakes, you’re doing your job! You’re learning things! So don’t get discouraged.

But didn’t I just tell you not to listen to anybody? Yes, but it’s never that simple – these are complex systems.

You are probably an unpaid part-time QA person, embrace it

In the course of writing documentation, you will naturally test things out to ensure that what you’re writing isn’t totally useless (even so, that happens sometimes). You’ll probably spend a lot of time doing this, and taking notes on what you find. These notes will then be integrated into the documentation you write, or into bug reports of some kind to your engineering colleagues (which will probably be rejected because you are Wrong – see above).

Again: no one will assume this is part of your job or give you actual time to work on it; you just have to do it.

Never Forget that The Train Never Stops

Finally – and most importantly – don’t let the fact that everything is changing all of the time get you down. The user interface will change drastically; new APIs will be added, and old ones deprecated and removed; people you’ve gotten to know and like will come and go. If you stay at a job like this for a couple of years, you will be rewriting your Year Two revision of the rewrite you did when you started.

Through it all, “the platform” can never stop running and changing: upgrades, changes, and so on all have to happen while the train is running down the tracks at eighty miles per hour. It doesn’t stop for the engineering teams working on it, and it certainly isn’t going to stop for you.

Your best weapon against change is automating as much drudge work as possible (again with the scripting!) so that you can focus on:

  1. Learning your company’s systems as well as you can technically given your limited time and resources
  2. Knowing what’s getting built next, and why
  3. Knowing what the business cares about (this is usually what drives #2)

Finally, never forget that you are doing important work. Someday, when the people who designed and built the current system have moved on, people will wonder “Why does API service ‘X’ behave this way when I set the twiddle_frobs field on API 'Y' to true, but only when API Z's ignore_frob_twiddle_settings field is not null?''

If you've done a good job (and gotten really lucky), there will be some documentation on that.

For a great perspective on why documentation is so important to the health of a company from a smart man who's been around the block once or twice, see Tim Daly's talk Literate Programming in the Large.

(Image courtesy typedow under CC-BY-NC-SA.)

Applying Lean Principles to the Documentation Lifecycle

pipes

Earlier, I promised to post my notes from talks I attended at the 2014 STC Summit. This talk, by Alan Houser, was probably the most impactful of the Summit for me. The tl;dr version is simply this: Find out what your customers value, and spend your time doing that.

Below is a lightly edited version of the notes I took during the session. The content of the talk is copyright Mr. Hauser, and any errors are mine.

Big Ideas

  • Build/measure/learn
  • get out of the building
  • minimum viable product
  • pivot

How much of what we do truly provides value to the customer?

What we care about

  • deliverables
  • schedules
  • tools
  • org structure
  • office politics
  • legacy file formats

What customers care about

  • can i find it?
  • does it help me?

The Pivot

Can we, based on data, adjust what we do?

“We’ve always done it this way”.

How Companies Pivot

  • budget cuts
  • re-org
  • reduction in force

What Works?

Do That.

What Doesn’t?

Don’t Do That.

What do you measure?

  • pages?
  • topics?
  • words/topic?
  • word count of doc set
  • average word count of headings?
  • readability score?
  • hours/topic?
  • percentage of reuse?
  • revisions/time
  • customer views/topic
  • number of unique words

Do You Get Out of the Building?

What is Waste?

  • things that don’t provide customer value
  • waste time, money, resources, focus
  • (some orgs try to do too much)
  • let’s document this corner case
  • let’s adjust this formatting
  • let’s deliver a CHM file

Let It Go!

Are you continually asking: How does this provide value?

Do you pivot when your process is not aligned with customer value?

Rocky Balboa did two things in the story:

1. Transformed himself

2. Massively Exceeded Expectations

How to exceed expectations?

1. learn something new

2. try something different

3. talk to customers

4. measure something you haven’t before

(Image courtesy dirtyf under Creative Commons License)

Thoughts on the 2014 STC Summit

This is a collection of random thoughts based on my attendance at the 2014 STC Summit earlier this week. I will try to post my more detailed notes from the various individual talks over the next days and weeks.

Lots of proprietary tools, not so much open source

There are lots of proprietary document creation and management tools, and their vendors seem to be well-represented here. Coming from a hybrid tech-writing/programming background, I have to admit that some of the proprietary solutions looked sort of strange to me. Many appeared to be Windows-only to boot.

It seems like there is a lot of opportunity for open-source software to make inroads here. I wonder what it would take to bring the XML-editing capabilities of open source editors like Emacs and Vim up to date (if they aren’t already) to match the capabilities of proprietary tools like Framemaker, Oxygen editor, Madcap Flare, and the like.

There are a few reasons why open source and tech comm could be a match made in heaven. Especially when you consider that whatever improvements in process or tooling you create in open source environments are yours to keep, free of charge, forever. This is definitely not the case in proprietary environments. When you develop your own automation and tooling against proprietary tools, and the vendor breaks stuff, you’re often out of luck.

DITA and XML

It turns out that XML and the transformation tools that work on it such as XSLT and friends are pretty powerful. I have been aware of the existence of these technologies but haven’t used them much thus far in my career.

I feel like I understood the appeal of the DITA XML spec/style better after attending a great talk given by Caitlin Cronkhite and Ted Kuster from Salesforce. As I understood them to say, DITA is just a way of structuring your XML into topics that other tools can then use to create your documentation set with minimal repetition on the part of the writer. (I will put up my notes from that talk in another post.)

However, I admit I still don’t fully understand the reason for layering the proprietary environments over the top of the structure provided by DITA. I would probably prefer to author directly in XML using Emacs and nXML-mode. Alternatively, I’d use a markup language, such as a Markdown variant, that could be translated to XML with a script much like my own confluence2html, and build the various document sets I needed using Makefiles.

Key Takeaway: Do More Professional Development

The number one lesson from this trip was that I have a lot to learn (this should be evident from the preceding paragraphs). There are so many tools and techniques out there that I am not aware of. I’ve only been at this tech writing gig for a couple of years now, after all.

I look forward to engaging more tech writers working in other industries to learn about how they do what they do. My hope is that this will allow me to develop my own skills by stealing some of their best ideas while sharing some of my own crazy notions as well.

Scsh Manual Available in Texinfo Format

../img/shells.jpg

I am pleased to announce that, many years later, the famous Scsh Reference Manual has been converted to Texinfo. You can get your very own copy here: https://github.com/rmloveland/scsh-manual-texinfo.

This conversion represents quite a bit of work. Although I would have liked to do the conversion with a program, there was just too much of an impedance mismatch between the original LaTeX sources and Texinfo. Something tells me that’s why this conversion hasn’t happened until now. Whatever the reason, I’m glad to finally collect some internet cool points, even if it is almost twenty years later.

In the course of the conversion, I uncovered a number of small spelling, grammatical, and usage issues, which have now been fixed. To be clear: this is no criticism of the original authors. Several of them appear to be non-native English speakers. Their English is miles better than my German!

The resulting manual builds very nicely into HTML, PDF, or Info files. The PDF in particular is nice, since you get beautiful, clickable links everywhere! And the Info files are quite easy to install in your local Emacs.

Speaking of Emacs, if you like hacking Scsh from Emacs, try my geiser-scsh setup. Better yet, help me hack on it!

(Image courtesy steeljam under CC-BY-NC-ND license.)

SICP Exercises 1.5 through 1.7

../img/origami-pockets.jpg

In this installment, we’ll look at exercises 1.5 through 1.7. Note that these exercises are taken from the first edition.

Exercise 1.5

The GOOD-ENOUGH? test used in computing square roots will not be very effective for finding the square roots of very small numbers. Also, in real computers, arithmetic operations are almost always performed with limited precision. This makes our test inadequate for very large numbers. Explain these statements, with tests showing how the test fails for small and large numbers. An alternative strategy for implementing GOOD-ENOUGH? is to watch how GUESS changes from one iteration to the next and to stop when the change is a very small fraction of the guess. Design a square-root procedure that uses this kind of test. Does this work better for small and large numbers?

First, here are the original definitions of GOOD-ENOUGH? and friends from the text:

(define (square x)
  (* x x))

(define (sqrt-iter guess x)
  (if (good-enough? guess x)
      guess
      (sqrt-iter (improve guess x)
                 x)))

(define (improve guess x)
  (average guess (/ x guess)))

(define (average x y)
  (/ (+ x y) 2))

(define (good-enough? guess x)
  (< (abs (- (square guess) x)) 0.001))

(define (new-sqrt x)
  (sqrt-iter 1.0 x))

Let's test the operation of NEW-SQRT. In the results below, NEW-SQRT is off by a factor of ten when taking the root of .00001 – a small number.

(sqrt .00001)
0.00316228 ;Three one thousandths.
(new-sqrt .00001)
0.0313565 ;Three one hundredths. Wat?

Why are we so far off? Let's trace GOOD-ENOUGH?, since it's used to decide when our result is, er, good enough.

> ,trace good-enough?
> (new-sqrt .00001)
[Enter (good-enough? 1. 1.e-05)
 Leave good-enough? #f]
[Enter (good-enough? 0.500005 1.e-05)
 Leave good-enough? #f]
[Enter (good-enough? 0.250012 1.e-05)
 Leave good-enough? #f]
[Enter (good-enough? 0.125026 1.e-05)
 Leave good-enough? #f]
[Enter (good-enough? 0.0625531 1.e-05)
 Leave good-enough? #f]
[Enter (good-enough? 0.0313565 1.e-05)
 Leave good-enough? #t]
0.0313565

In the last call to GOOD-ENOUGH?, we're still off by a factor of ten! This is because inside GOOD-ENOUGH?, we're checking our results against an absolute value (also known as a ``magic number''), which is dumb. As suggested by the text, let's implement a better GOOD-ENOUGH? that watches how GUESS changes from one iteration to the next and returns true when the value of GUESS stops changing by much. This means we'll have to update SQRT-ITER, too.

(define (good-enough? last-guess this-guess x)
  (if (< (abs (- this-guess last-guess)) .0001) ;Watch how GUESS changes.
      #t
      #f))

(define (sqrt-iter last-guess this-guess x)
  (if (good-enough? last-guess this-guess x)
      this-guess
      (sqrt-iter this-guess (improve this-guess x) x)))

(define (new-sqrt x)
  (sqrt-iter 0 1.0 x))

Now we'll test our new GOOD-ENOUGH?:

> (sqrt .0001)
0.01
> (new-sqrt .0001)
0.01
> (sqrt .1)
0.316228
> (new-sqrt .1)
0.316228
> (sqrt .0000001)
0.000316228
> (new-sqrt .0000001)
0.000319804 ;Pretty darn close!

As we can see from the tests, this version of GOOD-ENOUGH? starts to lose accuracy when we work with millionths. I'm happy with that accuracy for this application. :-)

Exercise 1.6

Newton's method for cube roots is based on the fact that if y is an approximation for the cube root of x, then a better approximation is given by the value

(x/y**2 + 2y) / 3

Use this formula to implement a cube-root procedure analogous to the square root procedure.

After doing the last problem, this is pretty much a ``color by numbers'' implementation:

(define (cube-iter last-guess this-guess x)
  (if (good-enough? last-guess this-guess x)
      this-guess
      (cube-iter this-guess (cube-improve this-guess x) x)))

(define (cube-improve guess x)
  (/ (+ (/ x (square guess)) (* 2 guess)) 3))

(define (cube x)
  (cube-iter 0 1.0 x))

Testing it is also straightforward:

  • Take the cube root Y of a number, X
  • Use EXPT to cube Y
  • See if the number you get back is X

The tests look good. Note that GOOD-ENOUGH? is still being traced (just for fun).

> (cube 16)
[Enter (good-enough? 0 1. 16)
 Leave good-enough? #f]
[Enter (good-enough? 1. 6. 16)
 Leave good-enough? #f]
[Enter (good-enough? 6. 4.14815 16)
 Leave good-enough? #f]
[Enter (good-enough? 4.14815 3.07538 16)
 Leave good-enough? #f]
[Enter (good-enough? 3.07538 2.61415 16)
 Leave good-enough? #f]
[Enter (good-enough? 2.61415 2.5232 16)
 Leave good-enough? #f]
[Enter (good-enough? 2.5232 2.51985 16)
 Leave good-enough? #f]
[Enter (good-enough? 2.51985 2.51984 16)
 Leave good-enough? #t]
2.51984
> ##
2.51984
> (expt ## 3)
16.
> (expt 16 3)
4096
> (cube ##)
[Enter (good-enough? 0 1. 4096)
 Leave good-enough? #f]
[Enter (good-enough? 1. 1366. 4096)
 Leave good-enough? #f]
[Enter (good-enough? 1366. 910.667 4096)
 Leave good-enough? #f]
[Enter (good-enough? 910.667 607.113 4096)
 Leave good-enough? #f]
[Enter (good-enough? 607.113 404.746 4096)
 Leave good-enough? #f]
[Enter (good-enough? 404.746 269.839 4096)
 Leave good-enough? #f]
[Enter (good-enough? 269.839 179.911 4096)
 Leave good-enough? #f]
[Enter (good-enough? 179.911 119.983 4096)
 Leave good-enough? #f]
[Enter (good-enough? 119.983 80.0836 4096)
 Leave good-enough? #f]
[Enter (good-enough? 80.0836 53.6019 4096)
 Leave good-enough? #f]
[Enter (good-enough? 53.6019 36.2098 4096)
 Leave good-enough? #f]
[Enter (good-enough? 36.2098 25.1812 4096)
 Leave good-enough? #f]
[Enter (good-enough? 25.1812 18.9407 4096)
 Leave good-enough? #f]
[Enter (good-enough? 18.9407 16.4329 4096)
 Leave good-enough? #f]
[Enter (good-enough? 16.4329 16.0113 4096)
 Leave good-enough? #f]
[Enter (good-enough? 16.0113 16. 4096)
 Leave good-enough? #f]
[Enter (good-enough? 16. 16. 4096)
 Leave good-enough? #t]
16.

Exercise 1.7

Each of the following two procedures defines a method for adding two positive integers in terms of the more primitive operators 1+, which increments its argument by 1, and -1+, which decrements its argument by 1.

(define (incr n) (+ n 1))               ;Modern 1+
(define (decr n) (- n 1))               ;Modern -1+

;; The FOO and BAR designators are just used to differentiate these
;; procedures from our built-in addition procedure (and from each
;; other).
(define (foo+ a b)
  (if (= a 0)
      b
      (incr (foo+ (decr a) b))))

(define (bar+ a b)
  (if (= a 0)
      b
      (bar+ (decr a)
            (incr b))))

Using the substitution model, illustrate the process generated by each procedure in evaluating (+ 4 5). Are these processes iterative or recursive?

Let's start by tracing FOO+. This wasn't strictly necessary, but I like tracing things.

> (foo+ 3 4)
[Enter (foo+ 3 4)
[Enter (foo+ 2 4)
[Enter (foo+ 1 4)
[Enter (foo+ 0 4)
 Leave foo+ 4]
 Leave foo+ 5]
 Leave foo+ 6]
 Leave foo+ 7]
7

As you can see, we keep deferring operations until we reach our base case. This is shown in the way we enter the recursive FOO+ procedure call four times, and each time we decrement the variable A, but B remains the same. Then, when we hit our base case (when A is equal to 0), we make the addition calls we've been storing up all this time. You can tell this is happening because every time we `Leave' the call to FOO+, we're still adding to the value.

Here's another, possibly better way to visualize it:

(foo+ 3 4)
(incr (foo+ (decr 3) 4))
(incr (incr (foo+ (decr 2) 4)))
(incr (incr (incr (foo+ (decr 1) 4))))
(incr (incr (incr 4)))

Therefore this procedure uses O(n) time in B and O(n) space in A (assuming it's called as (FOO+ A B).)

Now let's look at BAR+. Here's the text of the procedure again:

(define (bar+ a b)
  (if (= a 0)
      b
      (bar+ (decr a)
            (incr b))))

BAR+ looks like it will generate an iterative process. Its self-call appears to be in tail position. Let's see how it looks in the REPL:

> (bar+ 3 4)
[Enter (bar+ 3 4)
[Enter (bar+ 2 5)
[Enter (bar+ 1 6)
[Enter (bar+ 0 7)
 Leave bar+ 7]
 Leave bar+ 7]
 Leave bar+ 7]
 Leave bar+ 7]
7

In this case, we can be pretty sure we're generating an iterative process (with no deferred computations being built up) because when we `Leave' the call to BAR+ we're not doing anything additional to the result, it's just being passed back up to the top-level unchanged.

Here's a trace written out by hand:

(BAR+ 4 3)
(BAR+ 3 4)
(BAR+ 2 5)
(BAR+ 1 6)
(BAR+ 0 7)

As you can see, we don't defer operations that take up additional space. Therefore we're O(1) in space, and O(n) in time for this implementation.

(Image courtesy SallySetsForth under Creative Commons license.)

How to Rock CSV files with SQLite

Hey, have you ever had to work with large quantities of CSV data (read: Excel) in a structured way? Has that experience ever made you want to smash something in a fit of rage? Have you ever wondered why you’re wasting your precious time on this earth reading half-arsed Excel tutorials from 2001 to run a simple query across some data set? If you’re anything like me, the answer is yes. Here’s how to banish that problem from your world forever (at least the CSV-related bits).

In this post, I will describe how to import CSV files into SQlite 3 and do some SQL-y things to them (also known as “rocking” them). There are many other tutorials on how to do this sort of thing; this one is mine. It assumes you have SQLite already installed on your computer.

I’m using the following version:

$ sqlite3 --version
3.7.12 2012-04-03 19:43:07 86b8481be7e76cccc92d14ce762d21bfb69504af

The Data Set

Here's the CSV data set we'll be working with – the columns are id, advertiser, line_item, campaign, impressions, and clicks. It's in a file called tiny-db.csv.

1,  A          , ACME Axe Co.                        , Midwest Lumberjack Competitors           ,        1792 ,     21 
2, A          , ACME Axe Co.                        , Northwestern Logging Equipment Intenders ,        4355 ,     34 
3, A          , Bar-None Ice Cream Cones            , New York Summer Heatwave                 ,      78231 ,   1408 
4, B          , Candles R-US                        , Home Decor Enthusiasts                   ,        2843 ,     65 
5, B          , Doggie Accessories                  , Southwestern Evening Weekends            ,        9486 ,    123 
6, C          , Evergreen Synthetic Xmas Trees      , Sun Belt Homeowners                      ,        2238 ,     46 
7, D          , Frog Hollow Athletic Socks          , Back-to-School                           ,        8198 ,    214 
8, D          , Frog Hollow Athletic Socks          , Practical Holiday  Shoppers              ,         103 ,     12 
9, E          , Best Soap & Candle                  , Clean Hands Stay Healthy                ,        3883 ,     41 
10, E          , Best Soap & Candle                  , Clean Hands Make Good Neighbors          ,        1292 ,    183 
11, E          , Best Soap & Candle                  , Bath Salt Relaxation                     ,         902 ,     81 
12, E          , Best Soap & Candle                  , Bath Salt Relaxation (plus Candles!)     ,        5352 ,    212 
13 , F          , Farmer Snowplows                    , Northeast Winter Storms                  ,      12448 ,    256 
14, F          , Farmer Snowplows                    , Upper Midwest Winter Storms              ,      23984 ,    782 
15, F          , Farmer Snowplows                    , Rocky Mountain Winter Storms             ,       8764 ,    128 
16, G          , Gretchen's Organic Chocolates       , East Coast NPR Podcast Listeners         ,      48996 ,    973 
17, H          , Hap's Go-Kart Track and Petting Zoo , City Folks; Country Weekends             ,        1108 ,     87 
18, H          , Hap's Go-Kart Track and Petting Zoo , Go-Kart Enthusiast Forums (Weekend)      ,        1872 ,    116 

Don't worry too much about what those terms mean, they mostly have to do with online advertising stuff and aren't that relevant, except to say that:

  • Impressions are those moments when a person views an ad on a website
  • Clicks are recorded when people click on advertisements
  • I totally made them up (this should be obvious from the names, but still…)

Importing the CSV File

Here's how to get the data into sqlite: first you have to make a table in the database that matches the columns in your CSV file. Then, set the comma as your separator, and import the data. Note that you may run into issues with fields that have commas inside them such as "Hello, World!"; it's best to clean them up somehow beforehand.

rloveland:code rloveland$ sqlite3 
SQLite version 3.7.12 2012-04-03 19:43:07
Enter ".help" for instructions
Enter SQL statements terminated with a ";"

sqlite> create table tiny (id integer, advertiser text, line_item text, campaign text, impressions integer, clicks integer);

sqlite> .separator ","

sqlite> .import tiny-db.csv tiny

Querying the Data Set

Now we can run SQL statements on the data.

How many rows had more than 9000 impressions?

sqlite> select * from tiny where impressions > 9000;

3, A          , Bar-None Ice Cream Cones            , New York Summer Heatwave                 ,78231,1408
5, B          , Doggie Accessories                  , Southwestern Evening Weekends            ,9486,123
13, F          , Farmer Snowplows                    , Northeast Winter Storms                  ,12448,256
14, F          , Farmer Snowplows                    , Upper Midwest Winter Storms              ,23984,782
16, G          , Gretchen's Organic Chocolates       , East Coast NPR Podcast Listeners         ,48996,973

How many rows had 10000 impressions or more? Can we order them by campaign?

sqlite> select * from tiny where impressions >= 10000 group by campaign;

16, G          , Gretchen's Organic Chocolates       , East Coast NPR Podcast Listeners         ,48996,973
3, A          , Bar-None Ice Cream Cones            , New York Summer Heatwave                 ,78231,1408
13, F          , Farmer Snowplows                    , Northeast Winter Storms                  ,12448,256
14, F          , Farmer Snowplows                    , Upper Midwest Winter Storms              ,23984,782

Using Full Text Search

Finally, let's set up full text searching. We have to use something called a `virtual' table that mirrors our existing table.

sqlite> create virtual table tiny_fts3 using fts4 (id, advertiser, line_item, campaign, impressions, clicks);

sqlite> insert into tiny_fts3(id, advertiser, line_item, campaign, impressions, clicks) select id, advertiser, line_item, campaign, impressions, clicks from tiny;

Let's run a few text searches using the MATCH keyword to make sure it works:

sqlite> select * from tiny_fts3 where line_item match('Acme');

1,  A          , ACME Axe Co.                        , Midwest Lumberjack Competitors           ,1792,21
2, A          , ACME Axe Co.                        , Northwestern Logging Equipment Intenders ,4355,34

sqlite> select * from tiny_fts3 where line_item match('Candle');

9, E          , Best Soap & Candle                  , Clean Hands Stay Healthy                ,3883,41
10, E          , Best Soap & Candle                  , Clean Hands Make Good Neighbors          ,1292,183
11, E          , Best Soap & Candle                  , Bath Salt Relaxation                     ,902,81
12, E          , Best Soap & Candle                  , Bath Salt Relaxation (plus Candles!)     ,5352,212

Conclusion

That's pretty much the whole show. Now go forth and become the master of all CSVs you survey!