Advent of Code 2017, Day 2

This is my solution for Day 2 of this year’s Advent of Code.

You may also enjoy browsing the Day 2 solutions megathread on Reddit.

PROBLEM

The spreadsheet consists of rows of apparently-random numbers. To make sure the recovery process is on the right track, they need you to calculate the spreadsheet’s checksum. For each row, determine the difference between the largest value and the smallest value; the checksum is the sum of all of these differences.

For example, given the following spreadsheet:

```5 1 9 5
7 5 3
2 4 6 8
```

The first row’s largest and smallest values are 9 and 1, and their difference is 8.

The second row’s largest and smallest values are 7 and 3, and their difference is 4.

The third row’s difference is 6.

In this example, the spreadsheet’s checksum would be 8 + 4 + 6 = 18.

SOLUTION

```(define (line->list line)
;; String -> List
(in-port (make-string-input-port line)))
(map string->number fields))))

;; File -> List[List[Number]]
(call-with-input-file file
(lambda (port)
(results '()))
(if (eof-object? line)
results
(loop (read-line port) (cons line results)))))))

(define (main prog+args)
(write (apply + (map
(lambda (row)
(let* ((xs (line->list row))
(min (apply min xs))
(max (apply max xs)))
(- max min)))
rows)))
(newline)))
```

Advent of Code 2017, Day 1

This is my solution for Day 1 of this year’s Advent of Code.

You may also enjoy browsing the Day 1 solutions megathread on Reddit.

PROBLEM

The captcha requires you to review a sequence of digits (your puzzle input) and find the sum of all digits that match the next digit in the list. The list is circular, so the digit after the last digit is the first digit in the list.

For example:

• 1122 produces a sum of 3 (1 + 2) because the first digit (1) matches the second digit and the third digit (2) matches the fourth digit.

• 1111 produces 4 because each digit (all 1) matches the next.

• 1234 produces 0 because no digit matches the next.

• 91212129 produces 9 because the only digit that matches the next one is the last digit, 9.

SOLUTION

```(define captcha-input "5994521226795838")

(define (gather-matches s)
;; String -> List
(let ((in-port (make-string-input-port s)) (count 0) (head #f) (vals '()))
(let loop ((cur (read-char in-port)) (next (peek-char in-port)) (count count) (vals vals))
(if (eof-object? next)
(cons cur vals)
vals)
(cond ((= count 0)
(begin
(loop cur next (+ 1 count) vals)))
((char=? cur next)
(loop (read-char in-port) (peek-char in-port) (+ 1 count) (cons cur vals)))
(else (loop (read-char in-port) (peek-char in-port) (+ 1 count) vals)))))))

(define (main prog+args)
(matches* (map (lambda (c) (string->number (string c))) matches))
(sum (apply + matches*)))
(begin
(format #t "MATCHES*: ~A~%" matches*)
(format #t "SUM: ~A~%" sum))))
```

This post describes my solution for Day 2 of the Advent of Code.

Problem Description

First, the problem description (copied from the website):

Day 2: I Was Told There Would Be No Math

The elves are running low on wrapping paper, and so they need to submit an order for more. They have a list of the dimensions (length l, width w, and height h) of each present, and only want to order exactly as much as they need.

Fortunately, every present is a box (a perfect right rectangular prism), which makes calculating the required wrapping paper for each gift a little easier: find the surface area of the box, which is 2 x l x w + 2 x w x h + 2 x h x l. The elves also need a little extra paper for each present: the area of the smallest side.

For example:

• A present with dimensions 2x3x4 requires 2 x 6 + 2 x 12 + 2 x 8 = 52 square feet of wrapping paper plus 6 square feet of slack, for a total of 58 square feet.
• A present with dimensions 1x1x10 requires 2 x 1 + 2 x 10 + 2 x 10 = 42 square feet of wrapping paper plus 1 square foot of slack, for a total of 43 square feet.

All numbers in the elves’ list are in feet. How many total square feet of wrapping paper should they order?

Solution

Once again, we’ll be working in Scheme.

For this problem, I decided to create a “box” data type. In addition to the automatically generated accessors (thanks SRFI-9!), I wrote several procedures to perform calculations on boxes, namely:

• `SURFACE-AREA`: Calculate the box’s surface area.
• `SMALLEST-SIDE`: Determine which of the box’s sides has the smallest surface area (the extra material makes it easier to wrap).

`WRAPPING-PAPER` is just a “wrapper” (pun intended) around the first two.

`LINE->BOX`, `READ-BOXES`, and `SUM-BOXES` are all about parsing the input file contents and shuffling them into the box data type that we use to do the actual calculation. The only part that required a bit of thought was the line with `STRING-TOKENIZE` in `LINE->BOX`. In Perl I’d use `my @params = split /x/, \$line` without even thinking, but I was less familiar with Scheme’s facility for solving this problem, so it took a few minutes to puzzle out the right part of Scheme’s “API”. (`STRING-TOKENIZE` was helpfully provided by SRFI-13.)

Abstract data types FTW! I’ll be using them more as the month’s challenges progress.

```;; ,open srfi-9 srfi-13 sort

(define-record-type box
(make-box l w h)
box?
(l box-length set-box-length!)
(w box-width set-box-width!)
(h box-height set-box-height!))

(define (surface-area box)
;; Box -> Int
(let ((l (box-length box))
(w (box-width box))
(h (box-height box)))
(+ (* 2 l w)
(* 2 w h)
(* 2 h l))))

(define (smallest-side box)
;; Box -> Int
(define (smallest-two xs)
;; List -> List
(let ((sorted (sort-list xs <)))
(list (first sorted)
(second sorted))))
(let ((l (box-length box))
(w (box-width box))
(h (box-height box)))
(apply * (smallest-two (list l w h)))))

(define (wrapping-paper box)
;; Box -> Int
(let ((minimum (surface-area box))
(extra (smallest-side box)))
(+ minimum extra)))

(define (line->box line)
;; String -> Box
(define (line->lon s)
;; String -> List<Number>
(let ((xs (string-tokenize s (char-set-complement (char-set #\x)))))
(map string->number xs)))
(let* ((dims (line->lon line)))
(let ((l (first dims))
(w (second dims))
(h (third dims)))
(make-box l w h))))

;; Pathname -> List<Box>
(with-input-from-file file
(lambda ()
(ys '()))
(if (eof-object? line)
ys
(loop
(cons (line->box line) ys)))))))

(define (sum-boxes boxes)
;; List<Box> -> Int
(let ((xs (map wrapping-paper boxes)))
(apply + xs)))

;; eof
```

Make your own iterators in Scheme using closures

Introduction

I recently bought a copy of Higher Order Perl (also known as HOP) by Mark Dominus. In chapter 4, he introduces iterators. At a high level, an iterator is a black box with an input funnel, an output funnel, and a big red “NEXT” button. You put something into the funnel at time `T`; then, at some later time `T+n`, you press the “NEXT” button and something useful comes out.

Iterators are useful in cases when you want to access a potentially huge list of things one at a time, but you don’t have access to a computer with infinite memory.

In this post we’ll look at a moderately useful iterator example; a directory walker that recurses down a directory tree on your file system. The implementation is given in Scheme (specifically, the scsh dialect); the code and the descriptions thereof are heavily inspired by Dominus’ excellent book. However, this post contains the work of an amateur. Any errors or omissions in the following code and explanations are mine.

In order to build a useful iterator, we need to have a basic understanding of closures. A closure is just a function packaged up with an environment. The environment allows us to store up state between function calls. As noted above, it’s easiest just to think of it as a little black box that we can pass around in our program.

In Scheme and Common Lisp, you build functions with `LAMBDA`. Here’s a very simple closure example in Scheme, translated from the first example given in HOP:

```(define (upto m n)
(set! m (- m 1)) ;needed since we don't have post-increment (\$i++)
(lambda ()
(if (< m n)
(begin (set! m (+ m 1)) m)
#f)))
```

This is a function that returns a function (a closure). To use the closure returned by `UPTO`, assign it to a variable and call it whenever you want the next value. When the iterator is exhausted, it will return `#f`.

```> (define 1to10 (upto 1 10))
> (1to10)
1
> (1to10)
2
...
> (1to10)
9
> (1to10)
10
> (1to10)
#f
```

A trivial file tree iterator

A more interesting and practical use of this technique is to walk a directory tree on your computer. In the Scheme procedure below, we take a list of directories and then build and return a closure that, when called, will walk the directory tree breadth-first, printing out the names of files as we go.

```(define (dir-walk queue)
(let ((file #f))
(lambda ()
(if (not (null? queue))
(begin (set! file (car queue))
(set! queue (cdr queue))
(cond ((file-directory? file)
(let ((new-files (directory-files file)))
(set! queue (append queue (map (lambda (filename)
(string-append file "/" filename))
new-files)))
file))
((file-regular? file) file)
(else #f)))))))
```

The important part to notice is the inner `LAMBDA`. This is where we create a closure by packaging up a procedure with an environment. The closure’s environment remembers the contents of the variable `QUEUE` in memory, so we can ask for the elements of `QUEUE` at our leisure.

Here it is running on my machine:

```> (define dir-iter1 (dir-walk '("/Users/rloveland/Desktop/current/")))
> (dir-iter1)
"/Users/rloveland/Desktop/current/XProlog.jar"
> (dir-iter1)
"/Users/rloveland/Desktop/current/all-jira-issues.txt"
> (dir-iter1)
"/Users/rloveland/Desktop/current/automate-campaign-setup.txt"
> (dir-iter1)
"/Users/rloveland/Desktop/current/buflocal.scm"
> (dir-iter1)
"/Users/rloveland/Desktop/current/bundle.js"
```

Just for fun, here’s a version of the same function in Common Lisp. It’s essentially the same as the scsh version, save that it uses the nice CL pathname machinery instead of appending raw strings, and also makes use of the CL-FAD convenience library.

```(defun dir-walk (queue)
(let ((file nil))
(lambda ()
(if (not (null queue))
(progn
(setf file (car queue))
(setf queue (cdr queue))
(setf queue (append queue (mapcar #'(lambda (filename)
(merge-pathnames filename file))
new-files)))
file))
(t nil)))))))
```

Usage is also essentially the same:

```CL-USER> (defvar *tree-walker* (dir-walk '(#P"/Users/rloveland/Desktop/current/")))
*TREE-WALKER*
CL-USER> (type-of *tree-walker*)
FUNCTION
CL-USER> (funcall *tree-walker*)
#P"/Users/rloveland/Desktop/current/"
CL-USER> (funcall *tree-walker*)
#P"/Users/rloveland/Dropbox/current/.debuggerDefaults"
CL-USER> (funcall *tree-walker*)
#P"/Users/rloveland/Dropbox/current/.DS_Store"
CL-USER> (funcall *tree-walker*)
#P"/Users/rloveland/Dropbox/current/A"
CL-USER> (funcall *tree-walker*)
#P"/Users/rloveland/Dropbox/current/A.hi"
CL-USER> (funcall *tree-walker*)
#P"/Users/rloveland/Dropbox/current/A.hs"
```

A slightly less trivial file tree iterator

Listing out files and directories is nice, but not that useful. We’d like a way to see only those files and directories that have some interesting property.

This can be accomplished by passing in another argument to our `DIR-WALK` function: this argument will be yet another function that will test the current file to see whether it’s interesting to us. It’s pretty easy to change `DIR-WALK` to accept a function argument, `INTERESTING?`. This arbitrary function is used to check the file to see if we care about it.

This time around, when we build our internal queue, we use a call to `FILTER` to make sure that only interesting files get added.

```(define (dir-walk* interesting? queue)
(let ((file #f))
(lambda ()
(if (not (null? queue))
(begin (set! file (car queue))
(set! queue (cdr queue))
(cond
((file-directory? file)
(let ((new-files (directory-files file)))
(set! queue (append queue (filter interesting? (map (lambda (filename)
(string-append file "/" filename))
new-files))))
(if (interesting? file) file)))
((interesting? file) file)
(else #f)))
#f))))
```

And here it is in use; in this example, we pass an `INTERESTING?` function that asks for only files that are not marked as executable:

```> (define dir-iter2 (dir-walk* (lambda (f) (file-not-executable? f)) '("/home/rml/Desktop/current")))
> (dir-iter2)
> (dir-iter2)
"/home/rml/Desktop/current/A.hi"
> (dir-iter2)
"/home/rml/Desktop/current/A.hs"
> (dir-iter2)
"/home/rml/Desktop/current/A.o"
> (dir-iter2)
"/home/rml/Desktop/current/ABBREV.xlsx"
> (dir-iter2)
> (dir-iter2)
```

Conclusion

There are still bugs and unexpected behavior in `DIR-WALK*`. For example, on the first call to the resulting iterator there is no result due to the one-armed `IF`. There are also strange things that happen if we want to filter out directories from our results, since we mix together the `FILE-DIRECTORY?` check and the `INTERESTING?` check inside the iterator. However, despite these small nits, it can still do useful work on my machine 1, and it’s a good enough example of using closures to build iterators.

Here’s hoping you’ll enjoy playing with closures and iterators in your own code!

(Image courtesy zeitfanger.at under a Creative Commons License.)

Footnotes:

1 In the grandest internet tradition, “It works on my machine” ™.

Scsh is UNIX, Part 1: Sockets

Inspired by words from Ryan Tomayko and Jacob Kaplan-Moss, I set out to see how difficult it would be to implement a tiny UNIX echo server in scsh. It turns out to be fairly easy. In this post, we’ll cover:

• How the UNIX socket programming model works at a high level
• How to write a basic echo server in scsh

Unlike the echo servers of Tomayko and Kaplan-Moss, this one doesn’t use `fork` (yet). We will add forking to the server in a follow-up post. Hopefully this will make it easier for less experienced UNIX programmers (like me) to get started with.

The UNIX socket programming model for dummies

At a high level, the steps to open a server-side socket are:

• create a socket “thing”
• toggle some settings on it so the OS knows it’s for interweb use
• bind it to an address
• listen for connections on it
• start accepting those connections

scsh does its part to make this easy by providing a high-level Scheme procedure, `BIND-LISTEN-ACCEPT-LOOP`, which handles this stuff for you, in addition to some nifty error handling and other bookkeeping chores. But we’re going to ignore that for now and write everything by hand to see how it’s done.

You should be able to enter all of the code in the next section directly in the scsh top-level. You don’t need to open any additional packages; this is just plain old scsh.

Our echo server

Our server takes two arguments: the `PROC` is a procedure that causes the server to actually do something; we’ll write in a minute. The `PORT` is the port you want to serve on, e.g., 49995:

``````(define (server proc port)
(let* ((sock (create-socket protocol-family/internet socket-type/stream))
(listen-socket sock 5)
(let loop ()
(lambda () (accept-connection sock))
proc
(loop))))
``````

The first thing you’ll notice is that it’s pretty sequential and quite simple really. We just go through the socket opening dance we described earlier: open, configure, bind, listen, and accept.

Since this is an echo server, we’ll just, like, echo stuff:

``````(define (echo socket address)
(let* ((in (socket:inport socket))
(out (socket:outport socket))
(format out "~A~%" message)
(force-output out)))
``````

As you can see, our `ECHO` procedure takes a socket and an address. (We don’t do anything with the address in this example, but our procedure needs to “implement this interface”, as they say, in order to work. You can see this for yourself in the scsh 0.6.7 tarball; it’s in `scsh/network.scm`.)

In our `LET*`-binding we create some convenient locally bound names for the socket structure’s input and output ports, and then we read a line from the socket’s input port.

Since this is echo server, we just write the same data back out with `FORMAT`. We call `FORCE-OUTPUT` to flush output to the terminal immediately. Otherwise Scheme will wait for the operating system’s output buffer to fill before writing out, and it will appear to the user that nothing is happening.

Trying it out

Let’s start it up and see if it works. Assuming you’ve loaded the procedures above in your scsh image somehow, enter this at the REPL:

``````> (server echo 49995)
``````

The REPL should appear to “hang” while the server is running. Now go to your terminal and connect to the echo server. There are several ways to do it; here’s what I used:

``````~ \$ telnet localhost 49995
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
HELLO
HELLO
who's there?
who's there?
i don't know
i don't know
could you stop that?
could you stop that?
fine then
fine then
goodbye
goodbye
``````

Hopefully that was enough to get you started playing with sockets in scsh. I’ll write a followup post where we talk about the UNIX forking process model and update this example to make it a “preforking” server.

(Image courtesy Hope for Gorilla under a Creative Commons license.)

Driving the Scsh disassembler from Geiser

(Note: This post refers to functionality provided by geiser-scsh.)

Most Scheme 48 and scsh programmers are familiar with the `\,dis` command. It allows you to disassemble Scheme code (a procedure or continuation) into the “assembly” understood by the Scheme 48 virtual machine. Here’s an example using scsh’s `fork/pipe` procedure:

```> ,dis fork/pipe
fork/pipe
0 (protocol 0 +)
4 (make-env 1)
7 (global fork)
10 (push-local0 1)
13 (push)
14 (global really-fork/pipe)
17 (call 2)
```

This is pretty cool. However, I thought it would be even better if Geiser could run the disassembler on a region of Scheme code, and pop open the disassembler output in a new buffer. Geiser already supports similar functionality for macro-expansion via `GEISER-EXPAND-REGION`. There’s no reason why it shouldn’t do the same for the disassembler. So that’s what I taught it to do – in this screenshot I’m disassembling the `LETREC` expression:

(I have to admit I was inspired to work on this by my jealousy of the disassembler functionality available to Common Lispers via the `SLIME-DISASSEMBLE-{DEFINITION,SYMBOL}` commands in SLIME.)

How it Works

Here’s how it works:

• Highlight a region of Scheme (scsh) code in a buffer
• Call ‘`M-x geiser-scsh-disassemble-region`‘ (a.k.a., ‘`C-c RET C-d`‘).
• Emacs sends the Scheme code to the running `scsh` process for disassembly via a newly exported Geiser macro, `GE:DISASSEMBLE`. This macro wraps the code in a `LAMBDA` before sending it to the disassembler to work around the fact that the S48 `DISASSEMBLE` procedure won’t accept arbitrary Scheme expressions such as `(+ 1 1)`. (Wrapping expressions in `LAMBDA` like this is not ideal for reasons I’ll explain below.)

Future Work

This implementation is a proof of concept – don’t believe in it. Think of it as a prototype of how this functionality could work if built into Geiser proper. Here are some of the issues with it from an implementation perspective:

• It doesn’t use the correct Geiser protocols for sending Scheme code to evaluate, instead just piping in raw strings. This was expedient because of the way s48/scsh use non-reader-friendly strings like `{#Uncategorized}` as the final lines of output for procedures whose outputs are not defined by the standard. I think this can be fixed by coming up with a better implementation of the Geiser code evaluation protocols (in `scsh/geiser/evaluation.scm`) so that they handle all of the weird cases in S48 output.
• Related to the previous point, I’m doing some ugly regex stuff on the stringified output of the disassembler to make it nicer before piping it into the temp buffer.
• This functionality should really be added to Geiser itself, via the `GEISER-DEBUG-*` namespace. Then it would be forced to address both of the above points. Right now it’s just an ugly little hack in `geiser-scsh.el`. In principle, with the right infrastructure in `GEISER-DEBUG-*`, there’s nothing preventing a Guile or Racket implementation (here’s the Guile disassembler in action – you can see that it’s not so different from S48):
```scheme@(guile-user)> (define (foo n) (expt n n))
scheme@(guile-user)> ,x foo
Disassembly of #<procedure foo (n)>:

0    (assert-nargs-ee/locals 1)      ;; 1 arg, 0 locals
2    (toplevel-ref 1)                ;; `expt'
4    (local-ref 0)                   ;; `n'
6    (local-ref 0)                   ;; `n'
8    (tail-call 2)                                         at (unknown file):5:16
```
• The disassembler, like the macro-expansion functionality, should be able to work recursively. That is, in places where the S48 assembly makes procedure calls like `(global ge:really-fork/pipe)` (as it does in our first example way at the top of this post), you should be able to ask for the disassembled output of `GE:REALLY-FORK/PIPE` as well, all the way down to the core primitives. Currently, there are cases where you still need to call `\,dis <PROCEDURE>` at the command prompt. I think this limitation is created by the way we wrap the expression in a `LAMBDA` before sending it to the disassembler. A better design is needed, one that (for example) calls `PROCEDURE?` on the proposed input, and then decides whether the expression needs to be wrapped in a `LAMBDA` before going to the disassembler.

Scsh Manual Available in Texinfo Format

I am pleased to announce that, many years later, the famous Scsh Reference Manual has been converted to Texinfo. You can get your very own copy here: https://github.com/rmloveland/scsh-manual-texinfo.

This conversion represents quite a bit of work. Although I would have liked to do the conversion with a program, there was just too much of an impedance mismatch between the original LaTeX sources and Texinfo. Something tells me that’s why this conversion hasn’t happened until now. Whatever the reason, I’m glad to finally collect some internet cool points, even if it is almost twenty years later.

In the course of the conversion, I uncovered a number of small spelling, grammatical, and usage issues, which have now been fixed. To be clear: this is no criticism of the original authors. Several of them appear to be non-native English speakers. Their English is miles better than my German!

The resulting manual builds very nicely into HTML, PDF, or Info files. The PDF in particular is nice, since you get beautiful, clickable links everywhere! And the Info files are quite easy to install in your local Emacs.

Speaking of Emacs, if you like hacking Scsh from Emacs, try my geiser-scsh setup. Better yet, help me hack on it!

(Image courtesy steeljam under CC-BY-NC-ND license.)

Geiser and Scsh are Talking

In this post I’d like to announce an early version of scsh 1 support for Geiser 2. There’s plenty left to do, but enough of Geiser is working that you can write scsh with real completions provided by scsh itself! What’s more, if you want to you can hack on the Geiser support from the inside. :-}

Working features include:

• Symbol and module completion
• Module imports (mostly)
• view macroexpansions in buffer
• Talking to Geiser over its async protocol (the foundation of everything else)

Some features aren’t there yet:

• Jumping to procedure definition in source
• Show callers/callees of a procedure
• Jumping to module source
• Jumping to documentation for the thing at point (I’ve only gotten this to work sporadically for some reason)

Getting the Code

You can get the code here:

See the README and TODO.org files in the repo for more detailed information about project goals and what works and what doesn’t. Feel free to open issues on Github for any broken things (there are plenty).

You can also get a copy of the scsh manual in Texinfo here – please let me know if you can get the Geiser doc commands working with it, so far I haven’t had much success:

Getting Started

There are a few setup quirks:

• First, set Emacs’ `geiser-scheme-dir` variable to wherever you’re working from; here’s what I do:
```(setq geiser-scheme-dir (expand-file-name "~/Code/geiser-scsh/scheme/"))
```
• Second, go ahead and load up `geiser-scsh.el` from the git repo.
• Next, hop into Customize via `M-x customize-group ENTER geiser ENTER`, and set the following:
Custom Variable Value Note
Geiser Repl Skip Version Check P on You must set this, or else Emacs throws you into the debugger. This is a bug I need to fix.
Geiser Active Implementations scsh
Geiser Default Implementation scsh I do this since I’m mostly working in scsh, but it shouldn’t be necessary.

(Note that you’ll need to set `geiser-scheme-dir` back to its original value if you want to go back to using Guile or Racket.)

At this point, you should be able to do `M-x run-scsh` and enjoy your scsh hacking with nice completions and whatnot. If you’re going to hack on the Geiser support, do a `C-u M-x geiser-show-logs`; a buffer will open where you can listen to scsh and Emacs talking. And of course, the source is your friend. :-}

(Image courtesy RayMorris1 under CC-BY-NC-ND license.)

Word Count, scsh-style

Introduction

Just for fun, i’ve implemented a command line “word count” program in `scsh`. Along the way i’ve learned about a few really neat features of `scsh`, which I’ll discuss more below.

What is a word count program? In my perfect world, a word counter does only one thing: count the number of words it sees on standard input, and print that number to standard output. Let me give some examples of what should qualify as “words” for our purposes:

below
that’s
schleswig-holstein (one word!)
autonomy
friday’s
putrescence

Basic Design

If I were specifying a word count program in natural language, I might think of this series of steps:

1. Read standard input (fd 0) into some kind of input buffer.
2. Iterate over that buffer, checking for breaks (a.k.a. whitespace) between words.

A string like the following might prove tricky enough (feel free to replace the reference to “Satan” with your preferred \$EVIL_DEITY):

Every good dog goes to heaven – and if not? – well, I hear
Satan has de-
licious bones to chew!

By inspection, the sentence above has eighteen words. (Note that the implementation of `wc` in scsh described here (eventually) says 19, and GNU’s `wc` says there are 21–both are incorrect, but more on that below.) There are several edge cases to note in this sentence:

1. A space at the beginning of the sentence.
2. A clause inside em-dashes.
3. A hyphenated word occurring at the end of a line.

Some things we should probably do, given the above:

1. Ignore spaces at the beginnings of lines.
2. Ignore punctuation in general, such as periods, exclamation points…
3. ..except hyphens. Hyphens join two or more words into one. This should work across newlines; that would take care of the “hyphenated word at line’s end” case.

Of the above 3 items, the last point sounds the trickiest. I think we need to take care of the hyphen case eventually, but for now let’s punt on it and worry about getting something basic working. Starting at a high level, here’s my take on the program flow:

1. Loop over the lines of standard input.
2. For each line, split the line into words using an appropriately “clever” regular expression.
3. Keep a running total of how many words we’ve seen.

Implementation Notes

Sounds simple enough, right? Well, after some hacking around, here’s what I came up with. Items 1 and 3 from the above program flow list are pretty standard parts of any programming language, and scsh has them covered, as we’ll see. That leaves it up to me to implement the above-mentioned “clever” regular expression (which, as you’ll see shortly, could stand to be a little, um, “clever-er”).

In the course of describing this implementation, i’ll introduce several interesting features of scsh,

• The `rx’ regular expression notation
• The `awk’ macro
• The `regexp-fold-right’ procedure

and briefly mention another:

• Scheme 48/scsh “records”.

The Not-so-Clever Regular Expression

Rather than do it here, I’ve written the description of the regular expression using comments in the code itself. One of the strengths of shivers’ `rx’ regular expression package (standard in scsh) is that it allows the programmer to do matching notation using a syntax that, while not Scheme proper, is s-expression based rather than the traditional string-based syntax (Don’t worry: the string syntax is also supported). Because of the s-expressions, we can format these regexps easily in Emacs, as well as add descriptive inline comments.

The Perl aficionados among you will note that this is similar to Perl’s `-x’ regular expression modifier, which allows you to “Extend your pattern’s legibility by permitting whitespace and comments.” Of course, when we use scsh, we get all of that, and the rest of Scheme for free!

In any event, here is the expression that we match against every line of input:

```(define wc-rx (rx (:                   ; start rx matching sequence
(* whitespace)      ; 0 or more spaces to start the line
(+ alphanumeric)    ; match 1 or more chars in [0-9A-Za-z]
(?                  ; then, optionally match the following:
(or #\' #\`)       ; - apostrophe or backtick
(* alphanumeric))  ; - 0 or more [0-9a-zA-Z]
(* whitespace))))   ; - 0 or more spaces to end the line
```

Let’s see how it does against our word list from above:

Text Count
below 1
that’s 1
schleswig-holstein 2
autonomy 1
friday’s 1
putrescence 1

Looks like we’re failing to match correctly against “schleswig-holstein”, a hyphenated word. This is something of a corner case, since relatively few words in English use hyphens. Therefore, I’ll punt on it for now (as I did on the “hyphenated line break” case above), with the caveat that it may need to be implemented in future.

Looping with awk

I tried to avoid learning too much about scsh’s `awk’ macro at first. Unfortunately, I just didn’t take the time to read the manual in this case. My thinking was: Can’t I just write a `for-each` style Scheme loop? Well, yes. scsh actually comes with `regexp-for-each` procedure, which will happily loop across a string, gobbling up one match after another. The difficulty arose when i tried to update a state variable from inside my `awk’ macro loop. I just couldn’t seem to get it to work. Dramatization (This is not the actual code–I’ve since deleted it):

```(let ((count 0))
... stuff here ...
(set! count (+ 1 count)))
... etc. ...)
```

It turns out that this was extremely dumb, since `awk’ provides a nice state variable abstraction already – here’s the definition of the form from the excellent manual:

```(awk NEXT-RECORD RECORD+FIELD-VARS
[COUNTER] STATE-VAR-DECLS
CLAUSE_1 ...)
```

Using the block above (and the manual) as a reference, let’s look at the actual code in the section below – you’ll have to substitute the existing code below for the terms `NEXT-RECORD’ and the like:

1. Each time through the loop, `awk’ gets its next set of values using the procedure NEXT-RECORD; in this case, it uses a simple `read-line`.
2. The variable `line` (standing in for RECORD+FIELD-VARS) holds the line of input returned by `read-line`.
3. `awk’ provides an optional COUNTER variable, which is omitted since isn’t needed in this case.
4. We set the value of the state variable `words` to 0 (This is `STATE-VAR-DECLS’).
5. Finally, we get to `CLAUSE_1′. In the `awk’ macro, there are several types of clauses, which are all evaluated in sequence (this is unlike the way `cond` works, for example). In the most basic case, the clause is of the form `(/test/ /body/)`. As you can see here, test is simply `#t`, Scheme’s boolean truth value, so its corresponding body is evaluated every time through the loop. This causes the value of our state variable `words` to be incremented by the `length` of the list returned by the `regexp-fold-right’ block. For now we’ll just treat that block as a black box which looks at a string and returns a list of matching substrings.
```(awk (read-line) (line) ((words 0))
(#t (+ words
(length
(regexp-fold-right wc-rx (lambda (m i lis)
(cons (match:substring m 0) lis))
'() line)))))
```

Gathering Matches with `regexp-fold-right`

As we’ve just deduced by reading through the `awk’ loop, `regexp-fold-right` returns a list of substrings matching our regular expression `wc-rx`. We’re going to take the length of that list and add it to our running total. But how does this crazy `regexp-fold-right` work, anyway? It took me a while to figure out, as I hadn’t used any of the functional-style `fold` procedures before.

First, the code:

```(regexp-fold-right wc-rx (lambda (m i lis)
(cons (match:substring m 0) lis))
'() line)
```

My initial impressions: I can see that we’re using our regular expression `wc-rx` and doing something with a `lambda`-expression that `cons`-es part of the resulting `match` record onto an empty list. Somehow this is operating on the `line` of input read in by `awk’.

Looking at the code above with the manual open, we see the following:

• Again, `m` is a `match’ record returned by matching `wc-rx` against this line of input.
• `i` is the index of the end of the match described by `m`
• `lis` is the list of substring matches we’ll be building up by repeatedly `cons`-ing the substring in the match record `m` onto `lis` – in this way we build up a list of matching substrings of our regular expression

To make sure we really understand this, let’s step through an example. We’ll create a simple match against 2 lines of input, and diagram the states of variables `m`, `i`, `lis`, and `line` during each step. By watching the ways these variables change as we loop over each line, we’ll get a fuller understanding of the way `regexp-fold-right` works. Four iterations should be enough:

STEP 1
String “let’s go to the Boston aquarium!”
i 0
lis ‘()
(match:substring m 0) “”
STEP 2
String “go to the Boston aquarium!”
i 6
lis ‘(“let’s “)
(match:substring m 0) “let’s “
STEP 3
String “to the Boston aquarium!”
i 9
lis ‘(“go ” “let’s “)
(match:substring m 0) “go “
STEP 4
String “the Boston aquarium!”
i 12
lis ‘(“to ” “go ” “let’s “)
(match:substring m 0) “to “

Does it Work?

It’s been fun writing this program, but how does it stack up against GNU `wc`? According to my quick and dirty testing, quite well. Depending upon the type of file and how much weird formatting and markup it contains, our scsh word count will report slightly more or fewer words than GNU’s `wc`. Here are a few examples from readily available text files (The first file is Neal Stephenson’s In the Beginning was the Command Line, the second Homer’s Iliad):

File GNU wc wc.scm Discrepancy (%)
command-line.txt 36331 37262 0.025

Those numbers look pretty good to me. And remember that we still have room for improvement, since we don’t handle hyphenated words correctly (easy to implement), nor do we handle hyphenated line breaks (slightly harder). Remember from Basic Design that GNU word count doesn’t handle the hyphenated line break correctly either, and also appears to be incorrectly treating `–’ as a word, at least according to our single test.

Just for fun, I ran both implementations against this post – not surprisingly, they’re pretty close:

 Homegrown scsh version GNU version 1957 1988

Summary

To sum up, we’ve implemented a relatively sane word count program and introduced a few interesting areas of Scheme and scsh:

• `rx’ regular expression notation
• `awk’ loop macro
• `fold’-style functional iteration
• Scheme 48/scsh records

You can read more about these and other features on the scsh website. Below you’ll find the full version of the program described in this post.

Finally, thanks for reading, and happy scsh hacking!

Full Program Listing

```#!/usr/local/bin/scsh \
-e main -s

Display the number of words read from standard input
to standard output.
!#

(define wc-rx (rx (:                   ; start rx matching sequence
(* whitespace)      ; 0 or more spaces to start the line
(+ alphanumeric)    ; match 1 or more chars in [0-9A-Za-z]
(?                  ; then, optionally match the following:
(or #\' #\`)       ; - apostrophe or backtick
(* alphanumeric))  ; - 0 or more [0-9a-zA-Z]
(* whitespace))))   ; - 0 or more spaces to end the line

(define (main prog+args)
(display
(#t (+ words
(length
(regexp-fold-right wc-rx (lambda (m i lis)
(cons (match:substring m 0) lis))
'() line))))))
(newline))
```

Footnotes:

1 For more information about records, as well as the other features introduced here, consult the excellent manual at http://www.scsh.net/docu/man.html.

(Image courtesy Melisande under Creative Commons license.)

How I Built scsh 0.7 from Github (for newbies)

The following is a short message I sent to the scsh-users mailing list. It’s archived at Gmane, but I thought I’d add it here (with minor edits) as well for the sake of redundancy since it may prove useful to someone else.

Hey fellow scshers,

Warning: Newbie info follows, real hackers need not continue reading.

I’ve managed to build scsh 0.7 on my 32-bit Fedora box, but required slight tweaking of the Makefile so that scsh could see my Scheme48 1.9 development libraries. I had to tweak the Makefile vars like so:

CC = gcc -I/usr/local/include
LIBS = -lutil -L/usr/local/lib/scheme48-1.9T
SCHEME48 = /usr/local/bin/s48

Note that I already had Scheme 1.8 installed as `scheme48`, so I needed to call the 1.9 executable `s48`. Also note that installing the development version of Scheme48 1.9 (as required by scsh 0.7) will require you to have an already installed Scheme 48 of 1.6 or above.

I probably could have done this with a few `configure` flags like so (I think) 1:

./configure –libdir=/usr/local/lib/scheme48-1.9T
–includedir=/usr/local/include

Oh well. Posting this here if there are any mere lusers like myself hanging around.

It occurs to me that it might be nice to write up a step-by-step install guide that covers a number of systems (32- and 64-bit). I’ve got access to a 64-bit MacBook Pro; I think I’ll try that next and report back.

Best, Rich

Footnotes:

1 I tried these `configure` flags and they don’t work. Use the directions for modifying the Makefile instead.