Just for fun, i’ve implemented a command line “word count” program in
scsh. Along the way i’ve learned about a few really neat features of
scsh, which I’ll discuss more below.
What is a word count program? In my perfect world, a word counter does only one thing: count the number of words it sees on standard input, and print that number to standard output. Let me give some examples of what should qualify as “words” for our purposes:
schleswig-holstein (one word!)
If I were specifying a word count program in natural language, I might think of this series of steps:
- Read standard input (fd 0) into some kind of input buffer.
- Iterate over that buffer, checking for breaks (a.k.a. whitespace) between words.
A string like the following might prove tricky enough (feel free to replace the reference to “Satan” with your preferred $EVIL_DEITY):
Every good dog goes to heaven – and if not? – well, I hear
Satan has de-
licious bones to chew!
By inspection, the sentence above has eighteen words. (Note that the implementation of
wc in scsh described here (eventually) says 19, and GNU’s
wc says there are 21–both are incorrect, but more on that below.) There are several edge cases to note in this sentence:
- A space at the beginning of the sentence.
- A clause inside em-dashes.
- A hyphenated word occurring at the end of a line.
Some things we should probably do, given the above:
- Ignore spaces at the beginnings of lines.
- Ignore punctuation in general, such as periods, exclamation points…
- ..except hyphens. Hyphens join two or more words into one. This should work across newlines; that would take care of the “hyphenated word at line’s end” case.
Of the above 3 items, the last point sounds the trickiest. I think we need to take care of the hyphen case eventually, but for now let’s punt on it and worry about getting something basic working. Starting at a high level, here’s my take on the program flow:
- Loop over the lines of standard input.
- For each line, split the line into words using an appropriately “clever” regular expression.
- Keep a running total of how many words we’ve seen.
Sounds simple enough, right? Well, after some hacking around, here’s what I came up with. Items 1 and 3 from the above program flow list are pretty standard parts of any programming language, and scsh has them covered, as we’ll see. That leaves it up to me to implement the above-mentioned “clever” regular expression (which, as you’ll see shortly, could stand to be a little, um, “clever-er”).
In the course of describing this implementation, i’ll introduce several interesting features of scsh,
- The `rx’ regular expression notation
- The `awk’ macro
- The `regexp-fold-right’ procedure
and briefly mention another:
- Scheme 48/scsh “records”.
For more information on these and other features of scsh, I recommend reading the excellent manual.
The Not-so-Clever Regular Expression
Rather than do it here, I’ve written the description of the regular expression using comments in the code itself. One of the strengths of shivers’ `rx’ regular expression package (standard in scsh) is that it allows the programmer to do matching notation using a syntax that, while not Scheme proper, is s-expression based rather than the traditional string-based syntax (Don’t worry: the string syntax is also supported). Because of the s-expressions, we can format these regexps easily in Emacs, as well as add descriptive inline comments.
The Perl aficionados among you will note that this is similar to Perl’s `-x’ regular expression modifier, which allows you to “Extend your pattern’s legibility by permitting whitespace and comments.” Of course, when we use scsh, we get all of that, and the rest of Scheme for free!
In any event, here is the expression that we match against every line of input:
(define wc-rx (rx (: ; start rx matching sequence (* whitespace) ; 0 or more spaces to start the line (+ alphanumeric) ; match 1 or more chars in [0-9A-Za-z] (? ; then, optionally match the following: (or #\' #\`) ; - apostrophe or backtick (* alphanumeric)) ; - 0 or more [0-9a-zA-Z] (* whitespace)))) ; - 0 or more spaces to end the line
Let’s see how it does against our word list from above:
Looks like we’re failing to match correctly against “schleswig-holstein”, a hyphenated word. This is something of a corner case, since relatively few words in English use hyphens. Therefore, I’ll punt on it for now (as I did on the “hyphenated line break” case above), with the caveat that it may need to be implemented in future.
Looping with awk
I tried to avoid learning too much about scsh’s `awk’ macro at first. Unfortunately, I just didn’t take the time to read the manual in this case. My thinking was: Can’t I just write a
for-each style Scheme loop? Well, yes. scsh actually comes with
regexp-for-each procedure, which will happily loop across a string, gobbling up one match after another. The difficulty arose when i tried to update a state variable from inside my `awk’ macro loop. I just couldn’t seem to get it to work. Dramatization (This is not the actual code–I’ve since deleted it):
(let ((count 0)) (awk (read-line) (line) () ... stuff here ... (set! count (+ 1 count))) ... etc. ...)
It turns out that this was extremely dumb, since `awk’ provides a nice state variable abstraction already – here’s the definition of the form from the excellent manual:
(awk NEXT-RECORD RECORD+FIELD-VARS [COUNTER] STATE-VAR-DECLS CLAUSE_1 ...)
Using the block above (and the manual) as a reference, let’s look at the actual code in the section below – you’ll have to substitute the existing code below for the terms `NEXT-RECORD’ and the like:
- Each time through the loop, `awk’ gets its next set of values using the procedure NEXT-RECORD; in this case, it uses a simple
- The variable
line(standing in for RECORD+FIELD-VARS) holds the line of input returned by
- `awk’ provides an optional COUNTER variable, which is omitted since isn’t needed in this case.
- We set the value of the state variable
wordsto 0 (This is `STATE-VAR-DECLS’).
- Finally, we get to `CLAUSE_1′. In the `awk’ macro, there are several types of clauses, which are all evaluated in sequence (this is unlike the way
condworks, for example). In the most basic case, the clause is of the form
(/test/ /body/). As you can see here, test is simply
#t, Scheme’s boolean truth value, so its corresponding body is evaluated every time through the loop. This causes the value of our state variable
wordsto be incremented by the
lengthof the list returned by the `regexp-fold-right’ block. For now we’ll just treat that block as a black box which looks at a string and returns a list of matching substrings.
(awk (read-line) (line) ((words 0)) (#t (+ words (length (regexp-fold-right wc-rx (lambda (m i lis) (cons (match:substring m 0) lis)) '() line)))))
Gathering Matches with
As we’ve just deduced by reading through the `awk’ loop,
regexp-fold-right returns a list of substrings matching our regular expression
wc-rx. We’re going to take the length of that list and add it to our running total. But how does this crazy
regexp-fold-right work, anyway? It took me a while to figure out, as I hadn’t used any of the functional-style
fold procedures before.
First, the code:
(regexp-fold-right wc-rx (lambda (m i lis) (cons (match:substring m 0) lis)) '() line)
My initial impressions: I can see that we’re using our regular expression
wc-rx and doing something with a
cons-es part of the resulting
match record onto an empty list. Somehow this is operating on the
line of input read in by `awk’.
Looking at the code above with the manual open, we see the following:
mis a `match’ record returned by matching
wc-rxagainst this line of input.
iis the index of the end of the match described by
lisis the list of substring matches we’ll be building up by repeatedly
cons-ing the substring in the match record
lis– in this way we build up a list of matching substrings of our regular expression
To make sure we really understand this, let’s step through an example. We’ll create a simple match against 2 lines of input, and diagram the states of variables
line during each step. By watching the ways these variables change as we loop over each line, we’ll get a fuller understanding of the way
regexp-fold-right works. Four iterations should be enough:
|String||“let’s go to the Boston aquarium!”|
|(match:substring m 0)||“”|
|String||“go to the Boston aquarium!”|
|(match:substring m 0)||“let’s “|
|String||“to the Boston aquarium!”|
|lis||‘(“go ” “let’s “)|
|(match:substring m 0)||“go “|
|String||“the Boston aquarium!”|
|lis||‘(“to ” “go ” “let’s “)|
|(match:substring m 0)||“to “|
Does it Work?
It’s been fun writing this program, but how does it stack up against GNU
wc? According to my quick and dirty testing, quite well. Depending upon the type of file and how much weird formatting and markup it contains, our scsh word count will report slightly more or fewer words than GNU’s
wc. Here are a few examples from readily available text files (The first file is Neal Stephenson’s In the Beginning was the Command Line, the second Homer’s Iliad):
|File||GNU wc||wc.scm||Discrepancy (%)|
Those numbers look pretty good to me. And remember that we still have room for improvement, since we don’t handle hyphenated words correctly (easy to implement), nor do we handle hyphenated line breaks (slightly harder). Remember from Basic Design that GNU word count doesn’t handle the hyphenated line break correctly either, and also appears to be incorrectly treating `–’ as a word, at least according to our single test.
Just for fun, I ran both implementations against this post – not surprisingly, they’re pretty close:
|Homegrown scsh version||GNU version|
To sum up, we’ve implemented a relatively sane word count program and introduced a few interesting areas of Scheme and scsh:
- `rx’ regular expression notation
- `awk’ loop macro
- `fold’-style functional iteration
- Scheme 48/scsh records
You can read more about these and other features on the scsh website. Below you’ll find the full version of the program described in this post.
Finally, thanks for reading, and happy scsh hacking!
Full Program Listing
#!/usr/local/bin/scsh \ -e main -s Display the number of words read from standard input to standard output. !# (define wc-rx (rx (: ; start rx matching sequence (* whitespace) ; 0 or more spaces to start the line (+ alphanumeric) ; match 1 or more chars in [0-9A-Za-z] (? ; then, optionally match the following: (or #\' #\`) ; - apostrophe or backtick (* alphanumeric)) ; - 0 or more [0-9a-zA-Z] (* whitespace)))) ; - 0 or more spaces to end the line (define (main prog+args) (display (awk (read-line) (line) ((words 0)) (#t (+ words (length (regexp-fold-right wc-rx (lambda (m i lis) (cons (match:substring m 0) lis)) '() line)))))) (newline))