Word Count, scsh-style

../img/trastevere-quilt-opus-lvi.jpg

Introduction

Just for fun, i’ve implemented a command line “word count” program in scsh. Along the way i’ve learned about a few really neat features of scsh, which I’ll discuss more below.

What is a word count program? In my perfect world, a word counter does only one thing: count the number of words it sees on standard input, and print that number to standard output. Let me give some examples of what should qualify as “words” for our purposes:

below
that’s
schleswig-holstein (one word!)
autonomy
friday’s
putrescence

Basic Design

If I were specifying a word count program in natural language, I might think of this series of steps:

  1. Read standard input (fd 0) into some kind of input buffer.
  2. Iterate over that buffer, checking for breaks (a.k.a. whitespace) between words.

A string like the following might prove tricky enough (feel free to replace the reference to “Satan” with your preferred $EVIL_DEITY):

Every good dog goes to heaven – and if not? – well, I hear
Satan has de-
licious bones to chew!

By inspection, the sentence above has eighteen words. (Note that the implementation of wc in scsh described here (eventually) says 19, and GNU’s wc says there are 21–both are incorrect, but more on that below.) There are several edge cases to note in this sentence:

  1. A space at the beginning of the sentence.
  2. A clause inside em-dashes.
  3. A hyphenated word occurring at the end of a line.

Some things we should probably do, given the above:

  1. Ignore spaces at the beginnings of lines.
  2. Ignore punctuation in general, such as periods, exclamation points…
  3. ..except hyphens. Hyphens join two or more words into one. This should work across newlines; that would take care of the “hyphenated word at line’s end” case.

Of the above 3 items, the last point sounds the trickiest. I think we need to take care of the hyphen case eventually, but for now let’s punt on it and worry about getting something basic working. Starting at a high level, here’s my take on the program flow:

  1. Loop over the lines of standard input.
  2. For each line, split the line into words using an appropriately “clever” regular expression.
  3. Keep a running total of how many words we’ve seen.

Implementation Notes

Sounds simple enough, right? Well, after some hacking around, here’s what I came up with. Items 1 and 3 from the above program flow list are pretty standard parts of any programming language, and scsh has them covered, as we’ll see. That leaves it up to me to implement the above-mentioned “clever” regular expression (which, as you’ll see shortly, could stand to be a little, um, “clever-er”).

In the course of describing this implementation, i’ll introduce several interesting features of scsh,

  • The `rx’ regular expression notation
  • The `awk’ macro
  • The `regexp-fold-right’ procedure

and briefly mention another:

  • Scheme 48/scsh “records”.

For more information on these and other features of scsh, I recommend reading the excellent manual.

The Not-so-Clever Regular Expression

Rather than do it here, I’ve written the description of the regular expression using comments in the code itself. One of the strengths of shivers’ `rx’ regular expression package (standard in scsh) is that it allows the programmer to do matching notation using a syntax that, while not Scheme proper, is s-expression based rather than the traditional string-based syntax (Don’t worry: the string syntax is also supported). Because of the s-expressions, we can format these regexps easily in Emacs, as well as add descriptive inline comments.

The Perl aficionados among you will note that this is similar to Perl’s `-x’ regular expression modifier, which allows you to “Extend your pattern’s legibility by permitting whitespace and comments.” Of course, when we use scsh, we get all of that, and the rest of Scheme for free!

In any event, here is the expression that we match against every line of input:

(define wc-rx (rx (:                   ; start rx matching sequence
                   (* whitespace)      ; 0 or more spaces to start the line
                   (+ alphanumeric)    ; match 1 or more chars in [0-9A-Za-z]
                   (?                  ; then, optionally match the following:
                    (or #\' #\`)       ; - apostrophe or backtick
                    (* alphanumeric))  ; - 0 or more [0-9a-zA-Z]
                   (* whitespace))))   ; - 0 or more spaces to end the line

Let’s see how it does against our word list from above:

Text Count
below 1
that’s 1
schleswig-holstein 2
autonomy 1
friday’s 1
putrescence 1

Looks like we’re failing to match correctly against “schleswig-holstein”, a hyphenated word. This is something of a corner case, since relatively few words in English use hyphens. Therefore, I’ll punt on it for now (as I did on the “hyphenated line break” case above), with the caveat that it may need to be implemented in future.

Looping with awk

I tried to avoid learning too much about scsh’s `awk’ macro at first. Unfortunately, I just didn’t take the time to read the manual in this case. My thinking was: Can’t I just write a for-each style Scheme loop? Well, yes. scsh actually comes with regexp-for-each procedure, which will happily loop across a string, gobbling up one match after another. The difficulty arose when i tried to update a state variable from inside my `awk’ macro loop. I just couldn’t seem to get it to work. Dramatization (This is not the actual code–I’ve since deleted it):

(let ((count 0))
  (awk (read-line) (line) ()
    ... stuff here ...
    (set! count (+ 1 count)))
  ... etc. ...)

It turns out that this was extremely dumb, since `awk’ provides a nice state variable abstraction already – here’s the definition of the form from the excellent manual:

(awk NEXT-RECORD RECORD+FIELD-VARS
     [COUNTER] STATE-VAR-DECLS
     CLAUSE_1 ...)

Using the block above (and the manual) as a reference, let’s look at the actual code in the section below – you’ll have to substitute the existing code below for the terms `NEXT-RECORD’ and the like:

  1. Each time through the loop, `awk’ gets its next set of values using the procedure NEXT-RECORD; in this case, it uses a simple read-line.
  2. The variable line (standing in for RECORD+FIELD-VARS) holds the line of input returned by read-line.
  3. `awk’ provides an optional COUNTER variable, which is omitted since isn’t needed in this case.
  4. We set the value of the state variable words to 0 (This is `STATE-VAR-DECLS’).
  5. Finally, we get to `CLAUSE_1′. In the `awk’ macro, there are several types of clauses, which are all evaluated in sequence (this is unlike the way cond works, for example). In the most basic case, the clause is of the form (/test/ /body/). As you can see here, test is simply #t, Scheme’s boolean truth value, so its corresponding body is evaluated every time through the loop. This causes the value of our state variable words to be incremented by the length of the list returned by the `regexp-fold-right’ block. For now we’ll just treat that block as a black box which looks at a string and returns a list of matching substrings.
(awk (read-line) (line) ((words 0))
  (#t (+ words
         (length
          (regexp-fold-right wc-rx (lambda (m i lis)
                                     (cons (match:substring m 0) lis))
                             '() line)))))

Gathering Matches with regexp-fold-right

As we’ve just deduced by reading through the `awk’ loop, regexp-fold-right returns a list of substrings matching our regular expression wc-rx. We’re going to take the length of that list and add it to our running total. But how does this crazy regexp-fold-right work, anyway? It took me a while to figure out, as I hadn’t used any of the functional-style fold procedures before.

First, the code:

(regexp-fold-right wc-rx (lambda (m i lis)
                           (cons (match:substring m 0) lis))
                   '() line)

My initial impressions: I can see that we’re using our regular expression wc-rx and doing something with a lambda-expression that cons-es part of the resulting match record onto an empty list. Somehow this is operating on the line of input read in by `awk’.

Looking at the code above with the manual open, we see the following:

  • Again, m is a `match’ record returned by matching wc-rx against this line of input.
  • i is the index of the end of the match described by m
  • lis is the list of substring matches we’ll be building up by repeatedly cons-ing the substring in the match record m onto lis – in this way we build up a list of matching substrings of our regular expression

To make sure we really understand this, let’s step through an example. We’ll create a simple match against 2 lines of input, and diagram the states of variables m, i, lis, and line during each step. By watching the ways these variables change as we loop over each line, we’ll get a fuller understanding of the way regexp-fold-right works. Four iterations should be enough:

STEP 1
String “let’s go to the Boston aquarium!”
i 0
lis ‘()
(match:substring m 0) “”
STEP 2
String “go to the Boston aquarium!”
i 6
lis ‘(“let’s “)
(match:substring m 0) “let’s “
STEP 3
String “to the Boston aquarium!”
i 9
lis ‘(“go ” “let’s “)
(match:substring m 0) “go “
STEP 4
String “the Boston aquarium!”
i 12
lis ‘(“to ” “go ” “let’s “)
(match:substring m 0) “to “

Does it Work?

It’s been fun writing this program, but how does it stack up against GNU wc? According to my quick and dirty testing, quite well. Depending upon the type of file and how much weird formatting and markup it contains, our scsh word count will report slightly more or fewer words than GNU’s wc. Here are a few examples from readily available text files (The first file is Neal Stephenson’s In the Beginning was the Command Line, the second Homer’s Iliad):

File GNU wc wc.scm Discrepancy (%)
command-line.txt 36331 37262 0.025
iliad.txt 192656 194122 0.008

Those numbers look pretty good to me. And remember that we still have room for improvement, since we don’t handle hyphenated words correctly (easy to implement), nor do we handle hyphenated line breaks (slightly harder). Remember from Basic Design that GNU word count doesn’t handle the hyphenated line break correctly either, and also appears to be incorrectly treating `–’ as a word, at least according to our single test.

Just for fun, I ran both implementations against this post – not surprisingly, they’re pretty close:

Homegrown scsh version GNU version
1957 1988

Summary

To sum up, we’ve implemented a relatively sane word count program and introduced a few interesting areas of Scheme and scsh:

  • `rx’ regular expression notation
  • `awk’ loop macro
  • `fold’-style functional iteration
  • Scheme 48/scsh records

You can read more about these and other features on the scsh website. Below you’ll find the full version of the program described in this post.

Finally, thanks for reading, and happy scsh hacking!

Full Program Listing

#!/usr/local/bin/scsh \
-e main -s

Display the number of words read from standard input
to standard output.
!#

(define wc-rx (rx (:                   ; start rx matching sequence
                   (* whitespace)      ; 0 or more spaces to start the line
                   (+ alphanumeric)    ; match 1 or more chars in [0-9A-Za-z]
                   (?                  ; then, optionally match the following:
                    (or #\' #\`)       ; - apostrophe or backtick
                    (* alphanumeric))  ; - 0 or more [0-9a-zA-Z]
                   (* whitespace))))   ; - 0 or more spaces to end the line

(define (main prog+args)
  (display
   (awk (read-line) (line) ((words 0))
     (#t (+ words
            (length
             (regexp-fold-right wc-rx (lambda (m i lis)
                                        (cons (match:substring m 0) lis))
                                '() line))))))
  (newline))

Footnotes:

1 For more information about records, as well as the other features introduced here, consult the excellent manual at http://www.scsh.net/docu/man.html.

(Image courtesy Melisande under Creative Commons license.)

About these ads

2 thoughts on “Word Count, scsh-style

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s