Tag Archives: scheme-48

Make your own iterators in Scheme using closures



I recently bought a copy of Higher Order Perl (also known as HOP) by Mark Dominus. In chapter 4, he introduces iterators. At a high level, an iterator is a black box with an input funnel, an output funnel, and a big red “NEXT” button. You put something into the funnel at time T; then, at some later time T+n, you press the “NEXT” button and something useful comes out.

Iterators are useful in cases when you want to access a potentially huge list of things one at a time, but you don’t have access to a computer with infinite memory.

In this post we’ll look at a moderately useful iterator example; a directory walker that recurses down a directory tree on your file system. The implementation is given in Scheme (specifically, the scsh dialect); the code and the descriptions thereof are heavily inspired by Dominus’ excellent book. However, this post contains the work of an amateur. Any errors or omissions in the following code and explanations are mine.

Iterators are made with closures

In order to build a useful iterator, we need to have a basic understanding of closures. A closure is just a function packaged up with an environment. The environment allows us to store up state between function calls. As noted above, it’s easiest just to think of it as a little black box that we can pass around in our program.

In Scheme and Common Lisp, you build functions with LAMBDA. Here’s a very simple closure example in Scheme, translated from the first example given in HOP:

(define (upto m n)
  (set! m (- m 1)) ;needed since we don't have post-increment ($i++)
  (lambda ()
    (if (< m n)
        (begin (set! m (+ m 1)) m)

This is a function that returns a function (a closure). To use the closure returned by UPTO, assign it to a variable and call it whenever you want the next value. When the iterator is exhausted, it will return #f.

> (define 1to10 (upto 1 10))
> (1to10)
> (1to10)
> (1to10)
> (1to10)
> (1to10)

A trivial file tree iterator

A more interesting and practical use of this technique is to walk a directory tree on your computer. In the Scheme procedure below, we take a list of directories and then build and return a closure that, when called, will walk the directory tree breadth-first, printing out the names of files as we go.

(define (dir-walk queue)
  (let ((file #f))
    (lambda ()
      (if (not (null? queue))
          (begin (set! file (car queue))
                 (set! queue (cdr queue))
                 (cond ((file-directory? file)
                        (let ((new-files (directory-files file)))
                          (set! queue (append queue (map (lambda (filename)
                                                           (string-append file "/" filename))
                       ((file-regular? file) file)
                       (else #f)))))))

The important part to notice is the inner LAMBDA. This is where we create a closure by packaging up a procedure with an environment. The closure’s environment remembers the contents of the variable QUEUE in memory, so we can ask for the elements of QUEUE at our leisure.

Here it is running on my machine:

> (define dir-iter1 (dir-walk '("/Users/rloveland/Desktop/current/")))
> (dir-iter1)
> (dir-iter1)
> (dir-iter1)
> (dir-iter1)
> (dir-iter1)

Just for fun, here’s a version of the same function in Common Lisp. It’s essentially the same as the scsh version, save that it uses the nice CL pathname machinery instead of appending raw strings, and also makes use of the CL-FAD convenience library.

(defun dir-walk (queue)
  (let ((file nil))
    (lambda ()
      (if (not (null queue))
            (setf file (car queue))
            (setf queue (cdr queue))
            (cond ((cl-fad:directory-pathname-p file)
                   (let ((new-files (cl-fad:list-directory file)))
                     (setf queue (append queue (mapcar #'(lambda (filename)
                                                           (merge-pathnames filename file))
                  ((cl-fad:file-exists-p file) file)
                  (t nil)))))))

Usage is also essentially the same:

CL-USER> (defvar *tree-walker* (dir-walk '(#P"/Users/rloveland/Desktop/current/")))
CL-USER> (type-of *tree-walker*)
CL-USER> (funcall *tree-walker*)
CL-USER> (funcall *tree-walker*)
CL-USER> (funcall *tree-walker*)
CL-USER> (funcall *tree-walker*)
CL-USER> (funcall *tree-walker*)
CL-USER> (funcall *tree-walker*)

A slightly less trivial file tree iterator

Listing out files and directories is nice, but not that useful. We’d like a way to see only those files and directories that have some interesting property.

This can be accomplished by passing in another argument to our DIR-WALK function: this argument will be yet another function that will test the current file to see whether it’s interesting to us. It’s pretty easy to change DIR-WALK to accept a function argument, INTERESTING?. This arbitrary function is used to check the file to see if we care about it.

This time around, when we build our internal queue, we use a call to FILTER to make sure that only interesting files get added.

(define (dir-walk* interesting? queue)
  (let ((file #f))
    (lambda ()
      (if (not (null? queue))
          (begin (set! file (car queue))
                 (set! queue (cdr queue))
                  ((file-directory? file)
                        (let ((new-files (directory-files file)))
                          (set! queue (append queue (filter interesting? (map (lambda (filename)
                                                           (string-append file "/" filename))
                          (if (interesting? file) file)))
                  ((interesting? file) file)
                  (else #f)))

And here it is in use; in this example, we pass an INTERESTING? function that asks for only files that are not marked as executable:

> (define dir-iter2 (dir-walk* (lambda (f) (file-not-executable? f)) '("/home/rml/Desktop/current")))
> (dir-iter2)
> (dir-iter2)
> (dir-iter2)
> (dir-iter2)
> (dir-iter2)
> (dir-iter2)
> (dir-iter2)


There are still bugs and unexpected behavior in DIR-WALK*. For example, on the first call to the resulting iterator there is no result due to the one-armed IF. There are also strange things that happen if we want to filter out directories from our results, since we mix together the FILE-DIRECTORY? check and the INTERESTING? check inside the iterator. However, despite these small nits, it can still do useful work on my machine 1, and it’s a good enough example of using closures to build iterators.

Here’s hoping you’ll enjoy playing with closures and iterators in your own code!

(Image courtesy zeitfanger.at under a Creative Commons License.)


1 In the grandest internet tradition, “It works on my machine” ™.

Word Count, scsh-style



Just for fun, i’ve implemented a command line “word count” program in scsh. Along the way i’ve learned about a few really neat features of scsh, which I’ll discuss more below.

What is a word count program? In my perfect world, a word counter does only one thing: count the number of words it sees on standard input, and print that number to standard output. Let me give some examples of what should qualify as “words” for our purposes:

schleswig-holstein (one word!)

Basic Design

If I were specifying a word count program in natural language, I might think of this series of steps:

  1. Read standard input (fd 0) into some kind of input buffer.
  2. Iterate over that buffer, checking for breaks (a.k.a. whitespace) between words.

A string like the following might prove tricky enough (feel free to replace the reference to “Satan” with your preferred $EVIL_DEITY):

Every good dog goes to heaven – and if not? – well, I hear
Satan has de-
licious bones to chew!

By inspection, the sentence above has eighteen words. (Note that the implementation of wc in scsh described here (eventually) says 19, and GNU’s wc says there are 21–both are incorrect, but more on that below.) There are several edge cases to note in this sentence:

  1. A space at the beginning of the sentence.
  2. A clause inside em-dashes.
  3. A hyphenated word occurring at the end of a line.

Some things we should probably do, given the above:

  1. Ignore spaces at the beginnings of lines.
  2. Ignore punctuation in general, such as periods, exclamation points…
  3. ..except hyphens. Hyphens join two or more words into one. This should work across newlines; that would take care of the “hyphenated word at line’s end” case.

Of the above 3 items, the last point sounds the trickiest. I think we need to take care of the hyphen case eventually, but for now let’s punt on it and worry about getting something basic working. Starting at a high level, here’s my take on the program flow:

  1. Loop over the lines of standard input.
  2. For each line, split the line into words using an appropriately “clever” regular expression.
  3. Keep a running total of how many words we’ve seen.

Implementation Notes

Sounds simple enough, right? Well, after some hacking around, here’s what I came up with. Items 1 and 3 from the above program flow list are pretty standard parts of any programming language, and scsh has them covered, as we’ll see. That leaves it up to me to implement the above-mentioned “clever” regular expression (which, as you’ll see shortly, could stand to be a little, um, “clever-er”).

In the course of describing this implementation, i’ll introduce several interesting features of scsh,

  • The `rx’ regular expression notation
  • The `awk’ macro
  • The `regexp-fold-right’ procedure

and briefly mention another:

  • Scheme 48/scsh “records”.

For more information on these and other features of scsh, I recommend reading the excellent manual.

The Not-so-Clever Regular Expression

Rather than do it here, I’ve written the description of the regular expression using comments in the code itself. One of the strengths of shivers’ `rx’ regular expression package (standard in scsh) is that it allows the programmer to do matching notation using a syntax that, while not Scheme proper, is s-expression based rather than the traditional string-based syntax (Don’t worry: the string syntax is also supported). Because of the s-expressions, we can format these regexps easily in Emacs, as well as add descriptive inline comments.

The Perl aficionados among you will note that this is similar to Perl’s `-x’ regular expression modifier, which allows you to “Extend your pattern’s legibility by permitting whitespace and comments.” Of course, when we use scsh, we get all of that, and the rest of Scheme for free!

In any event, here is the expression that we match against every line of input:

(define wc-rx (rx (:                   ; start rx matching sequence
                   (* whitespace)      ; 0 or more spaces to start the line
                   (+ alphanumeric)    ; match 1 or more chars in [0-9A-Za-z]
                   (?                  ; then, optionally match the following:
                    (or #\' #\`)       ; - apostrophe or backtick
                    (* alphanumeric))  ; - 0 or more [0-9a-zA-Z]
                   (* whitespace))))   ; - 0 or more spaces to end the line

Let’s see how it does against our word list from above:

Text Count
below 1
that’s 1
schleswig-holstein 2
autonomy 1
friday’s 1
putrescence 1

Looks like we’re failing to match correctly against “schleswig-holstein”, a hyphenated word. This is something of a corner case, since relatively few words in English use hyphens. Therefore, I’ll punt on it for now (as I did on the “hyphenated line break” case above), with the caveat that it may need to be implemented in future.

Looping with awk

I tried to avoid learning too much about scsh’s `awk’ macro at first. Unfortunately, I just didn’t take the time to read the manual in this case. My thinking was: Can’t I just write a for-each style Scheme loop? Well, yes. scsh actually comes with regexp-for-each procedure, which will happily loop across a string, gobbling up one match after another. The difficulty arose when i tried to update a state variable from inside my `awk’ macro loop. I just couldn’t seem to get it to work. Dramatization (This is not the actual code–I’ve since deleted it):

(let ((count 0))
  (awk (read-line) (line) ()
    ... stuff here ...
    (set! count (+ 1 count)))
  ... etc. ...)

It turns out that this was extremely dumb, since `awk’ provides a nice state variable abstraction already – here’s the definition of the form from the excellent manual:

     CLAUSE_1 ...)

Using the block above (and the manual) as a reference, let’s look at the actual code in the section below – you’ll have to substitute the existing code below for the terms `NEXT-RECORD’ and the like:

  1. Each time through the loop, `awk’ gets its next set of values using the procedure NEXT-RECORD; in this case, it uses a simple read-line.
  2. The variable line (standing in for RECORD+FIELD-VARS) holds the line of input returned by read-line.
  3. `awk’ provides an optional COUNTER variable, which is omitted since isn’t needed in this case.
  4. We set the value of the state variable words to 0 (This is `STATE-VAR-DECLS’).
  5. Finally, we get to `CLAUSE_1′. In the `awk’ macro, there are several types of clauses, which are all evaluated in sequence (this is unlike the way cond works, for example). In the most basic case, the clause is of the form (/test/ /body/). As you can see here, test is simply #t, Scheme’s boolean truth value, so its corresponding body is evaluated every time through the loop. This causes the value of our state variable words to be incremented by the length of the list returned by the `regexp-fold-right’ block. For now we’ll just treat that block as a black box which looks at a string and returns a list of matching substrings.
(awk (read-line) (line) ((words 0))
  (#t (+ words
          (regexp-fold-right wc-rx (lambda (m i lis)
                                     (cons (match:substring m 0) lis))
                             '() line)))))

Gathering Matches with regexp-fold-right

As we’ve just deduced by reading through the `awk’ loop, regexp-fold-right returns a list of substrings matching our regular expression wc-rx. We’re going to take the length of that list and add it to our running total. But how does this crazy regexp-fold-right work, anyway? It took me a while to figure out, as I hadn’t used any of the functional-style fold procedures before.

First, the code:

(regexp-fold-right wc-rx (lambda (m i lis)
                           (cons (match:substring m 0) lis))
                   '() line)

My initial impressions: I can see that we’re using our regular expression wc-rx and doing something with a lambda-expression that cons-es part of the resulting match record onto an empty list. Somehow this is operating on the line of input read in by `awk’.

Looking at the code above with the manual open, we see the following:

  • Again, m is a `match’ record returned by matching wc-rx against this line of input.
  • i is the index of the end of the match described by m
  • lis is the list of substring matches we’ll be building up by repeatedly cons-ing the substring in the match record m onto lis – in this way we build up a list of matching substrings of our regular expression

To make sure we really understand this, let’s step through an example. We’ll create a simple match against 2 lines of input, and diagram the states of variables m, i, lis, and line during each step. By watching the ways these variables change as we loop over each line, we’ll get a fuller understanding of the way regexp-fold-right works. Four iterations should be enough:

String “let’s go to the Boston aquarium!”
i 0
lis ‘()
(match:substring m 0) “”
String “go to the Boston aquarium!”
i 6
lis ‘(“let’s “)
(match:substring m 0) “let’s “
String “to the Boston aquarium!”
i 9
lis ‘(“go ” “let’s “)
(match:substring m 0) “go “
String “the Boston aquarium!”
i 12
lis ‘(“to ” “go ” “let’s “)
(match:substring m 0) “to “

Does it Work?

It’s been fun writing this program, but how does it stack up against GNU wc? According to my quick and dirty testing, quite well. Depending upon the type of file and how much weird formatting and markup it contains, our scsh word count will report slightly more or fewer words than GNU’s wc. Here are a few examples from readily available text files (The first file is Neal Stephenson’s In the Beginning was the Command Line, the second Homer’s Iliad):

File GNU wc wc.scm Discrepancy (%)
command-line.txt 36331 37262 0.025
iliad.txt 192656 194122 0.008

Those numbers look pretty good to me. And remember that we still have room for improvement, since we don’t handle hyphenated words correctly (easy to implement), nor do we handle hyphenated line breaks (slightly harder). Remember from Basic Design that GNU word count doesn’t handle the hyphenated line break correctly either, and also appears to be incorrectly treating `–’ as a word, at least according to our single test.

Just for fun, I ran both implementations against this post – not surprisingly, they’re pretty close:

Homegrown scsh version GNU version
1957 1988


To sum up, we’ve implemented a relatively sane word count program and introduced a few interesting areas of Scheme and scsh:

  • `rx’ regular expression notation
  • `awk’ loop macro
  • `fold’-style functional iteration
  • Scheme 48/scsh records

You can read more about these and other features on the scsh website. Below you’ll find the full version of the program described in this post.

Finally, thanks for reading, and happy scsh hacking!

Full Program Listing

#!/usr/local/bin/scsh \
-e main -s

Display the number of words read from standard input
to standard output.

(define wc-rx (rx (:                   ; start rx matching sequence
                   (* whitespace)      ; 0 or more spaces to start the line
                   (+ alphanumeric)    ; match 1 or more chars in [0-9A-Za-z]
                   (?                  ; then, optionally match the following:
                    (or #\' #\`)       ; - apostrophe or backtick
                    (* alphanumeric))  ; - 0 or more [0-9a-zA-Z]
                   (* whitespace))))   ; - 0 or more spaces to end the line

(define (main prog+args)
   (awk (read-line) (line) ((words 0))
     (#t (+ words
             (regexp-fold-right wc-rx (lambda (m i lis)
                                        (cons (match:substring m 0) lis))
                                '() line))))))


1 For more information about records, as well as the other features introduced here, consult the excellent manual at http://www.scsh.net/docu/man.html.

(Image courtesy Melisande under Creative Commons license.)

How I Built scsh 0.7 from Github (for newbies)

The following is a short message I sent to the scsh-users mailing list. It’s archived at Gmane, but I thought I’d add it here (with minor edits) as well for the sake of redundancy since it may prove useful to someone else.

Hey fellow scshers,

Warning: Newbie info follows, real hackers need not continue reading.

I’ve managed to build scsh 0.7 on my 32-bit Fedora box, but required slight tweaking of the Makefile so that scsh could see my Scheme48 1.9 development libraries. I had to tweak the Makefile vars like so:

CC = gcc -I/usr/local/include
LIBS = -lutil -L/usr/local/lib/scheme48-1.9T
SCHEME48 = /usr/local/bin/s48

Note that I already had Scheme 1.8 installed as scheme48, so I needed to call the 1.9 executable s48. Also note that installing the development version of Scheme48 1.9 (as required by scsh 0.7) will require you to have an already installed Scheme 48 of 1.6 or above.

I probably could have done this with a few configure flags like so (I think) 1:

./configure –libdir=/usr/local/lib/scheme48-1.9T

Oh well. Posting this here if there are any mere lusers like myself hanging around.

It occurs to me that it might be nice to write up a step-by-step install guide that covers a number of systems (32- and 64-bit). I’ve got access to a 64-bit MacBook Pro; I think I’ll try that next and report back.

Best, Rich


1 I tried these configure flags and they don’t work. Use the directions for modifying the Makefile instead.