Scripting Language Idioms: The “Seen” Hash

The “seen” hash technique is an idiom that lets you use a hash (or dictionary if you prefer) as a set data type. It’s good for generating a de-duplicated list of things, where each thing appears only once. If your language of choice has a real set data type, you may want to use that instead.

To illustrate I’ll offer a real-world use case.

The other day at work I needed to grab a bunch of information about git commits from a batch of automated emails. For reasons that don’t matter right now, our team (Docs) gets automated emails about git commits on our API (It’s not really what we asked for, but it’s what we could get somebody to build).

As a result, we get a bunch of emails formatted like this (personal details changed to protect the guilty):

--------- Project: Foo Details: something something garbage noise etc.

J. Random Luser ABC-123 did something to some other thing
Alyssa P. Hacker ABC-124 fixed mistake in ABC-123
Ben Bitdiddle ABC-125 yet another thing was done
J. Random Luser oh yeah this thing too
Ben Bitdiddle Merge ABC-123 into Foo/master

Luckily I don’t need to read all of these darn things. I have a filter set up on my mail client that saves them all in my ‘Archive’ folder, where I can safely ignore them.

When we’re getting ready to do our API release notes, I go into my ‘Archive’ folder and search for all of the emails with the subject “Project: Foo” that arrived between our last set of release notes and today. I end up with (say) about 100 files formatted like the above.

The format is: Name, JIRA ticket ID, description. Except that sometimes there is no JIRA ticket ID. And sometimes there are duplicate ticket IDs, since the emails contain messages about merge commits.

As a tech writer, I don’t need to look at the contents of every commit. I need to generate a (de-duplicated) list of JIRA ticket IDs, so I can go and review those tickets to see if there is user-facing docs work that needs to happen for those commits. (Sometimes I still need to look at the commits anyway because a ticket has a description like “change the frobnitz”, but hey.)

So I save all of these email files into a directory, and I write some code that loops over each file, generating a set of JIRA ticket IDs, which I then print out. Here’s the code what done it (it’s written in Perl but could as easily be Ruby or Python or whatevs):

#!perl

use strict;
use warnings;
use feature     qw/ say   /;
use File::Slurp qw/ slurp /;

my @files = glob('*.eml');
my $jira_pat = '([A-Z]+-[0-9]+)';
my %seen;

for my $f (@files) {
  my @lines = slurp($f);

  for my $line (@lines) {
    next unless $line =~ /$jira_pat/; # Skip unless it has a JIRA ticket ID
    my $id = $1;                      # If it did match, save the capture
    $seen{$id}++;                     # Stick the ID in the hash (as a key)
  }
}

say for sort keys %seen;        # Print out all the keys (which are de-duped)

The reason this trick works is that a hash table can’t have duplicate keys. Therefore the ‘$seen{$id}++’ bit means: “Stick the ID in the hash, and increment its value”. Based on the example email above, you end up with a hash table that looks like this:

{
  ABC-123 => 2,
  ABC-124 => 1,
  ABC-125 => 1,
}

Then we print the keys using the line say for sort keys %seen, which just means “print the hash keys in sorted order”.

Perl’s Autovivification FTW

Interestingly, part of the reason this idiom is cleaner in Perl than in, say, Ruby, is that Perl does something called “autovivification” of hash keys. Basically, it means that stuff gets created as soon as you mention it. That’s why you can call the ‘$seen{$id}++’ all in one line. (If you want more information about autovivification, there’s a good article on the Wikipedia.)

By contrast, in Ruby you have to first explicitly create the key’s value, and then increment it. As you can see below, if you try to bump the value of a key that doesn’t exist yet, you get an error (unless you use the technique from the Wikipedia article).

irb(main):015:0> RUBY_VERSION
=> "2.2.4"
irb(main):010:0> tix
=> {"ABC-123"=>2, "ABC-124"=>1}
irb(main):011:0> tix['ABC-125'] += 1
NoMethodError: undefined method `+' for nil:NilClass
    from (irb):11
    from c:/Ruby22/bin/irb:11:in `<main>'
irb(main):012:0> tix['ABC-125'] = 1
=> 1
irb(main):013:0> tix
=> {"ABC-123"=>2, "ABC-124"=>1, "ABC-125"=>1}
irb(main):014:0> tix['ABC-125'] += 1
=> 2

Further Reading

Advertisements

A Mini Python and Shell Tutorial

wooly-mammoth-cp

The following is an email I sent to a couple of coworkers whom I’d been teaching a short Python course for technical writers, using Automate the Boring Stuff with Python. The email was meant to show them a real-life example of how a technical writer can use Python and shell scripting to automate something that is, well, boring. In this case, the task was to clean up a CSV file containing a list of git commits to the AppNexus REST APIs.

Because of the way we received this data, it had duplicate entries, and lots of non-interesting merge commits that were unrelated to a feature (a feature is generally associated with a JIRA ticket). Our task was to review the commits and see if there was anything interesting that should be added to our monthly API release notes.

(The names of my coworkers have been changed, obv.)


To: Jane X. (‘REDACTED@appnexus.com’)

Subject: Filtered API git commits to review (bonus: mini Python & shell tutorial)

From: Rich Loveland (‘REDACTED@appnexus.com’)

CC: Victoria Y. (‘REDACTED@appnexus.com’)

Date: Wed, 18 Nov 2015 16:59:04 -0500

+Victoria for the code fun

Jane, the file of commit logs for you to review is attached (along with some others). But so what, that’s boring! Let’s talk about how it was made.

To make the really boring task of reviewing API git commits less awful, let’s do some programming for fun. First let’s write a short Python script to pull out only those commits that have a JIRA ticket ID in them (since we don’t care about the other ones), and call it ‘filter-commit-messages.py’:

  #!/usr/bin/env python

  import re
  import sys

  jira_pat = "[A-Z]+-[0-9]+"

  for line in sys.stdin.readlines():
      m = re.search(jira_pat, line)
      if m:
          print(line)

This tries to match a regular expression against each line of its input (in this case the compiled API git commit list), and prints the line if the match occurs.

Let’s make it executable from our shell:

$ cd ~/bin
$ ln -s ~/work/code/filter-commit-messages.py filter-commit-messages
$ chmod +x ~/bin/filter-commit-messages
$ export PATH=$HOME/bin:$PATH

Then we can run it on the text file with the git commits like so:

$ filter-commit-messages < api-release-november-2015.csv

(The “<” in the shell means “Read your input from this place”.)

This prints out only the matching lines, but there are a lot of annoying extra lines in the output. We can get rid of those lines while sorting them like so:

$ filter-commit-messages < api-release-november-2015.csv | sort 

(The ”

” in the shell means “Pass your output through to this other command”.)

Now that we are extracting only the important lines, let’s throw them in a file:

$ filter-commit-messages < api-release-november-2015.csv | sort > api-release-november-2015-actual.csv

(The “>” near the end means “Write all of the output to this place”.)

We can see how much less reading we have to do now by running a word count program (‘wc’) on the before and after files:

$ wc -l api-release-november-2015.csv # old
     201 api-release-november-2015.csv
$ wc -l api-release-november-2015-actual.csv # new
     115 api-release-november-2015-actual.csv

(The “-l” means “count the lines”.)

Now, since Jane and I each have to review half of the commits, we can use the ‘split’ shell command to break the file in half. Since we know the file is 115 lines, we need to tell ‘split’ how many lines to put in each half with the ‘-l’ option (see ‘man split’ in your terminal):

$ split -l 58 api-release-november-2015-actual.csv COMMITS-TO-REVIEW

‘split’ takes the last argument, “COMMITS-TO-REVIEW”, and creates two files based on that, “COMMITS-TO-REVIEWaa” and “COMMITS-TO-REVIEWbb”, which we can rename for each reviewer:

$ mv COMMITS-TO-REVIEWaa COMMITS-TO-REVIEW-RICH
$ mv COMMITS-TO-REVIEWbb COMMITS-TO-REVIEW-JANE

A nice thing is that because we sorted the lines of the files, each reviewer gets commits by a sorted subset of the engineers, making it easier to see their related commits next to each other.

p.s. We didn’t actually need a Python program for the first part, we could have just used ‘grep’ and stayed with shell commands. But hey!

p.p.s. With more work, this could all be put together into a single program if we were inclined, but since it doesn’t get used that often it’s probably OK to type a few commands.

(Image courtesy William Hartman under Creative Commons license.)

Announcing confluence2html

../img/kyoto-swan.jpg

If you use (or used to use) a Confluence wiki, you may need to deliver content that was written in wiki markup to HTML. Confluence does have the ability to export an entire space to HTML, but not a single page (or section of a page). To overcome this limitation, I’ve written a script called confluence2html which can convert a subset of Confluence’s wiki markup to HTML. (You can check it out at my Github page – For examples of the supported subset of wiki syntax, see the t/files directory).

Unfortunately, the new versions of Confluence use a hacky “not-quite-XML” storage format that is terrible for writers and for which there are basically no existing XML tools either. If you are trying to get your content out of a newer version of Confluence and back into wiki markup, check out Graham Hannington’s site (search the page for “wikifier”). His page also has tools to help you wrangle the new format, if you care to (I don’t).

With a bit of editing, you should be able to get the output of Graham’s tool to conform to the subset of Confluence wiki markup supported by this script. It supports some common macros (including tables!), and I’ve found it really useful.

Right now, it’s a command line tool only. You can use it like so:

$ confluence2html < wiki-input.txt > html-output.html

If you don’t know how to run commands from the shell, you can read about it at LinuxCommand.org. If you are on Windows, you can run shell commands using Cygwin. If you’re on a Mac, open Utilities > Terminal in Finder.

At this point I’ll quit blathering and point you to the Github page. Again: for examples of the supported subset of wiki syntax, see the t/files directory. For documentation on the supported wiki syntax, see the README.

Happy writing!

(Image courtesy of caribb under a Creative Commons license.)

Just for Fun: Accumulate

Some years ago, I came across a neat little shell utility that Mark Dominus had written called accumulate. You can read more about it at his blog.

He thought it so trivial that he worried about insulting the intelligence of his readers by sharing the source code; since I have no readers, I have no such concerns. Here is my Python version:

#!/usr/bin/env python

import sys
"""
An ugly not-quite-port of Mark J. Dominus' 'accumulate' utility,
described at [http://blog.plover.com/prog/accumulate.html]
"""

seen = {}

for line in sys.stdin.readlines():
  line.strip()
  array = line.split()
  key = array[0]
  value = array[1:]
  if key in seen:
    seen[key].extend(value)
  else:
    seen[key] = value

for k, v in seen.iteritems():
  v.reverse()
  print k, ' '.join(v)

Word Count, scsh-style

../img/trastevere-quilt-opus-lvi.jpg

Introduction

Just for fun, i’ve implemented a command line “word count” program in scsh. Along the way i’ve learned about a few really neat features of scsh, which I’ll discuss more below.

What is a word count program? In my perfect world, a word counter does only one thing: count the number of words it sees on standard input, and print that number to standard output. Let me give some examples of what should qualify as “words” for our purposes:

below
that’s
schleswig-holstein (one word!)
autonomy
friday’s
putrescence

Basic Design

If I were specifying a word count program in natural language, I might think of this series of steps:

  1. Read standard input (fd 0) into some kind of input buffer.
  2. Iterate over that buffer, checking for breaks (a.k.a. whitespace) between words.

A string like the following might prove tricky enough (feel free to replace the reference to “Satan” with your preferred $EVIL_DEITY):

Every good dog goes to heaven – and if not? – well, I hear
Satan has de-
licious bones to chew!

By inspection, the sentence above has eighteen words. (Note that the implementation of wc in scsh described here (eventually) says 19, and GNU’s wc says there are 21–both are incorrect, but more on that below.) There are several edge cases to note in this sentence:

  1. A space at the beginning of the sentence.
  2. A clause inside em-dashes.
  3. A hyphenated word occurring at the end of a line.

Some things we should probably do, given the above:

  1. Ignore spaces at the beginnings of lines.
  2. Ignore punctuation in general, such as periods, exclamation points…
  3. ..except hyphens. Hyphens join two or more words into one. This should work across newlines; that would take care of the “hyphenated word at line’s end” case.

Of the above 3 items, the last point sounds the trickiest. I think we need to take care of the hyphen case eventually, but for now let’s punt on it and worry about getting something basic working. Starting at a high level, here’s my take on the program flow:

  1. Loop over the lines of standard input.
  2. For each line, split the line into words using an appropriately “clever” regular expression.
  3. Keep a running total of how many words we’ve seen.

Implementation Notes

Sounds simple enough, right? Well, after some hacking around, here’s what I came up with. Items 1 and 3 from the above program flow list are pretty standard parts of any programming language, and scsh has them covered, as we’ll see. That leaves it up to me to implement the above-mentioned “clever” regular expression (which, as you’ll see shortly, could stand to be a little, um, “clever-er”).

In the course of describing this implementation, i’ll introduce several interesting features of scsh,

  • The `rx’ regular expression notation
  • The `awk’ macro
  • The `regexp-fold-right’ procedure

and briefly mention another:

  • Scheme 48/scsh “records”.

For more information on these and other features of scsh, I recommend reading the excellent manual.

The Not-so-Clever Regular Expression

Rather than do it here, I’ve written the description of the regular expression using comments in the code itself. One of the strengths of shivers’ `rx’ regular expression package (standard in scsh) is that it allows the programmer to do matching notation using a syntax that, while not Scheme proper, is s-expression based rather than the traditional string-based syntax (Don’t worry: the string syntax is also supported). Because of the s-expressions, we can format these regexps easily in Emacs, as well as add descriptive inline comments.

The Perl aficionados among you will note that this is similar to Perl’s `-x’ regular expression modifier, which allows you to “Extend your pattern’s legibility by permitting whitespace and comments.” Of course, when we use scsh, we get all of that, and the rest of Scheme for free!

In any event, here is the expression that we match against every line of input:

(define wc-rx (rx (:                   ; start rx matching sequence
                   (* whitespace)      ; 0 or more spaces to start the line
                   (+ alphanumeric)    ; match 1 or more chars in [0-9A-Za-z]
                   (?                  ; then, optionally match the following:
                    (or #\' #\`)       ; - apostrophe or backtick
                    (* alphanumeric))  ; - 0 or more [0-9a-zA-Z]
                   (* whitespace))))   ; - 0 or more spaces to end the line

Let’s see how it does against our word list from above:

Text Count
below 1
that’s 1
schleswig-holstein 2
autonomy 1
friday’s 1
putrescence 1

Looks like we’re failing to match correctly against “schleswig-holstein”, a hyphenated word. This is something of a corner case, since relatively few words in English use hyphens. Therefore, I’ll punt on it for now (as I did on the “hyphenated line break” case above), with the caveat that it may need to be implemented in future.

Looping with awk

I tried to avoid learning too much about scsh’s `awk’ macro at first. Unfortunately, I just didn’t take the time to read the manual in this case. My thinking was: Can’t I just write a for-each style Scheme loop? Well, yes. scsh actually comes with regexp-for-each procedure, which will happily loop across a string, gobbling up one match after another. The difficulty arose when i tried to update a state variable from inside my `awk’ macro loop. I just couldn’t seem to get it to work. Dramatization (This is not the actual code–I’ve since deleted it):

(let ((count 0))
  (awk (read-line) (line) ()
    ... stuff here ...
    (set! count (+ 1 count)))
  ... etc. ...)

It turns out that this was extremely dumb, since `awk’ provides a nice state variable abstraction already – here’s the definition of the form from the excellent manual:

(awk NEXT-RECORD RECORD+FIELD-VARS
     [COUNTER] STATE-VAR-DECLS
     CLAUSE_1 ...)

Using the block above (and the manual) as a reference, let’s look at the actual code in the section below – you’ll have to substitute the existing code below for the terms `NEXT-RECORD’ and the like:

  1. Each time through the loop, `awk’ gets its next set of values using the procedure NEXT-RECORD; in this case, it uses a simple read-line.
  2. The variable line (standing in for RECORD+FIELD-VARS) holds the line of input returned by read-line.
  3. `awk’ provides an optional COUNTER variable, which is omitted since isn’t needed in this case.
  4. We set the value of the state variable words to 0 (This is `STATE-VAR-DECLS’).
  5. Finally, we get to `CLAUSE_1′. In the `awk’ macro, there are several types of clauses, which are all evaluated in sequence (this is unlike the way cond works, for example). In the most basic case, the clause is of the form (/test/ /body/). As you can see here, test is simply #t, Scheme’s boolean truth value, so its corresponding body is evaluated every time through the loop. This causes the value of our state variable words to be incremented by the length of the list returned by the `regexp-fold-right’ block. For now we’ll just treat that block as a black box which looks at a string and returns a list of matching substrings.
(awk (read-line) (line) ((words 0))
  (#t (+ words
         (length
          (regexp-fold-right wc-rx (lambda (m i lis)
                                     (cons (match:substring m 0) lis))
                             '() line)))))

Gathering Matches with regexp-fold-right

As we’ve just deduced by reading through the `awk’ loop, regexp-fold-right returns a list of substrings matching our regular expression wc-rx. We’re going to take the length of that list and add it to our running total. But how does this crazy regexp-fold-right work, anyway? It took me a while to figure out, as I hadn’t used any of the functional-style fold procedures before.

First, the code:

(regexp-fold-right wc-rx (lambda (m i lis)
                           (cons (match:substring m 0) lis))
                   '() line)

My initial impressions: I can see that we’re using our regular expression wc-rx and doing something with a lambda-expression that cons-es part of the resulting match record onto an empty list. Somehow this is operating on the line of input read in by `awk’.

Looking at the code above with the manual open, we see the following:

  • Again, m is a `match’ record returned by matching wc-rx against this line of input.
  • i is the index of the end of the match described by m
  • lis is the list of substring matches we’ll be building up by repeatedly cons-ing the substring in the match record m onto lis – in this way we build up a list of matching substrings of our regular expression

To make sure we really understand this, let’s step through an example. We’ll create a simple match against 2 lines of input, and diagram the states of variables m, i, lis, and line during each step. By watching the ways these variables change as we loop over each line, we’ll get a fuller understanding of the way regexp-fold-right works. Four iterations should be enough:

STEP 1
String “let’s go to the Boston aquarium!”
i 0
lis ‘()
(match:substring m 0) “”
STEP 2
String “go to the Boston aquarium!”
i 6
lis ‘(“let’s “)
(match:substring m 0) “let’s “
STEP 3
String “to the Boston aquarium!”
i 9
lis ‘(“go ” “let’s “)
(match:substring m 0) “go “
STEP 4
String “the Boston aquarium!”
i 12
lis ‘(“to ” “go ” “let’s “)
(match:substring m 0) “to “

Does it Work?

It’s been fun writing this program, but how does it stack up against GNU wc? According to my quick and dirty testing, quite well. Depending upon the type of file and how much weird formatting and markup it contains, our scsh word count will report slightly more or fewer words than GNU’s wc. Here are a few examples from readily available text files (The first file is Neal Stephenson’s In the Beginning was the Command Line, the second Homer’s Iliad):

File GNU wc wc.scm Discrepancy (%)
command-line.txt 36331 37262 0.025
iliad.txt 192656 194122 0.008

Those numbers look pretty good to me. And remember that we still have room for improvement, since we don’t handle hyphenated words correctly (easy to implement), nor do we handle hyphenated line breaks (slightly harder). Remember from Basic Design that GNU word count doesn’t handle the hyphenated line break correctly either, and also appears to be incorrectly treating `–’ as a word, at least according to our single test.

Just for fun, I ran both implementations against this post – not surprisingly, they’re pretty close:

Homegrown scsh version GNU version
1957 1988

Summary

To sum up, we’ve implemented a relatively sane word count program and introduced a few interesting areas of Scheme and scsh:

  • `rx’ regular expression notation
  • `awk’ loop macro
  • `fold’-style functional iteration
  • Scheme 48/scsh records

You can read more about these and other features on the scsh website. Below you’ll find the full version of the program described in this post.

Finally, thanks for reading, and happy scsh hacking!

Full Program Listing

#!/usr/local/bin/scsh \
-e main -s

Display the number of words read from standard input
to standard output.
!#

(define wc-rx (rx (:                   ; start rx matching sequence
                   (* whitespace)      ; 0 or more spaces to start the line
                   (+ alphanumeric)    ; match 1 or more chars in [0-9A-Za-z]
                   (?                  ; then, optionally match the following:
                    (or #\' #\`)       ; - apostrophe or backtick
                    (* alphanumeric))  ; - 0 or more [0-9a-zA-Z]
                   (* whitespace))))   ; - 0 or more spaces to end the line

(define (main prog+args)
  (display
   (awk (read-line) (line) ((words 0))
     (#t (+ words
            (length
             (regexp-fold-right wc-rx (lambda (m i lis)
                                        (cons (match:substring m 0) lis))
                                '() line))))))
  (newline))

Footnotes:

1 For more information about records, as well as the other features introduced here, consult the excellent manual at http://www.scsh.net/docu/man.html.

(Image courtesy Melisande under Creative Commons license.)

A First Guile Script

https://logicgrimoire.files.wordpress.com/2012/09/wpid-fujimoto-perpetuus.jpg

Preamble

For a long time now, I’ve been looking to make my life more Lispy. As part of that transformation, I’ve begun porting some of my little Perl scripts over to Guile Scheme. Today I’m going to walk through a script that renames my files in a nice, *nix-friendly fashion. For example, if I download a file that someone has erroneously (if good-naturedly) called “My Cool Data.tar.gz”, this script will rename it to “my-cool-data.tar.gz”.

A note on filename style: I’ve never liked the common practice of naming files using underscores (`_’), so I use hyphens instead (`-‘). It’s more Lispy! Also, regular expressions usually recognize the underscore character as part of a word, such that `my_cool_data’ is considered one word, whereas `my-cool-data’ will be treated as three, and the latter is almost always what I’d prefer (since those are, in fact, three words).

Ok. So what about Guile then? It’s an R5RS-compatible scheme, so you get all of that goodness. If you’re an Emacs user, check out Geiser, which turns Emacs into an AWESOME Scheme hacking environment. You don’t need to be an Emacs weirdo like me to write programs in Guile, however. Vim works very nicely, as a matter of fact, and it also highlights Scheme source code beautifully.

Finally, not that it matters that much, but this short essay is also a literate program, thanks to Orgmode (a.k.a. “The Teal Unicorn”). Fun!

Program Headers and Modules

Just like any other *nix script, we need to declare a path to our interpreter, as well as any arguments to the interpreter itself. In Guile’s case, there are two things to notice: (1) The guile executable must be passed the -s argument to execute in script-mode, and (2) The opening #! in the interpreter path must be matched by a closing !# due to the way Scheme works (or at least, this particular Scheme).

Next, we declare the modules we’d like to use. In this case, it’s just the one: ice-9 regex. Please don’t ask me what the ice-9 part means, but Guile has a whole bunch of functionality under the ice-9 umbrella, such as regular expression support (which we’re using here), POSIX-related stuff, a getopt-long library, and more. For details see [the fine manual]. Or just type “C-h i C-s guile” as $DEITY intended.

#!/usr/local/bin/guile -s
!#

(use-modules (ice-9 regex))

Defining the main procedure

We’re ready to start writing our actual program! Because we’re exciting and creative folk, we’ll call our single procedure main.

We’ll go ahead and use a let statement to grab all but the first element of the program-arguments list and stick it in the args variable for brevity (the first element is the name of the executable file). This use of let isn’t really required in such a simple program, but I find that it makes things easier to read, and if I expand the program later, it’s easier to modify.

(define (main)
  (let ((args (cdr (program-arguments))))

We can’t just assume that the args list is going to have anything in it, however, so we’ll print a short message and exit the program if it’s empty. If it’s not empty, we travel on to the `else’ clause of the if expression.

(if (null? args)
      (begin (display "No arguments, exiting...")
             (newline)
             (exit))

Now that we’ve invoked the interpreter with the right incantations, loaded our required module, and checked the program arguments list to make sure that we have something there to process, we can write the part of the program that actually does something. Sweet!

In the `else’ clause of the if expression, we iterate over args using the for-each procedure. We use for-each in this case (rather than our beloved map) because we don’t want to build a new list by transforming each element of args, we just want to iterate over our list being all “side-effect-y” (a technical term that in this case means “affecting the state of stuff on disk”).

The best way to read Lisp code is usually “inside-out”. Begin with the innermost element, figure out what argument(s) it takes, and see what it passes along as a return value. That return value is then an input for something else. This is true in most computer languages, but in Lisp it becomes especially necessary to read things this way.

Therefore we’ll start inside the innermost expression, at regexp-substitute/global. The documentation says that it needs a port, a regular expression, and a string to match that regular expression against. Since regexp-substitute/global isn’t writing its output to a port, but passing its arguments out to string-downcase, we specify “no port” as #f. Post has to do with making regexp-substitute/global recur on any unmatched parts of the string in arg, and the literal - is what we’d like to replace our matches with. For more comprehensive information on pre and post, I actually needed to consult the documentation on regexp-substitute, since regexp-substitute/global is apparently a special case of the former (and is perhaps implemented using regexp-substitute? I didn’t check, but it would be easy enough to do so).

Let’s look at that regex, [,'!_ \t]+. In English, it means “match any commas, apostrophes, exclamation points, underscores, blank spaces or tabs”. As noted above, we want to replace any occurrences of these characters with -.

For example, a string like Hey Kids I Have Spaces.txt would become Hey-Kids-I-Have-Spaces.txt. We then pass it out to the string-downcase procedure, which transforms it into hey-kids-i-have-spaces.txt.

That value is then passed as the second argument to the rename-file procedure, which renames arg (our original, uncool filename) to hey-kids-i-have-spaces.txt.

It’s all wrapped in a lambda expression, which does the job of creating and invoking a one-argument procedure out of the several we’ve discussed; this procedure is then applied to every item in our argument list args.

(for-each (lambda (arg)
            (rename-file arg
                         (string-downcase
                          (regexp-substitute/global
                           #f "[,'!_ \t]+" arg
                           'pre "-" 'post))))
          args))))

Invocation and Program Listing

In this way the file renaming operation that we’ve defined here is applied to each of our program’s arguments, and we invoke it like so (shown here operating on two files):

$ guile renamer.scm Hey\ Kids\ I\ Got\ Spaces.txt Oh_no_ugly_underscores.html

A final note: even for a program as simple as this, I didn’t sit down and bang it out all in one go. Especially with the regex, I was testing little parts of it at the REPL the whole way, consulting the documentation for these functions via the relevant Geiser and Emacs commands. But that’s a story for another day…

Finally, here’s the complete program listing:

#!/usr/local/bin/guile -s
!#

(use-modules (ice-9 regex))

(if (null? args)
      (begin (display "No arguments, exiting...")
             (newline)
             (exit))
(for-each (lambda (arg)
            (rename-file arg
                         (string-downcase
                          (regexp-substitute/global
                           #f "[,'!_ \t]+" arg
                           'pre "-" 'post))))
          args))))

(main)

(Image courtesy Melisande under Creative Commons license.)