Scripting Language Idioms: The “Seen” Hash

The “seen” hash technique is an idiom that lets you use a hash (or dictionary if you prefer) as a set data type. It’s good for generating a de-duplicated list of things, where each thing appears only once. If your language of choice has a real set data type, you may want to use that instead.

To illustrate I’ll offer a real-world use case.

The other day at work I needed to grab a bunch of information about git commits from a batch of automated emails. For reasons that don’t matter right now, our team (Docs) gets automated emails about git commits on our API (It’s not really what we asked for, but it’s what we could get somebody to build).

As a result, we get a bunch of emails formatted like this (personal details changed to protect the guilty):

--------- Project: Foo Details: something something garbage noise etc.

J. Random Luser ABC-123 did something to some other thing
Alyssa P. Hacker ABC-124 fixed mistake in ABC-123
Ben Bitdiddle ABC-125 yet another thing was done
J. Random Luser oh yeah this thing too
Ben Bitdiddle Merge ABC-123 into Foo/master

Luckily I don’t need to read all of these darn things. I have a filter set up on my mail client that saves them all in my ‘Archive’ folder, where I can safely ignore them.

When we’re getting ready to do our API release notes, I go into my ‘Archive’ folder and search for all of the emails with the subject “Project: Foo” that arrived between our last set of release notes and today. I end up with (say) about 100 files formatted like the above.

The format is: Name, JIRA ticket ID, description. Except that sometimes there is no JIRA ticket ID. And sometimes there are duplicate ticket IDs, since the emails contain messages about merge commits.

As a tech writer, I don’t need to look at the contents of every commit. I need to generate a (de-duplicated) list of JIRA ticket IDs, so I can go and review those tickets to see if there is user-facing docs work that needs to happen for those commits. (Sometimes I still need to look at the commits anyway because a ticket has a description like “change the frobnitz”, but hey.)

So I save all of these email files into a directory, and I write some code that loops over each file, generating a set of JIRA ticket IDs, which I then print out. Here’s the code what done it (it’s written in Perl but could as easily be Ruby or Python or whatevs):


use strict;
use warnings;
use feature     qw/ say   /;
use File::Slurp qw/ slurp /;

my @files = glob('*.eml');
my $jira_pat = '([A-Z]+-[0-9]+)';
my %seen;

for my $f (@files) {
  my @lines = slurp($f);

  for my $line (@lines) {
    next unless $line =~ /$jira_pat/; # Skip unless it has a JIRA ticket ID
    my $id = $1;                      # If it did match, save the capture
    $seen{$id}++;                     # Stick the ID in the hash (as a key)

say for sort keys %seen;        # Print out all the keys (which are de-duped)

The reason this trick works is that a hash table can’t have duplicate keys. Therefore the ‘$seen{$id}++’ bit means: “Stick the ID in the hash, and increment its value”. Based on the example email above, you end up with a hash table that looks like this:

  ABC-123 => 2,
  ABC-124 => 1,
  ABC-125 => 1,

Then we print the keys using the line say for sort keys %seen, which just means “print the hash keys in sorted order”.

Perl’s Autovivification FTW

Interestingly, part of the reason this idiom is cleaner in Perl than in, say, Ruby, is that Perl does something called “autovivification” of hash keys. Basically, it means that stuff gets created as soon as you mention it. That’s why you can call the ‘$seen{$id}++’ all in one line. (If you want more information about autovivification, there’s a good article on the Wikipedia.)

By contrast, in Ruby you have to first explicitly create the key’s value, and then increment it. As you can see below, if you try to bump the value of a key that doesn’t exist yet, you get an error (unless you use the technique from the Wikipedia article).

irb(main):015:0> RUBY_VERSION
=> "2.2.4"
irb(main):010:0> tix
=> {"ABC-123"=>2, "ABC-124"=>1}
irb(main):011:0> tix['ABC-125'] += 1
NoMethodError: undefined method `+' for nil:NilClass
    from (irb):11
    from c:/Ruby22/bin/irb:11:in `<main>'
irb(main):012:0> tix['ABC-125'] = 1
=> 1
irb(main):013:0> tix
=> {"ABC-123"=>2, "ABC-124"=>1, "ABC-125"=>1}
irb(main):014:0> tix['ABC-125'] += 1
=> 2

Further Reading


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s