Monthly Archives: October 2016

Scripting Language Idioms: The “Seen” Hash

The “seen” hash technique is an idiom that lets you use a hash (or dictionary if you prefer) as a set data type. It’s good for generating a de-duplicated list of things, where each thing appears only once. If your language of choice has a real set data type, you may want to use that instead.

To illustrate I’ll offer a real-world use case.

The other day at work I needed to grab a bunch of information about git commits from a batch of automated emails. For reasons that don’t matter right now, our team (Docs) gets automated emails about git commits on our API (It’s not really what we asked for, but it’s what we could get somebody to build).

As a result, we get a bunch of emails formatted like this (personal details changed to protect the guilty):

--------- Project: Foo Details: something something garbage noise etc.

J. Random Luser ABC-123 did something to some other thing
Alyssa P. Hacker ABC-124 fixed mistake in ABC-123
Ben Bitdiddle ABC-125 yet another thing was done
J. Random Luser oh yeah this thing too
Ben Bitdiddle Merge ABC-123 into Foo/master

Luckily I don’t need to read all of these darn things. I have a filter set up on my mail client that saves them all in my ‘Archive’ folder, where I can safely ignore them.

When we’re getting ready to do our API release notes, I go into my ‘Archive’ folder and search for all of the emails with the subject “Project: Foo” that arrived between our last set of release notes and today. I end up with (say) about 100 files formatted like the above.

The format is: Name, JIRA ticket ID, description. Except that sometimes there is no JIRA ticket ID. And sometimes there are duplicate ticket IDs, since the emails contain messages about merge commits.

As a tech writer, I don’t need to look at the contents of every commit. I need to generate a (de-duplicated) list of JIRA ticket IDs, so I can go and review those tickets to see if there is user-facing docs work that needs to happen for those commits. (Sometimes I still need to look at the commits anyway because a ticket has a description like “change the frobnitz”, but hey.)

So I save all of these email files into a directory, and I write some code that loops over each file, generating a set of JIRA ticket IDs, which I then print out. Here’s the code what done it (it’s written in Perl but could as easily be Ruby or Python or whatevs):

#!perl

use strict;
use warnings;
use feature     qw/ say   /;
use File::Slurp qw/ slurp /;

my @files = glob('*.eml');
my $jira_pat = '([A-Z]+-[0-9]+)';
my %seen;

for my $f (@files) {
  my @lines = slurp($f);

  for my $line (@lines) {
    next unless $line =~ /$jira_pat/; # Skip unless it has a JIRA ticket ID
    my $id = $1;                      # If it did match, save the capture
    $seen{$id}++;                     # Stick the ID in the hash (as a key)
  }
}

say for sort keys %seen;        # Print out all the keys (which are de-duped)

The reason this trick works is that a hash table can’t have duplicate keys. Therefore the ‘$seen{$id}++’ bit means: “Stick the ID in the hash, and increment its value”. Based on the example email above, you end up with a hash table that looks like this:

{
  ABC-123 => 2,
  ABC-124 => 1,
  ABC-125 => 1,
}

Then we print the keys using the line say for sort keys %seen, which just means “print the hash keys in sorted order”.

Perl’s Autovivification FTW

Interestingly, part of the reason this idiom is cleaner in Perl than in, say, Ruby, is that Perl does something called “autovivification” of hash keys. Basically, it means that stuff gets created as soon as you mention it. That’s why you can call the ‘$seen{$id}++’ all in one line. (If you want more information about autovivification, there’s a good article on the Wikipedia.)

By contrast, in Ruby you have to first explicitly create the key’s value, and then increment it. As you can see below, if you try to bump the value of a key that doesn’t exist yet, you get an error (unless you use the technique from the Wikipedia article).

irb(main):015:0> RUBY_VERSION
=> "2.2.4"
irb(main):010:0> tix
=> {"ABC-123"=>2, "ABC-124"=>1}
irb(main):011:0> tix['ABC-125'] += 1
NoMethodError: undefined method `+' for nil:NilClass
    from (irb):11
    from c:/Ruby22/bin/irb:11:in `<main>'
irb(main):012:0> tix['ABC-125'] = 1
=> 1
irb(main):013:0> tix
=> {"ABC-123"=>2, "ABC-124"=>1, "ABC-125"=>1}
irb(main):014:0> tix['ABC-125'] += 1
=> 2

Further Reading

How Dangerous is the Samsung Galaxy Note 7? It’s Safer than Driving

Now that the Galaxy Note 7 has been officially discontinued, I’m not sure it’s worth worrying about the failure rate of this device. But there’s something that really bothered me about the coverage of the device’s various recalls and eventual discontinuation (is that a word?), which was that almost nobody seemed to be running the numbers on the actual failure rates.

If you do the arithmetic on the device failure rates, you end up looking at the situation rather differently. This is not to say that the device being discontinued was the wrong decision — all it takes is one person being horribly burned to create a panic and do serious damage to the company, not to mention that person!

Rather, I think it’s interesting to do the arithmetic as a way of exploring how humans think about risk. It may not surprise you to hear that I think we are really bad at this. And oftentimes it’s because we don’t run the numbers.

With that said, let’s look at some numbers.

According to this article on the Galaxy Note 7 recall, there were about 2.5 million devices sold in the initial batch, and, at least in early September, there had been 35 handsets discovered with the issue. A later report said that over 70 devices had overheated.

The best final count I could find is the one from the Consumer Product Safety Commission. According to the CPSC, there have been 92 reports of the batteries overheating.

How much danger was I really in? (I just returned my Galaxy Note 7 yesterday, which I LOVED, which is part of why I’m writing this.)

  • 2.5 million phones
  • 92 incidents of overheating

Turning to my trusty calculator, that looks like a 1 in 26,000 chance of the device overheating:

CL-USER> (/ 2.4e6 92)
26086.957

Expressed as a percentage, there is approximately a 0.004% chance that your device would have been one of the ones to overheat:

CL-USER> (format t "~6$" (* (/ 92 2.4e6) 100))
0.003833

However, let’s make a more conservative assumption that 1000 devices (over 10x as many) would eventually overheat. That’s still 0.04%, far less than one tenth of one percent. Of course, that number “less than one tenth of one percent” was quoted by Samsung themselves during the initial recall:

CL-USER> (format t "~6$" (* (/ 1000 2.4e6) 100))
0.041667

Eventually, the bad PR due to overheating devices grew to be too much, and Samsung discontinued the model.

One lesson of this incident seems to be that you can make a product that is nearly perfect, with a 0.0038% failure rate, but if the failure mode is bad enough (it probably is), and if the media exposure is widespread enough to create a public outcry (it definitely was), you’re fucked. With lines like this one appearing in the Verge, it’s not hard to understand why Samsung realized they had to just kill it:

It’s easy to imagine how terrifying it would be to have a phone begin smoking like this on a plane or on your bedside table. No thanks.

The mobile device hardware industry is brutal. You can’t have a failure that occurs even in 0.0038% of devices. And even if you maintain that near-perfect safety record you still have to compete on price, features, and time-to-market. I really don’t envy those folks.

But what about risk assessment?

What I find most interesting, as I mentioned above, is what this reflects about how humans assess risk. For example, in 2012, 92 people died in car accidents every day, and nobody has ever considered doing a recall of all automobiles sold in the United States for being fundamentally unsafe!

According to the chart linked above, in 2012 there were 10.691 auto accident deaths per 100,000 people, which means that you had a 0.001% chance of literally dying in a horrible car crash:

CL-USER> (format t "~6$" (* (/ 10.691 10e5) 100))
0.001069

Compare this to the 0.004% chance of your phone overheating that we calculated above (based on the 92 incidents figure). Given the rather imprecise way we’ve been slinging these numbers around, let’s just assume there’s a lot of error there, and that the figures are roughly equal.

That’s how we arrive at our conclusion:

You have as much chance of your Samsung Galaxy Note 7 overheating as you did of dying in a car crash in 2012.

(Note: this article and its headline do not constitute a claim that the Note 7 is “safe”, or that you should not return it as recommended, etc. This article is not advice on how to live your life, it’s just an exploration of how humans think about risk.)