Whoa, Peter Norvig used some of my code!

I’m generally not one to be impressed by celebrity — you won’t catch me reading People or US Weekly, example.  However, this morning I noticed with a shimmer of glee that Peter Norvig used some code that I wrote years ago in one of his recent projects.  So, just for the record, if Dr. Norvig ever shows up in US Weekly, I’ll pick one up!

In case you don’t know, Peter Norvig is the Director of Research at Google.  That’s interesting, but the real reason Dr. Norvig holds sway with me is his classic book, Paradigms of Artificial Intelligence Programming.  If it weren’t for that book, I almost certainly would not be doing what I’m doing today.  Its pages are where I came to understand lisps, and began to imagine what was possible and what I might be able to accomplish in computer science (final results yet to be determined, of course).  For that, I am extraordinarily grateful to him (and others, of course, but I’ll wait to talk about them when they get around to using some of my code! ;-) ).

Back to the story.  This morning, I decided to hop onto Google Analytics for a bit to check up on the traffic stats for our various websites.  Lo and behold, in the “top referrals” listing, I saw ‘norvig.com’; “Well,” I thought to myself, “that’s interesting!”   A quick grep of the server logs (is there a screen in Google Analytics that actually provides you with the full referral URLs?) showed the referral URL to be Dr. Norvig’s “post” from last week, An Exercise in Species Barcoding.

A search of my name on that page shows that he needed a way to calculate the Levenshtein distance (also known as the edit distance) between two large strings — his quick implementation (like most) operated in O(n^2) space, which would have required weeks of processing time in his particular case.  So, he looked around for a more efficient implementation, and found one that I wrote in October of 2003 that operated in linear space bounds (and was, ironically enough, my first-ever contribution to an open source project).  With a couple of tweaks to suit his specific needs, the code I wrote worked out nicely for him.

This story is satisfying and funny (for me, anyway) in a couple of different ways:

First, there’s the fact that (what I would now consider) throwaway work of mine floating around the nets six years later.  Remember kids, the Internet never forgets!

Second, it reminded me of what I was doing when I wrote that particular code.  I was building what would later become PDFTextStream’s first ground-truthing system1(although I don’t think I knew of that term at the time). It’s a lot more sophisticated now, but back in 2003, I was simply trying to set up a “ground truthing” system where the full (vetted and known-good) extracted text from each PDF document in our nascent test repository would be saved off somewhere, and later builds of PDFTextStream would compare its extracted PDF text to those saved files.

Of course, it wouldn’t be practical to require that PDFTextStream produce identical output forever — some amount of slop had to be allowable, because (for example) if an extracted word was outputted with four spaces before it instead of two, that would generally be sufficient.  For that and other reasons, I wanted to test that current PDF text extracts were the same as the known-good extracts within a defined margin of error.  Unfortunately, I was ground-truthing full document extracts at that time, and most Levenstein functions with their quadratic performance characteristics would take a lot of memory to diff the multi-megabyte strings that were involved.

Solution: write my own Levenshtein function (loosely based off of a pedagogical implementation by Mike Gilleland that had been incorporated into the Apache commons-lang project) that operated in linear space bounds.  Thankfully, I opted to offer the improvement back to the Apache commons-lang project and to Dr. Gilleland — had I not, Dr. Norvig would never had found that code, and I wouldn’t be writing this right now.

Third and finally, this story is satisfying because, hell, Peter Norvig used some of my code.  A person I respect and admire has found it convenient to use some minor thing I created years ago, and was thoughtful enough to say so.  I hope I can follow that example as I go along in my travels.

See, Dr. Norvig, I’m still learning from you.


1 Ground truthing is a testing methodology often used in document processing systems where ideal or otherwise known-good output is cataloged, and then actual or current output is compared to it to determine relative accuracy.  PDFTextStream’s current ground-truthing system serves as a semi-rigorous smoke test of its aggregate text extraction accuracy while we’re doing active development, as well as an ironclad regression test for when we’re looking to cut a release.  Thankfully, it’s come a long, long way from the very naive approach I was pursuing in 2003.