Every person has a particular set of experiences they search for when choosing an occupation. For me, I’ve always be fascinated with the act and process of discovery. Thankfully, helping to build and maintain PDFTextStream satisfies that fascination in spades in ways that I never anticipated.
One would assume that working on a piece of software that extracts text from PDF documents would be pretty dry work. And, to a certain extent, it is: supporting all of the intricacies and minutiae associated with a complex file format like PDF is not the most thrilling software development work.
However, what can be exciting about the experience is how it forces me to be exposed to things that I never would have seen otherwise. See, in order to ensure that PDFTextStream works well and continues to do so as it is improved and changed, we have developed a suite of test PDF documents. These documents must be examined one by one, fed into PDFTextStream, and records of the documents’ logical structure and text content saved off into what are called ‘ground truth’ files. Then, whenever a change is made to PDFTextStream, our automated tests compare all of the preexisting ground truth files with what PDFTextStream provides after it has been changed. This process of constantly tracking the impact of changes to PDFTextStream is critical in ensuring that it continues to be robust, providing high-quality output.
The point here though, is that the process of building up and maintaining our suite of PDF documents (which numbers in the thousands now) exposes us to documents from nearly every corner of human activity. That’s thrilling for me, as I get the option to read about things that I never would have come across had I not been involved in PDFTextStream. For example, our test suite includes PDF documents like:
- An issue of the newsletter produced by the National Multiple Sclerosis Society
- A research paper describing CFS, a Cryptographic File System for Unix that was developed at AT&T
- Various PDF versions of U.S. patents
- A maintenance worksheet that describes how to apply and care for a particular type of asphalt emulsion
- A whitepaper discussing various systems that help in managing spectral data
- An essay by Seth Godin called Do Less that discusses the need to be selective in one’s entrepreneurial venture
- An English translation of an al Qaeda training manual siezed by the Manchester, UK police in a raid of an al Qaeda cell house
- An article discussing options for 2D visualization of complex ontologies
- The 2004 roster for the University of Pittsburgh softball team
- A PDF version of a Powerpoint presentation about the excruciating financial minutiae of reinsurance
- An article about how to safely set up and use tower scaffolding
- A catalog of activities at the 2003 Melbourne Scarf Festival (who knew someone would ever host a lecture called “The Nature of Scarves”?)
As you can see, the list goes on and on and on. The world of human knowledge and experience is functionally infinite, but I love getting glimpses of obscure corners of it and making little personal discoveries. Pretty geeky, I know, but that’s not really surprising, is it?