Automated Quality Control, Part II

In my last post about quality control, I detailed the challenges we face in testing PDFTextStream in order to minimize hard faults, and some of the patchwork testing ’strategy’ that we employed in the early days. Now, I’d like to walk you though our specific design goals and technical solutions that went into building our current automated quality control environment.

For some months, most of the PDF documents we tested PDFTextStream against were retrieved from the Internet using a collection of scripts. These scripts controlled a very simple workflow:

  1. Query for PDF URLs In this step, a search engine (usually Yahoo, simply because I like its web services API) is queried for URLs that reference PDF documents that contain a search term or phrase.
  2. Download PDFs All of the URLs retrieved from the search engine are downloaded.
  3. Test PDFs with PDFTextStream PDFTextStream is then tested against each of the PDF documents that were successfully downloaded.
  4. Report failures, suspicious results Any errors thrown by PDFTextStream are reported, along with any spurious log messages that might indicate a ’soft failure’.

This approach is solid. It makes it possible to test PDFTextStream against random collections of PDF documents, thanks to the nature of search engine results. However, while the general approach is effective in principle, our implementation of it was unenviable for some time:

So, drawing from these lessons, we set out to design and build a quality control environment. To me, the emphasis on ‘environment’ here is shorthand for a number of qualities that the system resulting from this effort should exhibit:

Those that know such fields would recognize these design principles as being very similar to those that are relied upon in multi-agent systems or distributed computation systems programming. That is not accidental: from the start, we recognized that in order to test PDFTextStream to the degree that we thought necessary, we would need to test it against millions of PDF documents. That simply was not going to happen with any kind of manual, or even scheduled system (such as simply running those old scripts from cron). Between that requirement and the notion that we need to have multiple ‘nodes’ running simultaneously in order to utilize all of the resources we have available, it was a no-brainer to use some of the concepts that are taken for granted by those that are steeped in the multi-agent systems field, for example.

So, there’s the design goals of our automated quality control environment, in broad strokes. It retains the fundamental workflow that was implemented long ago in that patchwork of scripts, but includes design principles that make the environment efficient, manageable, and effective in terms of pushing PDFTextStream to its limits.