In my last post about quality control, I detailed the challenges we face in testing PDFTextStream in order to minimize hard faults, and some of the patchwork testing ’strategy’ that we employed in the early days. Now, I’d like to walk you though our specific design goals and technical solutions that went into building our current automated quality control environment.
For some months, most of the PDF documents we tested PDFTextStream against were retrieved from the Internet using a collection of scripts. These scripts controlled a very simple workflow:
- Query for PDF URLs In this step, a search engine (usually Yahoo, simply because I like its web services API) is queried for URLs that reference PDF documents that contain a search term or phrase.
- Download PDFs All of the URLs retrieved from the search engine are downloaded.
- Test PDFs with PDFTextStream PDFTextStream is then tested against each of the PDF documents that were successfully downloaded.
- Report failures, suspicious results Any errors thrown by PDFTextStream are reported, along with any spurious log messages that might indicate a ’soft failure’.
This approach is solid. It makes it possible to test PDFTextStream against random collections of PDF documents, thanks to the nature of search engine results. However, while the general approach is effective in principle, our implementation of it was unenviable for some time:
- Being a collection of scripts, the process was manual, so testing runs happened only when someone was ‘at the helm’. This involved providing query strings for the search engine access phase, nursing the downloads in various ways, and then picking through the test results (failures weren’t ‘reported’ so much as they were spit out to a log file, which then had to be grepped through in order to find interesting nuggets).
- Since the process was manual, it couldn’t scale. That’s obviously bad, and led to significant restrictions on the number of PDF documents that could be reasonably tested in a given period. Beyond that, it led to our test box(es) sitting idle much of the time.
- Since failures (and ’soft failures’) weren’t actually being reported or even recorded anywhere in any useful way, it was impossible (or really, really hard) to know what failures to concentrate on after the testing was finished. One always wants to focus on the bugs that are causing the most trouble, but we couldn’t readily tell which failures were most common, or even which of two different kinds of failures were more common than the other. This makes prioritizing work very difficult, and much like throwing darts blindly.
So, drawing from these lessons, we set out to design and build a quality control environment. To me, the emphasis on ‘environment’ here is shorthand for a number of qualities that the system resulting from this effort should exhibit:
- Autonomy Each component of the environment (usually called a node) should operate asynchronously, moving through the workflow presented earlier without any intervention, assistance, or monitoring, either from other systems or components or from people.
- Scalability Each node (and each group of nodes) should be able to saturate all available resources available to it — CPU capacity, bandwidth, disk, etc. Our aim here is to maximize the number of PDF documents PDFTextStream can be tested against in a given period, so having any resources of any kind sitting idle is simply wasteful.
- Auditability Any any moment, we should be able to know what every node in the environment is doing, what it’s going to do next, and what it’s done since its inception. Further, we should be able to generate reports on what kinds of faults PDFTextStream has thrown, on which PDF documents, which build of PDFTextStream was used in each test, etc. This makes it very simple to determine which errors should be focussed on, and which can be put on the back burner.
Those that know such fields would recognize these design principles as being very similar to those that are relied upon in multi-agent systems or distributed computation systems programming. That is not accidental: from the start, we recognized that in order to test PDFTextStream to the degree that we thought necessary, we would need to test it against millions of PDF documents. That simply was not going to happen with any kind of manual, or even scheduled system (such as simply running those old scripts from cron). Between that requirement and the notion that we need to have multiple ‘nodes’ running simultaneously in order to utilize all of the resources we have available, it was a no-brainer to use some of the concepts that are taken for granted by those that are steeped in the multi-agent systems field, for example.
So, there’s the design goals of our automated quality control environment, in broad strokes. It retains the fundamental workflow that was implemented long ago in that patchwork of scripts, but includes design principles that make the environment efficient, manageable, and effective in terms of pushing PDFTextStream to its limits.