High Noon - Chas Emerick

PDFlib released a PDF text extraction component, so let’s see how we stack up. A week or so ago, Dan Shea at PlanetPDF posted a news item about PDFlib releasing a PDF text extraction library. That’s obviously very interesting to us, simply because until now, PDFTextStream has been the only library out there concentrating on PDF text extraction. My first reaction to reading this news was to shoot an email off to Dan, suggesting that a PDF text extraction library shootout of some kind might be in order. His reply was, “What do you have in mind?” Well, jeez, I hadn’t gotten that far yet. I assume any comparison of text extraction libraries should focus on a few things immediately critical to the endeavor:

Text extract accuracy
Operational performance and throughput
PDF compatibility (PDF specification support, decryption services, etc.)
Auxilliary features (accessibility of other content)

And then there’s the extras that one looks for in any library:

Platform/Environment support
API clarity
Documentation and support
Vendor stability and longevity

Obviously, there’s a lot there, and since text extraction is a minute field compared to PDF generation, etc., Dan (or any other reviewer) would likely pick and choose what to focus on. May he (and others) always choose those aspects where we dominate… ;-) In this particular situation, there’s also the complication of platform support: PDFlib’s component is available on a variety of platforms (through C bindings), whereas PDFTextStream is only available on the Java platform. That gives PDFlib an obvious advantage where Java isn’t in play, since we’re not showing up on .NET, python, etc., yet. Anything missing here? Feel free to email me with any aspects that you think are important.