PDFTextStream started out as a Java library, but is now available and supported for
Python. How that leap was made exemplifies how commercial and open source software efforts complement each other in the best of circumstances, and is also a fantastic case study in Java + Python integration.
In general, Java and Python don’t really mix. Their architectures, best-practices, object models, and philsophies are pretty divergent in a lot of ways. Because of this, you don’t often find them cohabiting peacefully.
However, there are significant advantages to be had by bringing these two environments together. Python is a really elegant language, and is very well-suited to whole classes of software development that are much more painful to tackle in Java. Java has its advantages as well: a very mature standard library, a huge array of third-party library support, fantastic development environments, and the backing of big players in IT. As always, there’s a right tool for each job, and sometimes Java works best, and sometimes Python works best, but a combination would truly be more than the sum of its parts.
As PDFTextStream got its legs in the market about 18 months ago, our consulting business picked up, and I began to look for a way to use Python for prototyping and custom development in conjunction with PDFTextStream. Of course, back then, PDFTextStream was only for Java, so some bridge-building was in order.
I came across JPype (
http://jpype.sourceforge.net), and found it to be a promising solution. JPype is an open-source Python library that gives “python programs full access to java class libraries”. Sounds good, and it was.
Eventually, however, we ran into some problems. Specifically, one of our clients wanted to have PDFTextStream extract text from PDF documents in-memory (i.e. without having the PDF file(s) on disk). That’s not problem with PDFTextStream — we added that feature in short order.
However, this client was also adamant in their desire for a Python-based solution. The rest of their application (with which our piece integrated) is 100% Python, and their performance requirements (think millions of PDF documents processed per month) made running PDFTextStream as some kind of service component unthinkable.
What’s the problem? JPype, circa summer of 2005, copied data between Python and Java. That means that, if you have a PDF file in memory in Python, and want to use PDFTextStream’s in-memory extraction capability, JPype made a copy of that PDF file data before passing it off into the target Java function or constructor.
Bad, bad, bad. That was a huge performance hit to the application, and simply unacceptable from the client’s (and users’) point of view.
The obvious course of action was to make JPype, in effect, “pass by reference” when working with significant chunks of data (byte arrays, Strings, etc). This was no simple task, but we soon contacted the maintainer of JPype, a friendly fellow named Steve Ménard, and explained our predicament.
Within a few days, he had hammered out the idea to expose Python strings (the byte array of the Python world in most environments) as DirectByteBuffer objects in Java. This was a great idea, and meshed nicely with PDFTextStream’s in-memory processing API. Steve and I hammered out a relatively informal work agreement and hourly rate, and it was assumed by both of us that his enhancements to JPype for our purposes would stay licensed under the Apache v2.0 license to be enjoyed by the rest of the JPype community.
Nailing down all the technical details took a few weeks, but in the end, Steve was successful. We were able to put PDFTextStream’s entire API to use from within Python in a way that sacrificed not one ounce of performance or functionality.
So what’s the upshot of all of this?
- Our consulting job completed with high praise from our customer, and our component of their application continues to hum away, extracting text from millions of PDF documents per month using PDFTextStream from Python
- We’ve since worked with Steve here and there as necessary in order to make additional tweaks to JPype. Because of his help, we now distribute a supported version of PDFTextStream for Python (click that for more technical details about the Python/Java integration made possible by JPype).
- The JPype project retains the new/improved functionality that we paid for, and the broader community continues to benefit from that.
- Steve got to pick up a new mac mini, plus whatever else he felt like buying with his hard-earned cash
That’s what I call a win-win situation, for us, for our customers, for Steve, and for the JPype project and its other users. In an ideal world, this is how open source and commercial software efforts should collaborate and cross-pollinate.