I just flipped the switch on v2.5.0 of PDFTextStream. It's a fairly significant release, representing hundreds of distinct improvements and bugfixes, most in response to feedback and experiences reported by Snowtide customers. If you find yourself needing to get data out of some PDF documents, you might want to give it a look…especially if existing open source libraries are falling down on certain documents or aren't cutting it performance-wise.
But, this piece isn't about PDFTextStream, not really. After prepping the release last night, I realized that PDFTextStream is ten years old, by at least one reckoning: though the first public release was in early 2004, I started the project two years prior, in early 2002, ten years ago. Ten years.
It's interesting to contemplate that I'm chiefly responsible for something that is ten years old, that is relied upon by lots of organizations internally, and by lots of companies as part of their own products. Aside from the odd personal retrospectives that can be had by someone in my situation (e.g. friends of mine have children that are around the same age as PDFTextStream; am I better or worse off having "had" the latter when I did instead of a son or daughter?), some thought has to be given to what the longevity and particular role of PDFTextStream (or, really, any other piece of long-lived software) implies and requires.
I don't know if there are any formal models for determining the maturity of a piece of software, but it seems that PDFTextStream should qualify by at least some measures, in addition to its vintage. So, for your consideration, some observations and opinions from someone that has their hand in a piece of mature software:
PDFTextStream is in production on three different classes of runtimes: all flavours of the JVM, both Microsoft and Mono varieties of the .NET CLR, and the CPython implementation of Python. This all flows from a single codebase, which reminds me many kinds mature systems (sometimes referred to as "legacy" once they're purely in maintenance mode — a stage of life that PDFTextStream certainly hasn't entered yet) that, once constructed, are often lifted out of their original runtime/platform/architecture to sit on top of whatever happens to be the flavour of the month, without touching the source tree.
Often, the effort required to make this happen simply isn't worth it; the less mature a piece of software is, the easier it is at any point to port it by brute force, e.g. rewriting something in C# or Haskell that was originally written in Java. This is how lots of libraries made the crossing from the JVM to .NET (NAnt and NHibernate are two examples off the top of my head).
However, the more mature a codebase, and the more challenging the domain, the more unthinkable such a plan becomes. For example, the prospect of rewriting PDFTextStream in C# to target .NET — or, if I had my druthers, rewriting PDFTextStream in Clojure to satisfy my geek id — is absolutely terrifying. All those years of fixes and tweaks in the PDFTextStream sources…trying to port all of them to a new implementation would constitute both technical and business suicide.
In PDFTextStream's case, going from its Java sources to a .NET assembly is fairly straightforward given the excellent IKVM cross-compiler. However, there's no easy Java->Python transpiler to reach for, and a bytecode cross-compiler wasn't available either. The best solution was to invest in making it possible to efficiently load and use a JVM from within CPython (via JNI). With that, PDFTextStream, derived from Java sources, ran without a hitch in production CPython environments. Maybe it was a hack, but it was, in relative terms, easier and safer than any alternative, and had no downsides in terms of performance or capabilities.
(I eventually nixed the CPython option a few years ago due to a lack of broad commercial interest.)
When I first started programming in Java, I sat aghast in the ominous glow of
java.util.Date. It was a horror then, and remains so. The whole thing has been marked as deprecated since 1997; and, despite the availability of all sorts of better options, it has not been removed from the standard library. Similar examples abound throughout the JRE, and all sorts of decidedly mature libraries.
For some time, I attributed this to sloth, or pointy-haired corporate policies, or accommodation of such characteristics amongst the broad userbase, or…god, I dunno, what are those guys thinking? In the abstract, if the physician's creed is to "do no harm", it seems that the engineer's should be "fix what's broken"; so, continual improvement should be the law of the land, API compatibility be damned.
Of course, it was naïve for me to think so. Brokenness is often in the eye of the beholder, and formal correctness is a rare thing outside of mathematics. Thus, the urge one has to "make things better" must be tempered by an understanding of the knock-on effects for whoever is living downstream of you. In particular, while making "fixes" to APIs that manifest breaking changes — either in terms of signatures or semantics — might make you feel better, there are repercussions:
Sarah: "Hey Gene, the new version of FooLib changes the semantics of the
Bar(string)function. Do you want me to fix it now?" Gene: "Sheesh, again? Well, weren't you looking at BazLib before?" Sarah: "Yeah; BazLib isn't quite as slick, but Pete over in Accounts said he's not had any troubles with it." Gene: "I'm sold. Stick with the current version of FooLib for now, but next time you're in that area of the code, swap it out for BazLib instead."
This is why semantic versioning is so important: when used and understood properly, it allows you to communicate a great deal of information in a single token. It's also why I can often be found urging people to make good breaking changes in v0.0.X releases of libraries, and why PDFTextStream hasn't had a breaking change in 6 years.
Of course there are parts of PDFTextStream's API that I'm not super proud of; I've learned a ton over the course of its ten year existence, and there are a lot of things I'd do differently if I knew then what I know now. However, overall, it works, and it works very well, and it would be selfish (not to mention a bad business decision) to start whacking away at changes that make the API aesthetically more pleasant, or of marginally higher quality, but which make customers miss a beat.
It seems to me that a good guideline might be that any breaking change needs to be accompanied by a corresponding 10x improvement in capability in order to be justifiable. This ties up well with the notion that a product new to the market must be 10x better than its competition in order to win; insofar as a new version of the same product with API breakage can potentially be considered as foreign as competing products, that new version is a new product.
If your hand is on the tiller of some mature software — or, some software that you would like to see live long enough to qualify as mature — your first priority at all times is to manage, a.k.a. minimize, risk for your users and customers.
As Prof. Christensen might say, software is hired to do a job. Now, "managing risk" isn't generally the job your software is hired to do, e.g. PDFTextStream's job is to efficiently extract content from any PDF document that is thrown at it, and do so faster and more accurately than the other alternatives. But, implicit in being hired for a job is not only that the task at hand will be completed appropriately, but that the thing being hired to do that job doesn't itself introduce risk.
The scope of software as risk management is huge, and goes way beyond technical considerations:
Even if one is selling a component library (which PDFTextStream essentially is), managing risk effectively for customers and users can be a key way to offer a sort of a whole product. Indeed, for many customers, managing risk is something that you must do, or you will simply never be hired for that job, no matter how well you fulfill the explicit requirements.