I was recently posed a question by a colleague working at the special historical collections at NTNU (spesialsamlingene) about image formats and archives; this reminded me of work I used to do when I worked as at a technology company.
The problem was that the company needed to provide its salespeople and the salespeople working for agents around the world with the certification documents that the company held for its safety products. These documents were almost without exception provided in paper format, and were maintained in a folder archive
by a single person. The problem with this was that it was never easy to find, and fax the requested document (and there would typically be many requests per day, for different documents). The solution then was to scan the documents and put them on the web — where this was allowed, of course — but many of the agents relied on low-bandwidth connections, and so couldn’t download these large scanned documents (the documents had to be large in order to be legible).
Presented with this problem, I developed a framework to scan the documents in extremely high resolution (1200 dpi) as a black-and-white bitmap. This file was then converted to EPS to then be converted to PDF. The process of conversion was at the time necessary, because I could see no easy way of producing standardized PDFs other than via EPS. The resulting file was 17k per A4 page, and the printing quality was typically that of a normal laser print.
Of course, this process worked well on non-raster documents (typically documents that contained only text), and it worked well in the given context. But I realized that this was an ideal opportunity to plan ahead.
Being interested in information and technology related to information, I had been working with Adobe’s eXstensible Metadata Platform for some time, and I saw an ideal opportunity to implement it here.
I created a simple application based on the APIs provided for Acrobat and Photoshop by Adobe, and used a few commandline tools like gs in order to create a workflow that took the raw scanned files, processed them into a single document, prompted the user for metadata, and then locked the file down using standard Acrobat security. This worked fine, and the document could then be posted to the web. The only problem here was that there was really no point in inserting the metadata, with the exception that you like the idea of a file containing its own metadata, and creating an indestructible link between content and metadata. The problem was that the web presence used by the company was primitive, and there was no way to read the metadata in these files.
O f course, my boss wondered why we were spending time doing this, and then I showed him the way in which the cryptically named files (the files bore the certificate number issued by the certification body) were suddenly searchable on a standard PC, in terms of certification body, product and so forth.
Spool forward a few years, and I get the question mentioned above: we’re looking at putting the image collection online, and I mention the metadata insertion system I outlined above.
The difference being that you can deal with a variety of document types, insert metadata harvested from other sources (our OPAC, for example), and that the system can be used to post directly to the web simply by entering it in to the workflow.
Additional functions could include “purposing”, specifying conversion processes — typically in ImageMagick, but also using Adobe’s graphics server technology — so that users get a file they can use in a wordprocessor file, or in their professional advertising campaign.
One of the things I like about this approach is that it dispenses with the need for a secondary database: the files themselves are the database. I presented a simple parser that retrieves XML metadata from PDFs in a previous post. Of course, the parser I wrote here is ludicrouly simple, and uses a very rough technique to retrieve data. On the other hand, Adobe’s XMP SDK provides advanced methods to retrieve this data, and this provides a promising future for this kind of approach. It’s especially heartening since PDF/A archiving standard for electronic documents came about.