jump to navigation

Fighting the Filing Cabinet March 12, 2007

Posted by Johan in Off Topic, Self-Management.

I am an idealist. I believe that I will be able to achieve what no academic has achieved before me. In short, I believe that I will find a way to manage without a filing cabinet.

For recent articles, this is not really a problem – everything is already in PDF format, so it’s a simple matter of putting it all on your hard drive. There is the minor problem of finding the articles, but this is no longer so much of an issue with Spotlight search in Mac OS X, or Google Desktop search for Windows (I believe Vista has a Spotlight clone built-in, also). The nifty thing about these new search functions is not just that they’re quite a lot faster than the old ones, but mainly that they allow you to search for text anywhere in the document, not just in the title.

The challenge, then, is older articles, photocopies from books, and such materials. You can scan these, and with the help of some software make PDFs out of them, but the beauty of searching for text is lost, as they will be scanned as images. Many older journal articles that actually are available online have also been scanned this way.

The solution is Optical Character Recognition (OCR), technology that translates the image text into actual, machine-editable text, which you can search and copy. As a first step in my quest for a paperless office, I decided to look at the different OCR options that are available. I am a Mac user, but all the programs we’ll go through are available for PC also. For comparison purposes, I scanned the same article with all programs. I chose the reference section, because it seems to give scanners the most trouble. Note that while the output I show here may seem noisy, the main text is usually processed a bit better.

Adobe Acrobat
Acrobat includes a basic OCR feature, under the Documents menu. It offers next to no options, but if left to its own devices, it seems to do a decent job. It does have a weird kink in that it refuses to scan pages that have both machine text and image text – this is sometimes the case with older articles that have been accessed online, where a front page or a header is added. Acrobat is able to ignore front pages and scan the remaining pages, but in cases where each page has a machine text header, you will need to go in and manually remove these to get Acrobat to scan. Here is what Acrobat’s machine text output looks like:

At present,
this hypothesis is rather indeterminate and
further research is required to refine or extend
ASCH, S. Reformulation of the problem of association.
American Psychologist, 1969, 24, 92-102.
BOWERG, . H. Mental imagery and associative learning.
In Lee Gregg (Ed.), Cognition in learning and
memory. New York: Wiley, 1970.
and familiarity in associative learning. Psychological
Monographs, 1960,74, No. 491.

Note that Acrobat considered the References header an image, which is why it doesn’t appear here. I’ve put the obvious errors in bold. Acrobat also frequently mixes up spaces, adding too many or none at all. To be fair, this is something that all the programs I tested do to some extent, and it matters very little as long as you aren’t using sentences as your search terms.

Acrobat handles the mix between the image text and the machine text very nicely. The original image text is preserved, and you can highlight text very easily. This may sound trivial, but in reality, OCR creates a second layer of the document, which stores the machine text, and Acrobat then hides this layer under the image text, making it match it perfectly. All in all, the original text is preserved, but the machine text remains easily searchable.

Readiris Pro
Unlike Acrobat, Readiris is a dedicated OCR program. It is cheaper than Acrobat, but at $130, it’s not cheap by any other standards. For this, you do get superior control over the OCR algorithm. When you first open a file in Readiris, the program highlights what it considers the text chunks. You are then free to tweak this as you like, to ensure that things like Acrobat treating the References header as an image don’t happen. It’s worth noting that by default, Readiris doesn’t just put a hidden machine text layer under the image text – it replaces the image text with machine text in a default font. This looks alright, normally, but it does mean that when the scanner gets it wrong, you have no image text to fall back on. Fortunately, you can change this under the Settings/Text Format menu.

An example of machine text output:

At present,
this hypothesis is rather indeterminate and
further research is required to refine or extend
ASCH,S. Reformulation of the problem of association.
American Psychologist, 1969,24,92-102.
BOWER,G. H. Mental imagery and associative learn-
ing. In Lee Gregg (Ed.), Cognition in learning and
memory. New York: Wiley, 1970.
and familiarity in associative learning. Psycho-logical Monographs, 1960,74, No. 49l.

On casual inspection, this looks rather good. The actual PDF that is produced by using only machine text is not so impressive however, as the font change appears to have made the text contract in unpleasant ways:

You will want to switch to the “image-text” option, which creates a hidden machine text layer, much like Acrobat. Finally, it should be noted that Readiris has some pretty severe stability issues, at least in the Mac version. For the most part, ordinary scanning seems to be fine, but going into the settings is a mine field.

In true Web 2.0 style, our last contestant is a web application. Scanr is a currently free website that is primarily designed for camera phones. The idea is that you take a picture of what you want to scan and email it to scanr. The “scanned” image will then appear in your scanr account, available for download as an image or pdf. Scanr applies some basic image processing, much like a photocopier: it removes the gray-scales, cranks the contrast, and essentially makes your scanned document look like a photocopy. This usually does make the document easier to read.

More in the interest of our purposes, Scanr offers OCR. It may seem convoluted to attach and email your file off for processing, but to its advantage, Scanr is free, and you won’t need to buy a scanner as you can use your camera. There is, however, a major stumbling block: currently, Scanr can only accept certain image formats, not PDFs. Thus, to scan an old image text article like the one I used for my examples, you have to first save it as a JPEG, then attach it, email it over, wait for the image to get processed (which usually takes a minute or so), and then download the result. It’s pretty clear that this is not going to be a sustainable option when you’re dealing with filing cabinet-style amounts of documents.

In any case, here is what the output looks like:

fax experiences are translated for storage and out _ 7 -J , ,- « », o* _r , . , , , _ .e Paivio, A., Yuilli,J. C, &Madigan,S. A. Concrete- of which either surface sentences, imagery, or ness imagerv and meaningfuness values for 925 drawings may be generated, depending on the nouns. Journal of Experimental Psychology material and the task demands. At present, Monograph Supplement, 1968, 67 (1), 1-25. this hypothesis is rather indeterminate and Rohwer, W. D., Jr. Constraint, syntax, and meaning further research is required to refine or extend jrn paired-as;atfee ,,earnin?- y”‘”‘fl’ {A \r Learning ana Verbal Behavior, 1966, 5, 541-547. Jt* Schank, R. C. A conceptual dependency representa tion for a computer-oriented semantics. Technical Report No. CS-130, March 1969, Computer Science Department,

Like Acrobat, Scanr generates a hidden layer for the machine text, thus preserving the look of the original image text. Unfortunately, Scanr does this in a way that doesn’t even attempt to match the layout of the original text. Instead, all the machine text is placed invisibly at the top of the page, in a single paragraph in what can conservatively be estimated to font size 2. I tried my best to extract the same segment as in the previous examples, but I failed, as Scanr had garbled up the parallel-paragraph style of the original article by treating it as a single column. This means that for anyone who is actually planning to copy and paste text, Scanr is not really an option.

It can also be seen that the output is far noisier than Acrobat or Readiris, with a host of bizarre symbols getting mixed up with misspelled words. That being said, the machine text is probably going to be serviceable for searching purposes. As long as your search term is mentioned more than once in the document, Scanr will probably get it right at least once.

For most purposes, Acrobat appears to be the sustainable option. Unfortunately, the limited customisability spells trouble, as there will be no way to correct the output if Acrobat fails to process a given document.

If you have deep pockets (or if you are familiar with Serial Box), Readiris may be a very useful backup option for those times when Acrobat fails. Once the garish machine text-only setting is changed, it performs just as well as Acrobat, though with none of its stability.

Scanr offers the budget option of the bunch. You can get away with using your digital camera instead of a scanner, and the price of the actual application can’t be beat. Since I will only use the machine text for searching, I don’t mind Scanr’s garbled output that much. A more critical error is the lack of PDF support, which makes it a very poor contender at present. If this is resolved, Scanr may become an appealing option: the process of uploading your file is going to be faster than the normal processing time in Acrobat or Readiris, and your email application most certainly takes up less resources than these two unashamed resource hogs.

So where does this leave my goal of a hard drive-based filing system? It’s pretty clear that there is no ideal solution, at present. A two-pronged mode of attack with both Acrobat and Readiris will produce workable results, but since both programs have their kinks, you won’t be able to automatise the process very much.

Time keeps coming up in this article, and for a good reason: scanning documents takes time, even if you splurge on a scanner with an automatic paper feeder. Running OCR on each document takes more time. While a filing cabinet is undesirable, there can be no doubt that just managing the electronic option is going to eat up considerable amounts of time.

There is also the issue of cost: a scanner with an automatic paper feeder will start somewhere around £300, and the cost of an Acrobat license (should you need to buy one) is hardly less than that. Add to that the cost of Readiris, and you are looking at around £800. That much money will buy you a filing cabinet or two, and will also cover the cost of having to take them with you when you move.

For the time being, then, I think I’ll stick with a hybrid system.

This post was inspired by a few recent posts about digital document filing: Playing With Wire, Lifehacker, Fazal Majid. If you want to learn more about the technical side of this, do look them up.


No comments yet — be the first.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: