Light blogging: organising life May 25, 2007

Posted by Johan in Links, Off Topic, Self-Management.
Wired has a few interesting articles about taxonomies. In Order is in the Eye of the Tagger, David Weinberger starts in today’s web 2.0 tagging systems, and traces its roots back to the original taxonomies proposed by Linnaeus. The central thesis is that older taxonomies such as that by Linnaeus were limited by the very fact that they were written on paper, which only allowed structure through hierarchies, going from high to low (human to worm, in Linnaeus’ system).

Modern tagging systems, such as those used by blogs, Flickr and del.icio.us, rely on a horizontal structure instead, where categories can be made up on the spot, by mixing and matching different tags to produce a list of the content you want to see. The flexibility of tagging is that items can exist in multiple categories, a solution that quickly results in chaos in traditional hierarchical tree structures.

A related article describes an attempt to impose this kind of tagging structure on Linnaeus’ taxonomy. The business-backed Encyclopedia of Life project seeks to do just that, as this article outlines. This is not the first attempt at this, but unlike previous projects it has enough financial backing to make it borderline-feasible. To get an idea of what they’re trying to do, watch this promo video from youtube:

Note the none-too-subtle similarities with this video. I guess plagiarism isn’t quite as much frowned upon in marketing as it is in academia.

While all this is pretty cool, especially for librarians and web designers, it’s easy to get carried away with what essentially is a filing system. On the page for the Encyclopedia of Life promo video, one commenter announced that this project is the biggest thing in biology since Watson and Crick, which is frankly delusional. Sure, structured information makes research faster and easier, but in itself, a perfectly organised book shelf has no value. It’s what you make of it that matters.

It is interesting, however, to try to apply tagging to your own files. Unfortunately, no current operating system really supports filing by tag rather than by directory (although some try to reverse-engineer this feature anyway). I can’t wait for these technologies to move from the web to your hard drive – my own article filing system is already falling apart (4 folder levels deep and counting…), and I’m not even postgrad yet.

Fighting the Filing Cabinet March 12, 2007

Posted by Johan in Off Topic, Self-Management.
I am an idealist. I believe that I will be able to achieve what no academic has achieved before me. In short, I believe that I will find a way to manage without a filing cabinet.

For recent articles, this is not really a problem – everything is already in PDF format, so it’s a simple matter of putting it all on your hard drive. There is the minor problem of finding the articles, but this is no longer so much of an issue with Spotlight search in Mac OS X, or Google Desktop search for Windows (I believe Vista has a Spotlight clone built-in, also). The nifty thing about these new search functions is not just that they’re quite a lot faster than the old ones, but mainly that they allow you to search for text anywhere in the document, not just in the title.

The challenge, then, is older articles, photocopies from books, and such materials. You can scan these, and with the help of some software make PDFs out of them, but the beauty of searching for text is lost, as they will be scanned as images. Many older journal articles that actually are available online have also been scanned this way.

The solution is Optical Character Recognition (OCR), technology that translates the image text into actual, machine-editable text, which you can search and copy. As a first step in my quest for a paperless office, I decided to look at the different OCR options that are available. I am a Mac user, but all the programs we’ll go through are available for PC also. For comparison purposes, I scanned the same article with all programs. I chose the reference section, because it seems to give scanners the most trouble. Note that while the output I show here may seem noisy, the main text is usually processed a bit better.

Adobe Acrobat
Acrobat includes a basic OCR feature, under the Documents menu. It offers next to no options, but if left to its own devices, it seems to do a decent job. It does have a weird kink in that it refuses to scan pages that have both machine text and image text – this is sometimes the case with older articles that have been accessed online, where a front page or a header is added. Acrobat is able to ignore front pages and scan the remaining pages, but in cases where each page has a machine text header, you will need to go in and manually remove these to get Acrobat to scan. Here is what Acrobat’s machine text output looks like:

At present,
this hypothesis is rather indeterminate and
further research is required to refine or extend
ASCH, S. Reformulation of the problem of association.
American Psychologist, 1969, 24, 92-102.
BOWERG, . H. Mental imagery and associative learning.
In Lee Gregg (Ed.), Cognition in learning and
memory. New York: Wiley, 1970.
and familiarity in associative learning. Psychological
Monographs, 1960,74, No. 491.

Note that Acrobat considered the References header an image, which is why it doesn’t appear here. I’ve put the obvious errors in bold. Acrobat also frequently mixes up spaces, adding too many or none at all. To be fair, this is something that all the programs I tested do to some extent, and it matters very little as long as you aren’t using sentences as your search terms.

Acrobat handles the mix between the image text and the machine text very nicely. The original image text is preserved, and you can highlight text very easily. This may sound trivial, but in reality, OCR creates a second layer of the document, which stores the machine text, and Acrobat then hides this layer under the image text, making it match it perfectly. All in all, the original text is preserved, but the machine text remains easily searchable.

Readiris Pro
Unlike Acrobat, Readiris is a dedicated OCR program. It is cheaper than Acrobat, but at $130, it’s not cheap by any other standards. For this, you do get superior control over the OCR algorithm. When you first open a file in Readiris, the program highlights what it considers the text chunks. You are then free to tweak this as you like, to ensure that things like Acrobat treating the References header as an image don’t happen. It’s worth noting that by default, Readiris doesn’t just put a hidden machine text layer under the image text – it replaces the image text with machine text in a default font. This looks alright, normally, but it does mean that when the scanner gets it wrong, you have no image text to fall back on. Fortunately, you can change this under the Settings/Text Format menu.

An example of machine text output:

At present,
this hypothesis is rather indeterminate and
further research is required to refine or extend
ASCH,S. Reformulation of the problem of association.
American Psychologist, 1969,24,92-102.
BOWER,G. H. Mental imagery and associative learn-
ing. In Lee Gregg (Ed.), Cognition in learning and
memory. New York: Wiley, 1970.
and familiarity in associative learning. Psycho-logical Monographs, 1960,74, No. 49l.

On casual inspection, this looks rather good. The actual PDF that is produced by using only machine text is not so impressive however, as the font change appears to have made the text contract in unpleasant ways:

You will want to switch to the “image-text” option, which creates a hidden machine text layer, much like Acrobat. Finally, it should be noted that Readiris has some pretty severe stability issues, at least in the Mac version. For the most part, ordinary scanning seems to be fine, but going into the settings is a mine field.

In true Web 2.0 style, our last contestant is a web application. Scanr is a currently free website that is primarily designed for camera phones. The idea is that you take a picture of what you want to scan and email it to scanr. The “scanned” image will then appear in your scanr account, available for download as an image or pdf. Scanr applies some basic image processing, much like a photocopier: it removes the gray-scales, cranks the contrast, and essentially makes your scanned document look like a photocopy. This usually does make the document easier to read.

More in the interest of our purposes, Scanr offers OCR. It may seem convoluted to attach and email your file off for processing, but to its advantage, Scanr is free, and you won’t need to buy a scanner as you can use your camera. There is, however, a major stumbling block: currently, Scanr can only accept certain image formats, not PDFs. Thus, to scan an old image text article like the one I used for my examples, you have to first save it as a JPEG, then attach it, email it over, wait for the image to get processed (which usually takes a minute or so), and then download the result. It’s pretty clear that this is not going to be a sustainable option when you’re dealing with filing cabinet-style amounts of documents.

In any case, here is what the output looks like:

fax experiences are translated for storage and out _ 7 -J , ,- « », o* _r , . , , , _ .e Paivio, A., Yuilli,J. C, &Madigan,S. A. Concrete- of which either surface sentences, imagery, or ness imagerv and meaningfuness values for 925 drawings may be generated, depending on the nouns. Journal of Experimental Psychology material and the task demands. At present, Monograph Supplement, 1968, 67 (1), 1-25. this hypothesis is rather indeterminate and Rohwer, W. D., Jr. Constraint, syntax, and meaning further research is required to refine or extend jrn paired-as;atfee ,,earnin?- y”‘”‘fl’ {A \r Learning ana Verbal Behavior, 1966, 5, 541-547. Jt* Schank, R. C. A conceptual dependency representa tion for a computer-oriented semantics. Technical Report No. CS-130, March 1969, Computer Science Department,

Like Acrobat, Scanr generates a hidden layer for the machine text, thus preserving the look of the original image text. Unfortunately, Scanr does this in a way that doesn’t even attempt to match the layout of the original text. Instead, all the machine text is placed invisibly at the top of the page, in a single paragraph in what can conservatively be estimated to font size 2. I tried my best to extract the same segment as in the previous examples, but I failed, as Scanr had garbled up the parallel-paragraph style of the original article by treating it as a single column. This means that for anyone who is actually planning to copy and paste text, Scanr is not really an option.

It can also be seen that the output is far noisier than Acrobat or Readiris, with a host of bizarre symbols getting mixed up with misspelled words. That being said, the machine text is probably going to be serviceable for searching purposes. As long as your search term is mentioned more than once in the document, Scanr will probably get it right at least once.

For most purposes, Acrobat appears to be the sustainable option. Unfortunately, the limited customisability spells trouble, as there will be no way to correct the output if Acrobat fails to process a given document.

If you have deep pockets (or if you are familiar with Serial Box), Readiris may be a very useful backup option for those times when Acrobat fails. Once the garish machine text-only setting is changed, it performs just as well as Acrobat, though with none of its stability.

Scanr offers the budget option of the bunch. You can get away with using your digital camera instead of a scanner, and the price of the actual application can’t be beat. Since I will only use the machine text for searching, I don’t mind Scanr’s garbled output that much. A more critical error is the lack of PDF support, which makes it a very poor contender at present. If this is resolved, Scanr may become an appealing option: the process of uploading your file is going to be faster than the normal processing time in Acrobat or Readiris, and your email application most certainly takes up less resources than these two unashamed resource hogs.

So where does this leave my goal of a hard drive-based filing system? It’s pretty clear that there is no ideal solution, at present. A two-pronged mode of attack with both Acrobat and Readiris will produce workable results, but since both programs have their kinks, you won’t be able to automatise the process very much.

Time keeps coming up in this article, and for a good reason: scanning documents takes time, even if you splurge on a scanner with an automatic paper feeder. Running OCR on each document takes more time. While a filing cabinet is undesirable, there can be no doubt that just managing the electronic option is going to eat up considerable amounts of time.

There is also the issue of cost: a scanner with an automatic paper feeder will start somewhere around £300, and the cost of an Acrobat license (should you need to buy one) is hardly less than that. Add to that the cost of Readiris, and you are looking at around £800. That much money will buy you a filing cabinet or two, and will also cover the cost of having to take them with you when you move.

For the time being, then, I think I’ll stick with a hybrid system.

This post was inspired by a few recent posts about digital document filing: Playing With Wire, Lifehacker, Fazal Majid. If you want to learn more about the technical side of this, do look them up.

Getting Progressively Better Organised… November 26, 2006

Posted by Johan in Off Topic, Self-Management.
As my course moves on, I find that keeping all my notes, articles and other random stuff organised is proving quite difficult. From the start of my degree, I’ve kept a tidy folder structure where everything I do is sorted by module, but this is not always intuitive, when you’re trying to find some random article that you vaguely remember reading last year.

I’ve found that, looking back, I’ve progressively gotten better at achieving some kind of order (though I don’t know if my improvements match the rate of new material that gets added). In true Piagetian fashion, I will outline a developmental trajectory:

The 3 Stages of Organisation

1. Ignorance
For almost all of my first year, I was actually in the habit of deleting PDFs that I had read, in some insane cleaning mania. This is really the one mistake that you cannot correct through later re-organisations – if it’s no longer on your harddrive, it’s gone. I learned this the hard way when a recent article in Wired about Daniel Langleben’s use of fMRI for lie detection prompted me to remember a meta-analysis by Ben-Shakhar and Elaad on the Guilty Knowledge Test, which I had read about a year before. I spent the better part of an afternoon scouring my harddrive, but no… Apparently I had, despite finding the article fascinating, decided to delete it after reading it. Fortunately, I stumbled upon the article again, casually referenced in a blog somewhere. Next time I don’t count on being so lucky.

Lesson learned: Save everything. PDFs aren’t that big, you can afford to gather up a few thousand.

2. Elaboration
It turns out that even if the article is on your harddrive, a cryptic title like “Ben-Shakhar2003.pdf” is not always going to be enough to jog your memory. Since I use a Mac, I can use the built-in Spotlight feature in OS X to search any text on the harddrive, including text in PDFs (similar features are available to PCs through Google Desktop Search, I hear)… But this doesn’t work with older articles, which are typically scanned as images. You can tell that this is the case if you are unable to highlight text for copying.

In addition to this, sometimes even searchable articles refuse to be found because your search terms are based on your own version of psychobabble, which may be subtly different from the psychobabble used in the article (e.g., short-term memory versus working memory).

And let’s face it – some researchers just don’t know how to write. There are plenty of important articles that I would just never want to attempt to decipher again.

Lesson learned: Write your own notes on articles. Jot down a few bullet points on the key findings, and save as a text file with the same name as the pdf.

3. Integration
At some point, sooner rather than later, this “note and PDF” system will become a little unwieldy, no matter how good your folder organisation is. My folder system is now four levels deep (e.g., Psychology/Perception/MAE Storage/Sources). Since this system is based on the current module, I’m fine as long as I stay within perception, for example. But as soon as I need to integrate findings from different modules, I end up jumping up and down folder structures like mad.

Additionally, this system still relies on me knowing what I’m looking for. This only works if I remember every single source I have ever saved.

Lesson learned: time to switch to an organisation program. Fortunately for Mac users, there are plenty of solid options. The most “Pro” alternatives are perhaps DEVONthink and Boswell. I tried both, and they are both improvements over a straight folder structure… However, I ended up settling for the decidedly non-pro VoodooPad, because hyperlinks are awesome.

Let me explain: Voodoopad is essentially a personal wiki (as in Wikipedia), that you keep for yourself. Without any scripting, you can create pages and link them together. It’s also easy to create outside links to locations on your harddrive. The really cool thing about this is that if I have a page where I sum up all my stuff on Behavioural Genetics, basically anytime I type those two words they are automatically linked to said page. You can set up aliases for each page, so I tend to create an alias for each article that I summarise in that page, e.g., Turkheimer, 2003. Now, every time I cite (Turkheimer, 2003), that citation becomes a link to my notes on the original article.

As you might imagine, this becomes extremely intuitive, once you get used to it. When typing out an essage plan, the plan itself contains direct links to every article I’ve used, and if I want to relate my writing to another area, I just type in something like “this is similar to the heritability estimate used in Behavioural Genetics”, and poof, I have direct links to my writing on heritability, and Behavioural Genetics.

I’ve only started to tap the full potential of Voodoopad. When you first start to enter data, obviously you get very few links lighting up, because there is no other data to link to. It is only once you go through the trouble of importing all your old notes into Voodoopad and giving them appropriate aliases that you start seeing what the application is capable of.

I’m under no illusion that I’ve now reached organisation nirvana. Stages 4 and 5 are out there, and I’m sure I’ll get there eventually. There is also an issue of optimal self-management to consider here. In other words, it’s easy to forget the time you spend figuring out how to self-manage better, when estimating how much productivity you can gain from switching to a new system. More on that in another post.

Self-Management November 24, 2006

Posted by Johan in Learning, Raves, Self-Management.
Mindhacks has an interesting post up on how Skinner was not a fascist, which deserves to be said in itself… But the really interesting bit is the article that is linked: Skinner as a Self-Manager, by Robert Epstein.

Self-management, for those unfamiliar with the Psychobabble, is basically how you get yourself to achieve your goals. If your goal is to write an essay, self-management would involve setting aside time to write, ensuring that your topic is appropriate, and so forth. By extension, self-management is the stuff that productivity websites like Lifehacker are concerned with. And it turns out that decades before GTD reared its ugly head, Skinner was already on it:

“a few examples and a brief analysis can’t begin to capture how pervasive self-management was in his life. It was much more than a few gizmos and timers. It was what many would call an attitude. He managed his own behavior almost continuously. When I was in graduate school, a fellow student mentioned that Fred seemed to dispose of envelopes and junk mail in an especially efficient way. I had never noticed this before, but it was true. When he opened his mail in the morning, he usually positioned his chair and trash can so that the very slightest flick of his wrist did the job. This was no accident, and it was part of the reason he was able to reply to virtually every letter he ever received, even until the end (Vargas, 1990).” (Epstein, 1997, p. 547)

More specifically:

“Fred kept lists of things to do, because people who keep lists of things to do do more things. He made schedules for himself to keep himself on track. We all use daily and weekly schedules, but Fred made longterm schedules as well —even 10- and 20-year schedules (Skinner, 1979, 1983b). “ (Epstein, 1997, p. 554)

“He knew that the best ideas are often fleeting, so he developed special ways to capture them. He kept a notebook or a tape recorder by his bed and by his pool, for example. He knew that writing was a delicate and easily disrupted activity, so he took pains to shelter it from disruptions. He built special shelves so that his dictionaries and other reference books were always at arm’s reach. He used his writing desk for serious writing only; he answered letters and paid bills elsewhere.” (Epstein 1997, 554)

Skinner was proud of his self-management skills, as he considered them a direct application of Behaviourism. And of course, this could be related to other scientific paradigms:

“Freud was unable to stop smoking cigars, up to 25 a day, though smoking must have been obviously related to the heavy ‘‘catarrh’’ he suffered from most of his life, as well as to the protracted cancer of the jaw in his last years . . . an astonishing lack of self-understanding or self-control. Was he not bothered by it, or did much of his theory spring from the need to acknowledge that the habit was ‘‘bigger than he was’’?
(Epstein, 1980, p. 341)

Of course, poking fun a psychoanalysis is a bit like shooting fish in a barrel. A final quote sums up the core of Skinner’s self-management:

“Fred’s most important self-management practice is implied in his writings but is nowhere clearly stated. He always spent a few minutes each day, often scattered throughout the day, searching for and analyzing variables of which his behavior seemed to be a function. It is not enough to live your life, he told me; you also need to analyze it and make changes in it frequently and regularly.” (Epstein, 1997, p. 559)

On reading Epstein’s article, you do not get the impression that this was an ascethic, disciplinarian lifestyle that Skinner imposed on himself. Self-management was a source of joy, a constant pet project that was carried out because it was a positive reinforcer, not a negative one.

I believe that this kind of constant, critical re-evaluation of your lifestyle is crucial, if you aim to achieve anything beyond the norm. Epstein goes as far as suggesting that this self-management skill was the crucial factor that enabled Skinner to achieve as much as he did. I’m not sure about that, but it’s nevertheless inspiring to see that at the base of it, it may not be just an abstract “genius” quality that separates those who achieve something from those who don’t. There is a method here, and it can be applied to just about anyone.

I will outline my own self-management attempts in a future post.

