Tuesday 1 February 2011

Working with OCR software

I was trying so hard not to digitize Brown’s Toronto Directory. I thought it would take up too much time and I really was trying to complete the linking of the other two directories to the people who had the fortune or misfortune to be represented amongst the inhabitants of "Institutions" in the 1861 census. About a month ago I succumbed. Since then I have been ensconced in yet another data collection dealing with Toronto in 1861.

First, was the compiler, W R Brown, part of the Brown family of The Globe, or was he somebody else? Originally I assumed he was part of the newspaper family, but I am beginning to wonder if my assumption was correct. Living at Idlewold in Rosedale would have meant a long trip to the office everyday. Either his knowledge of downtown Toronto or a great mixup of entries just before typesetting led to a very noticeable error on Front Street West. The south side of the street suddenly jumped from Bay to Simcoe omitting Jacques & Hays’ cabinet works and Union Station. Numbered premises began again after Simcoe with even numbers which should have been on the north side. All the entries actually existed, but the order spoiled the meaning of the word “index”.

When I first discussed indexing city directories back in November I said that optical character recoginition (OCR) was not the way to go with 19th century printed pages. A couple of weeks ago I decided to see if another attempt with some software I bought years ago would be more successful.

The pages of the directory are filed individually in my computer and each page has two columns of entries. I open each directory page in my photo-finishing/drawing program, Paint.net, where I straighten the page and crop each of the columns to individual images or files. Then I take the separate columns to the OCR program and re-crop to get rid of extraneous lines and blobs of dirt on the image. The image is then “read” by the software. The result is not a pretty sight. To get one line completely correct on a page of 60 entries is unusual. The software is exceedingly poor at distinguishing numbers like “3” and “8” and “6” and “0”. The letters “v” and “y” are almost always misinterpreted. One gets used to these little vagaries.

A new discovery for me was that the OCR software permits find-and-replace corrections. I now start off work on each page by getting rid of unwanted double and triple spaces and remove all periods or full stops—things that aren’t wanted in my database style. Other start-up operations include altering the font to a uniform 20-point Times New Roman and typing a title right on the page—usually the book page number and the street it contains. Correcting a column takes about 20 minutes. Each column is saved to an individual spreadsheet file.

Once I have a collection of 10 to 20 completed spreadsheets I take a break from OCR work and combine the spreadsheets into one large one where I arrange the data into a series of columns which will ultimately be moved to a database.

The whole process probably takes as long as copy-typing the whole book, but because there is such a variety of tasks involved, the boredom factor is reduced. Into the bargain, my aging body is subject to a lot of aches and pains, particularly in the neck and shoulders. Continuous copy-typing would send me into complete lock-down. The other advantage, a personal one, is that, once I have gone through both the OCR and spreadsheet steps, I have a much better knowledge of the street I have been looking at and the people who lived there.

No comments:

Post a Comment