Monday, 14 February 2011

Brown's Directory Now a Database

It was January 11 when I wrote that I was attempting to make a database from an OCR scan of Brown’s Directory of Toronto dated 1861-62. On Friday (11 February), I had a listing of the complete street directory of approximately 8500 entries. OCR is workable on 19th century printing after all, but find-and-replace was far more useful than the automated spell checker that the software makers expect users to employ.

I have also experimented with using OCR on the “names directory” in Brown . Having got as far as the D’s, I have decided that it is not worth the effort. In the “street directory” people are arranged by their addresses and the street names only come up as headers; while, in the names directory, there is an address on every line. OCR can find no uniformity in these street names, so there is no time saved over copy-typing. It also appears that entries in the alphabetical list are one per house, no more than in the street directory. Mitchell’s Directory of three years later included most employed people whether they were head of household or not. There is not point on duplicating Brown’s Directory by transcibing a second lot of the same information.

The work of matching the census to my new directory database had now begun. The first ward I attacked was St Andrew’s which included many forms for commercial premises and where street names were omitted for the first division (Yonge Street west to York, Queen West to Adelaide). I had added as many clues from these forms as I could to my transcription including all notes stating “Personal details at place of residence”, signed by the proprietor and, maybe, giving his home address. Two of the first three forms with nameless proprietors have now been matched to actual people. For instance, Mr Leask’s handwriting was very hard to read and I had not been able to recognize the street he lived on, although the house number was 174. A quick inspection of my sorted digitized database found him at 174 Gerrard Street East, as well as the nature of his business downtown.

This kind of discovery challenged me to continue the census matching process. I have now finished the letter A for five wards. Two more city wards and the suburbs of Yorkville and East York to go. I hope temptations won’t lead me astray before the end of the alphabet. Certainly, if Brown’s Directory had come into my hands earlier, Caverhill and Mitchell would never have been done.

Tuesday, 1 February 2011

Working with OCR software

I was trying so hard not to digitize Brown’s Toronto Directory. I thought it would take up too much time and I really was trying to complete the linking of the other two directories to the people who had the fortune or misfortune to be represented amongst the inhabitants of "Institutions" in the 1861 census. About a month ago I succumbed. Since then I have been ensconced in yet another data collection dealing with Toronto in 1861.

First, was the compiler, W R Brown, part of the Brown family of The Globe, or was he somebody else? Originally I assumed he was part of the newspaper family, but I am beginning to wonder if my assumption was correct. Living at Idlewold in Rosedale would have meant a long trip to the office everyday. Either his knowledge of downtown Toronto or a great mixup of entries just before typesetting led to a very noticeable error on Front Street West. The south side of the street suddenly jumped from Bay to Simcoe omitting Jacques & Hays’ cabinet works and Union Station. Numbered premises began again after Simcoe with even numbers which should have been on the north side. All the entries actually existed, but the order spoiled the meaning of the word “index”.

When I first discussed indexing city directories back in November I said that optical character recoginition (OCR) was not the way to go with 19th century printed pages. A couple of weeks ago I decided to see if another attempt with some software I bought years ago would be more successful.

The pages of the directory are filed individually in my computer and each page has two columns of entries. I open each directory page in my photo-finishing/drawing program, Paint.net, where I straighten the page and crop each of the columns to individual images or files. Then I take the separate columns to the OCR program and re-crop to get rid of extraneous lines and blobs of dirt on the image. The image is then “read” by the software. The result is not a pretty sight. To get one line completely correct on a page of 60 entries is unusual. The software is exceedingly poor at distinguishing numbers like “3” and “8” and “6” and “0”. The letters “v” and “y” are almost always misinterpreted. One gets used to these little vagaries.

A new discovery for me was that the OCR software permits find-and-replace corrections. I now start off work on each page by getting rid of unwanted double and triple spaces and remove all periods or full stops—things that aren’t wanted in my database style. Other start-up operations include altering the font to a uniform 20-point Times New Roman and typing a title right on the page—usually the book page number and the street it contains. Correcting a column takes about 20 minutes. Each column is saved to an individual spreadsheet file.

Once I have a collection of 10 to 20 completed spreadsheets I take a break from OCR work and combine the spreadsheets into one large one where I arrange the data into a series of columns which will ultimately be moved to a database.

The whole process probably takes as long as copy-typing the whole book, but because there is such a variety of tasks involved, the boredom factor is reduced. Into the bargain, my aging body is subject to a lot of aches and pains, particularly in the neck and shoulders. Continuous copy-typing would send me into complete lock-down. The other advantage, a personal one, is that, once I have gone through both the OCR and spreadsheet steps, I have a much better knowledge of the street I have been looking at and the people who lived there.