Scanning for OCR

Scanning Books and articles for OCR

Which OCR Package?

In turn I've used:

Omnipage 10 Pro. Once the market leader, this is now obsolete, and very inferior to the others. Do not buy this. One oddity - make sure your .opd files are not read-only if you burned them onto CDR to save space, otherwise you won't be able to open them!
TextBridge Millenium. For English and French, I have found that TextBridge Millenium gives near perfect results much of the time - far more accurate than OmniPage. TextBridge has some odd bugs and negative features. It isn't possible to save the images you're scanning in, except by a convoluted process, nor to save your editing session. So if you edit 100 pages and then the machine crashes, you've no chance to continue. There is no feature to save line-breaks, which is an utter pain also. It doesn't handle Latin well. It also treats the left-square-bracket [ character as 'end of file' when you save the text to Word, HTML, ASCII, whatever, which is mildly incredible, and was a pain when I was working on some French and German.
Abbyy Finereader 5.0 Office with the Cyrillic option; and 6.0 Pro (not very different -- not worth the upgrade). There is a downloadable demo.
ReadIris Pro 8.0 (downloadable demo version). This does not have the facility to proof the pages of text against the scanned images of the pages. The OCR quality seems reasonable -- but all proofing must be done either in the spell-checker facility ('Learn') or in the external document manually swapping to and fro the images. This means that it is probably not that useful, unless you don't proof in the OCR package.

I recommend Finereader. It will often give perfect results in English and French, although more often still it will have an error or two per page. German is also good -- Cyrillic I found worked like a dream, and could then be pasted into an online translator. Latin works better than either of the others. Greek is not handled well, however.

Scanner

I currently use:

an Hewlett-Packard ScanJet 6300-series: actually a 6350 which comes with the ADF sheet feeder (but without some feature that the top end 6390 had, but I didn't need). This is capable of 600 dpi and can actually scan a page in a few seconds. This cost around $450.

Scanner notes

It is possible to scan with a $60 flatbed scanner, and I did it for years. However I was amazed at how much better the results were with the ScanJet. My advice is don't mess about with a cheap scanner. Your time is the most valuable thing you have; and a cheap scanner will have you sat there correcting errors into the night which a small extra sum would have saved you.

The ADF is a necessary thing and can save a lot of time, if you can use it (e.g. on articles). I once scanned a 120 page thesis in an hour using it. However it can be temperamental; don't overload it, and be aware that it is quite fussy about temperature and humidity. Your originals must be in good condition, and don't let the room get too sticky on a hot day.

OCR Software Notes

Here are some notes on using FineReader:

Scan at about 400 dpi. You can get away with less for English, but you get best results at 400 dpi.
Photocopy a book, and use the sheetfeeder. It will save time, and it will split the pages nicely.
Do make sure the book is in the middle of the area being scanned if you want it to split in the right place! Few books are exactly A4, and so one always ends up with a border.
Sometimes when a page is scanned, the image is crooked. You can't fix this in Finereader, without exporting to something like Paintshop Pro. However, I usually find that this happens when I have black margins on the scan, because I just scanned the book at A4 size, and it isn't. If you use the eraser in the image window, and remove these black borders, often it will realign itself as upright!
If you have something which contains French and German, select these as additional multi-languages. Otherwise you'll have to enter every accented character yourself.

Here's a further tip for you.

I have a photocopy of one book that has caused problems. The copies were made too dark, so in many works there are black blotches down the spine and onto the text and also where foxing occurred. This has hitherto defeated me, because it causes so much noise in OCR that I end up almost retyping the lot. Since this is French text with accents, I can't do that.

However, I have just discovered a trick which allows OCR to cope with this sort of rubbish. If I scan the page using the HP Precision Scan Pro utility that came with my (HP) scanner, I can adjust various parameters. If I scan it as a greyscale image, with the Highlight and Mid-Tones set to maximum, and Shadow set to minimum, then I get a big 5Mb file, which is generally rather greyish and with reduced contrast. If I then pass this into Finereader, it can OCR it OK!

There is still some noise, but much reduced.

Proofing the OCR'd text

This is the bit that takes the time!

I don't find the spell-check facility very useful, I have to say.
If you find a given error keeps cropping up, make a note of it, and do a search and replace on them all. For instance in Latin, 'Iam' is always rendered 'lam'. This file of errors will save you time.
Some errors are visible in different fonts. The difference between '1' and 'l' is not very visible in Times Roman. Make sure you change the fonts when you export, and check your numbers!

Other

I always save without full formatting, but retain fonts (so that Italics, paragraphs are kept), pull it into FrontPage, select the whole document and choose default font and size. This is a nuisance when you have some Greek in it, but it removes all the font stuff which in reality one does not want.

Note that FineReader intend to bring out an add-on to handle German gothic ('Fraktur') fonts. I intend to buy it!

I hope that's useful - happy scanning! If you learn a tip I don't know, I'm more than happy to hear from you!

Constructive feedback is welcomed to Roger Pearse.

Created 30th December, 2000. Updated for FineReader 16th November 2001. Rewritten for FR, 18th January 2003. Note added on Precision Scan, 17th July 2004.

This page has been online since 30th December 2000.

Return to Roger Pearse's Pages