Scanning Books and articles for OCR

Which OCR Package?

In turn I've used:

I recommend Finereader.  It will often  give perfect results in English and French, although more often still it will have an error or two per page.  German is also good -- Cyrillic I found worked like a dream, and could then be pasted into an online translator.  Latin works better than either of the others.  Greek is not handled well, however.

Scanner

I currently use: 

Scanner notes

It is possible to scan with a $60 flatbed scanner, and I did it for years.  However I was amazed at how much better the results were with the ScanJet.  My advice is don't mess about with a cheap scanner.  Your time is the most valuable thing you have; and a cheap scanner will have you sat there correcting errors into the night which a small extra sum would have saved you.

The ADF is a necessary thing and can save a lot of time, if you can use it (e.g. on articles).  I once scanned a 120 page thesis in an hour using it.  However it can be temperamental; don't overload it, and be aware that it is quite fussy about temperature and humidity.  Your originals must be in good condition, and don't let the room get too sticky on a hot day.

OCR Software Notes

Here are some  notes on using FineReader:

Here's a further tip for you.  

I have a photocopy of one book that has caused problems.  The copies were made too dark, so in many works there are black blotches down the spine and onto the text and also where foxing occurred.  This has hitherto defeated me, because it causes so much noise in OCR that I end up almost retyping the lot.  Since this is French text with accents, I can't do that.  

However, I have just discovered a trick which allows OCR to cope with this sort of rubbish.  If I scan the page using the HP Precision Scan Pro utility that came with my (HP) scanner, I can adjust various parameters.  If I scan it as a greyscale image, with the Highlight and Mid-Tones set to maximum, and Shadow set to minimum, then I get a big 5Mb file, which is generally rather greyish and with reduced contrast.  If I then pass this into Finereader, it can OCR it OK!  

There is still some noise, but much reduced.

 Proofing the OCR'd text

This is the bit that takes the time!

Other

I always save without full formatting, but retain fonts (so that Italics, paragraphs are kept), pull it into FrontPage, select the whole document and choose default font and size.  This is a nuisance when you have some Greek in it, but it removes all the font stuff which in reality one does not want.

Note that FineReader intend to bring out an add-on to handle German gothic ('Fraktur') fonts.  I intend to buy it!

I hope that's useful - happy scanning!  If you learn a tip I don't know, I'm more than happy to hear from you!

Constructive feedback is welcomed to Roger Pearse.

Created 30th December, 2000.  Updated for FineReader 16th November 2001.  Rewritten for FR, 18th January 2003.  Note added on Precision Scan, 17th July 2004.

This page has been online since 30th December 2000.

Return to Roger Pearse's Pages