PDF Scanning OCR

When you get a document that has been scanned rather than exported from the software that created it, such as MS Word, it's just an image (i.e. a picture). Remember, to a computer, a picture of the letter "A" is not the same as the text character "A," so when you try to text-search an image, you get no hits because there's no text to search. Typical scanned litigation documents are in the TIFF (image) format. There are also many software and hardware packages that scan paper directly into PDF. For now, I'm not going to address using Acrobat or other tools as the scanning software. For our purposes today, let's just say, "you've got those image files that you want to convert into something you can search."

Searchable PDF
The unique thing about PDF is that you can have an exact image of the document, plus the text, plus all kinds of metadata ALL IN ONE FILE. This is a wonderful thing -- but I will expound on its wonderfulness later... With the "Paper Capture" tools in Acrobat, the software reads the picture, and figures out what the text is. So while you still see the "image," the software can also read the underlying text. OCR is not perfect, and it works best on first generation, laser printed images (just like your eyes do). In the past decade, however, OCR technology has gotten surprisingly accurate.

Scanning and OCR with Acrobat

Scan to PDF

I see that one area that concerns many people is how to use the OCR (Optical Character Recognition) abilities of Acrobat. Here's an overview, and I'll try to deal with other OCR issues very soon.

Acrobat can import scanned images (best in TIFF) or interface with any TWAIN driver to scanners and digital cameras. This imports only pixel data, so a text recognition step is needed to create searchable text and possibly reduce file size. Acrobat calls this OCR step paper capture.

    * File > Import to bring in image data
    * Tools > Capture to set up your capture preferences. Normal will convert image data to text. Both will keep the image visible and the text data hidden behind it, so find-operations do take you to the right location in the document.
    * Find Next Suspect (Ctrl-H) to review the words Acrobat thinks it may not have recognized correctly. You will see a magnified version of the word/pixels in question. In the actual document, the word Acrobat chose is highlighted. Accept it or type over it.
    * If good looks are really important, you may need to go over most words/lines and edit their font properties with the touchup-text tool, a very lengthy and tedious process. Probably retyping in a word processor would be faster.

Convert scanned pages to searchable Adobe PDF files that anyone with the free Adobe Reader can view, navigate, and print.     
    •     Create reusable document-processing workflows tailored to different types of conversion projects.    
    •     Accurately perform OCR PDF , font, and page recognition.    
    •     Automatically create intra-document links, including tables of contents, cross references, and indexes.    
    •     Efficiently correct OCR text suspects with the new QuickFix tool.    
    •     Use the new Zone tool to define areas of scanned pages to be treated as images, text, or even keywords.    
    •     Decrease processing time with workload balancing and multi-processor support (Cluster Edition only).    
    •     Create your own web interface with simple html pages using the Acrobat Capture SDK.

Bring your paper documents to life on the Web
Bridge the gap between your paper and digital workflows. Adobe® Acrobat® Capture® 3.0 is a professional production tool that teams with your scanner to convert volumes of paper documents into searchable Adobe Portable Document Format (PDF) files. Accurate OCR, advanced page and content recognition, and powerful cleanup tools let you turn all your important PDF scanning paper-based information into high-quality electronic documents ready for publication via the Web, intranets, extranets, CD-ROM, and more. Sophisticated productivity features streamline processing from start to finish, so you can get your jobs done more efficiently than ever.

When it's done, don't forget to File > Save the document. And there you have it. (At this point, I always like to do a little test by running a quick search on a word that I see on the first page. It just makes me feel better to know that it worked. I also have a continuing dialogue about what to do with the original TIFF file...)

As I said, if your image file is from a laser printed copy, and it's a decent scan, the OCR accuracy is amazingly good. But it may have garbled some words, so if you want to get really fancy, go back to Document > Paper Capture and select "Find first OCR suspect" or "Find all OCR suspects." This identifies characters that the OCR engine had problems with, and gives you a chance to correct the text. You can fix the spelling if it's important to you -- say for a proper name or term. That way you can be sure that the search software will find it. Otherwise, for a common word, I'd just save time and let it slide.

Scan to PDF Software