How to improve Document Scanning and Zone OCR Accuracy

How to Improve Document Scanning Zone OCR Accuracy

Zonal OCR (Optical Character Recognition) is a technique used to capture specific items from your documents.  Invoice numbers, dates, client names, etc., all exist in freeform (unstructured) or fixed (structured documents.  In this article, we will explore the techniques available to help improve the accuracy of the zones to realize the best results.   free download OCR to PDF converter

If you are extracting text for indexing or file naming purposes, OCR Zone Processing is used to turn specific document regions or zones, into usable data. Some of the most influencial capabilities include:

  • Dynamic Zone Alignment
  • Zone Image Pre-Processing
  • Multi OCR engine voting
  • Regular Expressions scripts
Download and use our Free ImageRamp tools for Document Assembly, OCR, or barcode separation page creation.  No registration or obligations.  

Zone alignment

Zone Anchor Words

Often, the location of a zone moves from image to Image.  This is due to changes in the form types, or even border issues with scanning.  Auto cropping can help to some degree, but too often, zones move ever so slightly to make fixed zones impossible to use.  Some tools incorporate "Anchor Words" which are high accuracy words that are used as an Anchor point.  A general area of interest is defined on the document and the Anchor Word is used to align the zones to it.  As long as there is consistency to a anchor word, your zone positions should improve greatly with these kinds of tools.  Only high confidence words should be used.   In our example, high confidence words are shown along with the selected anchor word (Part). 

Zone Image Improvement

Since zone images are used temporarily for text extraction purposes, they can be further processed without consideration of the saved file.  Some Image Processing options that can effect the OCR results include:

  • Line Removal will remove grid lines that sometimes interfere with the OCR or result in additional characters recognized such as an "I".  More advanced systems can remove lines that intersect with text and repair the common area where the characters intersect with the removed line.
  •  Edge smoothing can be used to deal with "spurs" and other edge issues on documents. 
  • Adaptive Thresholding involves two dimensional image processing can also be used where neighboring pixels are incorporated into sophisticated algorithms to help smooth out the characters. 
  • Pixel Expansion or thickening is helpful for lightly scanned images.  The pixels are expanded in 4 or 8 directions to help the OCR recognize the objects.

 

OCR Zone Preview
OCR Zone Preview includes Image Pre-Processing, Dual Voting and Regular Expressions

Using multiple OCR engines and word confidence

Some systems use a confidence scores on the resulting words captured in a OCR Zone.  When combined with multiple OCR engines, you can take the best results to obtain the higher accuracy.  OCR engines use different techniques to match characters and deal with broken and disjointed image data.  Passing images through each engine, scoring the confidence, then using the best scoring engine can help improve accuracy or missed data when one engine fails on a zone.

Using Regular Expressions

With the use of regular expressions, you can expand the search region to look for keywords or have rules that only return the exact character string.  If we have a zone in which we want to extract a zip code, we can look for the numeric sequence of 5 numbers or 5 numbers followed by a dash and

Using Regex to find Zip Codes
Using Regex to find a Zip Code

 

To Wrap It Up

When extracting text from an image, OCR Accuracy, Zone placement and post processing is crucial. To optimize your scanning success:

  • Use Good Pre-processing Techniques
    Good pre-processing can be as important as the scanning technologies involved. Encourage accuracy by setting document procedures and guidelines to:
    • Use adequate white space
    • Limit lines and gridlines
    • Limit the use of color
    • Use OCR friendly fonts and sizes
  • Scan at 200 or 300 DPI Minimally
  • Use an Intelligent Document and Data Capture Solution
    Software such as ImageRamp uses advanced cleanup and validation technology with preview and testing mechanisms.