Application of OCR in NLP Scenarios

baoshi.rao

What is OCR?

OCR (Optical Character Recognition) refers to the process by which electronic devices (such as scanners or digital cameras) examine printed characters on paper, determine their shapes by detecting patterns of dark and light, and then translate these shapes into computer text using character recognition methods.

In other words, it is a technology that converts printed characters in paper documents into black-and-white bitmap image files through optical means and then uses recognition software to transform the text in the images into text format for further editing and processing by word processing software.

Since OCR relies on scanning or photography, it often encounters challenges such as complex backgrounds, low resolution, and more. Without a substantive understanding of OCR technology, some might assume that OCR recognition is a trivial task not worth discussing.

In reality, OCR in natural environments faces numerous challenges, including:

Complex backgrounds;
Presence of elements like watermarks, underlines, and borders;
Overlapping seals or stamps;
Low image contrast;
Tilted or blurred text;
Stains or wear;
Anti-counterfeiting marks;
A wide variety of fonts;
Variations in stroke depth and ink distribution during printing.

Typically, the performance of an OCR system is measured by metrics such as rejection rate, error rate, recognition speed, user interface friendliness, product stability, ease of use, and feasibility.

Traditional OCR Processing Steps

Below, we briefly outline the traditional OCR processing steps:

Image Preprocessing

Preprocessing generally includes tasks such as tilt correction, grayscale conversion, image denoising, binarization, etc.

Binarization:
Binarization involves setting each pixel in the image matrix to a grayscale value of 0 (black) or 255 (white), resulting in an image with only black and white pixels. In grayscale images, the grayscale range is 0~255, while in binarized images, it is either 0 or 255.

Binarization methods typically include the following:

Layout analysis: Dividing the scanned image into regions based on different attributes, such as horizontal text, vertical text, tables, and images.
Character segmentation: Cutting the text in the image at the character level, paying attention to issues like character粘连 (sticking).
Feature extraction: Extracting key features from character images and reducing dimensionality for subsequent character recognition algorithms.
Character recognition: Using feature vectors to recognize characters based on template matching or deep neural network classification.
Layout restoration: Reconstructing the original document's layout and outputting the recognition results in the same format.
Post-processing: Introducing error correction mechanisms or language models to correct similar-looking characters.

Of course, these traditional OCR methods are somewhat outdated. The current trend is toward end-to-end text recognition based on deep learning, where character segmentation is implicitly handled as part of the sequence learning problem. Despite varying input image sizes and text lengths, the entire text image can be recognized after processing through DCNN and RNN, with text segmentation integrated into the deep learning process.

OCR Technical Framework

Based on the technical framework above, here’s a brief introduction to the key steps and models:

1. Tilt Correction: Uses the AdvancedEast deep learning model for pixel-level segmentation. This algorithm, based on EAST (an efficient and accurate scene text detector), improves long-text prediction accuracy. Its network structure is as follows:

2. Text Line Detection with PixelLink: Proposed by Zhejiang University and Alibaba, this model uses image segmentation for scene text detection, outperforming previous detection-based models in both performance and accuracy. The PixelLink network architecture uses VGG16 for feature extraction and includes:

Pixel segmentation to classify each pixel as text/non-text.
Link prediction to merge adjacent text pixels or discard non-text pixels.

The final detection boxes are obtained by identifying connected components in the resulting mask.

3. Text Recognition with CRNN: This model consists of three parts:

Convolutional layers (CNN) to extract feature sequences from the input image.
Recurrent layers (RNN) to predict label distributions from the feature sequences.
Transcription layers (CTC) to convert label distributions into final recognition results by removing duplicates and integrating sequences.

CRNN combines CNN's robust feature extraction with LSTM's sequence recognition capabilities, avoiding the challenging tasks of single-character segmentation and recognition while embedding temporal dependencies.

Types of Text Recognition Available in the Market

General Text Recognition

Refers to the recognition of irregular documents, such as PDFs.

Card and Document Recognition

Includes ID cards, bank cards, business licenses, business cards, passports, household registers, driver's licenses, etc.

Invoice Recognition

Covers VAT invoices, fixed-amount invoices, train tickets, taxi receipts, travel itineraries, insurance policies, bank slips, etc.

Others

License plates, vehicle certificates, seal detection, etc.

Application Scenarios

Finally, let’s discuss OCR's application scenarios. As mentioned earlier, OCR plays an indispensable role in NLP-related products, particularly in document processing tasks such as PDF extraction, document review, and document comparison.

Remote Identity Verification: Combines OCR and facial recognition to automate the entry of user ID information and verify identities. Used in finance, insurance, social security, and O2O industries to mitigate risks.

Content Moderation and Regulation: Automatically identifies text in images and videos to detect inappropriate content (e.g., violence, politics, ads), reducing manual review costs and business risks.

Digitization of Paper Documents and Invoices: Automates the recognition and entry of paper documents, invoices, and forms, reducing manual input costs and improving efficiency.