Transformer Based OCR Model - Image To Text Transformer

By
Sweety Bajaj
Product Manager

Transformer-Based OCR Model: How OCR Decoder works

As you probably already know, Optical Character Recognition (OCR) is the electronic conversion of images of typed, handwritten, or printed text into machine-encoded text. The source can be a scanned document, a photo of a document, or a subtitle text imposed on an image. OCR converts such sources into machine-readable text.

Let’s understand how an OCR pipeline works before we dig deeper into Transformer Based OCR.

A typical OCR pipeline consists of two modules.

  • A Text Detection Module
  • A Text Recognition Module


Text Detection Module


Text Detection module as the name suggests detects where text is present in the source. It aims to localize all the text blocks within the text image, either at word level (individual words) or text line level.

This task is comparable to an object detection problem only here the object of interest is the text blocks. Popular object detection algorithms include YOLOv4/5, Detectron, Mask-RCNN, etc.

To understand Object Detection using YOLO click here.


Text Recognition Module


Text Recognition module aims to understand the content of the detected text block and convert the visual signals into natural language tokens.

A typical text recognition module consists of two sub-modules.

  • Word Piece Generation Module
  • Image Understanding

The workflow under the text recognition module works as follows.

  • The individual localized text boxes are resized to, let's say, 224x224 and passed as input to the image understanding module which is typically a CNN module (ResNet with self-attention).
  • The image features from a particular network depth are extracted and passed as input to the Word Piece Generation Module, which is an RNN based network. The output of this RNN network is machine-encoded texts of the localized text boxes.
  • Using an appropriate loss function, the Text Recognition Module is trained until the performance reaches an optimal scale.

What makes transformer-based OCR different?

Transformer-based OCR is an end-to-end transformer-based OCR model for text recognition, this is one of the first works to jointly leverage pre-trained image and text transformers.

Transformed-based OCR looks like the diagram below. The left-Hand side of the diagram is the Vision Transformer  Encoder and the Right-Hand side of the image is the Roberta (Text Transformer) Decoder.


ViTransformer or Encoder :



An image is split into NxN patches, where each patch is treated similarly to a token in a sentence. The image patches are flattened (2D → 1D) and are linearly projected with positional embeddings. The linear projection + positional embeddings are propagated through the transformer encoder layers.

In the case of OCR, the image is a series of localized text boxes. To ensure consistency in localized text boxes, the images/image region of the text boxes are resized to a HxW. After which the image is decomposed into patches, where each patch size HW/(PxP). P is the patch size.

After that, the patches are flattened and linearly projected to a D-Dimensional vector which are patch embeddings. The patch embeddings and two special tokens are given learnable 1D position embeddings according to their absolute positions. Then, the input sequence is passed through a stack of identical encoder layers.

Each Transformer layer has a multi-head self-attention module and a fully connected feed-forward network. Both of these two parts are followed by residual connection and layer normalization.

Note: Residual connections ensure gradient flow during backpropagation.


Roberta or Decoder :



The output embeddings from a certain depth of the ViTransformers are extracted & passed as input to the decoder module.

The output embeddings from a certain depth of the ViTransformers are extracted and passed as input to the decoder module.

The decoder module is also a transformer with a stack of identical layers that have similar structures to the layers in the encoder, except that the decoder inserts the “encoder-decoder attention” between the multi-head self-attention and feedforward network to distribute different attention on the output of the encoder. In the encoder-decoder attention module, the keys and values come from the encoder output, while the queries come from the decoder input.

The embeddings from the decoder are projected from the model dimension (768) to the dimension of vocabulary size V (50265).

The softmax function calculates the probabilities over the vocabulary and we use beam search to get the final output.

Advantages:

  • TrOCR, an end-to-end Transformer-based OCR model for text recognition with pre-trained CV and NLP models is the first work that jointly leverages pre-trained image and text Transformers for the text recognition task in OCR.
  • TrOCR achieves state-of-the-art accuracy with a standard transformer-based encoder-decoder model, which is convolution free and does not rely on any complex pre/post-processing step.

References:

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

https://arxiv.org/pdf/2109.10282.pdf

An image is worth 16X16 words: Transformers for Image Recognition at Scale

https://arxiv.org/pdf/2010.11929v2.pdf

Frequently asked questions

What technology is better than OCR?

OCR, short for "optical character recognition," gives information in a one-way manner. But the more advanced version is IDP, which stands for "Intelligent Document Processing," and does more than the latter by recognizing characters. It can break down the whole content and the context of the document in several ways. Modern AI techniques like machine learning and natural language processing are used together to produce more meaningful results. As a result, IDP can extract the content and determine the organization and meaning of each item in the document more like humans.

 What is the market for intelligent document processing?

Several industries use IDP. Here are some intelligent document processing uses that IDP provides: time-saving, better accuracy in accounting, documentation of loan applications, and other data processing processes. IDP is a trusted solution for automated data processing in numerous industries, including finance, legal, insurance, and logistics. Since it enables the sector to produce excellent results by concentrating more on the essential operations of the business system, even in human resource departments of industries, employee surveys, other HR data, employee screening, and resume processing are all possible with IDP.

What are the key innovation drivers supported by IDP?

IDP supports tremendous innovations in data-driven decision-making, deriving value from business documents and agile development.

To know more, book a 15-min session with an IDP expert

How can IDP help organizations eliminate operational inefficiencies?

Businesses can improve operational efficiencies using IDP by automating repetitive tasks, reducing errors, and increasing the processing volume.

To know more, book a 15-min session with an IDP expert

How can a business benefit from intelligent document processing systems in the context of accounting?

Intelligent Document Processing, or IDP, is perfect for accounting. It uses machine learning and mighty AI tools to handle data swiftly and accurately. Organizations find IDP useful because machines, unlike humans, don't tire or get sidetracked. What's more, they don't make expensive mistakes during paperwork management. This reliability improves operations with fewer mishaps. It significantly boosts the organization's overall work quality and productivity.

What are the potential challenges or considerations when implementing IDP?

One of the major challenges while implementing IDP is the normalization of the new workflows. Personnel training, process enhancements, and full assimilation require time to get fully absorbed by an organization.

To know more, book a 15-min session with an IDP expert

How does your solution handle corrections?

Did you know no system is 100% accurate all the time?  When extraction errors occur you want to correct them.  We provide a simple UI that your business analyst will use to make corrections.

To know more, book a 15-min session with an IDP expert

Does your solution work with handwriting?

Our solution excels at data extraction from handwriting.  We've got proprietary methods and techniques that do the trick.  It's pretty cool.  See for yourself.

To know more, book a 15-min session with an IDP expert