Creation and train model to process extracting meta data from pictures of receipts, automatic categorization and make prediction using OCR AI



Architecture of The Proposed System

The Flowchart Diagram shows the design for optical character recognition. It represents the steps involved in the construction of the system and the flow of data. The system is divided into four different main stages.

Flowchart Diagram


Major phases of our OCR

These phases are as follows:

Image acquisition: To capture the image from an external source like scanner or a camera etc.

Preprocessing: Once the image has been acquired, different preprocessing steps can be

performed to improve the quality of image. Among the different preprocessing techniques are noise removal, thresholding and extraction image base line etc.

Character segmentation: In this step, the characters in the image are separated such that they can be passed to recognition engine. Among the simplest techniques are connected component analysis and projection profiles can be used. However in complex situations, where the characters are overlapping  broken or some noise is present in the image.

In these situations, advance character segmentation techniques are used.

Feature extraction: The segmented characters are then processes to extract different features. Based on these features, the characters are recognized. Different types of features that can be used extracted from images are moments etc. The extracted features should be efficiently computable, minimize intra-class variations and maximizes inter-class variations.

Character classification: This step maps the features of segmented image to different categories or classes. There are different types of character classification techniques. Structural classification techniques are based on features extracted from the structure of image and uses different decision rules to classify characters. Statistical pattern classification methods are based on probabilistic models and other statistical methods to classify the characters.

Post processing: After classification, the results are not 100% correct, especially for complex languages. Post processing techniques can be performed to improve the accuracy of OCR systems. These techniques utilizes natural language processing, geometric and linguistic context to correct errors in OCR results.

Experimental Dataset

The receipt dataset consists of 200 receipts images . Each receipt is shown in entirety and includes business name, business address, cost, itemized items, subtotal, tax (if applicable), and total. All receipt images are average-quality with dimensions larger than 600 pixels (longest side). This sample receipt image dataset is ideal for OCR and image pre-processing.

receipt data-set we use

Feature Extraction

The most straight forward way of describing a character is by the actual raster image.

Another approach is to extract certain features that still characterize the symbols, but leaves out the unimportant attributes.

The techniques for extraction of such features are often divided into three main groups, where the features are found from:

• The distribution of points.

• Transformations and series expansions.

• Structural analysis.

The different groups of features may be evaluated according to their sensitivity to noise and deformation and the ease of implementation and use.

Robustness

1. Noise. Sensitivity to disconnected line segments, bumps, gaps, filled loops etc.

2. Distortions. Sensitivity to local variations like rounded corners, improper protrusions, dilations and shrinkage.

3. Style variation. Sensitivity to variation in style like the use of different shapes to represent the same character or the use of serifs, slants etc.

4. Translation. Sensitivity to movement of the whole character or its components.

5. Rotation. Sensitivity to change in orientation of the characters.

Pre-Processing

The image resulting from the scanning process (capture by phone or specific scanner) contain a certain amount of noise.

Depending on the resolution on the receipt, the characters may be smeared or broken.

Some of these defects, which may later cause poor recognition rates, In this stage a series

of operations are performed viz binarization, normalization, slant and rotation, noise removal, skew detection, character segmentation ,filling and thinning.

The main objective of pre-processing is to organize information so that the subsequent character recognition task becomes simpler.

It essentially enhances the image rendering it suitable for segmentation.

Normalization

Character normalization is very important pre-processing operation for character recognition.

Normally, the character image is mapped onto a standard plane(with predefined size)so as to give a representation of fixed dimensionality(an input character image must coincide with a corresponding template in terms of position, size, slant, and so on.)

The goal for character normalization is to reduce the within-class variation of the shapes of the characters/digits in order to facilitate feature extraction process and also improve their classification accuracy

The flowing figure demonstrates the normalization technique


 Original character; normalized character filled in standard(normalized) plane.


Basically there are two kinds of transformation(Normalization), linear and nonlinear.

The linear transformation is called an affine transformation, and mainly we used the liner transformation.

Nonlinear normalization is important when dealing with hand-printed characters.


Orientation and Script Detection(OSD)

we need to perform automatically detect and correct text orientation.

The OSD mode must including both estimated text orientation and script/writing system detection.

• Text orientation: The estimated rotation angle (in degrees) of the text in the input image.

• Script: The predicted "writing system" of the text in the image the Figure  shows some different orientation we can face.

example of varying text orientations

Thinning, Smoothing

The smoothing implies both filling and thinning. Filling eliminates small breaks, gaps and holes in the digitized characters, while thinning reduces the width of the line. the figure below depicts the smoothing and thinning approach.

smoothing and thinning of a symbol

Blurring and Sampling

• Blurring: Uncontrolled environments tend to have blurred, especially if the end-user is utilizing a smartphone that does not have some form of stabilization.

Blurring will blur focused features and reveal more of the "structural" components of the image.

The image bellow illustrates the blur effect.

small median blur to the input image

The sampling rate determines the spatial resolution of the digitized image (figure bellow), while the quantization level determines the number of grey levels in the digitized image.

A magnitude of the sampled image is expressed as a digital value in image processing.

image sub-sampled

Thresholding

The thresholding process is important as the results of the recognition is totally dependent of the quality of the bilevel image.

We used a fixed threshold, where gray-levels below this threshold is said to be black and levels above are said to be white.

The Figure shows the different result of Global Thresholding

result of Global Thresholding on input image


Adaptive Thresholding Selection Based on Topographical Image Analysis

For a high-contrast document with uniform background, a preselected fixed threshold can be sufficient.

However, a lot of receipt encountered in practice have a rather large range in contrast.

The best methods for thresholding are usually those which are able to vary the threshold over the document adapting to the local properties as contrast and brightness

result of Adaptive Thresholding (B) on input image (A)

Binary Morphology

The morphology represent the shapes that are manifested on binary or gray tone images.

The set of all the black pixels in a black-and-white image constitutes a complete description of the binary image.

Binary Dilation

The binary dilation of an image by a structuring element is the locus of the points covered by the structuring element, when its center lies within the non-zero points of the image.

The example of an image dilated is displayed in Figure 

An example of dilation operation

Binary Erosion

The binary erosion of an image by a structuring element is the locus of the points where a superimposition of the structuring element centered on the point is entirely contained in the set of non-zero elements of the image.

The example of an image erosion is displayed in Figure 

An example of erosion operation:

Duality between Dilation and Erosion

We can perform operations on binary images such as erosion or dilation which make objects smaller or bigger respectively, or opening and closing which separate or merge objects respectively:

Erode then dilate :

result of Erode then dilate

result of Dilation then erosion

 Opening is a process in which first erosion operation is performed and then dilation operation is performed.

Closing is a process in which first dilation operation is performed and then erosion operation is performed

result of Closing (a)to(b) and Opening (c)to(d) transformation

extraction data

OCR engine Architecture

customize Tesseract Architecture:

customize Tesseract Architecture


Neural Network Architecture

Recurrent Neural Networks (RNN)

limitation of traditional neural networks :
We don’t start our thinking from scratch every second.we understand each word based on our understanding of previous words. We don’t throw everything away and start thinking from scratch again. Our thoughts have persistence Traditional neural networks can’t do this, and it seems like a major shortcoming. Recurrent neural networks address this issue. They are networks with loops in them, allowing information to persist. In the below diagram, a chunk of neural network, A, looks at some input xt and outputs a value ht. A loop allows information to be passed from one step of the network to the next. 

Recurrent Neural Networks have loops

A recurrent neural network can be thought of as multiple copies of the same network, each passing a message to a successor.
Consider what happens if we unroll the loop:

Recurrent Neural Networks have loops

The Problem of Long-Term Dependencies:
One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task.
Sometimes, we only need to look at recent information to perform the present task, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.
But there are also cases where we need more context.
It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.
Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.

(a)RNN-short term depdencies (b)RNN-long term depdencies

Long Short Term Memory networks (LSTM)

As deep learning revolutionized the field of computer vision (as well as nearly every other field of computer science) in the 2010s, OCR accuracy was given a tremendous boost from specialized architectures called long short-term memory (LSTMs) networks.
Now, in the 2021s, we’re seeing how OCR has become increasingly commercialized by tech giants such as Google, Microsoft, and Amazon... We exist in a fantastic time in the computer science field.
LSTMs are explicitly designed to avoid the long-term dependency problem.
Remembering information for long periods of time is practically their default behavior.
All recurrent neural networks have the form of a chain of repeating modules of neural network.
In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

The repeating module in a standard RNN contains a single layer

LSTMs also have this chain like structure, but the repeating module has a different structure.
Instead of having a single neural network layer, there are four, interacting in a very special way.

The repeating module in an LSTM contains four interacting layers

Post-processing

The post processing stage is used to increase recognition. OCR systems will never be 100% accurate, so you should assume there will be some mistakes.

grouping

The result of plain symbol recognition on a document, is a set of individual symbols.
However, these symbols in themselves do usually not contain enough information.
Instead we would like to associate the individual symbols that belong to the same string with each other, making up words and numbers.
The process of performing this association of symbols into strings, is commonly referred to as grouping.
The grouping of the symbols into strings is based on the symbol’ location in the document(seller name, date pattern, total, currency ...)

categorization

We categorize them based on the text extracted after grouping them into some categories of expenses Figure III.28 (e.g meals, transportation, office supplies...)
Receipt Classifier

Error-detection and correction

The goal of post processing is to detect and correct grammatical misspellings in the OCR
output text after the input image has been scanned and completely processed.
Therefore, we provided the user with the possibility to correct the extracted data.
The simple Sequence Diagram shown bellow helps illustrate the process.

formatting and converting

After the text is cleaned up and ready to be consumed, we convert the extracted text to JSON format so that all http requests can get JSON format.

Training

Training and Recognition

Is the ability to learn new fonts and new languages. It can be custom made to recognize a particular fonts or new languages.
The training fundamentals consists of
• Character samples must be segregated by a font
• Few samples are required; 4-10 of each combination is good but one should be sufficient
• Not many fonts are required
• The number of different characters is limited only by memory
The first bullet explains that the engine require a font to be trained.
Thus, to be able to recognize symbols with engine an conversion of vectorized symbols into a font is required.
The process is done by the following steps:
1. Convert symbols into vectorized symbols
2. Combine all vectorized symbols into a true type font
3. Create a tif image with the symbol font exposing the different symbols
4. Create a box file containing the data representation of each symbol from the tif image.
The box file is essential for the training process to understand which symbol correspond to what data.
5. Generate a language training data file

trigger of the training process

the training process can be manually that the administrator trigger it or it can be automatically when a new training data present (when the user enter a new correction)

Tools, Libraries, and Packages for OCR

Tesseract

The Tesseract OCR engine was first developed as closed source software by Hewlett-
Packard (HP) Labs in the 1980s. However, when Tesseract was initially developed, very little was done to update the software, patch it, or include new state-of-the-art OCR algorithms.
Google then started sponsoring the development of Tesseract in 2006 

Tesseract OCR logo

PyTesseract

PyTesseract is a Python package developed by Matthias Lee , a PhD in computer science, who focuses on software engineering performance. The PyTesseract library is a Python package that interfaces with the tesseract command line binary. Using only one or two function calls.

OpenCV

To improve our OCR accuracy, we’ll need to utilize computer vision and image processing to "clean up" our input image, making it easier for Tesseract to correctly OCR the text in the image.
To facilitate our computer vision and image processing operations, we’ll use the OpenCV library, the de facto standard for computer vision and image processing . The OpenCV library provides Python bindings, making it a natural fit into our OCR ecosystem. The logo for OpenCV is shown bellow

OpenCV logo

System overview architecture

The diagram shown below is an overview of different container one for handle HTTP request of the user and the other is for training purpose and their internal process

System overview architecture

preparing for training phase

illustrate how the preparing for training phase is working.

preparing for training phase overview

Image Pre-processing

We developed two different methods of image processing, the main difference is that the first method does not warp the receipt, we generally use the second method

Image processing without warping

Image processing with warping


Binarization

We select global thresholding methods and local thresholding methods based on their popularity and accuracy

  • Skew Detection and Correction
  • Character Segmentation
  • Blur Detection in Text and Documents
  • Levenshtein distance
  • Canny Edge Detection (CED)
Finding Text Blobs in an Image with OpenCV
different pipeline for image presorting

Feature Extraction



Whitelisting and Blacklisting Characters with Tesseract

A whitelist specifies a list of characters that the OCR engine is only allowed to recognize if a character is not on the whitelist, it cannot be included in the output OCR results. The opposite of a whitelist is a blacklist. A blacklist specifies the characters that, under no circumstances, can be included in the output. We apply whitelisting and blacklisting with Tesseract.

OCR Using Template Matching

template matching is the process of accepting an input character and then matching it to a set of reference images (i.e., templates). If a given input receipt sufficiently matches the template. OCR via template matching is a viable candidate for us, allowing us to optimize Tesseract accuracy. The figure bellow shows the different step to generate a fingerprint(histogram presentation) for each receipt.

OCR Using Template Matching

Training an OCR Model

Tesseract 4.00 includes a new neural network-based recognition engine that delivers significantly higher accuracy (on document images) than the previous versions, in return for a significant increase in required compute power. On complex languages however, it may actually be faster than base Tesseract. Neural networks require significantly more training data and train a lot slower than base Tesseract. For Latin-based languages, the existing model data provided has been trained on about 400000 textlines spanning about 4500 fonts. For other scripts, not so many fonts are available, but they have still been trained on a similar number of textlines. Instead of taking a few minutes to a couple of hours to train, Tesseract 4.00 takes a few days to a couple of weeks. Even with all this new training data, you might find it inadequate for our particular problem.

There are multiple options for training:

• Fine tune.

Starting with an existing trained language, train on your specific additional data. This may work for problems that are close to the existing training data, but different in some subtle way, like a particularly unusual font. May work with even a small amount of training data.

• Cut off the top layer

(or some arbitrary number of layers) from the network and retrain a new top layer using the new data. If fine tuning doesn’t work, this is most likely the next best option. Cutting off the top layer could still work for training a completely new language or script, if you start with the most similar looking script. 

• Retrain from scratch.

This is a daunting task, unless you have a very representative and sufficiently large training set for your problem. If not, you are likely to end up with an over-fitted network that does really well on the training data, but not on the actual data.

Training a Custom Tesseract Model

Using tesstrain (http://pyimg.co/n6pdk [62]), we can train and fine-tune Tesseract’s deep learning-based LSTM models (the same model we’ve used to obtain high accuracy OCR results throughout this work). Perhaps most importantly, the tesstrain package provides instructions on how our training dataset should be structured, making it far easier to build our dataset and train the model. 

Choose model name

Choose a name for our model. receipt model Provide ground truth Place our ground truth consisting of line images and transcriptions in the folder data/receipt model ground-truth. This list of files will be split into training and evaluation data, we defined the RATIOTRAIN by 20% .

Train

start the training process
The following figure show the output of the training process.

output of the training process

flask Python Web Frameworks


Using Virtual Environments

The most convenient way to install Flask is to use a virtual environment. A virtual environment is a private copy of the Python interpreter onto which you can in- stall packages privately, without affecting the global Python interpreter installed in your system. Virtual environments are very useful because they prevent package clutter and version conflicts in the system’s Python interpreter. Creating a virtual environment for each application ensures that applications have access to only the packages that they use, while the global interpreter remains neat and clean and serves only as a source from which more virtual environments can be created. As an added benefit, virtual environments don’t require administrator rights.

flask

Flask is a micro-framework designed to create a web application in a short time. It only implements the core functionality giving developers the flexibility to add the feature as required during the implementation. It is a lightweight, WSGI application framework. This framework can either be used for pure backend as well as frontend if need be. The former provides the functionality of the interactive debugger, full request object, routing system for endpoints, HTTP utilities for handling entity tags, cache controls, dates, cookies etc.. It also provides a threaded WSGI server for local development including the test client for simulating the HTTP requests. Werkzeug and Jinja are the two core libraries The Jinja, however, is another dependency of the Flask.

It is a full-featured template engine.
Sandboxed execution, powerful XSS prevention, template inheritance, easy to do debug,
configurable syntax is it’s few of many features.
In addition, the code written in the HTML template is compiled as python code.
Figure below shows the Flask logo.

Flask logo


REST API

OpenAPI Specification

OpenAPI Specification (OAS) is a standard for defining RESTful APIs in a manner that makes them understandable for both humans and machines. The files made in this format can be read and used by multiple tools that help to design, build and manage APIs and they also provide documentation for developers. OpenAPI has two versions, OpenAPI 2.0 and the latest OpenAPI 3.0 released in 2017.

Swagger

A REST or RESTful API is a Web Application Programming Interface (API) which fol- lows the Representational State Transfer architectural style. REST is a technical description of how the World Wide Web works. It has been described by Roy Fielding in 2000 and has become a popular approach to design Web APIs.

Swagger Editor’s online user interface API of our OCR system

User Interface


User Interface

Benchmark

Different iterations

result of different iterations

boxplot result of different iterations


Execution time
average execution time

Feature Importance in the prediction of accuracy


Correlation Analysis of all field

Thanks for reading! 

Comments