Creation and train model to process extracting meta data from pictures of receipts, automatic categorization and make prediction using OCR AI
Architecture of The Proposed System
The Flowchart Diagram shows the design for optical character recognition. It represents the steps involved in the construction of the system and the flow of data. The system is divided into four different main stages.
| Flowchart Diagram |
Major phases of our OCR
These phases are as follows:
Image acquisition: To capture the image from an external source like scanner or a camera etc.
Preprocessing: Once the image has been acquired, different preprocessing steps can be
performed to improve the quality of image. Among the different preprocessing techniques are noise removal, thresholding and extraction image base line etc.
Character segmentation: In this step, the characters in the image are separated such that they can be passed to recognition engine. Among the simplest techniques are connected component analysis and projection profiles can be used. However in complex situations, where the characters are overlapping broken or some noise is present in the image.
In these situations, advance character segmentation techniques are used.
Feature extraction: The segmented characters are then processes to extract different features. Based on these features, the characters are recognized. Different types of features that can be used extracted from images are moments etc. The extracted features should be efficiently computable, minimize intra-class variations and maximizes inter-class variations.
Character classification: This step maps the features of segmented image to different categories or classes. There are different types of character classification techniques. Structural classification techniques are based on features extracted from the structure of image and uses different decision rules to classify characters. Statistical pattern classification methods are based on probabilistic models and other statistical methods to classify the characters.
Post processing: After classification, the results are not 100% correct, especially for complex languages. Post processing techniques can be performed to improve the accuracy of OCR systems. These techniques utilizes natural language processing, geometric and linguistic context to correct errors in OCR results.
Experimental Dataset
The receipt dataset consists of 200 receipts images . Each receipt is shown in entirety and includes business name, business address, cost, itemized items, subtotal, tax (if applicable), and total. All receipt images are average-quality with dimensions larger than 600 pixels (longest side). This sample receipt image dataset is ideal for OCR and image pre-processing.
| receipt data-set we use |
Feature Extraction
The most straight forward way of describing a character is by the actual raster image.
Another approach is to extract certain features that still characterize the symbols, but leaves out the unimportant attributes.
The techniques for extraction of such features are often divided into three main groups, where the features are found from:
• The distribution of points.
• Transformations and series expansions.
• Structural analysis.
The different groups of features may be evaluated according to their sensitivity to noise and deformation and the ease of implementation and use.
Robustness
1. Noise. Sensitivity to disconnected line segments, bumps, gaps, filled loops etc.
2. Distortions. Sensitivity to local variations like rounded corners, improper protrusions, dilations and shrinkage.
3. Style variation. Sensitivity to variation in style like the use of different shapes to represent the same character or the use of serifs, slants etc.
4. Translation. Sensitivity to movement of the whole character or its components.
5. Rotation. Sensitivity to change in orientation of the characters.
Pre-Processing
The image resulting from the scanning process (capture by phone or specific scanner) contain a certain amount of noise.
Depending on the resolution on the receipt, the characters may be smeared or broken.
Some of these defects, which may later cause poor recognition rates, In this stage a series
of operations are performed viz binarization, normalization, slant and rotation, noise removal, skew detection, character segmentation ,filling and thinning.
The main objective of pre-processing is to organize information so that the subsequent character recognition task becomes simpler.
It essentially enhances the image rendering it suitable for segmentation.
Normalization
Character normalization is very important pre-processing operation for character recognition.
Normally, the character image is mapped onto a standard plane(with predefined size)so as to give a representation of fixed dimensionality(an input character image must coincide with a corresponding template in terms of position, size, slant, and so on.)
The goal for character normalization is to reduce the within-class variation of the shapes of the characters/digits in order to facilitate feature extraction process and also improve their classification accuracy
The flowing figure demonstrates the normalization technique
|
The linear transformation is called an affine transformation, and mainly we used the liner transformation.
Nonlinear normalization is important when dealing with hand-printed characters.
Orientation and Script Detection(OSD)
we need to perform automatically detect and correct text orientation.
The OSD mode must including both estimated text orientation and script/writing system detection.
• Text orientation: The estimated rotation angle (in degrees) of the text in the input image.
• Script: The predicted "writing system" of the text in the image the Figure shows some different orientation we can face.
| example of varying text orientations |
Thinning, Smoothing
The smoothing implies both filling and thinning. Filling eliminates small breaks, gaps and holes in the digitized characters, while thinning reduces the width of the line. the figure below depicts the smoothing and thinning approach.
| smoothing and thinning of a symbol |
Blurring and Sampling
• Blurring: Uncontrolled environments tend to have blurred, especially if the end-user is utilizing a smartphone that does not have some form of stabilization.
Blurring will blur focused features and reveal more of the "structural" components of the image.
The image bellow illustrates the blur effect.
| small median blur to the input image |
The sampling rate determines the spatial resolution of the digitized image (figure bellow), while the quantization level determines the number of grey levels in the digitized image.
A magnitude of the sampled image is expressed as a digital value in image processing.
| image sub-sampled |
Thresholding
The thresholding process is important as the results of the recognition is totally dependent of the quality of the bilevel image.
We used a fixed threshold, where gray-levels below this threshold is said to be black and levels above are said to be white.
The Figure shows the different result of Global Thresholding
Adaptive Thresholding Selection Based on Topographical Image Analysis
For a high-contrast document with uniform background, a preselected fixed threshold can be sufficient.
However, a lot of receipt encountered in practice have a rather large range in contrast.
The best methods for thresholding are usually those which are able to vary the threshold over the document adapting to the local properties as contrast and brightness
| result of Adaptive Thresholding (B) on input image (A) |
Binary Morphology
The morphology represent the shapes that are manifested on binary or gray tone images.
The set of all the black pixels in a black-and-white image constitutes a complete description of the binary image.
Binary Dilation
The binary dilation of an image by a structuring element is the locus of the points covered by the structuring element, when its center lies within the non-zero points of the image.
The example of an image dilated is displayed in Figure
| An example of dilation operation |
Binary Erosion
The binary erosion of an image by a structuring element is the locus of the points where a superimposition of the structuring element centered on the point is entirely contained in the set of non-zero elements of the image.
The example of an image erosion is displayed in Figure
| An example of erosion operation: |
Duality between Dilation and Erosion
We can perform operations on binary images such as erosion or dilation which make objects smaller or bigger respectively, or opening and closing which separate or merge objects respectively:
Erode then dilate :
| result of Erode then dilate |
| result of Dilation then erosion |
Opening is a process in which first erosion operation is performed and then dilation operation is performed.
Closing is a process in which first dilation operation is performed and then erosion operation is performed
| result of Closing (a)to(b) and Opening (c)to(d) transformation |
extraction data
Neural Network Architecture
Recurrent Neural Networks (RNN)
| Recurrent Neural Networks have loops |
| Recurrent Neural Networks have loops |
| (a)RNN-short term depdencies (b)RNN-long term depdencies |
Long Short Term Memory networks (LSTM)
| The repeating module in a standard RNN contains a single layer |
| The repeating module in an LSTM contains four interacting layers |
Post-processing
grouping
categorization
Error-detection and correction
formatting and converting
Training
Training and Recognition
trigger of the training process
Tools, Libraries, and Packages for OCR
Tesseract
| Tesseract OCR logo |
PyTesseract
OpenCV
| OpenCV logo |
System overview architecture
| System overview architecture |
preparing for training phase
| preparing for training phase overview |
Image Pre-processing
| Image processing without warping |
| Image processing with warping |
Binarization
- Skew Detection and Correction
- Character Segmentation
- Blur Detection in Text and Documents
- Levenshtein distance
- Canny Edge Detection (CED)
Finding Text Blobs in an Image with OpenCV
Whitelisting and Blacklisting Characters with Tesseract
A whitelist specifies a list of characters that the OCR engine is only allowed to recognize if a character is not on the whitelist, it cannot be included in the output OCR results. The opposite of a whitelist is a blacklist. A blacklist specifies the characters that, under no circumstances, can be included in the output. We apply whitelisting and blacklisting with Tesseract.OCR Using Template Matching
template matching is the process of accepting an input character and then matching it to a set of reference images (i.e., templates). If a given input receipt sufficiently matches the template. OCR via template matching is a viable candidate for us, allowing us to optimize Tesseract accuracy. The figure bellow shows the different step to generate a fingerprint(histogram presentation) for each receipt.

OCR Using Template Matching
Training an OCR Model
Tesseract 4.00 includes a new neural network-based recognition engine that delivers significantly higher accuracy (on document images) than the previous versions, in return for a significant increase in required compute power. On complex languages however, it may actually be faster than base Tesseract. Neural networks require significantly more training data and train a lot slower than base Tesseract. For Latin-based languages, the existing model data provided has been trained on about 400000 textlines spanning about 4500 fonts. For other scripts, not so many fonts are available, but they have still been trained on a similar number of textlines. Instead of taking a few minutes to a couple of hours to train, Tesseract 4.00 takes a few days to a couple of weeks. Even with all this new training data, you might find it inadequate for our particular problem.There are multiple options for training:
• Fine tune.
Starting with an existing trained language, train on your specific additional data. This may work for problems that are close to the existing training data, but different in some subtle way, like a particularly unusual font. May work with even a small amount of training data.• Cut off the top layer
(or some arbitrary number of layers) from the network and retrain a new top layer using the new data. If fine tuning doesn’t work, this is most likely the next best option. Cutting off the top layer could still work for training a completely new language or script, if you start with the most similar looking script. • Retrain from scratch.
This is a daunting task, unless you have a very representative and sufficiently large training set for your problem. If not, you are likely to end up with an over-fitted network that does really well on the training data, but not on the actual data.Training a Custom Tesseract Model
Using tesstrain (http://pyimg.co/n6pdk [62]), we can train and fine-tune Tesseract’s deep learning-based LSTM models (the same model we’ve used to obtain high accuracy OCR results throughout this work). Perhaps most importantly, the tesstrain package provides instructions on how our training dataset should be structured, making it far easier to build our dataset and train the model. Choose model name
Choose a name for our model. receipt model Provide ground truth Place our ground truth consisting of line images and transcriptions in the folder data/receipt model ground-truth. This list of files will be split into training and evaluation data, we defined the RATIOTRAIN by 20% .Train
start the training processThe following figure show the output of the training process.

output of the training process
flask Python Web Frameworks
Using Virtual Environments
The most convenient way to install Flask is to use a virtual environment. A virtual environment is a private copy of the Python interpreter onto which you can in- stall packages privately, without affecting the global Python interpreter installed in your system. Virtual environments are very useful because they prevent package clutter and version conflicts in the system’s Python interpreter. Creating a virtual environment for each application ensures that applications have access to only the packages that they use, while the global interpreter remains neat and clean and serves only as a source from which more virtual environments can be created. As an added benefit, virtual environments don’t require administrator rights.
flask
Flask is a micro-framework designed to create a web application in a short time. It only implements the core functionality giving developers the flexibility to add the feature as required during the implementation. It is a lightweight, WSGI application framework. This framework can either be used for pure backend as well as frontend if need be. The former provides the functionality of the interactive debugger, full request object, routing system for endpoints, HTTP utilities for handling entity tags, cache controls, dates, cookies etc.. It also provides a threaded WSGI server for local development including the test client for simulating the HTTP requests. Werkzeug and Jinja are the two core libraries The Jinja, however, is another dependency of the Flask.
It is a full-featured template engine.Sandboxed execution, powerful XSS prevention, template inheritance, easy to do debug,configurable syntax is it’s few of many features.In addition, the code written in the HTML template is compiled as python code.Figure below shows the Flask logo.
REST API
OpenAPI Specification
OpenAPI Specification (OAS) is a standard for defining RESTful APIs in a manner that makes them understandable for both humans and machines. The files made in this format can be read and used by multiple tools that help to design, build and manage APIs and they also provide documentation for developers. OpenAPI has two versions, OpenAPI 2.0 and the latest OpenAPI 3.0 released in 2017.
Swagger
A REST or RESTful API is a Web Application Programming Interface (API) which fol- lows the Representational State Transfer architectural style. REST is a technical description of how the World Wide Web works. It has been described by Roy Fielding in 2000 and has become a popular approach to design Web APIs.
Whitelisting and Blacklisting Characters with Tesseract
OCR Using Template Matching
| OCR Using Template Matching |
Training an OCR Model
There are multiple options for training:
• Fine tune.
• Cut off the top layer
• Retrain from scratch.
Training a Custom Tesseract Model
Choose model name
Train
| output of the training process |
Comments
Post a Comment