What can modern OCR do in 2021?

Just a few years ago, Optical character recognition solutions had been working well enough for the Indo-European group of languages. Some typewritten texts in different languages were well recognized, solutions for other languages were still under development. For example, for English and Russian languages, there were solutions for handwritten texts, which can process texts written by calligraphic hand only. Moreover, the images have to be scanned or photocopied in high definition exactly in parallel with the camera.

Even small changes in the slope of the surface, or in the writing style, sharply worsened quality of letters and words.

Similar shooting conditions belong to all recognition systems that can recognize texts in all known languages, both modern (German, French, Russian) and ancient such as ancient Japanese, Burmese.

Following the described shooting conditions, the output accuracy of the current solutions is at least 90% which is sufficient for many use cases. With changes in the surface, for example, the text printed on a bottle, or on a fabric, the accuracy varies from 65 to 95%. In this case it is already necessary to make a custom solution for each individual case, namely, a solution for recognizing the text on the bottle and a separate solution for recognizing the text on a fabric. There is still no one-size-fits-all solution with an acceptable quality.

At the moment, OCR tasks can be solved on almost all devices equipped with a computing processor and a camera. That is, in addition to personal computers, they can also be used on mobile devices, and on single-board computers such as Raspberry Pi and similar. Existing solutions allow to run on computationally weak devices only slightly sacrificing quality.

In reality, working solutions are combinations of convolutional and recurrent networks, various heuristics and custom methods of language processing. For example, if it is difficult to recognize a certain symbol, the solution may be to analyze adjacent symbols and try to predict whether this symbol fits into some word in which it can be found.

The old classical approaches and solutions based on ordinary simple machine learning models are hardly used anymore.

For text processing you can use ready-made solutions from Tesseract and Abbyy as well as Keras OCR or Easy OCR libraries. They will let you make a Proof of Concept, and then you need to point it toward a custom solution.

OCR solutions are not limited to language processing only. It’s widely used in the problems of recognizing chemical and mathematical formulas, accounting and financial documents.

Many problems of the past have been resolved, but several other OCR problems still remain. The performance of OCR systems on mobile devices and non-GPU computers remains an issue. Due to the lack of a GPU, some solutions have to be changed or adjusted for processing on a CPU, and for the sake of speed this affects the quality of the solution.

Another problem is the recognition of symbols on objects not on a flat surface, as described above. For example, an inscription on a bottle, a text on a tag of clothes, instruments, or air luggage is more difficult for recognition and as a workaround the barcodes might be used for identification which are more resistant to recognition.

Have a question?


Innovative Future, Inc.
@ 2021. All Rights Reserved.

Intelligent documents processing

Nowadays, most companies do their business in a digital form. Namely, they enter into a contract in a digital form, rather than in a paper, lead business correspondence electronically and so on. However, the companies and organizations have to keep the paper copies of the documents for legal and financial reasons.

The most direct example is the employees records and workers’ contributions to pension funds in the entire Organization’s history. A need to duplicate both paper and digital copies is justified by the fact when someone from the company employees quit, the records of his contributions to the funds will be stored in a safe way additionally protecting the data against its loss.

Therefore, there is already an urgent need to search, extract and analyze the information within the huge databases and repositories of scanned documents.

As another example, analysis of the data from various contracts and agreements (hereinafter contracts). The important pieces of information are contract number, a type of the contract (employment, sales, service, etc.), a date of signing, counterparts’ names and their legal addresses, a period of the contract, main contract terms (for instance, a list of goods with their prices, payment options and delivery terms for the sales contracts).

Example:

When it comes to analyzing huge amounts of text data, it’s just too big a task to do it manually. It’s also tedious, time-consuming, and expensive. Manually extracting through large amounts of data is more likely to lead to mistakes and inconsistencies.

To do that perfectly, there should be an effective searching algorithm of the required documents and its data, and a reliable automated data extraction mechanism. Our intelligent Machine Learning solutions allow to retrieve all the required data from any scanned and photographed copies of Handwritten or Machine-Typed documents. The highest data extraction accuracy and speed of data processing are achieved by Deep Learning algorithms, a modern technical stack and innovative approaches.

Let’s go down to possible images processing methods.

Firstly, the representation of the document pages in space should be determined. Of course, the ideal conditions when the pages were scanned by the traditional flatbed scanners. But there might be cases when the pages were photographed under different angles. Such circumstances definitely render even more difficult the work of the Computer Vision algorithms and Neural Networks models that have been trained for the specific conditions and executed before Optical Character Recognition step.

For instance, rotated ‘V’ letter is perceived as ‘>’ (greater-than) sign by 90%. As people read slowly flipped texts, so the Machine Learning algorithms make mistakes too. The old methods worked properly only with simple documents. It was a solution based on a contrasting image and calculating the rotation angle of the selected text rectangle. That solution has been working in ideal conditions and has been too far from the real-life cases.

Example:

Modern methods that work properly in different conditions are based on Deep Convolution Neural Networks. These can be both fast solutions where the core is a lightweight deep neural network like MobileNet or more individual streamlined solutions with their own architecture.

Having received the precise document boundaries on the image, the main areas of the document are being selected. There is a clear distinction where a text is, and where the tables and graphics are. Technical implementation of this is also done through Convolution Neural Networks. For the tasks like Object Detection and Localization these can be neural networks like ResNet and EfficientDet or networks based on U-Net architecture.

Having received the text boundaries on the document, we crop the texts out and send to OCR, the technology for characters and symbols recognition. The texts can be in different languages such as English, German, Japanese, etc. at the time of the OCR operation. If the conditions are not initially set, the document language is determined automatically.

OCR result is an unstructured characters set where one of the main conditions is to check the spelling of the words. A modern approach for dealing with such a task is the use of BERT word embedding to represent text in compressed tensor form. Different approaches and algorithms such as Transformers and Classifiers from Deep Learning family are used when there is a need for a custom solution. Also, the solutions provided by the SpaCy, NLTK, AllenNLP and similar libraries can be leveraged as well.

When the grammar check starts, the algorithms from the Machine Learning domain called Natural Language Processing, begin. It analyzes the context of the sentences to correct grammar mistakes, misused words and spelling mistakes with unmatched accuracy. Once it’s done, we will obtain various blocks of the corrected texts.

The next step is Named Entity Recognition that identifies and categorizes key information (entities) in text into predetermined categories such as person names, address location details, phone numbers, legal and actual addresses and so on. For example, it will detect and understand that the ‘Ramon J McMillan’ is a first and last name of the person, and categorize it as a {Person} category.

Example:

Usually, a trained Named Entity Recognition model based on SpaCy is used for identification and categorization of the entities in text. But depending on a complexity of the task we can also create a custom solution based on Deep Neural Networks within a short period of time.

After a such pre-processing we already can transfer the text blocks to another Natural Language Processing algorithms named Topic Classification to identify topics of the document sections.

Unlike the algorithms for Topic Modeling, the Machine Learning algorithms used for Topic Classification are supervised. This means we need to feed them documents sections already labeled by topic, and the algorithms will learn how to label new, ‘unseen’ sections with these topics.

For example, for a sales contract, it can be an identification of sections such as Sales of Goods, Delivery and Purchase Price. There are some solutions based on popular Natural Language Processing libraries but own customized solutions using Deep Neural Networks are of a better quality.

Having received topic labels for various text sections, we can already build a logical structure of the document. And taking into account the topic, we need to apply one more time another type of Named Entity Recognition which will help to identify the entity for each categorized section better. Thus, it will be already clear that it is unlikely that the name and surname of a Seller or a Buyer may presence in a Purchase Price section of the sales contract.

After finishing Machine Learning models training, we can already automatically extract the required information from the similar contracts normalizing and enriching the data, if required.

The obtained digitized data can be entered, for instance, into the clients’ document management systems and processed according to company business needs thereby reducing labor costs of manual processing of the paper documents.

Have a question?


Innovative Future, Inc.
@ 2021. All Rights Reserved.

Machine Learning leveraging in financial reports analysis

Financial reports are a key part of work of various financial institutions as their timeliness and correctness of processing often directly influences the company’s strategy. There is a tendency that the amount of data for manual processing is mounting rapidly. It becomes impossible for a person to provide a deep analysis of a report timely, so automated data processing comes to the forefront.

Financial report sample of the Morningstar agency

Financial reports differ in that apart of textual information they display many visual components such as various graphs, charts and other visual elements. Moreover, the text is scattered in various places.

In this situation, there is a significant amount of Machine Learning tasks that should be solved. Optical Character Recognition, Deep Learning and Natural Language Processing is a main toolset of the necessary solutions.

It is very important to find the required elements correctly among different textual structures, visual elements, tables, text blocks, and paragraphs. It is a delicate task to recognize the boundaries of the text and the tabular presentation. As you can see in the image above, in both cases, we have typographic fonts, and entities differ only in the form of text presentation.

Once the correct structure of the document is determined, various OCR approaches are applied, allowing, if necessary, to turn the graphic text representation into the text. The results of work is always a text, but the text will contain algorithm operation errors as well as possible text errors in the document itself. Thus, the intermediate results must be processed by machine learning systems designed to correct errors.

Since the reports can be issued in different languages, the Machine Learning systems will be different too. For example, the structure of words in Indo-Germanic languages will differ dramatically from Asian languages or Israeli.

And having passed this difficult path, we get a pure text. However, this is not the final goal. It is important to understand what is written in the report. For these purposes, Natural Language Processing systems are invited onto the stage.

For instance, using NLP we meet the challenge of document’s Topic Classification. That is, even financial reports can be different: annual, monthly, weekly or there could be biweekly reports with a slightly different focuses such as a profit and loss statement, a stock statement, and so on.

One of the most important tasks of NLP is to define entities in documents, i.e. to understand which company is being written about, over what period of time, whether any important event happened, etc. Based on the obtained entities it is possible to build a knowledge base about the report so the ML system can start to form its ontology.

In closing, it would be useful to note that many issues can be solved using Machine Learning in financial reports. Here the most obvious solutions were only described.

Have a question?


Innovative Future, Inc.
@ 2021. All Rights Reserved.

Recognition complexity of Asian and Arabic languages

Google provided text recognition for a document lying on the table at Google I/O annual developer conference in 2018.

It seemed upon first appearance that the task is quite simple and it was strange that we had been waiting for it for so long. Especially taking into account that Deep Convolutional Neural Networks were in full swing at the big tech giants already in 2015.

Nowadays, Convolutional Neural Networks is used in Computer Vision systems of self-driving cars, commercial drones, robots, closed-circuit television surveillance and other systems. You also use Convolutional Neural Networks every day if you have turned on face recognition feature on your smartphone.

Since the early 1990s, Yann LeCun, a founding father of convolutional networks, began developing his solution for recognizing certain characters of the American English based on Convolutional Neural Networks. It has been nearly 30 years since that time but many problems have not yet been solved. One of difficult tasks is the recognition of hieroglyphs and symbols of Asian and Arabic languages. Some of the problems faced by Data Science specialists who create and train Convolutional Neural Networks models we’re describing here.

Each of the Asian or Arabic languages has a bunch of its own features ranging from complex forms of hieroglyphs (their similarity, their structure, etc.), and ending with the size of the dictionary. It’s one thing to meet the challenge of symbols classification using a neural network with 50-100 symbols only, but quite another thing to have 10,000 symbols.

And therefore, the 1st problem is Performance.

It is important to understand where our solution will be used. In case if it is our server then we can install a Graphics Processing Unit (GPU) on the server and use Deep Neural Networks of different types and complexity. If we need to make a desktop application for a client there is no guarantee the client will have a powerful GPU or it will be configured in a proper way. As a consequence, a neural network prediction from one minute can result in waiting hours.

The same problem occurs when we create a Computer Vision solution for mobile devices. For example, it is extremely problematic for a person in a new country not only to understand, but also to type the symbols that he sees around (street names, city signages, etc.). Therefore, ordinary translators are not suitable for such a task. In this case we need modern systems immediately translating inscriptions from a picture or video into persons native language.

Even modern mobile devices do not have sufficiently powerful GPUs and we have to choose from a limited set of neural networks for easier prediction.

The 2nd problem is Complexity of symbols and their number.

It is rather difficult to unequivocally say how many characters Chinese language has but the most commonly used are about 10,000 characters.

There are some certain difficulties with hieroglyphs recognition such as:

  • a huge number of hieroglyphs to be classified,
  • more complex character structure.
This also leads to the fact that in order to achieve high quality it is necessary to use a large number of features (a measurable property of the object to analyze), and these features are calculated on the images of symbols for a longer time.

To prevent these problems from leading to severe slowdowns in the entire recognition system, many heuristics from mathematical optimization field have to be used. They are primarily focused on fast cutting off a significant number of hieroglyphs that the current character does not exactly look like. Unfortunately, it does not still eliminate all difficulties but at the same time we need to get a qualitatively new level.

To achieve the required quality, it is advisable to consider deeper and more complex neural networks architectures such as WideResNet, SqueezeNet, etc. With these solutions, it is possible to achieve the required quality level, but they give a serious loss of operating speed. Approx. 3-5 times slower comparing to the basic algorithm on CPU. It is obvious that the neural network speed on the GPU will be much faster than on the CPU. But since in the majority of scenarios, recognition is performed on the client side, it is reasonable to presume that there is no GPU on the client’s side.

Considering the requirements for network performance, it makes sense not to train the network on a large number of classes, that is, on the entire alphabet, but instead train many neural networks on a small number of classes (subsets of the entire alphabet).

In general terms, the ideal system is presented as follows: the alphabet is divided into groups of similar symbols. The first level neural network classifies which group of symbols a given image belongs to. For each group, in turn, a second level neural network is trained, which perform the final classification within each group.

The 3rd problem is the Unclear character boundaries.

Let’s take a look at Burmese language. Burmese has several dialects, but there is a certain set of letters that is used in official communications, in the press, etc. It makes up about 30% of all letters used in the regions, it includes 33 consonants and 12 additional characters.

Let’s review a possible approach of how such languages are recognized. At first, we get a picture with text, process it (correct distortions, convert to b&w color, etc), define blocks on the page (titles, text, footnotes, pictures, tables, etc.). After that we divide the text blocks into separate lines, lines – into words, words – into letters and finally recognize letters. Further we collect everything back into the text of the page. Since there is nothing special in dividing the Burmese texts into blocks, we will describe the dividing into text lines where certain difficulties may arise.

The algorithms have additional lines characteristics, one of which is the baseline on which the main characters are located. The baseline must be highlighted in order to correctly formulate hypotheses regarding certain symbols and, accordingly, recognize them in a proper way.

In Burmese, a large number of characters that go beyond the boundaries of the main part of the text line adds additional significant peaks to the histogram. Therefore, the algorithms configured to recognize, for example, European languages did not quite correctly determine the main parameters of the string.

Due to the fact that there are many semicircular characters in Burmese, there is a lot of “extra” peaks and troughs, which makes it difficult to categorize but that issue has been solved as well.

The 4th problem is the Lack of a native speaker and appropriate ready-made solutions.

A limited number of people knows, for instance, Burmese or any other specified language, and it is sometimes unrealistic or very expensive to have a native speaker in a team. Therefore, developers themselves often have to get to know to some extent the language from scratch.

Conclusion.

At the moment, ready-made solutions in the field of Optical Character Recognition for texts recognition of different languages are offered by Abbyy, but they are quite expensive and does not apply for all languages. A framework ready for very rough solutions of similar tasks is available only from Google Tesseract OCR.

Therefore, the most beneficial solution for recognizing Asian or Arabic languages is a development of custom Machine Learning solutions based on Deep Convolutional Neural Networks.

Have a question?


Innovative Future, Inc.
@ 2021. All Rights Reserved.