Nowadays, most companies do their business in a digital form. Namely, they enter into a contract in a digital form, rather than in a paper, lead business correspondence electronically and so on. However, the companies and organizations have to keep the paper copies of the documents for legal and financial reasons.

The most direct example is the employees records and workers’ contributions to pension funds in the entire Organization’s history. A need to duplicate both paper and digital copies is justified by the fact when someone from the company employees quit, the records of his contributions to the funds will be stored in a safe way additionally protecting the data against its loss.

Therefore, there is already an urgent need to search, extract and analyze the information within the huge databases and repositories of scanned documents.

As another example, analysis of the data from various contracts and agreements (hereinafter contracts). The important pieces of information are contract number, a type of the contract (employment, sales, service, etc.), a date of signing, counterparts’ names and their legal addresses, a period of the contract, main contract terms (for instance, a list of goods with their prices, payment options and delivery terms for the sales contracts).

Example:

When it comes to analyzing huge amounts of text data, it’s just too big a task to do it manually. It’s also tedious, time-consuming, and expensive. Manually extracting through large amounts of data is more likely to lead to mistakes and inconsistencies.

To do that perfectly, there should be an effective searching algorithm of the required documents and its data, and a reliable automated data extraction mechanism. Our intelligent Machine Learning solutions allow to retrieve all the required data from any scanned and photographed copies of Handwritten or Machine-Typed documents. The highest data extraction accuracy and speed of data processing are achieved by Deep Learning algorithms, a modern technical stack and innovative approaches.

Let’s go down to possible images processing methods.

Firstly, the representation of the document pages in space should be determined. Of course, the ideal conditions when the pages were scanned by the traditional flatbed scanners. But there might be cases when the pages were photographed under different angles. Such circumstances definitely render even more difficult the work of the Computer Vision algorithms and Neural Networks models that have been trained for the specific conditions and executed before Optical Character Recognition step.

For instance, rotated ‘V’ letter is perceived as ‘>’ (greater-than) sign by 90%. As people read slowly flipped texts, so the Machine Learning algorithms make mistakes too. The old methods worked properly only with simple documents. It was a solution based on a contrasting image and calculating the rotation angle of the selected text rectangle. That solution has been working in ideal conditions and has been too far from the real-life cases.

Example:

Modern methods that work properly in different conditions are based on Deep Convolution Neural Networks. These can be both fast solutions where the core is a lightweight deep neural network like MobileNet or more individual streamlined solutions with their own architecture.

Having received the precise document boundaries on the image, the main areas of the document are being selected. There is a clear distinction where a text is, and where the tables and graphics are. Technical implementation of this is also done through Convolution Neural Networks. For the tasks like Object Detection and Localization these can be neural networks like ResNet and EfficientDet or networks based on U-Net architecture.

Having received the text boundaries on the document, we crop the texts out and send to OCR, the technology for characters and symbols recognition. The texts can be in different languages such as English, German, Japanese, etc. at the time of the OCR operation. If the conditions are not initially set, the document language is determined automatically.

OCR result is an unstructured characters set where one of the main conditions is to check the spelling of the words. A modern approach for dealing with such a task is the use of BERT word embedding to represent text in compressed tensor form. Different approaches and algorithms such as Transformers and Classifiers from Deep Learning family are used when there is a need for a custom solution. Also, the solutions provided by the SpaCy, NLTK, AllenNLP and similar libraries can be leveraged as well.

When the grammar check starts, the algorithms from the Machine Learning domain called Natural Language Processing, begin. It analyzes the context of the sentences to correct grammar mistakes, misused words and spelling mistakes with unmatched accuracy. Once it’s done, we will obtain various blocks of the corrected texts.

The next step is Named Entity Recognition that identifies and categorizes key information (entities) in text into predetermined categories such as person names, address location details, phone numbers, legal and actual addresses and so on. For example, it will detect and understand that the ‘Ramon J McMillan’ is a first and last name of the person, and categorize it as a {Person} category.

Example:

Usually, a trained Named Entity Recognition model based on SpaCy is used for identification and categorization of the entities in text. But depending on a complexity of the task we can also create a custom solution based on Deep Neural Networks within a short period of time.

After a such pre-processing we already can transfer the text blocks to another Natural Language Processing algorithms named Topic Classification to identify topics of the document sections.

Unlike the algorithms for Topic Modeling, the Machine Learning algorithms used for Topic Classification are supervised. This means we need to feed them documents sections already labeled by topic, and the algorithms will learn how to label new, ‘unseen’ sections with these topics.

For example, for a sales contract, it can be an identification of sections such as Sales of Goods, Delivery and Purchase Price. There are some solutions based on popular Natural Language Processing libraries but own customized solutions using Deep Neural Networks are of a better quality.

Having received topic labels for various text sections, we can already build a logical structure of the document. And taking into account the topic, we need to apply one more time another type of Named Entity Recognition which will help to identify the entity for each categorized section better. Thus, it will be already clear that it is unlikely that the name and surname of a Seller or a Buyer may presence in a Purchase Price section of the sales contract.

After finishing Machine Learning models training, we can already automatically extract the required information from the similar contracts normalizing and enriching the data, if required.

The obtained digitized data can be entered, for instance, into the clients’ document management systems and processed according to company business needs thereby reducing labor costs of manual processing of the paper documents.

Have a question?


Innovative Future, Inc.
@ 2021. All Rights Reserved.