Google provided text recognition for a document lying on the table at Google I/O annual developer conference in 2018.
It seemed upon first appearance that the task is quite simple and it was strange that we had been waiting for it for so long. Especially taking into account that Deep Convolutional Neural Networks were in full swing at the big tech giants already in 2015.
Nowadays, Convolutional Neural Networks is used in Computer Vision systems of self-driving cars, commercial drones, robots, closed-circuit television surveillance and other systems. You also use Convolutional Neural Networks every day if you have turned on face recognition feature on your smartphone.
Since the early 1990s, Yann LeCun, a founding father of convolutional networks, began developing his solution for recognizing certain characters of the American English based on Convolutional Neural Networks. It has been nearly 30 years since that time but many problems have not yet been solved. One of difficult tasks is the recognition of hieroglyphs and symbols of Asian and Arabic languages. Some of the problems faced by Data Science specialists who create and train Convolutional Neural Networks models we’re describing here.
Each of the Asian or Arabic languages has a bunch of its own features ranging from complex forms of hieroglyphs (their similarity, their structure, etc.), and ending with the size of the dictionary. It’s one thing to meet the challenge of symbols classification using a neural network with 50-100 symbols only, but quite another thing to have 10,000 symbols.
And therefore, the 1st problem is Performance.
It is important to understand where our solution will be used. In case if it is our server then we can install a Graphics Processing Unit (GPU) on the server and use Deep Neural Networks of different types and complexity. If we need to make a desktop application for a client there is no guarantee the client will have a powerful GPU or it will be configured in a proper way. As a consequence, a neural network prediction from one minute can result in waiting hours.
The same problem occurs when we create a Computer Vision solution for mobile devices. For example, it is extremely problematic for a person in a new country not only to understand, but also to type the symbols that he sees around (street names, city signages, etc.). Therefore, ordinary translators are not suitable for such a task. In this case we need modern systems immediately translating inscriptions from a picture or video into persons native language.
Even modern mobile devices do not have sufficiently powerful GPUs and we have to choose from a limited set of neural networks for easier prediction.
The 2nd problem is Complexity of symbols and their number.
It is rather difficult to unequivocally say how many characters Chinese language has but the most commonly used are about 10,000 characters.
There are some certain difficulties with hieroglyphs recognition such as:
- a huge number of hieroglyphs to be classified,
- more complex character structure.
To prevent these problems from leading to severe slowdowns in the entire recognition system, many heuristics from mathematical optimization field have to be used. They are primarily focused on fast cutting off a significant number of hieroglyphs that the current character does not exactly look like. Unfortunately, it does not still eliminate all difficulties but at the same time we need to get a qualitatively new level.
To achieve the required quality, it is advisable to consider deeper and more complex neural networks architectures such as WideResNet, SqueezeNet, etc. With these solutions, it is possible to achieve the required quality level, but they give a serious loss of operating speed. Approx. 3-5 times slower comparing to the basic algorithm on CPU. It is obvious that the neural network speed on the GPU will be much faster than on the CPU. But since in the majority of scenarios, recognition is performed on the client side, it is reasonable to presume that there is no GPU on the client’s side.
Considering the requirements for network performance, it makes sense not to train the network on a large number of classes, that is, on the entire alphabet, but instead train many neural networks on a small number of classes (subsets of the entire alphabet).
In general terms, the ideal system is presented as follows: the alphabet is divided into groups of similar symbols. The first level neural network classifies which group of symbols a given image belongs to. For each group, in turn, a second level neural network is trained, which perform the final classification within each group.
The 3rd problem is the Unclear character boundaries.
Let’s take a look at Burmese language. Burmese has several dialects, but there is a certain set of letters that is used in official communications, in the press, etc. It makes up about 30% of all letters used in the regions, it includes 33 consonants and 12 additional characters.
Let’s review a possible approach of how such languages are recognized. At first, we get a picture with text, process it (correct distortions, convert to b&w color, etc), define blocks on the page (titles, text, footnotes, pictures, tables, etc.). After that we divide the text blocks into separate lines, lines – into words, words – into letters and finally recognize letters. Further we collect everything back into the text of the page. Since there is nothing special in dividing the Burmese texts into blocks, we will describe the dividing into text lines where certain difficulties may arise.
The algorithms have additional lines characteristics, one of which is the baseline on which the main characters are located. The baseline must be highlighted in order to correctly formulate hypotheses regarding certain symbols and, accordingly, recognize them in a proper way.
In Burmese, a large number of characters that go beyond the boundaries of the main part of the text line adds additional significant peaks to the histogram. Therefore, the algorithms configured to recognize, for example, European languages did not quite correctly determine the main parameters of the string.
Due to the fact that there are many semicircular characters in Burmese, there is a lot of “extra” peaks and troughs, which makes it difficult to categorize but that issue has been solved as well.
The 4th problem is the Lack of a native speaker and appropriate ready-made solutions.
A limited number of people knows, for instance, Burmese or any other specified language, and it is sometimes unrealistic or very expensive to have a native speaker in a team. Therefore, developers themselves often have to get to know to some extent the language from scratch.
At the moment, ready-made solutions in the field of Optical Character Recognition for texts recognition of different languages are offered by Abbyy, but they are quite expensive and does not apply for all languages. A framework ready for very rough solutions of similar tasks is available only from Google Tesseract OCR.
Therefore, the most beneficial solution for recognizing Asian or Arabic languages is a development of custom Machine Learning solutions based on Deep Convolutional Neural Networks.