Using Image Processing to Detect Text

A demonstration of how easy it is to find words and letters in an image

6 min readJan 23, 2019

Detecting text from images is a prototypical modern puzzle that incorporates image processing, computer vision, and machine learning. Many existing applications do a splendid job in performing this function, such as Google Lens and CamScanner. Both of these applications take the next step and implement an optical character recognition (OCR) algorithm to interpret images into actual text.

As part of a bigger project, I wanted to implement such an OCR. While I’m sure there are plenty of libraries that can already accomplish this feat, I needed certain control and customization for my purpose. I additionally wanted to take this opportunity to transition to using OpenCV for image processing and computer vision tasks.

Up until this point, I have used Matlabs built in image processing toolbox. The toolbox is splendid and makes image processing projects incredibly easy. Yet, with OpenCV being a common choice in industry (and my student Matlab licence expiring soon), I thought it to be beneficial to explore this python library.

Conceptually, finding text is fairly simple. Ideally, a threshold filter can be applied that separates the letters from a contrasting background. After this is achieved, each letter is a blob that can be isolated by finding pixel regions. For my application, I was interested in different types of lettering, not just the letters themselves (bold, italics, underline, etc.). To test my algorithm, I used the well known example below.

An Excerpt from the United States Constitution

Similarly to most image processing tasks, the first step in this procedure is to perform a threshold. At first I played around with using Otsu’s threshold method, a global threshold technique. I initially thought this would be more simple to code, but later realized OpenCV has fairly convenient functions that allow developers to easily change threshold algorithms.

Otsu’s method (top), Adaptive method (bottom)

The two methods produced fairly different results. Arguably, both can be used depending on the desired information. If the whole word is desired, then the blended result of the Otsu method is more useful. If individual letters are needed an adaptive threshold is preferred. Even if a whole word is desired, the word can be determined from individual letters, as will be shown in a future episode. An adaptive filter will also be more convenient if an image is not clean or has poor lighting.

With the image threshold applied, one would think that the letters can be determined. Taking this direct approach creates a very patent glitch. Cavities that occur within letters will be considered letters themselves. This bug occurs due to the way pixel regions are determined in OpenCV. Opposed to Matlab which uses the ‘regionprops’ command to find pixel regions, OpenCV instead detects contours around regions of a certain pixel value. The command used is ‘findContours()’ and it’s output can be tweaked by the parameters it is given.

Cavities in the letters ‘g’, ‘o’, ‘e’, and ‘a’ are selected as separate objects

In order to avoid this mistake, we need to determine with contours are cavities. Contours need to be pulled along with their hierarchies. We want to distinguish contours by whether they are external or internal. This is done, by setting the second argument in the ‘findContours()’ function to ‘RETR_CCOMP’. The function will now return an array of lists that describe the relationship of each contour with regards to other contours. One of the indices provide the number of the parent of the internal contour or -1 if there is no parent.

With this data, we simply need to ignore any contours that have a parent. Looping through each contour, record their maximum left, right, top, and bottom pixel locations if they passed the mentioned conditional. These coordinates will be used to make the bounding box for each letter.

By applying the above procedure, we get impressively clean results. Letters are isolated and bounded appropriately. The algorithm is actually “too good” in places, which can be seen in locations where the dots above the letter ‘i’ are interpreted as a single object. Yet the results are not without fault either. This specific font and text editor appears to place the letter ‘t’ and ‘y’ too close to each other. In the word ‘liberty’, we can see that ‘rty’ is considered to be a single letter. There are tricks to refine this algorithm, which I will discuss in future pieces.

I ultimately want to use this script on handwritten text. The script was tested on two different people’s handwriting. One set of handwriting was meant to be very clean, while the other, not so much (I’ll leave it to the reader to guess which one is which). This test displayed some areas of trouble that will need to be overcome in the future.

For one, any non-continuous letter will be detected as multiple separate objects. This is most clearly visible in the letter ‘T’ in the image below. The top of the ‘T’ is considered to be a separate object from the bottom. The opposite occurs as well. Any script style letters that combine multiple letters together will be interpreted as a single symbol. This occurs with ‘Fo’ in the word ‘For’ below. Both of these issues will require some ingenuity to overcome.

A couple of immediate ideas come to mind. For letters that have been combined, we can look for letters that have a larger bounding box as compared to other letters in its cluster, paragraph, or sentence. If the bounding box is too large, it can suggest that the letter was accidentally combined. From here arises the issue of performing a correct split.

To combine broken letters, we might have to first determine that a bounded region does not resemble an existing letter in our alphabet. This will most likely be determined from some convoluted neural network. After determining an object is not a letter, combination of adjacent bounding boxes will be tested to see if a more clear letter is possible.

With the script working suitably on normal text, I will be moving on to the next steps of the project in later articles. Some other tasks I want to accomplish is letter recognition, paragraph estimation, and font and style determination.

Github

TimChinenov/PictureText

A basic image processing code to detect text on a high contrast image - TimChinenov/PictureText

github.com

Using Image Processing to Detect Text

A demonstration of how easy it is to find words and letters in an image

Github

TimChinenov/PictureText

A basic image processing code to detect text on a high contrast image - TimChinenov/PictureText

More Pictures

Written by Tim Chinenov

Responses (1)