Convolutional neural networks, or CNNs, found applications in computer vision and audio processing. More generally they are suitable for tasks that use a number of features to identify interesting objects, events, sound, etc. The power of CNNs comes from the fact that they can be trained to come up with features that are hard to define by hand. What are “features”, you may ask? It is probably easier to understand this concept through example. Suppose your task is to write a program that can analyze an image of an LCD display and can recognize if the display shows 0 or 1. Thus your program is presented with images that look as follows:
Fig 1. Images representing 0 and 1 on an LCD display
A possible solution would be to convert images to a numeric representation. If 0 was used to represent black pixels, and 1 to represent white pixels, the above images would have numeric representation as shown below:
[[ 1 1 1 1 1] [[ 0 0 0 0 1] [ 1 0 0 0 1] [ 0 0 0 0 1] [ 1 0 0 0 1] [ 0 0 0 0 1] [ 1 0 0 0 1] [ 0 0 0 0 1] [ 1 1 1 1 1]] [ 0 0 0 0 1]]
Fig 2. Numeric representation of images of 0 and 1
One could make a vertical line a feature (x[i][j] == 1, i = 0 .. 5). Then all the program would have to do is to count the number of features found in each image. If it found two, it would declare that it sees a 0, if it found one, it would output 1. Of course as features go, this one is somewhat weak, as adding 8 to the set of displayed numbers would make the program incorrectly identify it as 0. The point is not, however, to come up with a set of fool-proof features, but rather to illustrate what a feature is.
Back to CNNs. One of the mostly lauded achievements of CNNs is their ability to recognize handwritten digits. It is quite hard to tell what makes a handwritten three a 3. But if you let a CNN look at a sufficient number of handwritten 3s, at some point it comes up, by the virtue of back propagation, with a set of features that, when present, uniquely identify a 3. There is a reasonably comprehensive tutorial on tensorflow.org site that shows how to program a neural network to solve this specific task. My goal is not to repeat it. Instead I go over the concepts that make CNNs such a powerful tool.
In this series of posts I show how to develop a CNN that can recognize all 10 digits shown by a hypothetical LCD display. I start with a simple linear regression to show that automatic derivation of features is not specific to CNNs. By adding a degree of freedom, where digits can appear in a larger image, I show that simple linear regression is not sufficient. One possible approach is to use deep neural networks (DNNs), but they too have limits. The final solution is probably the simplest CNN one can build. Despite its simplicity, it is 100% successful in recognizing all ten digits, regardless of their position on the screen.