# Understanding Convolutional Neural Networks

Convolutional neural networks, or CNNs, found applications in computer vision and audio processing. More generally they are suitable for tasks that use a number of features to identify interesting objects, events, sound, etc. The power of CNNs comes from the fact that they can be trained to come up with features that are hard to define by hand. What are “features”, you may ask? It is probably easier to understand this concept through example. Suppose your task is to write a program that can analyze an image of an LCD display and can recognize if the display shows 0 or 1. Thus your program is presented with images that look as follows:

Fig 1. Images representing 0 and 1 on an LCD display

A possible solution would be to convert images to a numeric representation. If 0 was used to represent black pixels, and 1 to represent white pixels, the above images would have numeric representation as shown below:

[[ 1  1  1  1  1]     [[ 0  0  0  0  1]
[ 1  0  0  0  1]      [ 0  0  0  0  1]
[ 1  0  0  0  1]      [ 0  0  0  0  1]
[ 1  0  0  0  1]      [ 0  0  0  0  1]
[ 1  1  1  1  1]]     [ 0  0  0  0  1]]

Fig 2. Numeric representation of images of 0 and 1

One could make a vertical line a feature (x[i][j] == 1, i = 0 .. 5). Then all the program would have to do is to count the number of features found in each image. If it found two, it would declare that it sees a 0, if it found one, it would output 1. Of course as features go, this one is somewhat weak, as adding 8 to the set of displayed numbers would make the program incorrectly identify it as 0. The point is not, however, to come up with a set of fool-proof features, but rather to illustrate what a feature is.

Back to CNNs. One of the mostly lauded achievements of CNNs is their ability to recognize handwritten digits. It is quite hard to tell what makes a handwritten three a 3. But if you let a CNN look at a sufficient number of handwritten 3s, at some point it comes up, by the virtue of back propagation, with a set of features that, when present, uniquely identify a 3. There is a reasonably comprehensive tutorial on tensorflow.org site that shows how to program a neural network to solve this specific task. My goal is not to repeat it. Instead I go over the concepts that make CNNs such a powerful tool.

In this series of posts I show how to develop a CNN that can recognize all 10 digits shown by a hypothetical LCD display. I start with a simple linear regression to show that automatic derivation of features is not specific to CNNs. By adding a degree of freedom, where digits can appear in a larger image, I show that simple linear regression is not sufficient. One possible approach is to use deep neural networks (DNNs), but they too have limits. The final solution is probably the simplest CNN one can build. Despite its simplicity, it is 100% successful in recognizing all ten digits, regardless of their position on the screen.

# How to recognize a digit

The first task is to train a model so that it can recognize 5 x 5 LCD digits, shown in Fig 1.

Fig 1. Images of ten LCD digits

To do this we use a linear regression, which is sufficiently powerful for a task of this complexity.

Each digit is represented as a 5 x 5 array of 0s and 1s. We flatten each one of the arrays into a vector of 25 floats. We stack all vectors forming a 2D array $X_{\mbox{flat}}$. Next, we compute $X_{\mbox{flat}} W + b$, where $W$ is a 25 by 10 matrix and $b$ is a vector of size 10.

$\left[ \begin{array}{ccccccc} 1 & 1 & 1 & \cdots & 1 & 1 & 1 \\ 0 & 0 & 0 & \cdots & 0 & 0 & 1 \\ 1 & 1 & 1 & \cdots & 1 & 1 & 1 \\ & & & \ddots & & & \\ 1 & 1 & 1 & \cdots & 1 & 1 & 1 \\ 1 & 1 & 1 & \cdots & 0 & 0 & 1 \end{array} \right] \times \left[ \begin{array}{cccc} w_{0,0} & w_{0,1} & \cdots & w_{0,9} \\ w_{1,0} & w_{1,1} & \cdots & w_{1,9} \\ w_{2,0} & w_{2,1} & \cdots & w_{2,9} \\ & & \ddots & \\ w_{23,0} & w_{23,1} & \cdots & w_{23,9} \\ w_{24,0} & w_{24,1} & \cdots & w_{24,9} \end{array} \right] + \left[ \begin{array}{c} b_0 \\ b_1 \\ \vdots \\ b_9 \end{array} \right]$

The above expression gives us, for each row of $X_{\mbox{flat}}$, a vector of ten numbers, $y$ referred to as logits. $y_i$ is proportional to the likelihood that the row represents digit $i$. When training a model, we use gradient descent to nudge the model so that if a row represents, say, a 0, then $y_0$ is greater than $y_1$$y_9$.

We cannot treat $y_i$ as probability, as there is no guarantee that each $y_i$ is between 0 and 1 and the sum of all $y_i$‘s adds to 1. However, this can be remedied by using softmax function. A value $h_i$ is expressed as

$\displaystyle h_i = {e^{y_i} \over \sum_{j} e^{y_j}}$

Using softmax converts values y to probabilities h. The goal of training a model is to make sure that when we see an image for, say, 0, the resulting vector $h$ is as close as possible to $[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]$. This requirement is captured with the help of cross entropy function:

$\displaystyle -\frac{1}{10} \sum_{i = 0}^{10} t_i \lg h_i + (1 - t_i) \lg(1 - h_i)$

Where $t_i$ is the true (desired) value, taking either 1 or 0. Let us express the above using Tensorflow:

img_size = 5
shape_size = 5
kind_count = 10
learning_rate = 0.03
pixel_count = img_size * img_size

x = tf.placeholder(tf.float32,
shape=[None, img_size, img_size])
x_flat = tf.reshape(x, [-1, pixel_count])
y_true = tf.placeholder(tf.float32,
shape=[None, kind_count])
W = tf.Variable(tf.zeros([pixel_count, kind_count]))
b = tf.Variable(tf.zeros([kind_count]))
y_pred = tf.matmul(x_flat, W) + b
loss_op = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(
labels=y_true, logits=y_pred))
learning_rate)
train_op = optimizer.minimize(loss_op)
correct_prediction = tf.equal(tf.argmax(y_pred, 1),
tf.argmax(y_true, 1))
accuracy_op = tf.reduce_mean(
tf.cast(correct_prediction, tf.float32))


First we set up a few parameters. Initially image size and shape size are going to be the same. This means that the shape completely fills the image. We set the number of kinds of images to 10, so that all ten digits are present. The learning rate defines how fast we follow the slope of the loss function. In our case we chose a conservative 0.03. Choosing a larger value can lead to a faster convergence, but it can also cause us to overshoot the minimum, once we are near it. Lines 7, 8 and 9 create placeholders for input data. Both x and y_true are going to be repeatedly fed batches of images and correct labels for those images. In line 12 we set up the weight matrix. In order to compute $X_{\mbox{flat}} W$, W must have pixel_count = 25 rows. It has 10 columns (or kind_count) to produce a vector of size 10. The i-th element of that vector is proportional to the likelihood that the given image represents digit i. Line 15 sets up the loss function. It is set as the mean value of softmax expression computed by taking predictions and true labels. Line 18 uses a gradient descent optimizer and line 20 uses it to create a training operation that minimizes the loss function by descending along the gradient of the mean value of the softmax expression. Line 21 computes, for each batch, how many predictions for that batch were correct. Finally, in line 23 we express accuracy as the sum of the correct predictions divided by the total number of predictions. Next comes the training of the model:

sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
batch_maker = BatchMaker(img_data, true_kind, batch_size)
step_nbr = 0
while step_nbr < 5:
img_batch, label_batch = batch_maker.next()
train_op.run(feed_dict={x: img_batch, y_true: label_batch})
step_nbr += 1


We first train the model for five steps. After five steps we print a selection of images. As LCD digits are very regular and fit exactly inside each image, just 5 steps is sufficient to get 60% accuracy:

Fig 2. Predictions made by the model after five steps.

The model cannot distinguish between 1, 4, 7 and 9 and between 2 and 3. We continue training until accuracy is 1.

The final matrix W is shown in Fig 3. To better visualize W, we reshape each column of W into a 5 x 5 matrix. Then we normalize each value between -1 and 1. Finally, we plot 1 as red, -1 as blue, 0 as white, and intermediate values as shades of blue or red.

Fig 3. Matrix W at 100% accuracy.

The red color indicates positive interaction between the matrix and the image, while blue colors indicate negative interaction. Looking at 0, we can see that the model learned to distinguish between 0 and 8 by using negative weights for the line in the middle. For 1, the matrix strongly selects for a vertical line on the left, penalizing at the same time parts that would make the image look like 4, 7 or 9. 3 has positive interaction with almost all but 2 pixels, which would turn 3 into 8. The model did not learn the cleanest features, but it learned enough to tell each digit apart.

Clearly, linear regression can learn a number of features, given a well behaved problem. In the following posts we are going to show how a small change in the problem’s complexity causes linear regression to struggle. The change is to increase the image size, while keeping the digits sizes at 5. Deep neural networks are able to cope with this situation, at the expense of much longer training and much larger models. The final solution, that uses a convolutional neural network, can achieve fast training, 100% accuracy and small model size.

### Resources

A Jupyter Notebook with the above code can be found at GitHub’s LCD repository.