How to recognize a digit

The first task is to train a model so that it can recognize 5 x 5 LCD digits, shown in Fig 1. Fig 1. Images of ten LCD digits

To do this we use a linear regression, which is sufficiently powerful for a task of this complexity.

Each digit is represented as a 5 x 5 array of 0s and 1s. We flatten each one of the arrays into a vector of 25 floats. We stack all vectors forming a 2D array $X_{\mbox{flat}}$. Next, we compute $X_{\mbox{flat}} W + b$, where $W$ is a 25 by 10 matrix and $b$ is a vector of size 10. $\left[ \begin{array}{ccccccc} 1 & 1 & 1 & \cdots & 1 & 1 & 1 \\ 0 & 0 & 0 & \cdots & 0 & 0 & 1 \\ 1 & 1 & 1 & \cdots & 1 & 1 & 1 \\ & & & \ddots & & & \\ 1 & 1 & 1 & \cdots & 1 & 1 & 1 \\ 1 & 1 & 1 & \cdots & 0 & 0 & 1 \end{array} \right] \times \left[ \begin{array}{cccc} w_{0,0} & w_{0,1} & \cdots & w_{0,9} \\ w_{1,0} & w_{1,1} & \cdots & w_{1,9} \\ w_{2,0} & w_{2,1} & \cdots & w_{2,9} \\ & & \ddots & \\ w_{23,0} & w_{23,1} & \cdots & w_{23,9} \\ w_{24,0} & w_{24,1} & \cdots & w_{24,9} \end{array} \right] + \left[ \begin{array}{c} b_0 \\ b_1 \\ \vdots \\ b_9 \end{array} \right]$

The above expression gives us, for each row of $X_{\mbox{flat}}$, a vector of ten numbers, $y$ referred to as logits. $y_i$ is proportional to the likelihood that the row represents digit $i$. When training a model, we use gradient descent to nudge the model so that if a row represents, say, a 0, then $y_0$ is greater than $y_1$ $y_9$.

We cannot treat $y_i$ as probability, as there is no guarantee that each $y_i$ is between 0 and 1 and the sum of all $y_i$‘s adds to 1. However, this can be remedied by using softmax function. A value $h_i$ is expressed as $\displaystyle h_i = {e^{y_i} \over \sum_{j} e^{y_j}}$

Using softmax converts values y to probabilities h. The goal of training a model is to make sure that when we see an image for, say, 0, the resulting vector $h$ is as close as possible to $[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]$. This requirement is captured with the help of cross entropy function: $\displaystyle -\frac{1}{10} \sum_{i = 0}^{10} t_i \lg h_i + (1 - t_i) \lg(1 - h_i)$

Where $t_i$ is the true (desired) value, taking either 1 or 0. Let us express the above using Tensorflow:

img_size = 5
shape_size = 5
kind_count = 10
learning_rate = 0.03
pixel_count = img_size * img_size

x = tf.placeholder(tf.float32,
shape=[None, img_size, img_size])
x_flat = tf.reshape(x, [-1, pixel_count])
y_true = tf.placeholder(tf.float32,
shape=[None, kind_count])
W = tf.Variable(tf.zeros([pixel_count, kind_count]))
b = tf.Variable(tf.zeros([kind_count]))
y_pred = tf.matmul(x_flat, W) + b
loss_op = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(
labels=y_true, logits=y_pred))
learning_rate)
train_op = optimizer.minimize(loss_op)
correct_prediction = tf.equal(tf.argmax(y_pred, 1),
tf.argmax(y_true, 1))
accuracy_op = tf.reduce_mean(
tf.cast(correct_prediction, tf.float32))

First we set up a few parameters. Initially image size and shape size are going to be the same. This means that the shape completely fills the image. We set the number of kinds of images to 10, so that all ten digits are present. The learning rate defines how fast we follow the slope of the loss function. In our case we chose a conservative 0.03. Choosing a larger value can lead to a faster convergence, but it can also cause us to overshoot the minimum, once we are near it. Lines 7, 8 and 9 create placeholders for input data. Both x and y_true are going to be repeatedly fed batches of images and correct labels for those images. In line 12 we set up the weight matrix. In order to compute $X_{\mbox{flat}} W$, W must have pixel_count = 25 rows. It has 10 columns (or kind_count) to produce a vector of size 10. The i-th element of that vector is proportional to the likelihood that the given image represents digit i. Line 15 sets up the loss function. It is set as the mean value of softmax expression computed by taking predictions and true labels. Line 18 uses a gradient descent optimizer and line 20 uses it to create a training operation that minimizes the loss function by descending along the gradient of the mean value of the softmax expression. Line 21 computes, for each batch, how many predictions for that batch were correct. Finally, in line 23 we express accuracy as the sum of the correct predictions divided by the total number of predictions. Next comes the training of the model:

sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
batch_maker = BatchMaker(img_data, true_kind, batch_size)
step_nbr = 0
while step_nbr < 5:
img_batch, label_batch = batch_maker.next()
train_op.run(feed_dict={x: img_batch, y_true: label_batch})
step_nbr += 1

We first train the model for five steps. After five steps we print a selection of images. As LCD digits are very regular and fit exactly inside each image, just 5 steps is sufficient to get 60% accuracy: Fig 2. Predictions made by the model after five steps.

The model cannot distinguish between 1, 4, 7 and 9 and between 2 and 3. We continue training until accuracy is 1.

The final matrix W is shown in Fig 3. To better visualize W, we reshape each column of W into a 5 x 5 matrix. Then we normalize each value between -1 and 1. Finally, we plot 1 as red, -1 as blue, 0 as white, and intermediate values as shades of blue or red. Fig 3. Matrix W at 100% accuracy.

The red color indicates positive interaction between the matrix and the image, while blue colors indicate negative interaction. Looking at 0, we can see that the model learned to distinguish between 0 and 8 by using negative weights for the line in the middle. For 1, the matrix strongly selects for a vertical line on the left, penalizing at the same time parts that would make the image look like 4, 7 or 9. 3 has positive interaction with almost all but 2 pixels, which would turn 3 into 8. The model did not learn the cleanest features, but it learned enough to tell each digit apart.

Clearly, linear regression can learn a number of features, given a well behaved problem. In the following posts we are going to show how a small change in the problem’s complexity causes linear regression to struggle. The change is to increase the image size, while keeping the digits sizes at 5. Deep neural networks are able to cope with this situation, at the expense of much longer training and much larger models. The final solution, that uses a convolutional neural network, can achieve fast training, 100% accuracy and small model size.

Resources

A Jupyter Notebook with the above code can be found at GitHub’s LCD repository.

Linear Regression with Multiple Variables in Tensorflow

In Lecture 4.1 Linear Regression with multiple variables Andrew Ng shows how to generalize linear regression with a single variable to the case of multiple variables. Andrew Ng introduces a bit of notation to derive a more succinct formulation of the problem. Namely, $n$ features $x_1$ $x_n$ are extended by adding feature $x_0$ which is always set to 1. This way the hypothesis can be expressed as: $h_{\theta}(x) = \theta_{0} x_0 + \theta_{1} x_1 + \cdots + \theta_{n} x_n = \theta^T x$

For $m$ examples, the task of linear regression can be expressed as a task of finding vector $\theta$ such that $\left[ \begin{array}{cccc} \theta_0 & \theta_1 & \cdots & \theta_n \end{array} \right] \times \left[ \begin{array}{ccccc} 1 & 1 & \cdots & 1 \\ x^{(1)}_1 & x^{(2)}_1 & \cdots & x^{(m)}_1 \\ & & \vdots \\ x^{(n)}_m & x^{(n)}_m & \cdots & x^{(n)}_m \\ \end{array} \right]$

is as close as possible to some observed values $y_1, y_2, \cdots, y_m$. The “as close as possible” typically means that the mean sum of square errors between $h_{\theta}(x^{(i)})$ and $y_i$ for $i \in [1, m]$ is minimized. This quantity is often referred to as cost or loss function: $J(\theta) = \dfrac{1}{2 m} \sum_{i=1}^{m} \left( h_{\theta}(x^{(i)}) - y_i\right)^2$

To express the above concepts in Tensorflow, and more importantly, have Tensorflow find $\theta$ that minimizes the cost function, we need to make a few adjustments. We rename vector $\theta$, as w. We are not using $x_0 = 1$. Instead, we use a tensor of size 0 (also known as scalar), called b to represent $x_0$. As it is easier to stack rows than columns, we form matrix $X$, in such a way that the i-th row is the i-th sample. Our formulation thus has the form $h_{w,b}(X) = \left[ \begin{array}{ccc} \text{---} & (x^{(1)})^T & \text{---} \\ \text{---} & (x^{(2)})^T & \text{---} \\ & \vdots & \\ \text{---} & (x^{(m)})^T & \text{---} \end{array} \right] \times \left[ \begin{array}{c} w_1 \\ w_2 \\ \vdots \\ w_m \end{array} \right] + b$

This leads to the following Python code:

X_in = tf.placeholder(tf.float32, [None, n_features], "X_in")
w = tf.Variable(tf.random_normal([n_features, 1]), name="w")
b = tf.Variable(tf.constant(0.1, shape=[]), name="b")

We first introduce a tf.placeholder named X_in. This is how we supply data into our model. Line 2 creates a vector w corresponding to $\theta$. Line 3 creates a variable b corresponding to $x_0$. Finally, line 4 expresses function h as a matrix multiplication of X_in and w plus scalar b.

y_in = tf.placeholder(tf.float32, [None, 1], "y_in")
loss_op = tf.reduce_mean(tf.square(tf.subtract(y_in, h)),
name="loss")

To define the loss function, we introduce another placeholder y_in. It holds the ideal (or target) values for the function h. Next we create a loss_op. This corresponds to the loss function. The difference is that, rather than being a function directly, it defines for Tensorflow operations that need to be run to compute a loss function. Finally, the training operation uses a gradient descent optimizer, that uses learning rate of 0.3, and tries to minimize the loss.

Now we have all pieces in place to create a loop that finds w and b that minimize the loss function.

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for batch in range(1000):
sess.run(train_op, feed_dict={
X_in: X_true,
y_in: y_true
})
w_computed = sess.run(w)
b_computed = sess.run(b)

In line 1 we create a session that is going to run operations we created before. First we initialize all global variables. In lines 3-7 we repeatedly run the training operation. It computes the value of h based on X_in. Next, it computes the current loss, based on h, and y_in. It uses the data flow graph to compute derivatives of the loss function with respect to every variable in the computational graph. It automatically adjusts them, using the specified learning rate of 0.3. Once the desired number of steps has been completed, we record the final values of vector w and scalar b computed by Tensorflow.

To see how well Tensorflow did, we print the final version of computed variables. We compare them with ideal values (which for the purpose of this exercise were initialized to random values):

print "w computed [%s]" % ', '.join(['%.5f' % x for x in w_computed.flatten()])
print "w actual   [%s]" % ', '.join(['%.5f' % x for x in w_true.flatten()])
print "b computed %.3f" % b_computed
print "b actual  %.3f" % b_true

w computed [5.48375, 90.52216, 48.28834, 38.46674]
w actual   [5.48446, 90.52165, 48.28952, 38.46534]
b computed -9.326
b actual  -9.331

Resources

You can download the Jupyter notebook with the above code from a github linear regression repository.

Multiple Variable Linear Regression using Tensorflow Layers

In version 1.0 of Tensorflow released in Feb 2017 a higher level APIs, called layers, were added. These allow a reduction in the amount of boilerplate code one has to write. For example, for linear regression with $n$ features we would always create a matrix $X$ and vector $y$ represented by placeholders. We would always create variables representing weights and biases, etc., and so on. By using layers this can be avoided. Instead, we need to focus on describing and supplying data to a regressor. Let us rewrite linear regression using layers. The code is shown below:

x_feature = tf.contrib.layers.real_valued_column('X', 4)
regressor = tf.contrib.learn.LinearRegressor(
[x_feature],
regressor.fit(
input_fn=create_training_fn(m_examples, w_true, b_true),
steps=500)
eval_dict = regressor.evaluate(
input_fn=create_training_fn(10, w_true, b_true), steps=1)

First we describe our features. In our simple case we create a real valued feature, named X, 4 columns wide. We could have created four features x1x4. However, this would make input function more complex. Next, we create a linear regressor. We pass feature columns to it. It must be an iterable. Otherwise it will fail with fairly mysterious errors. The second parameter is the optimizer. We chose to rely on the same gradient descent optimizer as in the last example. Having created the regressor we train it, by calling fit method. This is done with the help of the input function that feeds labeled data into it. We run it for 500 steps. Finally, we evaluate how well the regressor fits the data. In real application, the last step should be called with data not included in the training set.

The input function must return a pair. The first element of the pair must be a map from feature names to feature values. The second element must be the target values (i.e., labels) that the regressor is learning. In our case the function is fairly simple, as shown below:

def create_training_fn(m, w, b):
def training_fn_():
X = np.random.rand(m, w.shape)
return ({'X': tf.constant(X)},
tf.constant(np.matmul(X, w) + b))
return training_fn_

It generates a random set of input data, X, and computes the target value as $X w + b$. In real applications this function can be arbitrarily complex. It could, for example, read data and labels from files, returning a fixed number of rows at a time.

Last, let us see how well the LinearRegressor did. Typically, one has just the loss function as the guide. However, for us, we also know w and b. Thus we can compare them with what the regressor computed, by fetching regressor’s variables:

print "loss", eval_dict['loss']
print "w true ", w_true.T
print "w found", regressor.get_variable_value('linear/X/weight').T
print "b true  %.4f" % b_true
print "b found", regressor.get_variable_value('linear/bias_weight')

loss 8.81575e-06
w true  [ 1.7396  62.2283  59.7082  6.9788]
w found [ 1.7304  62.2178  59.6973  6.9751]
b true  -4.7938
b found -4.77659

Both the weights and the bias are very close to the one we used to train the regressor.

It is worth mentioning that regressors also offer a way of tracking internal state that can be used to analyze their behavior using TensorBoard. There are also methods that allow regressor’s state to be saved and later restored. Finally, the predict function can be used to compute regressor output, for any unlabeled input.