machine learning

# Machine Learning with Tensorflow

Coursera offers an excellent machine learning course by Andrew Ng. The course provides comprehensive coverage of topics ranging from linear regression, through neural networks, support vector machines, to unsupervised learning. It also comes with a number of exercises that are a perfect complement to the lectures. Only completing those exercises gives you a much greater mastery of various techniques. In 2015, Google opened Tensorflow machine learning API. This blog rewrites solutions to problems presented in ML lectures, but not the assignments(!), using Tensorflow. Tensorflow makes solutions to the original ML examples much simpler. However, more importantly, if one has understood Andrew Ng’s course, seeing the same problems expressed using Tensorflow provides a better platform for learning and understanding Tensorflow APIs themselves.

Note: Originally, this blog was to be devoted solely to Andrew Ng’s machine learning course. However, recently I decided to add examples of convolutional neural networks, auto-encoders, recurrent neural networks, etc. If you are just interested in material corresponding to Courser’a course, skip ahead to Linear Regression with Multiple Variables in Tensorflow.

# Linear Regression with Multiple Variables in Tensorflow

In Lecture 4.1 Linear Regression with multiple variables Andrew Ng shows how to generalize linear regression with a single variable to the case of multiple variables. Andrew Ng introduces a bit of notation to derive a more succinct formulation of the problem. Namely, $n$ features $x_1$$x_n$ are extended by adding feature $x_0$ which is always set to 1. This way the hypothesis can be expressed as:

$h_{\theta}(x) = \theta_{0} x_0 + \theta_{1} x_1 + \cdots + \theta_{n} x_n = \theta^T x$

For $m$ examples, the task of linear regression can be expressed as a task of finding vector $\theta$ such that

$\left[ \begin{array}{cccc} \theta_0 & \theta_1 & \cdots & \theta_n \end{array} \right] \times \left[ \begin{array}{ccccc} 1 & 1 & \cdots & 1 \\ x^{(1)}_1 & x^{(2)}_1 & \cdots & x^{(m)}_1 \\ & & \vdots \\ x^{(n)}_m & x^{(n)}_m & \cdots & x^{(n)}_m \\ \end{array} \right]$

is as close as possible to some observed values $y_1, y_2, \cdots, y_m$. The “as close as possible” typically means that the mean sum of square errors between $h_{\theta}(x^{(i)})$ and $y_i$ for $i \in [1, m]$ is minimized. This quantity is often referred to as cost or loss function:

$J(\theta) = \dfrac{1}{2 m} \sum_{i=1}^{m} \left( h_{\theta}(x^{(i)}) - y_i\right)^2$

To express the above concepts in Tensorflow, and more importantly, have Tensorflow find $\theta$ that minimizes the cost function, we need to make a few adjustments. We rename vector $\theta$, as w. We are not using $x_0 = 1$. Instead, we use a tensor of size 0 (also known as scalar), called b to represent $x_0$. As it is easier to stack rows than columns, we form matrix $X$, in such a way that the i-th row is the i-th sample. Our formulation thus has the form

$h_{w,b}(X) = \left[ \begin{array}{ccc} \text{---} & (x^{(1)})^T & \text{---} \\ \text{---} & (x^{(2)})^T & \text{---} \\ & \vdots & \\ \text{---} & (x^{(m)})^T & \text{---} \end{array} \right] \times \left[ \begin{array}{c} w_1 \\ w_2 \\ \vdots \\ w_m \end{array} \right] + b$

This leads to the following Python code:

X_in = tf.placeholder(tf.float32, [None, n_features], "X_in")
w = tf.Variable(tf.random_normal([n_features, 1]), name="w")
b = tf.Variable(tf.constant(0.1, shape=[]), name="b")


We first introduce a tf.placeholder named X_in. This is how we supply data into our model. Line 2 creates a vector w corresponding to $\theta$. Line 3 creates a variable b corresponding to $x_0$. Finally, line 4 expresses function h as a matrix multiplication of X_in and w plus scalar b.

y_in = tf.placeholder(tf.float32, [None, 1], "y_in")
loss_op = tf.reduce_mean(tf.square(tf.subtract(y_in, h)),
name="loss")


To define the loss function, we introduce another placeholder y_in. It holds the ideal (or target) values for the function h. Next we create a loss_op. This corresponds to the loss function. The difference is that, rather than being a function directly, it defines for Tensorflow operations that need to be run to compute a loss function. Finally, the training operation uses a gradient descent optimizer, that uses learning rate of 0.3, and tries to minimize the loss.

Now we have all pieces in place to create a loop that finds w and b that minimize the loss function.

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for batch in range(1000):
sess.run(train_op, feed_dict={
X_in: X_true,
y_in: y_true
})
w_computed = sess.run(w)
b_computed = sess.run(b)


In line 1 we create a session that is going to run operations we created before. First we initialize all global variables. In lines 3-7 we repeatedly run the training operation. It computes the value of h based on X_in. Next, it computes the current loss, based on h, and y_in. It uses the data flow graph to compute derivatives of the loss function with respect to every variable in the computational graph. It automatically adjusts them, using the specified learning rate of 0.3. Once the desired number of steps has been completed, we record the final values of vector w and scalar b computed by Tensorflow.

To see how well Tensorflow did, we print the final version of computed variables. We compare them with ideal values (which for the purpose of this exercise were initialized to random values):

print "w computed [%s]" % ', '.join(['%.5f' % x for x in w_computed.flatten()])
print "w actual   [%s]" % ', '.join(['%.5f' % x for x in w_true.flatten()])
print "b computed %.3f" % b_computed
print "b actual  %.3f" % b_true[0]

w computed [5.48375, 90.52216, 48.28834, 38.46674]
w actual   [5.48446, 90.52165, 48.28952, 38.46534]
b computed -9.326
b actual  -9.331


### Resources

You can download the Jupyter notebook with the above code from a github linear regression repository.

# Multiple Variable Linear Regression using Tensorflow Layers

In version 1.0 of Tensorflow released in Feb 2017 a higher level APIs, called layers, were added. These allow a reduction in the amount of boilerplate code one has to write. For example, for linear regression with $n$ features we would always create a matrix $X$ and vector $y$ represented by placeholders. We would always create variables representing weights and biases, etc., and so on. By using layers this can be avoided. Instead, we need to focus on describing and supplying data to a regressor. Let us rewrite linear regression using layers. The code is shown below:

x_feature = tf.contrib.layers.real_valued_column('X', 4)
regressor = tf.contrib.learn.LinearRegressor(
[x_feature],
regressor.fit(
input_fn=create_training_fn(m_examples, w_true, b_true),
steps=500)
eval_dict = regressor.evaluate(
input_fn=create_training_fn(10, w_true, b_true), steps=1)


First we describe our features. In our simple case we create a real valued feature, named X, 4 columns wide. We could have created four features x1x4. However, this would make input function more complex. Next, we create a linear regressor. We pass feature columns to it. It must be an iterable. Otherwise it will fail with fairly mysterious errors. The second parameter is the optimizer. We chose to rely on the same gradient descent optimizer as in the last example. Having created the regressor we train it, by calling fit method. This is done with the help of the input function that feeds labeled data into it. We run it for 500 steps. Finally, we evaluate how well the regressor fits the data. In real application, the last step should be called with data not included in the training set.

The input function must return a pair. The first element of the pair must be a map from feature names to feature values. The second element must be the target values (i.e., labels) that the regressor is learning. In our case the function is fairly simple, as shown below:

def create_training_fn(m, w, b):
def training_fn_():
X = np.random.rand(m, w.shape[0])
return ({'X': tf.constant(X)},
tf.constant(np.matmul(X, w) + b))
return training_fn_


It generates a random set of input data, X, and computes the target value as $X w + b$. In real applications this function can be arbitrarily complex. It could, for example, read data and labels from files, returning a fixed number of rows at a time.

Last, let us see how well the LinearRegressor did. Typically, one has just the loss function as the guide. However, for us, we also know w and b. Thus we can compare them with what the regressor computed, by fetching regressor’s variables:

print "loss", eval_dict['loss']
print "w true ", w_true.T[0]
print "w found", regressor.get_variable_value('linear/X/weight').T[0]
print "b true  %.4f" % b_true[0]
print "b found", regressor.get_variable_value('linear/bias_weight')[0]

loss 8.81575e-06
w true  [ 1.7396  62.2283  59.7082  6.9788]
w found [ 1.7304  62.2178  59.6973  6.9751]
b true  -4.7938
b found -4.77659


Both the weights and the bias are very close to the one we used to train the regressor.

It is worth mentioning that regressors also offer a way of tracking internal state that can be used to analyze their behavior using TensorBoard. There are also methods that allow regressor’s state to be saved and later restored. Finally, the predict function can be used to compute regressor output, for any unlabeled input.

machine learning

# Computing XNOR with a Neural Network

This tutorial shows how to use Tensorflow to create a neural network that mimics $\neg (x_1 \oplus x_2)$ function. This function, abbreviated as XNOR, returns 1 only if $x_1$ is equal to $x_2$. The values are summarized in the table below:

$\begin{array}{c|c|c} x_1 & x_2 & y \\ \hline 0 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 0 \\ 1 & 1 & 1 \end{array}$

Andrew Ng shows in Lecture 8.5: Neural Networks – Representation how to construct a single neuron that can emulate a logical AND operation. The neuron is considered to act like a logical AND if it outputs a value close to 0 for (0, 0), (0, 1), and (1, 0) inputs, and a value close to 1 for (1, 1). This can be achieved as follows:

$h_{\mbox{and}}(x)=\dfrac{1}{e^{30 - 20x_1 - 20x_2}}$

To recreate the above in tensorflow we first create a function that takes theta, a vector of coefficients, together with x1 and x2. We use the vector to create three constants, represented by tf.constant. The first one is the bias unit. The next two are used to multiply x1 and x2, respectively. The expression is then fed into a sigmoid function, implemented by tf.nn.sigmoid.

Outside of the function we create two placeholders. For Tensorflow a tf.placeholder is an operation that is fed data. These are going to be our x1 and x2 variables. Next we create a h_and operation by calling the MakeModel function with the coefficient vector as suggested by Andrew Ng.

def MakeModel(theta, x1, x2):
h = tf.constant(theta[0]) + \
tf.constant(theta[1]) * x1 + tf.constant(theta[2]) * x2
return tf.nn.sigmoid(h)

x1 = tf.placeholder(tf.float32, name="x1")
x2 = tf.placeholder(tf.float32, name="x2")
h_and = MakeModel([-30.0, 20.0, 20.0], x1, x2)


We can then print the values to verify that our model works correctly. When creating Tensorflow operations, we do not create an actual program. Instead, we create a description of the program. To execute it, we need to create a session to run it:

with tf.Session() as sess:
print " x1 | x2 |  g"
print "----+----+-----"
for x in range(4):
x1_in, x2_in = x &amp;amp; 1, x &amp;gt; 1
print " %2.0f | %2.0f | %3.1f" % (
x1_in, x2_in, sess.run(h_and, {x1: x1_in, x2: x2_in}))


The above code produces the following output, confirming that we have correctly coded the AND function:

  x1| x2 |  g
----+----+-----
0 |  0 | 0.0
0 |  1 | 0.0
1 |  0 | 0.0
1 |  1 | 1.0


To get a better understanding of how a neuron, or more precisely, a sigmoid function with a linear input, emulates a logical AND, let us plot its values. Rather than just using four points, we compute its values for a set of 20 x 20 points from the range [0, 1]. First, we define a function that, for a given input function (a tensor) and a linear space, computes values of returned by the function (a tensor) when fed points from the linear space.

def ComputeVals(h, span):
vals = []
with tf.Session() as sess:
for x1_in in span:
vals.append([
sess.run(h, feed_dict={
x1: x1_in, x2: x2_in}) for x2_in in span
])
return vals


This is a rather inefficient way of doing this. However, at this stage we aim for clarity not efficiency. To plot values computed by the h_and tensor we use matplotlib. The result can be seen in Fig 1. We use coolwarm color map, with blue representing 0 and red representing 1.

Fig 1. Values of a neuron emulating the AND gate

Having created a logical AND, let us apply the same approach, and create a logical OR. Following Andrew Ng’s lecture, the bias is set to -10.0, while we use 20.0 as weights associated with x1 and x2. This has the effect of generating an input larger than or equal 10.0, if either x1 or x2 are 1, and -10, if both are zero. We reuse the same MakeModel function. We pass the same x1 and x2 as input, but change vector theta to [-10.0, 20.0, 20.0]

h_or = MakeModel([-10.0, 20.0, 20.0], x1, x2)
or_vals = ComputeVals(h_or, span)


When plotted with matplotlib we see the graph shown in Fig 2.

Fig 2. Values of a neuron emulating the OR gate

The negation can be crated by putting a large negative weight in front of the variable. Andrew Ng’s chose $10 - 20x$. This way $g(x)=1/(1 + e^{20x - 10})$ returns $0.00005$ for $latex x = 1$ and $0.99995$ for $x = 0$. By using −20 with both x1 and x2 we get a neuron that produces a logical and of negation of both variables, also known as the NOR gate: $h_{nor} = 1/(1+e^{-(10 - 20x1 - 20x2)})$.

h_nor = MakeModel([10.0, -20.0, -20.0], x1, x2)
nor_vals = ComputeVals(h_nor, span)


The plot of values of our h_nor function can be seen in Fig 3.

Fig 3. Value of a neuron emulating the NOR gate

With the last gate, we have everything in place. The first neuron generates values close to one when both x1 and x2 are 1, the third neuron generates value close to one when x1 and x2 are close to 0. Finally, the second neuron can perform a logical OR of values generated from two neurons. Thus our xnor neuron can be constructed by passing h_and and h_nor as inputs to h_or neuron. In Tensorflow this simply means that rather than passing x1 and x2 placeholders, when constructing h_or function, we pass h_and and h_nor tensors:

h_xnor = MakeModel([-10.0, 20.0, 20.0], h_nor, h_and)
xnor_vals = ComputeVals(h_xnor, span)


Again, to see what is happening, let us plot the values of h_xnor over the [0, 1] range. These are shown in Fig 4.

Fig 4. Value of a neural net emulating XNOR gate

In a typical Tensorflow application we would not see only constants being used to create a model. Instead constants are used to initialize variables. The reason we could use only constants is that we do not intend to train the model. Instead we already knew, thanks to Andrew Ng, the final values of all weights and biases.

Finally, the solution that we gave is quite inefficient. We will show next how by vectorising it one can speed it up by a factor of over 200 times. This is not an insignificant number, considering how simple our model is. In larger models vectorization can give us even more dramatic improvements.

### Resources

You can download the Jupyter notebook from which code snippets were presented above from github xnor-basic repository.

machine learning

# Vectorized XNOR

In the previous post we have showed how to encode XNOR function using a two layers deep neural net. The first layer consists of the NOR and AND gates. The second layer is a single OR gate. The Tensorflow implementation we developed is rather inefficient. This is due to the fact that all computations are done on individual variables. A better way is to create the model so that each layer of the neural net can be computed for a batch of inputs as matrix operations. The input to the first sigmoid function can be computed as follows:

$\left(\begin{array}{cc}x_1^{(1)} & x_2^{(1)} \\ x_1^{(2)} & x_2^{(2)} \\ \cdots \\ x_1^{(m)} & x_2^{(m)} \end{array}\right) \left(\begin{array}{cc} w_{11} & w_{12} \\ w_{21} & w_{22}\end{array}\right) + \left(\begin{array}{cc}b_1 & b_2\end{array}\right)$

This leads to the following model:

X = tf.placeholder(tf.float32, [None, 2], name="X")
W1 = tf.constant([[20.0, -20.0], [20.0, -20.0]])
b1 = tf.constant([-30, 10])
W2 = tf.constant([[20.0], [20.0]])
b2 = tf.constant([-10.0])


X representing x1 and x2, has unrestricted first dimension. This allows us to specify arbitrary many inputs. The first layer also computes both AND and NOR gates in a single computation. The second layer takes a single input, the previous layer, and again computes the output in two matrix operation. The function that computes values also undergoes changes

def ComputeValsFast(h, span):
x1, x2 = np.meshgrid(span, span)
X_in = np.column_stack([x1.flatten(), x2.flatten()])
with tf.Session() as sess:
return np.reshape(sess.run(h, feed_dict={X: X_in}), x1.shape)


It takes the vector that defines a space of input values, flattens it and stacks it as two columns. Then all values can be computed as a single call to sess.run(), followed by the reshaping operation. The difference? The original operation on a MacBook Air run in about 3.07s per loop. The reformulated, so-called fast version, runs in 14.5ms per loop. This level of speed allows us to recompute values of the optimized xor for 10,000, rather than the original 400 points leading to the image shown in Fig 1.

Fig 1. 100 x 100 XNOR values computed by a neural net.

### Resources

You can download the Jupyter notebook from which code snippets were presented above from github xnor-fast repository.