linear regression, machine learning

Linear Regression with Multiple Variables in Tensorflow

In Lecture 4.1 Linear Regression with multiple variables Andrew Ng shows how to generalize linear regression with a single variable to the case of multiple variables. Andrew Ng introduces a bit of notation to derive a more succinct formulation of the problem. Namely, n features x_1x_n are extended by adding feature x_0 which is always set to 1. This way the hypothesis can be expressed as:

h_{\theta}(x) = \theta_{0} x_0 + \theta_{1} x_1 + \cdots + \theta_{n} x_n = \theta^T x

For m examples, the task of linear regression can be expressed as a task of finding vector \theta such that

\left[ \begin{array}{cccc} \theta_0 & \theta_1 & \cdots & \theta_n \end{array} \right] \times \left[ \begin{array}{ccccc} 1 & 1 & \cdots & 1 \\ x^{(1)}_1 & x^{(2)}_1 & \cdots & x^{(m)}_1 \\ & & \vdots \\ x^{(n)}_m & x^{(n)}_m & \cdots & x^{(n)}_m \\ \end{array} \right]

is as close as possible to some observed values y_1, y_2, \cdots, y_m. The “as close as possible” typically means that the mean sum of square errors between h_{\theta}(x^{(i)}) and y_i for i \in [1, m] is minimized. This quantity is often referred to as cost or loss function:

J(\theta) = \dfrac{1}{2 m} \sum_{i=1}^{m} \left( h_{\theta}(x^{(i)}) - y_i\right)^2

To express the above concepts in Tensorflow, and more importantly, have Tensorflow find \theta that minimizes the cost function, we need to make a few adjustments. We rename vector \theta , as w. We are not using x_0 = 1 . Instead, we use a tensor of size 0 (also known as scalar), called b to represent x_0 . As it is easier to stack rows than columns, we form matrix X , in such a way that the i-th row is the i-th sample. Our formulation thus has the form

h_{w,b}(X) = \left[ \begin{array}{ccc} \text{---} & (x^{(1)})^T & \text{---} \\ \text{---} & (x^{(2)})^T & \text{---} \\ & \vdots & \\ \text{---} & (x^{(m)})^T & \text{---} \end{array} \right] \times \left[ \begin{array}{c} w_1 \\ w_2 \\ \vdots \\ w_m \end{array} \right] + b

This leads to the following Python code:

X_in = tf.placeholder(tf.float32, [None, n_features], "X_in")
w = tf.Variable(tf.random_normal([n_features, 1]), name="w")
b = tf.Variable(tf.constant(0.1, shape=[]), name="b")
h = tf.add(tf.matmul(X_in, w), b)

We first introduce a tf.placeholder named X_in. This is how we supply data into our model. Line 2 creates a vector w corresponding to \theta . Line 3 creates a variable b corresponding to x_0 . Finally, line 4 expresses function h as a matrix multiplication of X_in and w plus scalar b.

y_in = tf.placeholder(tf.float32, [None, 1], "y_in")
loss_op = tf.reduce_mean(tf.square(tf.subtract(y_in, h)),
                         name="loss")
train_op = tf.train.GradientDescentOptimizer(0.3).minimize(loss_op)

To define the loss function, we introduce another placeholder y_in. It holds the ideal (or target) values for the function h. Next we create a loss_op. This corresponds to the loss function. The difference is that, rather than being a function directly, it defines for Tensorflow operations that need to be run to compute a loss function. Finally, the training operation uses a gradient descent optimizer, that uses learning rate of 0.3, and tries to minimize the loss.

Now we have all pieces in place to create a loop that finds w and b that minimize the loss function.

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for batch in range(1000):
        sess.run(train_op, feed_dict={
            X_in: X_true,
            y_in: y_true
        })
    w_computed = sess.run(w)
    b_computed = sess.run(b)

In line 1 we create a session that is going to run operations we created before. First we initialize all global variables. In lines 3-7 we repeatedly run the training operation. It computes the value of h based on X_in. Next, it computes the current loss, based on h, and y_in. It uses the data flow graph to compute derivatives of the loss function with respect to every variable in the computational graph. It automatically adjusts them, using the specified learning rate of 0.3. Once the desired number of steps has been completed, we record the final values of vector w and scalar b computed by Tensorflow.

To see how well Tensorflow did, we print the final version of computed variables. We compare them with ideal values (which for the purpose of this exercise were initialized to random values):

print "w computed [%s]" % ', '.join(['%.5f' % x for x in w_computed.flatten()])
print "w actual   [%s]" % ', '.join(['%.5f' % x for x in w_true.flatten()])
print "b computed %.3f" % b_computed
print "b actual  %.3f" % b_true[0]

w computed [5.48375, 90.52216, 48.28834, 38.46674]
w actual   [5.48446, 90.52165, 48.28952, 38.46534]
b computed -9.326
b actual  -9.331

Resources

You can download the Jupyter notebook with the above code from a github linear regression repository.

machine learning

Computing XNOR with a Neural Network

This tutorial shows how to use Tensorflow to create a neural network that mimics \neg (x_1 \oplus x_2) function. This function, abbreviated as XNOR, returns 1 only if x_1 is equal to x_2. The values are summarized in the table below:

\begin{array}{c|c|c} x_1 & x_2 & y \\ \hline 0 & 0 & 1 \\ 0 & 1 & 0 \\ 1 & 0 & 0 \\ 1 & 1 & 1 \end{array}

Andrew Ng shows in Lecture 8.5: Neural Networks – Representation how to construct a single neuron that can emulate a logical AND operation. The neuron is considered to act like a logical AND if it outputs a value close to 0 for (0, 0), (0, 1), and (1, 0) inputs, and a value close to 1 for (1, 1). This can be achieved as follows:

 h_{\mbox{and}}(x)=\dfrac{1}{e^{30 - 20x_1 - 20x_2}}

To recreate the above in tensorflow we first create a function that takes theta, a vector of coefficients, together with x1 and x2. We use the vector to create three constants, represented by tf.constant. The first one is the bias unit. The next two are used to multiply x1 and x2, respectively. The expression is then fed into a sigmoid function, implemented by tf.nn.sigmoid.

Outside of the function we create two placeholders. For Tensorflow a tf.placeholder is an operation that is fed data. These are going to be our x1 and x2 variables. Next we create a h_and operation by calling the MakeModel function with the coefficient vector as suggested by Andrew Ng.

def MakeModel(theta, x1, x2):
  h = tf.constant(theta[0]) + \
    tf.constant(theta[1]) * x1 + tf.constant(theta[2]) * x2
  return tf.nn.sigmoid(h)

x1 = tf.placeholder(tf.float32, name="x1")
x2 = tf.placeholder(tf.float32, name="x2")
h_and = MakeModel([-30.0, 20.0, 20.0], x1, x2)

We can then print the values to verify that our model works correctly. When creating Tensorflow operations, we do not create an actual program. Instead, we create a description of the program. To execute it, we need to create a session to run it:

with tf.Session() as sess:
  print " x1 | x2 |  g"
  print "----+----+-----"
  for x in range(4):
    x1_in, x2_in = x & 1, x > 1
    print " %2.0f | %2.0f | %3.1f" % (
        x1_in, x2_in, sess.run(h_and, {x1: x1_in, x2: x2_in}))

The above code produces the following output, confirming that we have correctly coded the AND function:

  x1| x2 |  g
----+----+-----
  0 |  0 | 0.0
  0 |  1 | 0.0
  1 |  0 | 0.0
  1 |  1 | 1.0

To get a better understanding of how a neuron, or more precisely, a sigmoid function with a linear input, emulates a logical AND, let us plot its values. Rather than just using four points, we compute its values for a set of 20 x 20 points from the range [0, 1]. First, we define a function that, for a given input function (a tensor) and a linear space, computes values of returned by the function (a tensor) when fed points from the linear space.

def ComputeVals(h, span):
    vals = []
    with tf.Session() as sess:
        for x1_in in span:
            vals.append([
                sess.run(h, feed_dict={
                  x1: x1_in, x2: x2_in}) for x2_in in span
            ])
    return vals

This is a rather inefficient way of doing this. However, at this stage we aim for clarity not efficiency. To plot values computed by the h_and tensor we use matplotlib. The result can be seen in Fig 1. We use coolwarm color map, with blue representing 0 and red representing 1.

h_and

Fig 1. Values of a neuron emulating the AND gate

Having created a logical AND, let us apply the same approach, and create a logical OR. Following Andrew Ng’s lecture, the bias is set to -10.0, while we use 20.0 as weights associated with x1 and x2. This has the effect of generating an input larger than or equal 10.0, if either x1 or x2 are 1, and -10, if both are zero. We reuse the same MakeModel function. We pass the same x1 and x2 as input, but change vector theta to [-10.0, 20.0, 20.0]

h_or = MakeModel([-10.0, 20.0, 20.0], x1, x2)
or_vals = ComputeVals(h_or, span)

When plotted with matplotlib we see the graph shown in Fig 2.

or

Fig 2. Values of a neuron emulating the OR gate

The negation can be crated by putting a large negative weight in front of the variable. Andrew Ng’s chose 10 - 20x. This way g(x)=1/(1 + e^{20x - 10}) returns 0.00005 for latex x = 1 and 0.99995 for x = 0. By using −20 with both x1 and x2 we get a neuron that produces a logical and of negation of both variables, also known as the NOR gate: h_{nor} = 1/(1+e^{-(10 - 20x1 - 20x2)}).

h_nor = MakeModel([10.0, -20.0, -20.0], x1, x2)
nor_vals = ComputeVals(h_nor, span)

The plot of values of our h_nor function can be seen in Fig 3.

nor

Fig 3. Value of a neuron emulating the NOR gate

With the last gate, we have everything in place. The first neuron generates values close to one when both x1 and x2 are 1, the third neuron generates value close to one when x1 and x2 are close to 0. Finally, the second neuron can perform a logical OR of values generated from two neurons. Thus our xnor neuron can be constructed by passing h_and and h_nor as inputs to h_or neuron. In Tensorflow this simply means that rather than passing x1 and x2 placeholders, when constructing h_or function, we pass h_and and h_nor tensors:

h_xnor = MakeModel([-10.0, 20.0, 20.0], h_nor, h_and)
xnor_vals = ComputeVals(h_xnor, span)

Again, to see what is happening, let us plot the values of h_xnor over the [0, 1] range. These are shown in Fig 4.

xnor
Fig 4. Value of a neural net emulating XNOR gate

In a typical Tensorflow application we would not see only constants being used to create a model. Instead constants are used to initialize variables. The reason we could use only constants is that we do not intend to train the model. Instead we already knew, thanks to Andrew Ng, the final values of all weights and biases.

Finally, the solution that we gave is quite inefficient. We will show next how by vectorising it one can speed it up by a factor of over 200 times. This is not an insignificant number, considering how simple our model is. In larger models vectorization can give us even more dramatic improvements.

Resources

You can download the Jupyter notebook from which code snippets were presented above from github xnor-basic repository.

machine learning, svm

SVM with Tensorflow

Introduction

In lecture 12 Andrew Ng introduces support vector machines (SVMs). As Andrew Ng shows the intuition for what SVMs are can be gleaned from logistic regression. If we have a function h(x) = 1/(1 + e^{-\theta^T x}) that tells us how confident we are that a given x is a positive example, we wish to select \theta that results in h(x) \approx 1 for all positive examples. By the same token, we would like h(x) \approx 0 for all negative examples. The difference between the “least positive” and the “least negative” examples is the margin. By maximizing that margin we maximize the chances of yet unseen positive examples being recognized as such. The same holds for yet unseen negative examples. If we deal with linearly separable data, this is equivalent to finding a hyperplane (in case of 2D data, this is just a line) that maximizes the margin between positive and negative examples (see Fig 1)

separating hyperplane

Fig 1. Positive (red) and negative (blue) examples with the separating hyperplane (line) and its margins.

Without going into formalities, which are much better explained in Andrew Ng’s lecture notes, the task of finding such a hyperplane can be cast as task for finding support vectors for it. These can be efficiently found using gradient descent methods, with a slightly modified definition of the loss function.

SVM with Tensorflow

Tensorflow added, in version 1.0, tf.contrib.learn.SVM. It implements the Estimator interface. As with other estimators the approach is to create an estimator, fit known examples, while periodically evaluating the fitness of the estimator on the validation set. Once the evaluator is trained, it may be exported. From then on, for any new data, you use prediction to classify it.

Preparing Data

The first step is to prepare data, similar to the one shown in Fig 1. In real application this data would be collected from external sources, rather than generated. We generate a set of 1,000 random points. Each point is assigned a class. If for the given point (x, y), y > x the point is considered a part of a positive class. Otherwise, it falls into the negative class. As randomly generated points would not likely to have a margin separating positive from negative examples, we add the margin by pushing positive points (-\sqrt{1/2}, \sqrt{1/2}) left and up. Negative examples are pushed right and down by (\sqrt{1/2}, -\sqrt{1/2}).

min_y = min_x = -5
max_y = max_x = 5
x_coords = np.random.uniform(min_x, max_x, (500, 1))
y_coords = np.random.uniform(min_y, max_y, (500, 1))
clazz = np.greater(y_coords, x_coords).astype(int)
delta = 0.5 / np.sqrt(2.0)
x_coords = x_coords + ((0 - clazz) * delta) + ((1 - clazz) * delta)
y_coords = y_coords + (clazz * delta) + ((clazz - 1) * delta)

Preparing Input Function

For a given data we create an input function. The role of this function is to feed data and labels to the estimator. The data, for SVM, consist of a dictionary holding feature IDs, and features themselves. The labels tell the estimator the class to which each row of features belongs. In more complex setup the input function can return batches of data read from a disk or over a network. It can indicate the end of data by raising StopIteration or OutOfRangeError exception. For us the function trivially returns all 1,000 points with their labels.

def input_fn():
  return {
      'example_id': tf.constant(
          map(lambda x: str(x + 1), np.arange(len(x_coords)))),
      'x': tf.constant(np.reshape(x_coords, [x_coords.shape[0], 1])),
      'y': tf.constant(np.reshape(y_coords, [y_coords.shape[0], 1])),
  }, tf.constant(clazz)

Training SVM

Once the input function is set up, we create a new SVM estimator. In the constructor we tell it the names of the features, which for us are real valued columns. We also indicate which column holds the IDs for each row of features. Having done that we ask SVM to fit input data for a fixed number of steps. Since our data is trivially separable, we limit the number of steps to just 30. Next, we run one more step to estimate the quality of fit. For a trivial example the SVM achieves a perfect accuracy. In real application, the quality should be estimated on a data separate from the one used to train SVM.

feature1 = tf.contrib.layers.real_valued_column('x')
feature2 = tf.contrib.layers.real_valued_column('y')
svm_classifier = tf.contrib.learn.SVM(
  feature_columns=[feature1, feature2],
  example_id_column='example_id')
svm_classifier.fit(input_fn=input_fn, steps=30)
metrics = svm_classifier.evaluate(input_fn=input_fn, steps=1)
print "Loss", metrics['loss'], "\nAccuracy", metrics['accuracy']
Loss 0.00118758
Accuracy 1.0

Predicting Classes for New Data

Once SVM has been trained, it can be used to predict the class of a new, previously unseen data. To simulate this step we again generate random points and feed them to the trained SVM. SVM not only returns the class for each point, but gives us the logits value. The latter can be used to estimate the confidence in the class assigned to the point by the SVM. For example, a point (-0.27510791, -0.4940773) has class 0, and logits -0.28906667, indicating that it barely makes class 0. On the other hand (3.39027299, -2.13721821), which also belongs to class 0, has logits -7.00896215.

x_predict = np.random.uniform(min_x, max_x, (20, 1))
y_predict = np.random.uniform(min_y, max_y, (20, 1))

def predict_fn():
  return {
    'x': tf.constant(x_predict),
    'y': tf.constant(y_predict),
  }

pred = list(svm_classifier.predict(input_fn=predict_fn))
predicted_class = map(lambda x: x['classes'], pred)
annotations = map(lambda x: '%.2f' % x['logits'][0], pred)

The results of classification of random points, together with the logits values for each point are shown in Fig 2.

svm predictions

Fig 2. SVM prediction for a set of random points.

Resources

You can download the Jupyter notebook with the above code from a github svm repository.