Linear Regression

Linear Regression is a type of Regression Analysis where we fit a set of data points to a straight line.

For example, here are four (x,y) data points which appear roughly in the shape of a straight line.

x	y
1	4
3	4
6	7
9	8

Trial and Error

How do we find the equation for the line that best fits these four points. One way is trial and error.

In this graph we made three attempts, and that third line comes pretty close, but now we need to be more rigorous and more mathematically certain.

Least Squares

Predictor Function

Write the linear equation as a function of $x$. $$f(x) = ax + b$$ This is the predictor function. $f(x)$ gives the predicted value of $y$.

Error

Compare the predicted value to the actual value of y by subtracting. $$f(x) - y$$ This gives us the error of the predictor function for each data point. The error is also sometimes called the residual.

Squared Error

Square the error to make it more intense. $$(f(x) - y)²$$ The squared error.

Sum of Squared Error

Sum the squared error for all $n$ data points. $$\sum_{i=1}^n (f(x_i) - y_i)$$ The sum of the squared errors.

Vectorized

Now let's set up to calculate the least squares for all of the data points in one go using Matrix Math.

Rewrite the predictor function like this. $$f(X,θ) = \theta_0 x_0 + θ_1 x_1$$

where
$θ$ is a vector containing the values for $a$ and $b$ (in reverse order),
$X$ is a matrix containing the $x$ data points, prefixed with a column of ones, and
$y$ is a vector containing the $y$ data points, like this:

$$\theta=\begin{bmatrix}2\\.8\end{bmatrix}\hspace{30pt}x=\begin{bmatrix}1&1\\1&3\\1&6\\1&9\end{bmatrix}\hspace{30pt}y=\begin{bmatrix}4\\4\\7\\8\end{bmatrix}$$

Now the predictor function simplifies to $f(X,θ) = θ<sup>T</sup>X$ using element-wise matrix multiplication, and

Sum of Squared Errors = $∑(((θ X) - y)^2)$

In Octave:

Sum_of_Squared_Errors = sum(theta' .* x,2)

This equation is generalized to use any number of input values, by expanding $θ$ and $X$.

Least Squares

Now we have a single number to represent the accuracy of the predictor function. We can calculate the sum of the squared errors for different predictor functions, and the one with the smallest value, the least squares, is the best predictor.

Optimize

Our goal is to find the optimal $θ$, the one that gives us the least squares.

We will use three systematic methods based on least squares.

Brute Force (blue)
Normal Equation (orange)
Gradient Descent (green)

Normal Equation

Applying Differential Calculus to minimize the Sum of Squared Errors, we come up with this formula:

$$\theta = \left ( X^T X \right )^{-1}X^T y$$

But that was much too easy, so let's continue on with the Gradient Descent method.

Gradient Descent

We can use gradient descent to minimize the cost function. That is to say, we use gradient descent to find the optimal values of theta.

Gradient Descent is an algorithm for finding the optimal theta.

We run the predictor function many times, each time with a different value of theta. It is set up mathematically so that each iteration gets a little closer to the optimal theta.

Iterative Steps

Here is the pseudo code for gradient descent.

set number//of//iterations
set alpha α
set starting θ
For (1 to number_of_iterations)
    Calculate the cost function J(θ)
    Calculate the gradient g(θ)
    Subtract the gradient from θ
end loop

Hypothesis

We can rename the predictor function to the hypothesis. So $f(X,θ)$ becomes $h_θ (X)$. With each iteration, we test a new and improved hypothesis.

Cost Function

The Cost Function is another name for the Sum of Squared Errors formula. Or it can be a modified version of the Sum of Squared Errors formula.

Andrew Ng multiplies the Squared Errors formula by $1/2m$, where m is the number of data points. He says this makes the math work out better.

$$J(\theta ) = \left ( \frac{1}{2m} \right )\sum \left ( h_\theta (x) - y)^2 \right )$$ where $m$ = the number of data points

In Octave, it looks like this:

	J = (1/(2*m)) * sum((sum(theta' .* X,2) - y).^2);

Gradient

Like the derivative, the gradient represents the slope of the tangent of the graph of the function. More precisely, the gradient points in the direction of the greatest rate of increase of the function, and its magnitude is the slope of the graph in that direction.

But the way Andrew Ng is doing it, it's not a derivative function, it's just a percentage of the error times the X factor.

Also, the Andrew Ng course led me to believe that alpha is part of the gradient formula. But it is not. It is again taking a percentage of the gradient to apply to the new theta.

$$g(θ) = α(\frac{∂}{∂θ}) J(θ)$$ where
$α$ = learning rate, and
$\frac{∂}{∂θ}$ = partial derivative of $J(θ)$ with respect to $θ$

Expanded, it looks like this: $$g(θ) = α \frac{∂}{∂θ} (\frac{1}{2m})\sum_{i=1}^m(hθ(x_i)−y_i)^2$$

where
$α$ = learning rate, and
$\frac{∂}{∂θ} J(θ₀,θ₁) = $ derivative
$\frac{∂}{∂θ} J(θ₀,θ₁) = $ partial derivative (same thing)

$$\frac{∂}{∂θ} = 1/m = 1/2$$

In Octave

	gradient =  alpha * (1/(2*m)) * sum(((sum(theta' .* X,2) - y).*2) .* X);

Normalized Data

Variations

References

Source: Andrew Ng, Coursera Machine Language course, week 2
http://mathworld.wolfram.com/NormalEquation.html

Curriculum

Table of Contents

Linear Regression

Trial and Error

Least Squares

Predictor Function

Error

Squared Error

Sum of Squared Error

Vectorized

Least Squares

Optimize

Normal Equation

Gradient Descent

Iterative Steps

Hypothesis

Cost Function

Gradient

Normalized Data

Variations

References