Gradient Descent

Gradient descent is an algorithm used to minimize a cost function of two or more variables.

Each possible combination of values of those variables represents a point. The gradient is a vector that points from that point in the direction of steepest ascent. We follow in the opposite direction indicated by the gradient, from point to point, until we reach the minimum value of the cost function. The term “descent” describes this step-by-step movement towards the minimum point.

Cost Function

A cost function is any function that represents a cost or loss or error.

Gradient Ascent

The term gradient ascent is sometimes used to do the inverse and maximize a profit function.

Learning Rate

The learning rate is an arbitrary value, like .3, .03, or .003, that is the size of the step taken on each step. It is the portion of the gradient vector applied to the point to find the next step.

Algorithm

Here is the pseudo code for the gradient descent algorithm.

set number_of_iterations = 100 # arbitrary value
set learning_rate α = .1 # arbitrary value
set starting θ = [0 0] # random values
for (1 to number_of_iterations)
    set ∇g = []            #calculate the gradient
    set θ = θ - (α * ∇g)   #increment the point by a portion of the gradient
end loop

Applications

Gradient Descent is used with linear regression and with neural nets to minimize the error of a predictor function.

Example 1.

For the function

$$f(x,y) = (x^2 + y^2)$$

the gradient is

$$\nabla f(x,y) = \begin{bmatrix} \frac{\partial }{\partial x}f(x,y) & \frac{\partial }{\partial y}f(x,y) \end{bmatrix} \\ = \begin{bmatrix} 2x & 2y \end{bmatrix}$$

The graph of the function is a bowl-shaped concave parabola.

For the random point at $x=.75, y=-1.5$,

the gradient is given as $$ \nabla f(.75,-1.5) = \begin{bmatrix}1.5 & -3.0\end{bmatrix} $$

shown as the highest red dot on the graph.

We run the algorithm shown above and for each iteration the resulting point is shown by an additional red dot on the graph. The point gets closer and closer to the minimum point with every iteration. Final result:

Example 2.

For a linear regression example, we have the linear function and input data as follows.

$$Y = (aX + b) \enspace \text{ where } \enspace X = \begin{bmatrix}1\\ 4\\ 6\\ 9\end{bmatrix} Y = \begin{bmatrix}4\\ 4\\ 7\\ 8\end{bmatrix}$$

Our goal is to find optimal values for a and b.

For the cost function we will choose one-half the mean squared error: $$c = \frac{1}{2} * \frac{1}{m} * \sum(((aX + b) - Y)^2)$$

We seek to minimize this cost function using gradient descent.

Along with the cost function we need its gradient:

$$\nabla g = \begin{bmatrix} \frac{\partial}{\partial a} & \frac{\partial}{\partial b}\end{bmatrix}$$

which is a vector composed of the two partial derivatives:

$$\frac{\partial}{\partial a} = \frac{1}{m} * \sum((aX + b) - Y)$$ $$\frac{\partial}{\partial b} = \frac{1}{m} * \sum(((aX + b) - Y) * X)$$

Number of datapoints $m = 4$.

We choose a starting point of $a = .7$ and $b = 3.5$.

We set number_of_iterations to $40$.

And we set the learning rate $\alpha = .01$.

We run the gradient descent algorithm shown by pseudo-code above, and the result of each iteration is shown as a red dot on the graph.

Final result: the lowest value of the cost function is $0.13805$ for the optimal values of $a=0.60770$ and $b = 2.84019$.

Topics

Overfit

Variations

Batch Gradient Descent

This is the basic procedure we described above. The word batch describes the fact that we calculate the gradient on the entire set of training data at each iteration.

Stochastic Gradient Descent

The opposite of batch. Calculate the gradient on each datapoint, not the whole set of training data. Results in wider variation and big jumps at each step. Pro: Works with large datasets. May jump over the local minima to find the global minimum. Con: May continuously over-jump and never get to the minimum.

Mini-Batch Gradient Descent

Calculate the gradient on groups datapoints, but not the whole set of training data. A compromise between batch and mini-batch. Often the term Stochastic Gradient Descent is referring to Mini-batch.

Momentum

Add a fraction of the gradient of the previous iteration to the current iteration. Giving more weight to the overall general direction. Increasing momentum. Pro: Dampens the oscillations. Con: With too much momentum it may overshoot the minimum.

Nesterov Accelerated Gradient (NAG)

Yurii Nesterov in 1983 invented a modifed momentum technique. Apply the momentum step before the gradient step.

Adaptive Gradient (ADAGRAD)

Apply the Nesterov concept to the learning rate as well as to the gradient. Con: Learning rate gets so small that learning stops.

Adaptive Delta (ADADELTA)

Use a running average of previous gradients to apply to learning rate, so the learning rate does not get so small so fast.

RMSPROP

Adaptive Moment Estimation (ADAM)

References

The Octave scripts for these examples are available here: http://samantha.voyc.com/doc/script/octave/ Siraj Raval: The Evolution of Gradient Descent: https://youtu.be/nhqo0u1a6fw

Curriculum

Table of Contents