The mechanics of logistic regression are similar to those of linear regression, but the applications and the equations are different.
The term logistic regression is strange. A better term might be binary classification. The value of y is always either 1 or 0.
Differences differences for a logistic regression compared to linear and non-linear regressions.
Similar to linear regression, except y is always either 1 or 0.
Why the word “logistic”?
Why the word “regression”, when it's actually a “classification” problem.
The most widely used AI algorithm.
summary of differences between linear regression and logistic regression
| linear regression | logistic regression | |
|---|---|---|
| Application | Predictor | Classifier |
| Data Fitting | on the line | to one side of the line |
| Hypothesis | linear function | decision boundary function substituted into sigmoid function |
| Cost function | squared error | log of error |
1958 USA, David Cox invents logistic regression.
Logistic regression is a classification algorithm.
More specifically, it is a binary classification algorithm. It can determine yes or no. 1 or 0.
A classification function is actually a type of prediction.
| Binary Classification | Prediction |
|---|---|
| Malignant or benign | What is the probability the tumor is malignant |
| Spam or legitimate | What is the probability an email is spam |
| Fraud or legitimate | What is the probability a financial transaction is an attempted fraud |
| Pass or fail | What is the probability a given student will graduate |
| Jail or freedom | What is the probability a psychiatric patient will commit a crime |
data fitting = finding the decision boundary
1. formulate the decision boundary function
2. substitute the decision boundary function into the sigmoid function h_theta(x) = 1 / 1+e^-(theta transpose x)
We say: $y$ must be $0$ or $1$.
And then we say: the hypothesis $h_{\theta}(x)$ must be between $0$ and $1$.
Language Alert!
The hypothesis $h_{\theta}(x) = $ the probability that $y = 1$.
If $h_{\theta}(x) \ge .5$, then $y = 1$.
If $h_{\theta}(x) \lt .5$, then $y = 0$.
The hypotheses function is named the logistic function or the sigmoid function.
The predictor function is normally the Sigmoid function.
The word logistic is similar to the word logic which is sometimes used to mean binary. So for practical purposes we can think of the word logistic as synonymous with binary in this context. Language Alert!
A sigmoid function produces a sigmoid curve. The word sigmoid refers to an s-shaped curve - a curve that looks like the letter s. It comes from the word sigma which is a greek letter that looks something like the latin letter s.
Step function. Sigmoid function.
The function $h_\theta(x)$ gives the probability that $y==1$.
$$σ(z) ≡ \frac{1}{1+e−z} or σ(z) ≡ 1/1+exp(−∑jwjxj−b)$$
$$h_\theta (x) = g(\theta^T x)\\ z = \theta^T x\\ g(z) = \frac{1}{1+e^{-z}}$$
Total probability must equal 1. y must be either 1 or 0.
The cost function is based on the logged error, not the squared error. $$J(\theta) = \frac{1}{m} \sum_{i=1}^m Cost(h_\theta(x),y)$$
Where the error or cost of each datapoint is:
\begin{align} Cost(h_\theta(x),y) &= -\log(h_\theta(x)) &\text{if y=0)}\\ Cost(h_\theta(x),y) &= -\log(1-h_\theta(x)) &\text{if y=1)} \end{align}
These two conditional cases can be combined as follows.
$$Cost(h_\theta(x),y) = -y \log(h_\theta(x)) - (1-y) \log(1-h_\theta(x))$$
$$J(\theta) = \frac{1}{m} \sum_{i=1}^m y_i \log(h_\theta(x_i)) + (1-y_i) \log(1-h_\theta(x_i))$$
This logarithm-based cost function comes from the Maximum Likelihood Estimation method from the field of Statistics.
Vectorized, like this:
$$h = g(X\theta)$$ $$J(\theta) = \frac{1}{m} \cdot \left ( -y^T \log(h) - (1-y)^T \log(1-h) \right )$$
The gradient is a vector of two partial derivatives, $\theta_1$ and $\theta_2$:
\begin{align*} \frac{\partial}{\partial \theta_1} &= \frac{1}{m} \sum(f(X,\theta) - Y)\\ \frac{\partial}{\partial \theta_2} &= \frac{1}{m} \sum(f(X,\theta) - Y) X)\\ \nabla f &= \begin{bmatrix} \frac{\partial}{\partial \theta_1} \frac{\partial}{\partial \theta_2} \end{bmatrix} \end{align*}
$$Repeat {\\ \theta := \theta - \alpha * \nabla J(\theta)\\ }$$
where $\alpha = $ learning rate
The data has one independent variable.
Tumor is malignant, yes or no.
Our $X,Y$ training data set is: $$X = \begin{bmatrix}1&1\\1&2\\1&4\\1&6\\1&8\end{bmatrix} \enspace Y = \begin{bmatrix}0\\ 0\\ 0\\ 1\\ 1\end{bmatrix}$$
where
X = the size of the tumor, and
Y = the probability the tumor is malignant.
This data is shown by the red dots in the graph.
The decision boundary function is the linear function $\theta^T X$,
with the starting parameters: $$\theta = \begin{bmatrix} 3\\5\end{bmatrix}$$
With these parameters, the decision boundary function is the blue line on the graph and the predictor function is the red line.
After gradient descent the optimal parameter values are: $$\theta = \begin{bmatrix} 3\\5\end{bmatrix}$$
For a new datapoint X = 9, the classifier function gives: \begin{align*} h_\theta(9) &= \frac{1}{1-e^{-\theta^T \dot [1 9]}}\\ h_\theta(9) &= .98 \rightarrow y = 1 \therefore \text{The tumor is malignant.} \end{align*}
Two independent variables $X_1$ and $X_2$.
Decision boundary is a line with positive slope.
Two independent variables $X_1$ and $X_2$.
Decision boundary is an ellipse.
Alternatives to gradient descent.
These three algorithms are implemented in Octave and other libraries. They are more complex than gradient descent, but are often faster. Furthermore, you do not have to choose a learning rate, because these algorithms choose an appropriate learning rate for each iteration automatically.
Also, feature scaling (normalization) can make logistic regression run faster.
| Dataset | Classes |
|---|---|
| Emails | friends, family, work, newsletters |
| Tomorrow's weather | sunny, cloudy, rain or snow |
To solve a multi-class classification problem, it would have to be broken into separate binary classification problems.
For example, if you are predicting tomorrow's weather as sunny, cloudy, rain or snow, you must treat this as four different logistic regression problems: 1) sunny or not, 2) cloudy or not, 3) rain or not, 4) snow or not.
optimization classification prediction
prediction and classification are calculated in terms of probability every fact is a prediction
“I'd like to give you an intuition about logistic regression.” language alert! intuition
why do we make theta a column vector and transpose it whenever we use it? why not make it a row vector to start with?
gradient is a vector, row or column?
logistic regression classifier h_p(x) = g(p1x1 + p2x2 + p3x3)
converge convergence : add to Gradient Descent section