"Logistic Regression is a kind of very classical and basic classification algorithm."

Logistic Regression is a kind of classification algorithm, it can implement binary classification and multiple classification. It is also a supervised learning algorithm, it implements a mapping from a given data set to 0 and 1 in binary case. Although it is a classification model, the principle of regression still remains in the model.

1. From Linear Regression to Logistic Regression

In Linear Regression model, the linear relationship between output vector Y and input sample matrix X is: $ Y = X\theta $. In this case, Y is continuous, hence, it is a regression model. If we want the output Y is the descrete value (for classification), one method is to do a second transformation on Y, g(Y). Through the function g(Y), all continous value can be mapped to descrete value. For example, Y belongs to category A when it is in a real number interval, and category B when it is in another real number interval.

2. Binary Logistic Regression model

In Logistic Regression, we usually use a special function g(Y) to transform continous value Y:

g(z)=11+ezg(z) = \frac{1}{1+e^{-z}}

This function has some very good properties, which are suitable for classification probability model:

  • g(z)(0,1)g(z) \in (0,1)
  • When z tends to positive infinity ++\infty, g(z) tends to 1.
  • When z tends to negative infinity -\infty, g(z) tends to 0.
  • A good derivative propertiy:

    g(z)=g(z)(1g(z))g'(z) = g(z)(1-g(z))

If z=xθz=x\theta, the general form of logistic regression model is:

hθ(x)=11+exθh_\theta(x) = \frac{1}{1+e^{-x\theta}}

x is one 1xn data vector to be predicted. (n features)
θ\theta is a nx1 parameters vector.
xθx\theta is a linear regression scalar.
hθ(x)h_\theta(x) can be understood as the probability of one certain category. We have this correspondence with our binary sample output y (assuming 0 and 1):

  • If hθ(x)h_\theta(x) > 0.5 (xθx\theta > 0), predict y = 1.
  • If hθ(x)h_\theta(x) < 0.5 (xθx\theta < 0), predict y = 0.

The smaller the value of hθ(x)h_\theta(x), the higher the probability of being classified as 0. Conversely, the larger the value of hθ(x)h_\theta(x), the higher the probability of being classified as 1. If it is close to the critical point (0.5), the classification accuracy will decrease.

In matrix expression:

hθ(X)=11+eXθh_\theta(X) = \frac{1}{1+e^{-X\theta}}

hθ(X)h_\theta(X) is the output of logistic regression model. mx1 dimensions.
X is the input sample data features matrix. mxn dimensions.
θ\theta is the model coefficients for classification. nx1 dimensions.

After understanding the Logistic Regression model of binary classification, we have to look at the loss function of the model, our goal is to minimize the loss function to get the corresponding model coefficients θ\theta.

3. Loss Function of Binary Logistic Regression

Since the linear regression model is continuous, its loss function is Square Sum of Error. However, the logistic funtion is not continuous, SSE is not feasible in this case (will prove it later in the article). Now, we can use Maximum Likelihood Method to derive the loss function.

Suppose the output sample is two value: 0 or 1, then it follows Bernoulli distribution.

P(y=1x,θ)=hθ(x)P(y=0x,θ)=1hθ(x)\begin{aligned} &P(y=1|x,\theta) = h_\theta(x) \\ &P(y=0|x,\theta) = 1- h_\theta(x) \end{aligned}

Combine above two formula, the probability distribution function is:

P(yx,θ)=hθ(x)y(1hθ(x))1yP(y|x,\theta) = h_\theta(x)^y(1-h_\theta(x))^{1-y}

y{0,1}y \in \{0,1\}
Through likelihood function maxmization, we can derive the model coefficient parameters θ\theta that we need.

Assume that the samples are independent and identically distributed, Likelihood Function is:

L(θ)=i=1m(hθ(x(i)))y(i)(1hθ(x(i)))1y(i) L(\theta) = \prod_{i=1}^m (h_\theta(x^{(i)}))^{y^{(i)}} (1 - h_\theta(x^{(i)}))^{1-y^{(i)}}

m is the counts of sample

The function that inverts the logarithmization of the likelihood function, that is, the loss function is:

J(θ)=ln(L(θ))=i=1m(y(i)ln(hθ(x(i)))+(1y(i))ln(1hθ(x(i))))\begin{aligned} J(\theta) &= -ln(L(\theta)) \\ &=-\sum_{i=1}^m (y^{(i)}ln(h_\theta(x^{(i)})) + (1-y^{(i)})ln(1-h_\theta(x^{(i)}))) \end{aligned}

In Matrix Expression:

J(θ)=YTln(hθ(X))(EY)Tln(Ehθ(X))J(\theta) = -Y^Tln(h_\theta(X)) - (E-Y)^Tln(E-h_\theta(X))

E is an all 1 vector.

4. Optimization method of Loss Function

There are lots of method to minimize the loss function of Logistic Regression model. Most common methods include gradient descent, coordinate descent, and Newton's method. Only each iteration's formula of gradient descent method is derived here.

4.1 Algebraic Expression

J(θ)=i=1m(y(i)ln(hθ(x(i)))+(1y(i))ln(1hθ(x(i))))J(\theta) = -\sum_{i=1}^m (y^{(i)}ln(h_\theta(x^{(i)})) + (1-y^{(i)})ln(1-h_\theta(x^{(i)})))

θjJ(θ)=i=1m(y(i)ln(hθ(x(i)))+(1y(i))ln(1hθ(x(i))))θj=i=1my(i)hθ(x(i))hθ(x(i))θj1y(i)1hθ(x(i))hθ(x(i))θj=i=1m(y(i)hθ(x(i))1y(i)1hθ(x(i)))hθ(x(i))θj=i=1my(i)hθ(x(i))hθ(x(i))(1hθ(x(i)))hθ(x(i))Xθθj=i=1my(i)hθ(x(i))hθ(x(i))(1hθ(x(i)))hθ(x(i))(1hθ(x(i)))xj(i)=i=1m(y(i)hθ(x(i)))xj(i)=i=1m(hθ(x(i))y(i))xj(i)\begin{aligned} \frac{\partial}{\partial\theta_j}J(\theta) &= -\frac{\partial\sum_{i=1}^m (y^{(i)}ln(h_\theta(x^{(i)})) + (1-y^{(i)})ln(1-h_\theta(x^{(i)})))}{\partial\theta_j} \\ &= - \sum_{i=1}^m \frac{y^{(i)}}{h_\theta(x^{(i)})} \frac{\partial h_\theta(x^{(i)})}{\partial\theta_j} - \frac{1-y^{(i)}}{1-h_\theta(x^{(i)})} \frac{\partial h_\theta(x^{(i)})}{\partial\theta_j} \\ &= - \sum_{i=1}^m (\frac{y^{(i)}}{h_\theta(x^{(i)})} - \frac{1-y^{(i)}}{1-h_\theta(x^{(i)})}) \frac{\partial h_\theta(x^{(i)})}{\partial\theta_j} \\ &= - \sum_{i=1}^m \frac{y^{(i)} - h_\theta(x^{(i)})}{h_\theta(x^{(i)})(1-h_\theta(x^{(i)}))} h'_\theta(x^{(i)})\frac{\partial X\theta}{\partial\theta_j} \\ &= - \sum_{i=1}^m \frac{y^{(i)} - h_\theta(x^{(i)})}{h_\theta(x^{(i)})(1-h_\theta(x^{(i)}))} h_\theta(x^{(i)})(1-h_\theta(x^{(i)})) x_j^{(i)} \\ &= - \sum_{i=1}^m (y^{(i)} - h_\theta(x^{(i)}))x_j^{(i)} \\ &= \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \end{aligned}

In gradient descent iteration:

θj=θjαi=1m(hθ(x(i))y(i))xj(i)\theta_j = \theta_j - \alpha\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}

j = 0,1,2,...,n

4.2 Matrix Expression

J(θ)=YTln(hθ(X))(EY)Tln(Ehθ(X))J(\theta) = -Y^Tln(h_\theta(X)) - (E-Y)^Tln(E-h_\theta(X))

θJ(θ)=XT[1hθ(X)hθ(X)(Ehθ(X)(Y))]+XT[1Ehθ(X)hθ(X)(Ehθ(X)(EY))]=XT(hθ(X)Y)\begin{aligned} \frac{\partial}{\partial\theta}J(\theta) &= X^T \left[\frac{1}{h_\theta(X)} \odot h_\theta(X) \odot (E-h_\theta(X) \odot (-Y)) \right] + X^T \left[\frac{1}{E-h_\theta(X)} \odot h_\theta(X) \odot (E-h_\theta(X) \odot (E-Y)) \right] \\ &= X^T(h_\theta(X) - Y) \end{aligned}

It used the chain rules of vector differentiation:

  1. Xlog(X)=1/X\frac{\partial}{\partial X}log(X)=1/X
  2. zg(z)=g(z)(1g(z))\frac{\partial}{\partial z}g(z) = g(z)(1-g(z)), [g(z) is the sigmoid function]
  3. Xθθ=X\frac{\partial X\theta}{\partial \theta}= X

The iteration formula of θ\theta is :

θ=θαXT(hθ(X)Y)\theta = \theta - \alpha X^T(h_\theta(X)-Y)

5. Regularization of Logistic Regression

In order to avoiding overfitting, we need to consider regularization. Most common methods are L1 and L2 regularization.

  • L1L_1 regularization

    J(θ)=YTln(hθ(X))(EY)Tln(Ehθ(X))+αθ1J(\theta) = -Y^Tln(h_\theta(X)) - (E-Y)^Tln(E-h_\theta(X)) + \alpha\|\theta\|_1

    This L1 loss function is not derivable, we can find the parameters that minimize funtion based on two methods: Coordinate Descent or Least Angle Regression

  • L2L_2 regularization

    J(θ)=YTln(hθ(X))(EY)Tln(Ehθ(X))+12αθ22J(\theta) = -Y^Tln(h_\theta(X)) - (E-Y)^Tln(E-h_\theta(X)) + \frac{1}{2}\alpha\|\theta\|_2^2

    The optimization method of L2 loss function is similar to ordinary logistic regression.

6. Promotion of binary : Multiple Logistic Regression

  • Binary Case

P(y=1x,θ)=hθ(x)=11+exθ=exθ1+exθP(y=0x,θ)=1hθ(x)=11+exθ\begin{aligned} &P(y=1|x,\theta) = h_\theta(x) = \frac{1}{1+e^{-x\theta}} = \frac{e^{x\theta}}{1+e^{x\theta}} \\ &P(y=0|x,\theta) = 1-h_\theta(x) = \frac{1}{1+e^{x\theta}} \end{aligned}

y can only be 0 or 1, then:

ln(P(y=1x,θ)P(y=0x,θ))=xθln\left(\frac{P(y=1|x,\theta)}{P(y=0|x,\theta)}\right) = x\theta

Suppose there is a K classification model, the value of the sample output y is 1,2,...,K. According to the experience of binary classification, we get:

ln(P(y=1x,θ)P(y=Kx,θ))=xθ1ln(P(y=2x,θ)P(y=Kx,θ))=xθ2...ln(P(y=K1x,θ)P(y=Kx,θ))=xθK1\begin{aligned} &ln\left(\frac{P(y=1|x,\theta)}{P(y=K|x,\theta)}\right) = x\theta_1 \\ &ln\left(\frac{P(y=2|x,\theta)}{P(y=K|x,\theta)}\right) = x\theta_2 \\ &... \\ &ln\left(\frac{P(y=K-1|x,\theta)}{P(y=K|x,\theta)}\right) = x\theta_{K-1} \\ \end{aligned}

There are K-1 equations above, and the sum of all probability is 1:

i=1KP(y=ix,θ)=1\sum_{i=1}^KP(y=i|x,\theta) = 1

Then there are K equations now. Solving this K linear equations, the probability distribution of K logistic regression is as follows:

P(y=kx,θ)=exθk1+t=1K1exθtk=1,2,...,K1P(y=Kx,θ)=11+t=1K1exθt\begin{aligned} &P(y=k|x,\theta) = \frac{e^{x\theta_k}}{1+\sum_{t=1}^{K-1}e^{x\theta_t}} &k=1,2,...,K-1 \\ &P(y=K|x,\theta) = \frac{1}{1+\sum_{t=1}^{K-1}e^{x\theta_t}} \end{aligned}

The loss function derivation and optimization of multiple logistic regression is similar to that of binary logistic regression.

7. Conclusion

Logistic regression, especially binary logistic regression, is a very common model. The training speed is very fast. Although it is not as mainstream as the support vector machine (SVM), it is enough to solve normal classification problems. The training speed is also faster than SVM.


Question: For logistic regression, why is the Square Sum of Error non-convex and not suitable as the loss function?

Suppose we use SSE as logistic regression's loss function:

J(θ)=12i=1m(y^iyi)2J(\theta) = \frac{1}{2}\sum_{i=1}^m(\hat{y}_i - y_i)^2

y^i=11+exiθ\hat{y}_i = \frac{1}{1+e^{-x_i\theta}}

To determine whether J is a convex function, it depends on whether its second derivative is greater than 0.

θjJ(θ)=i=1m(y^iyi)y^iθj=i=1m(y^iyi)y^i(1y^i)xj(i)=i=1m(y^i3+(yi+1)y^i2yiy^i)xj(i)\begin{aligned} \frac{\partial}{\partial\theta_j}J(\theta) &= \sum_{i=1}^m(\hat{y}_i - y_i)\frac{\partial\hat{y}_i}{\partial\theta_j} \\ &= \sum_{i=1}^m(\hat{y}_i - y_i)\hat{y}_i(1-\hat{y}_i)x_j^{(i)} \\ &= \sum_{i=1}^m(-\hat{y}_i^3 + (y_i+1)\hat{y}_i^2 - y_i\hat{y}_i)x_j^{(i)} \end{aligned}

2θjJ(θ)=i=1m(3y^i2+2(yi+1)y^iyi)xj(i)y^θj=i=1m[3y^i2+2(yi+1)y^iyi]y^i(1y^i)(xj(i))2\begin{aligned} \frac{\partial^2}{\partial\theta_j}J(\theta) &= \sum_{i=1}^m(-3\hat{y}_i^2 + 2(y_i+1)\hat{y}_i - y_i)x_j^{(i)}\frac{\partial\hat{y}}{\partial\theta_j} \\ &= \sum_{i=1}^m\left[-3\hat{y}_i^2 + 2(y_i+1)\hat{y}_i - y_i\right]\hat{y}_i(1-\hat{y}_i)(x_j^{(i)})^2 \end{aligned}

y^(0,1)\hat{y} \in (0,1), hence, y^i(1y^i)(xj(i))2\hat{y}_i(1-\hat{y}_i)(x_j^{(i)})^2 > 0.
Therefore, the positive and negative property of the second derivative of J is determined by the term [3y^i2+2(yi+1)y^iyi][-3\hat{y}_i^2 + 2(y_i+1)\hat{y}_i - y_i]
And yi{0,1}y_i \in \{0,1\}

  • when yi=0y_i=0, the term is [3y^i2+2y^i][-3\hat{y}_i^2 + 2\hat{y}_i]. The condition of this term being greater than 0 is y^<2/3\hat{y}<2/3
  • when yi=1y_i=1, the term is [3y^i2+4y^i1]=(3y^i1)(y^i1)[-3\hat{y}_i^2 + 4\hat{y}_i - 1]=-(3\hat{y}_i-1)(\hat{y}_i-1). The condition of this term being greater than 0 is y^<1/3\hat{y}<1/3

As we can see, only when y^(0,13)\hat{y} \in (0,\frac{1}{3}), we are sure the second derivative is greater than 0. The second derivative of J is not strictly greater than 0, so J is not a convex function. And J (SSE) is not suitable as Loss Function.

non-convex