"A standard approach in regression analysis to approximate the solution of overdetermined systems."

The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the residuals made in the results of every single equation. It is used to do model fitting and find the extreme value of the loss function.

1. Least Square principle

The general form of least squares is simple:

f=(yy^)2f = \sum (y - \widehat{y})^2

f is the objective function
y is the observation value (actual value)
y^\widehat{y} is the theoretical value predicted from the model

The objective function is also the loss function commonly used in the machine learning. Our goal is to get a fitting model when the objective function is minimized. Give a simple example of the simplest linear regression, we have m samples with only one feature:
(x1,y1)(x2,y2),...,(xm,ym)(x_1,y_1)(x_2,y_2),...,(x_m,y_m)

Hypothesis model is:

hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1x

Loss function is:

J(θ0,θ1)=i=1m(yihθ(xi))2=i=1m(yiθ0θ1xi)2\begin{aligned} J(\theta_0,\theta_1) &= \sum_{i=1}^{m}(y_i - h_\theta(x_i))^2 \\ &= \sum_{i=1}^{m}(y_i - \theta_0 - \theta_1x_i)^2 \end{aligned}

Least Square method needs to get (θ0,θ1)(\theta_0,\theta_1) that minimizes the value of J(θ0,θ1)J(\theta_0,\theta_1)

2. Least Squre solution

2.1 Algebraic way

If we need to minimumize J(θ0,θ1)J(\theta_0,\theta_1), our method is to calculate the partial derivatives of θ0\theta_0 and θ1\theta_1 respectively, and make their partial derivatives all 0 to form a linear equation set for two variables.

  • Partial derivation of J(θ0,θ1)J(\theta_0,\theta_1) to θ0\theta_0:

    2i=1m(yiθ0θ1xi)=0-2\sum_{i=1}^{m}(y_i-\theta_0-\theta_1x_i) = 0

  • Partial derivation of J(θ0,θ1)J(\theta_0,\theta_1) to θ1\theta_1:

    2i=1m(yiθ0θ1xi)xi=0-2\sum_{i=1}^{m}(y_i-\theta_0-\theta_1x_i)x_i = 0

Now:

{i=1m(yiθ0θ1xi)=0i=1m(yiθ0θ1xi)xi=0\begin{cases} \sum_{i=1}^{m}(y_i-\theta_0-\theta_1x_i) = 0\\ \sum_{i=1}^{m}(y_i-\theta_0-\theta_1x_i)x_i = 0\\ \end{cases}

={i=1myimθ0θ1i=1mxi=0i=1myixiθ0i=1mxiθ1i=1mxi2=0=\begin{cases} \sum_{i=1}^{m}y_i - m\theta_0-\theta_1\sum_{i=1}^{m}x_i = 0\\ \sum_{i=1}^{m}y_ix_i -\theta_0\sum_{i=1}^{m}x_i-\theta_1\sum_{i=1}^{m}x_i^2 = 0\\ \end{cases}

Then:

θ0=xi2×yixi×xiyimxi2(xi)2 \theta_0 = \frac {\sum x_i^2 \times \sum y_i - \sum x_i \times \sum x_iy_i} {m\sum x_i^2-(\sum x_i)^2}

θ1=mxiyixi×yimxi2(xi)2 \theta_1 = \frac { m\sum x_iy_i - \sum x_i \times \sum y_i} {m\sum x_i^2-(\sum x_i)^2}

If the sample has more than 1 features, which is Multiple Linear Regression. We still use the rule that partial derivatives equal to 0 to form parametric equation set, then calculate these unknown parameters. The principle remains the same.

2.2 Matrix way

The matrix expression is more concise than the algebra expression, and it can also replace loops (In essence, the nature of computing has not changed). So many books and machine learning libraries now use the matrix method to do the least square.

Here we use the above multiple linear regression example to describe the matrix method solution. hθ(x1,x2,...,xn)=θ0+θ1x1+...+θnxnh_\theta(x_1,x_2,...,x_n)=\theta_0+\theta_1x_1+...+\theta_nx_n. If in matrix expression:

hθ(X)=Xθh_\theta(X) = X\theta

X is a m×nm \times n matrix (m records and n features).
θ\theta is a n×1n \times 1 vector (n coefficients of n features).
hθ(X)h_\theta(X) is a m×1m\times 1 vector (m prediction y value of m sample records)

The loss function is:

J(θ)=12(XθY)T(XθY)J(\theta) = \frac{1}{2}(X\theta-Y)^T(X\theta-Y)

Y is a m×1m\times 1 vector (m actual y value of m sample records)
1/2 is mainly to facilitate the calculation of later derivatives.

Let the derivative to be 0:

θJ(θ)=XT(XθY)=0\frac{\partial}{\partial\theta}J(\theta) = X^T(X\theta-Y) = 0

It based on two Matrix derivative chain rule formula:
Formula 1: X(XTX)=2X\frac{\partial}{\partial X}(X^TX) = 2X
Formula 2: $ \nabla_xf(AX+B) = A^T $

Then:

XTXθ=XTYθ=(XTX)1XTY\begin{array}{cc} X^TX\theta = X^TY\\\\ \theta = (X^TX)^{-1}X^TY \end{array}

(XTX)1XTY(X^TX)^{-1}X^TY is our answer.

3. Limitations and applicable scenarios of least square method

Least Square answer is simple and efficient, it has no iterations camparing with gradient descent. But it still has some limitations when we use it:

  1. Least Square needs to calculate the inverse matrix of XTXX^TX, however, it is possible that inverse matrix does not exist. In this case, there is no way to use Least Square directly, but Gradient Descent. Or, we can also try to remove redundant features by clean the sample data. Let the determinant of XTXX^TX not be 0, and then continue to use least squares.

  2. When the features n of sample is very large, the calculating the inverse matrix of XTXX^TX is a very time-consuming work, even not feasible. In this case, it's better to use Gradient Descent. (If n > 1000, suggest not use Least Square). Or, we can also use Principal Component Analysis (PCA) to reduce the dimension of features and then use the least square method.

  3. If the fitting function (objective model) is not linear, Least Square cannot be used. But Gradient Descent can still be used. Or, we can convert the fitting function to linearity through some techniques before Least Square can be used.

  4. There are some special situation:

    • When the sample size m is small and less than the feature number n, then the fitting equation is underdetermined, and the commonly used optimization methods cannot fit the data.
    • When the sample size m is equal to the feature number n, it can be solved by the equation system.
    • When the sample size m is greater than n, the fitting equation is overdetermined, which is the scenario where we usually use least squares.