MLA -- Linear Regression Principle

· 2020-06-01 · # Machine Learning

"Linear Regression is the beginning and the most basic algorithm."

Linear Regression is the most basic question in Machine Learning. Here is a simple summary of Linear Regression Algorithm principle.

1. Linear Regression Question

If we have m sample data, each sample has n features and one result value:
$(x^{(1)}_1,x^{(1)}_2,...,x^{(1)}_n, y_1), (x^{(2)}_1,x^{(2)}_2,...,x^{(2)}_n, y_2), ..., (x^{(m)}_1,x^{(m)}_2,...,x^{(m)}_n, y_m)$

Now question is if we get a new data with only n features: $ (x^{(a)}_1,x{(a)}_2,...,x^{(a)}_n) $, how can we predict its result value $ y_a $? And what is it?

If $ y_a $ is the continuous value, it is a regression question.
And if $ y_a $ is the discrete value, it is a classification question.

2. Linear Regression Model

If we decide to solve this question with linear regression, then we assume the data model is in this form:
$ h_\theta(x_0,x_1,..,x_n) = \theta_0 + \theta_1x_1 + ... + \theta_nx_n $

$\theta_i$ (i=0,1,...,n) is the coefficient of each feature, which is also the parameter of the model we need to estimate. If we assume $x_0=1$ , then model can be wrote in a simpler way:
$ h_\theta(x_0,x_1,..,x_n) = \sum_{i=0}^n \theta_ix_i $

If we use matrix representation, Model will become more simple and elegant:
$ h_\theta(X) = X\theta $

X is a $m \times n$ matrix (m records and n features).
$\theta$ is a $n \times 1$ vector (n coefficients of n features).
$h_\theta(X)$ is a $m\times 1$ vector (m prediction y value of m sample records)

Once we have determined the model prototype, we need to calculate the Loss Function. Generally, we use Mean Square Error as the loss function for linear regression model. The algebratic representation of the loss function is as follows:

J(\theta_0,\theta_1,...,\theta_n) = \sum_{i=1}^m (h_\theta(x_0^{(i)},x_1^{(i)},..,x_n^{(i)}) - y_i)^2

In Matix Representation:

J(\theta) = \frac{1}{2} (X\theta - Y)^T(X\theta - Y)

Y is a $m\times 1$ vector (m actual y value of m sample records)

3. Linear Regression Algorithm

Once the loss function is known, our goal is to find out the parameter $\theta$ that could minimize the value of loss function. There are two methods:

If we choose Gradient Descent method, the iteration formula of $\theta$ is:

\theta = \theta - \alpha X^T(X\theta-Y)

If we choose Least Square method, the formula of $\theta$ is:

\theta = (X^TX)^{-1}X^TY

4.1. Generalization of linear regression: Polynomial Regression

If the model prototype exists not only to the first power, but also to the second power, n-th power, the model becomes Polynomial Regression. For example, if the sample data has 2 features:

(x_1^{(1)}, x_2^{(1)}, y_1), (x_1^{(2)}, x_2^{(2)}, y_2),...,(x_1^{(m)}, x_2^{(m)}, y_m)

Assume the model is in this form:

h_\theta(x_1,x_2) = \theta_0+\theta_1x_1+\theta_2x_2 + \theta_3x_1^2 + \theta_4x_2^2 + \theta_5x_1x_2

If we define: $x_0=1, x_1=x_1, x_2=x_2, x_3=x_1^2, x_4=x_2^2, x_5=x_1x_2$ , then the Polynomial model is back to linear regression:

h_\theta(x_1,x_2) = \theta_0+\theta_1x_1+\theta_2x_2 + \theta_3x_3 + \theta_4x_4 + \theta_5x_5

Therefore, the solution is to build a 5 features sample data $(x_1, x_2, x_1^2, x_2^2, x_1x_2)$ for the 2 features sample data $(x_1,x_2)$ . Then we use this 5 features data to train the linear regression model.

4.2. Generalization of linear regression: Generalized Linear Regression

In 4.1 section, we generized the feature side of the sample data. Now we try to generize the y value of the sample data. For example, if the Y does not have a linear relationship X, but ln(Y) does:

ln(Y) = X\theta

In this case, use ln(y) instead of y, we can still deal with the problem with linear regression model.

We generalize In(y), assuming this function is a monotonically differentiable function 𝐠 (.), Then the generalized generalized linear regression form is:

g(Y) = X\theta

5. Regularization of linear regression

In order to preventing overfitting of the model, we often add regularization terms when building the linear model. There are generally L1 regularization and L2 regularization.

5.1 L1 regularization

L1 regularization of linear regression is usually called Lasso Regression.
The difference between it and general linear regression is that an L1 regularization term is added to the loss function. The L1 regularized term has a constant coefficient $\alpha$ to adjust the weight of mean square error term and the regularized term of the loss function:

J(\theta) = \frac{1}{2}(X\theta-Y)^T(X\theta-Y) + \alpha||\theta||_1

$\alpha$ is a constant coefficient and needs to be tuned.
$||\theta||_1$ is the L1 norm.

Lasso regression can make the coefficients of some features smaller, or even some coefficients with smaller absolute values directly become 0. Enhance the generalization ability of the model.

The solution methods of Lasso regression are generally:

Coordinate Descent
Least Angle Regression

Please check Regularization of Regression-Summary of Lasso Regression

5.2 L2 regularization

L2 regularization of linear regression is usually called Ridge Regression.
It adds an L2 regularization term to the loss function:

J(\theta) = \frac{1}{2}(X\theta-Y)^T(X\theta-Y) + \frac{1}{2}\alpha||\theta||_2^2

$\alpha$ is a constant coefficient and needs to be tuned.
$||\theta||_2$ is the L2 norm. (not equal to L1 norm)

Ridge regression reduces the regression coefficient without abandoning any feature, making the model relatively stable, but compared with Lasso regression, this will leave the model with a lot of features and poor interpretability.

The solution of Ridge regression is relatively simple, and the least square method is generally used. Here is a matrix derivation form using least squares, which is similar to ordinary linear regression.

Let the derivative of $J(\theta)$ be 0 and get the following formula:

X^T(X\theta - Y) + \alpha\theta = 0

Then:

\theta = (X^TX + \alpha E)^{-1} X^TY

E is the Identity matrix