Linear regression

8 min readOct 3, 2018

This article summarizes what I’ve learned about linear regression based on the content from CS229, ISL, and linear algebra.

Cost/loss function to quantify the divergence between a hypothesis and a fact.
The modeling problem turns into an optimization problem that aims to minimize the least-squares cost function as an objective function.
The optimization gives an “optimal” set of parameters, w, for the ordinary least-squares regression model.
The least-squares cost function makes sense based on the principle of maximum likelihood estimation and a common assumption that the errors naturally follow Gaussian distribution.
The Q-Q plot is used to verify the normality of the errors/residuals, which is the assumption for the maximum likelihood estimation. The residual plot is used to verify whether there is non-linearity remaining in the residual and whether the variance of the residual is heterogenous across y.

Let’s say we have a problem predicting a target value (y) with some data (x). Let’s assume there is a linear relationship between the dependent variable y and the independent variable x. The hypothesis is h(x, w), where w is the parameter. The problem now is how to find the optimal w for h(x, w). The definition of “optimal” is mathematically represented as optimizing a cost/loss/objective function L(w).

First, we need to define L(w). One of the most common cost functions is the sum of squares of h(x, w)-y.

Loss(h(x, w), y) = 𝝨(h(x, w)-y)²

Second, the “optimal” is defined as minimizing Loss(w), which is intuitively straightforward because all we care at this point is how well our hypothesis h(x, w) aligns with y. The above Loss(w) quantifies the distance between h(x, w) and y. This cost function is called the ordinary least-squares cost function. Thus, this linear regression model is called the ordinary least-squares regression model.

Third, as the modeling problems turns into an optimization problem, how to minimize the ordinary least-squares function? Because this function is differentiable, gradient descent (GD) can be applied here.

𝛻w = −𝛼×d(Loss(w))/dw

w = w+𝛼×𝛻w = w−𝛼×d(Loss(w))/dw

For the ordinary least-squares, gradient descent gives rise to the LMS or Widrow-Hoff update rule. Because Loss(w) is the sum of loss across all the training samples, ∇w is also the sum of 𝛻w across all the samples. That is, w is only updated when the total loss is computed over all the samples. This is batch gradient descent.

Alternatively, stochastic gradient descent (SGD) updates w every single time when the loss of a single sample is computed, which updates w much more frequently than batch GD. For example, if there are n samples in the training set, SGD updates n times frequently than batch GD. Mini-batch SGD is something in the middle of this spectrum. The training data is split into chunks of the same size, the so-called batch size. Each batch contains at least one but at most n samples. Interestingly, many empirical results show that mini-batch SGD works even better than batch GD for large dataet. The loss converges faster to the local/global minimum with mini-batch SGD when the learning rate and batch size are chosen properly.

Of course, gradient descent is not the only way to solve the optimization. But for differentiable functions, it is the most widely used method. Furthermore, for this paticular least-squares function, GD is not the only way to find w. Linear algebra can be applied to find the normal equation X^T Xw = X^T y. However, in practice, it is computationally difficult to directly inverse a matrix. (more linear algebra to find out situations when the dimension of a matrix, m and n, are not equal.)

Great. Now we can roll up sleeves and write some code to compute the weights/parameters, w, given a dataset. For example, we can use the gradient descent optimizer in Tensorflow to find w iteratively over batches and epochs until the loss converges to some low value. Problem solved! But it looks like there are two underlying assumptions we haven’t think about very much.

One assumption is the linear relationship between x and y. Let’s put it aside for now. We can add more non-linearity to the hypothesis in the future. I want to add some notes for the second assumption. Why minimizing least-squares will give the “optimal” w for a predictive model? Let’s write down the relationship between y and h(x, w), where ℇ indicates the error for each sample.

y = h(w, x) + ℇ

Let’s assume ℇ are IID according to a Gaussian distribution. Thus,

p(ℇ) = 1/√2𝜋𝜎×exp(-ℇ²/2𝜎²)

p(y-h(x, w)) = 1/√2𝜋𝜎×exp(-(y-h(x, w))²/2𝜎²) = p(y|x; w) = L(w|x, y)

Here, p(y|x; w) indicates the probability distribution of y conditioned on x and parameterized by w. Note that f(y|x), f(y;x) and f(y,x) indicate y conditioned on x, y parameterized by x (x is not a random variable) and a joint of y and x, respectively. Here is a detailed explanation. Furthermore, note that y is not conditioned on w because w is not a random variable. Instead, although w is unknown, w is assumed to be a constant (a fact-based mechanism) from the frequentist’s point of view.

In the end, let’s clarify this equality: p(y|x; w) = L(w|x, y). L(w|x, y) is a likelihood function while p(y|x; w) is a probability function. L(w|x, y) is mathematically equivalent to p(y|x;w). It is just a change of representation to emphasize likelihood or probability. In other words, they have the identical mathematical formulas but represent different meanings depending on which is given and which is not. For example, p(y|x; w) represents the probability of y conditioned on x when w is given or known, while L(w|x, y) represents the likelihood of w computed based on given data x and y. An interesting and straightforward analogy for the relationship between probability and likelihood function is made in this post. With the likelihood function, the principle of maximum likelihood estimates the optimal w should maximize the likelihood across all the samples, which is as follows:

L(w;X,Y) = ∏ L(w;x, y) = ∏ p(y|x; w) ~ ∏ exp(-(y-h(x, w))²)

If we take log on both sides, we want to find w to maximize -∑(y-h(x, w))², that is, to minimize ∑(y-h(x, w))², the least-squares cost function.

In summary, the idea that minimizing a least-squares function gives optimal w is based on the principle of maximum likelihood estimation and the assumption that the error between the hypothesis and the “truth” follows Gaussian distribution. This provides statistical justification for an ordinary least-square fitting.

What if the error doesn’t follow Gaussian? The Q-Q plot is used to check the normality assumption. It can reveal whether the data is normally distributed, left/right skewed, or heavy/light tailed. Here is a post with more details about the Q-Q plot. In addition, there are more diagnostic plots for linear regression analysis. The residual plot is used to check whether there is non-linearity left in the residuals. If there is, then a model that takes into account non-linearity should be used. The standardized residual plot (scale location) is used to check the assumption of equal variance (homoscedasticity). If there is heteroscedasticity, weighted least-square should be used. Lastly, the residual-leverage plot is used to identify “influential” points in the data, or outliers.

Another thing to think about is the above process works well when the data is representative. The randomness of collecting the training data leaves us an uncertainty of estimating w. We need to bound the errors of w due to the randomness of training data. This is going to be discussed in the next post.

A bit more on maximum likelihood estimation

Learn a bit more about maximum likelihood estimation from this post. First, let’s clarify the relationship between a likelihood and a probability function. The replier tried this analogy, using this formula a^b. When a is given, say 2^b, then it is an exponential function; when b is given, say a², then it is a quadratic function. Although these two functions share the same formula (a^b), they end up with different functions depending on which variable is given. Second, how is a maximum likelihood estimation used to estimate w? Below is a very nice example from the post. Write down the likelihood function L(w|x, y), then find w that maximizes it. Here the formula is $theta$^H * (1 — $theta$)^T. Given H=7 and T=3, i.e. data/observation is given, $theta$ will be estimated based on $theta$⁷ * (1 — $theta$)³. Given $theta$=0.5 and 10 flips, H and T will be estimated based on 0.5^H * 0.5^T.

The likelihood function for L($theta$|heads=7, flips=10) = $theta$ is $theta$⁷ * (1 — $theta$)³, where $theta$ is the parameter of this function, the probability of a fair coin, based on one experiment with 10 flips and 7 heads. How to compute $theta$? The principle of MLE tells us that the most likely $theta$ should be the one that maximizes the likelihood function. In this example, $theta$ = 0.7 gives the max likelihood. As a result, $theta$ based on MLE and this data is estimated to be 0.7.

The above example also reveals the potential drawback of MLE. It is not hard to flip a coin 10 times and get 7 heads in reality. Then we can estimate this coin is biased with $theta$=0.7. However, how likely would it be to have such a coin in reality? If we believe probably unlikely, then what goes wrong in the above calculation using MLE? The answer is data. MLE solely relies on the dataset we currently have, which could be very limited, like the one we have in this example, only 10 flips. With more data, more flips, we may get a different result from MLE. But what can we do if the data is limited? Here come the Bayes theorem and maximum A-posteriori (MAP) estimate, which includes prior knowledge (something like $theta$ = 0.5). Now one can take into account both the prior and the data to estimate parameters. But it is also challenging to have a good estimate of the prior.

Some more thoughts:

Note that in the above derivation, h(w,x) is a generic expression, meaning that it doesn’t have to be a linear model XW+b. In fact, it could be any model, such as a neural network parametrized by w, an XGBoost parameterized by w, etc. As a result, the above derivation is essentially a justification of estimating parameter w of a given model by minimizing MSE loss from a statistical perspective. This justification is based on the principle of maximum likelihood estimate and the assumption that error/residual follows Gaussian distribution. In other words, as long as minimizing MSE loss is used to estimate model parameter w, the process is based on MLE and the Gaussian assumption, no matter what optimizer is used (SGD, Adam, etc. what about black box optimizer?). The selection of the optimizers is just to choose a way of searching w in the parameter space with the same goal of minimizing MSE loss. Likewise, it is also based on MLE when estimating w by minimizing cross-entropy loss. The difference is to assume error/residual follows a binomial distribution (binary loss assumes a Bernoulli distribution). The selection of the loss function is based on the assumption of how error/residual follows different distributions. What method is based on MAP then? Bayesian neural network.

继续延伸一下，上面推导中的h(w,x)是一个generic的表达，可以是任何model，不一定只是linear model。所以这个推导真正做的是从统计的角度justify使用minimize MSE求得模型参数w的这个方法，是基于the principle of MLE。比如说，h(w,x)可以是神经网络或者XGBoost，当使用MSE作为loss去优化得到神经网络或者XGBoost的参数时，其实是基于MLE principle并且认为erros/residual是服从高斯分布这两个假设的。换句话说，只要是基于最小化MSE loss去优化模型参数的过程，都是基于MLE principle的，无论使用什么optimizer，比如SGD，black box optimizer等等。使用不同的optimizer只是优化和搜索的过程不同，但是目标（最小化MSE loss）是不变的。类似的，使用cross entropy loss优化模型参数，也是基于MLE principle，只是另一个假设 — error/residual服从高斯分布改变了，而是服从binomial （binary loss服从bernoulli)。不同loss的使用是基于error/residual服从不同分布的假设。而Bayesian neural network是基于MAP的。

References:

Likelihood and probability function: https://www.quora.com/What-is-the-difference-between-probability-and-likelihood-1/answer/Jason-Eisner?share=cbfeda82&srid=zDgIt

same reference in mandarin: https://www.zhihu.com/question/54082000

浅谈似然和概率: https://yangfangs.github.io/2018/04/06/the-different-of-likelihood-and-probability/#%E4%BE%8B%E5%A6%82

可以理解为，似然和概率都是描述一种可能性，数据公式可能是一样的，但是用概率时强调参数已知，要求y的概率；而用似然时强调参数未知待求，数据x和y已知。

Q-Q plot: https://data.library.virginia.edu/understanding-q-q-plots/

Heavy/light tailed distribution: https://www.bioss.ac.uk/people/adam/teaching/OR_EVT/2007/node11.html

Diagnostic plots for linear regression analysis: https://data.library.virginia.edu/diagnostic-plots/

Translation in mandarin: https://blog.csdn.net/qq_35837578/article/details/88357551

https://zhuanlan.zhihu.com/p/49149862

Linear regression

Written by regularizer