Logistic Regression

regularizer
6 min readOct 14, 2018

In this post, I’d like to look into classification. Here is the summary.

  • Logistic or sigmoid function, logits, the odds
  • Linear regression with threshold doesn’t work for classfication because the least-square loss is not proper for classification.
  • The cross-entropy loss function is proper for classification from a probabilistic point view, that assumes P(y|x;W) follows Bernoulli distribution.
  • The definition of cross entropy from Information Theory also fits the definition of loss function for classification. Find the difference between predicted distribution (probabilities) and true distribution (labels).
  • Some fundamentals of entropy, cross entropy and KL divergence (relative entropy) and their comparison.
  • Newton’s method is an optimization method for logistic regression.

The target variable y is disrete in a classification problem in contrast to continuous in a regression problem. Herein, y is also called label. For a binary classification, with y=1 or 0, although we can still create a linear regression and turn continuous values to discrete values by setting a threshold, the least-square loss function might not be a proper one for classification. For example, [ADD EXAMPLES]. Here comes logistic regression.

h(W, x) = 1/(1 + exp(-Wx))

The key thing in logistic regresstion is the logistic or sigmoid function.

p = 1/(1 + exp(-L))

A fews things about logistic function. First, because the output p is bounded from 0 to 1, a common interpretation of the output p is probablity. Second, L is logit and p/(1-p) is odds. The natural logarithm of the odds is known as log-odds or logit.

L = ln(p/(1-p))

A probablity of 0.5 corresponds to a logit of 0. A probability lower than 0.5 corresponds to a negative logit while a probability higher than 0.5 corresponds to a positive logit.

Logistic regression

In a similar structure with last post, first let’s have an assumption between y and x. The linear combination of x features indicates the probability of x belonging to y=1 in the binary classification, that is, P(y=1|x;W). The logistic function converts a logit Wx to a probability as a hypothesis.

P(y = 1|x;W) = h(W, x) = 1/(1 + exp(-Wx))

Notice the variable on the left is P(y=1|x;W), not the label y. The hypothesis is a probability because it is used to compute the loss, which is the sum of cross entropy H(p, q) over all the training samples.

H of one sample: H(h(W, x), y) = -(y×log(h(W, x)) + (1-y)log(1-h(W, x)))

All training samples: Loss(W) = ∑H(h(W, x), y)

Similar to least-square, one may ask why the cross-entropy is a proper loss function for classification. Here is a probabilistic explanation. The assumption is that P(y|x;W) is a Bernoulli distribution, which makes sense for a binary classification. As we have the formula for P(y=1|x;W)=h(W, x) and P(y=0|x;W)=1-h(W, x),

P(y|x;W) = h(W, x)^y × (1-h(W, x))^(1-y)

According to the principal of maximum likelihood estimation (MLE), the likelihood function is L(W) = ∏ P(y|x;W). To maximize L(W), we can take the log on both sides to get rid of exponents,

log L(W) = log (∏ P(y|x;W)) = ∑((y×h(W, x) + (1-y)×(1-h(W, x)))

Thus, log L(W)=-Loss(W). This expression indicates that it is equivalent to minimize Loss(W) in order to maximize log L(W). Thus, the cross-entropy loss function is proper for classification from a probabilistic point view, that assumes P(y|x;W) follows Bernoulli distribution.

In addition, the definition of cross entropy, H(p, q), fits very well with the loss function for two distributions. The cross entropy can be interpreted as the expected number of bits per message needed to encode events drawn from the true distribution p, if using an optimal code for the distribution q. The cross entropy is commonly used in machine learning. H(y, h) is the cross entropy between true label y and predicted probability h. From the above discussion, we have seen that the sum of cross entropy turns out to be the log of likelihood function in logistic regression. Some extra reading to compare entropy of one distribution, cross entropy and KL divergence of two distributions is here.

Newton’s method

With the well defined loss function, we can apply gradient descent to minimize the loss and find the optimal W. In addition to gradient descent, Newton’s method can also be used to find the solution for a logistic regression. Newton’s method finds the parameter 𝜃 when the function f(𝜃) is zero. When 𝜃 is only one parameter, the update rule is

𝜃 = 𝜃-f(𝜃)/f`(𝜃) = 𝜃-f(𝜃)/(df/d𝜃), df/d𝜃 when 𝜃=some number

For logistic regression, the goal is to find W that minimizes Loss(W), which is a convex optimization. There is only one minimal, that is, Loss’(W) = 0 is the optimum. Thus, when W is only one parameter, the update rule according to Newton’s method is,

W = W-Loss’(W)/Loss’’(W)

When W is a vector, a more general update rule is

W = W-inverse(H)×𝛻Loss(W), where H is the Hessian. Hessian is a matrix whose entries are given by

Hij = ∂²Loss(W)/∂Wi∂Wj

In practice, Newton’s method typically converges faster than gradient descent. Howver, it is computationally much more expensive to compute the inverse of H at one iteration of Newton’s method.

Gradient descent and Newton’s method are both common optimization methods without constraints. Both can be applied in both linear and non-linear programming. Conjugate method is a combination of those two. Lagrange multiplier is a method for optimization (quadratic programming) with constraints.

Comments

Linear and logistic regression seem the simplest models we would learn on a statistics or machine learning class. I believed I had mastered them well in class. I thought it would be easy to write this summary. Although there has been a strange caveat in my mind for a while, I thought it was minor so I didn’t try hard enough to think it out loud or figure out what I was confused about. Until today when I was writing this summary, it turned out it wasn’t as trivial and I paid off my time to get it straight. This is another time that I realize I can learn something significant from something that I thought I had already mastered every single part.

Previously, the reason I convince myself logistic regression is a better choice for classification was an example for which linear regression was subjective to a large variance. I didn’t think hard to realize the large variance is indeed caused by searching W with the least-square loss function, which is far from optimal for classification. Another minor reason is easy interpretation with probabilities. Other than that, I always take it as a given that cross entropy should be used for classification because it seems intuitive to do so.

Here I just think my confusion out loud. When I look at the mathematical formula of linear and logistic regression, even though they are different, one thing is the same: the hypothesis linearly associates with Wx. That is, when Wx goes up, y or P(y) goes up; when Wx goes down, y or P(y) goes down. Basically, linear regression is the logit of logistic regression. Then I thought of the softmax layer of a neural network. For inference, we only need to know the max of logits over classes to output the label. We don’t need the probability to output label because logits linearly associate with probabilities. Thus, I was struggling to understand why bother converting linear regression to probability? Now I have more convincing reasons.

  • Although the probabilities are not needed, or the softmax activation is not needed during inference, not during training. Training requires the cross-entropy loss to backprop weights. Thus, probabilities must be needed to compute the cross entropy loss to update weights.
  • Assuming P(y|x;W) follows Bernoulli/Multinomial distribution makes more sense than Gaussian distribution for binary/multi-class classification. (Bernouli -> Binomial -> Multinomial, Bernoulli: throw a coin once, Binomial: throw a coin multiple times, Multinomial: throw a k-sided dice multiple times). Thus, the cross-entropy loss is more proper objective than least-square loss to find optimal W.

--

--