上QQ阅读APP看书，第一时间看更新

Logistic regression – introduction and advantages

Logistic regression applies maximum likelihood estimation after transforming the dependent variable into a logit variable (natural log of the odds of the dependent variable occurring or not) with respect to independent variables. In this way, logistic regression estimates the probability of a certain event occurring. In the following equation, log of odds changes linearly as a function of explanatory variables:

One can simply ask, why odds, log(odds) and not probability? In fact, this is interviewers favorite question in analytics interviews.

The reason is as follows:

By converting probability to log(odds), we have expanded the range from [0, 1] to [- ∞, +∞ ]. By fitting model on probability we will encounter a restricted range problem, and also by applying log transformation, we cover-up the non-linearity involved and we can just fit with a linear combination of variables.

One more question one ask is what will happen if someone fit the linear regression on a 0-1 problem rather than on logistic regression?

A brief explanation is provided with the following image:

Error terms will tend to be large at the middle values of X (independent variable) and small at the extreme values, which is the violation of linear regression assumptions that errors should have zero mean and should be normally distributed
Generates nonsensical predictions of greater than 1 and less than 0 at end values of X
The ordinary least squares (OLS) estimates are inefficient and standard errors are biased
High error variance in the middle values of X and low variance at ends

All the preceding issues are solved by using logistic regression.