Logistic Regression

In logistic regression, the dependent variable is binary in nature. Independent variables can be continuous or binary.

Here my model is:

Why don’t we use linear regression in this case?

– In linear regression, range of ‘y’ is real line but here it can take only 2 values. So ‘y’ is either 0 or 1 but X’B is continuous thus we can’t use usual linear regression in such a situation.
– Secondly, the error terms are not normally distributed.
– y follows binomial distribution and hence is not normal.

Examples

HR Analytics: IT firms recruit large number of people, but one of the problems they encounter is after accepting the job offer many candidates do not join. So, this results in cost over-runs because they have to repeat the entire process again. Now when you get an application, can you actually predict whether that applicant is likely to join the organization (Binary Outcome – Join / Not Join).

Elections: Suppose that we are interested in the factors that influence whether a political candidate wins an election. The outcome (response) variable is binary (0/1); win or lose. The predictor variables of interest are the amount of money spent on the campaign and the amount of time spent campaigning negatively.

Predicting the category of dependent variable for a given vector X of independent variables

Through logistic regression we have

Thus we choose a cut of probability say ‘p’ and if P(Yi = 1) > p then we can say that Yi belongs to class 1 otherwise 0.

Interpreting the logistic regression coefficients (Concept of Odds Ratio)

If we take exponential of coefficients, then we’ll get odds ratio for ith explanatory variable. Suppose odds ratio is equal to two, then the odds of event is 2 times greater than the odds of non-event. Suppose dependent variable is customer attrition (whether customer will close relationship with the company) and independent variable is citizenship status (National / Expat). The odds of expat attrite is 3 times greater than the odds of a national attrite.

Logistic Regression in R

In this case, we are trying to estimate whether a person will have cancer depending whether he smokes or not.
We fit logistic regression with glm( ) function and we set family = “binomial”

model <- glm(Lung.Cancer..Y.~Smoking..X.,data = data, family = “binomial”)

The predicted probabilities are given by:

#Predicted Probablities

Predicting whether the person will have cancer or not when we choose the cut off probability to be 0.5

data$prediction <- model$fitted.values>0.5