Ridge Regression

It’s important to understand the concept of regularization before jumping to ridge regression.

1. Regularization
Regularization helps to solve over fitting problem which implies model performing well on training data but performing poorly on validation (test) data. Regularization solves this problem by adding a penalty term to the objective function and control the model complexity using that penalty term.

Regularization is generally useful in the following situations:

– Large number of variables
– Low ratio of number observations to number of variables
– High Multi-Collinearity

2. L1 Loss function or L1 Regularization
In L1 regularization we try to minimize the objective function by adding a penalty term to the sum of the absolute values of coefficients. This is also known as least absolute deviations method. Lasso Regression makes use of L1 regularization.

3. L2 Loss function or L2 Regularization
In L2 regularization we try to minimize the objective function by adding a penalty term to the sum of the squares of coefficients. Ridge Regression or shrinkage regression makes use of L2 regularization.
In general, L2 performs better than L1 regularization. L2 is efficient in terms of computation. There is one area where L1 is considered as a preferred option over L2. L1 has in-built feature selection for sparse feature spaces. For example, you are predicting whether a person is having a brain tumor using more than 20,000 genetic markers (features). It is known that the vast majority of genes have little or no effect on the presence or severity of most diseases.

In the linear regression objective function we try to minimize the sum of squares of errors. In ridge regression (also known as shrinkage regression) we add a constraint on the sum of squares of the regression coefficients. Thus in ridge regression our objective function is:

Here λ is the regularization parameter which is a non negative number. Here we do not assume normality in the error terms.

Very Important Note:

We do not regularize the intercept term. The constraint is just on the sum of squares of regression coefficients of X’s.

We can see that ridge regression makes use of L2 regularization.
On solving the above objective function we can get the estimates of β as:

How can we choose the regularization parameter λ?

If we choose lambda = 0 then we get back to the usual OLS estimates. If lambda is chosen to be very large then it will lead to underfitting. Thus it is highly important to determine a desirable value of lambda. To tackle this issue, we plot the parameter estimates against different values of lambda and select the minimum value of λ after which the parameters tend to stabilize.

R code for Ridge Regression
Considering the swiss data set, we create two different datasets, one containing dependent variable and other containing independent variables.
X = swiss[,-1]
y = swiss[,1]

We need to load glmnet library to carry out ridge regression.
library(glmnet)

Using cv.glmnet( ) function we can do cross validation. By default alpha = 0 which means we are carrying out ridge regression. lambda is a sequence of various values of lambda which will be used for cross validation.

set.seed(123) #Setting the seed to get similar results.
model = cv.glmnet(as.matrix(X),y,alpha = 0,lambda = 10^seq(4,-1,-0.1))

We take the best lambda by using lambda.min and hence get the regression coefficients using predict function.
best_lambda = model$lambda.min

ridge_coeff = predict(model,s = best_lambda,type = “coefficients”)
ridge_coeff The coefficients obtained using ridge regression are: