Logistic Regression For Machine Learning

Logistic Regression for machine learning is a popular term in statistics, to be more specific in predictive analytics. Logistics regression is also called direct probability model or logit model in the field of statistics. It was first introduced in the year 1958 by Dr Cox, a statistician. There are many types of regression models are available in the world of statistics or regression like linear regression, logistics regression, multiple linear regression, lasso regression and many more. Mainly types of regression model is being decided by the number of independent variables. In case of single independent variable, simple linear regression is being used. On the other hand, if number of independent variables are more than one, multiple linear regression model is being used. Same is true for simple logistic regression and multiple logistic regression.

In this article we’ll discuss about simple logistic regression, logistic regression for machine learning technique and how logistic regression can be performed with R.

Table of Contents

What is Logistic Regression?

Logistic Regression is a kind of supervised machine learning and it is a linear model. Logistic Regression is considered as a Machine Learning technique though the algorithm is learning from the training data set and give output. In a broader scenario, if dependent variable has two types of values or binary in nature like true/ false, win/ lose, logistic regression is the appropriate regression model.

Assumptions of Logistic Regression

1. Though the assumption (continuous variable) of linear regression is not fit for such binary dependent variable, logistics regression came into the picture. Also, in terms of residuals, it is not same as linear regression.

2. In logistics regression, multicollinearity should be checked to confirm that there is no or very low correlation among the independent variables. Multicollinearity should be checked with Variance Inflation Factor (VIF).

3. Outliers should not be the part of logistic regression model. To check outliers in the data, use boxplot.

4. The sample size should be large enough to make the model statistically significant.

5. The model should have normally distributed residuals.

Types of Logistic Regression

Logistics Regression can be categorized into three types, Binary Logistic Regression, Multinomial Logistic Regression, Ordinal Logistic Regression. The types are defined based on the number of values and values are in the form of dependent variable.

Binary Logistic Regression: When the dependent variable has two types of values or binary in nature, it is called binary logistic regression. Example: True or False.

Multinomial Logistic Regression: If dependent variable has two or more type of values but those are not in an order, it is considered as multinominal logistic regression.

Ordinal Logistic Regression: If dependent variable has two or more type of values and all are in order, considered as ordinal logistic regression.

Machine learning is a part of Artificial Intelligence (AI). Moreover, Machine learning technique is all about to train the machine by using training data set. There are mainly three types of machine learning based on the learning techniques, supervised learning, unsupervised learning, reinforcement learning. Here, machine or algorithm finding a trend or pattern from data and use that learning to the test data set. Logistic regression or any kind of regression is a type of supervised learning. To create logistic regression model, first step is to train the model and then test it as per the method of supervised learning. Any kind of regression model is type of machine learning. So, if the dependent variable is binary or multinomial or ordinal in nature, logistics regression type of machine learning is being used for predictive modeling.

When to select Logistics Regression Model?

It is important to decide which regression model to use by looking into the data. Where dependent variable is like binary or multinomial or ordinal, logistic regression is performed. As an example, in case of employee attrition analysis or churn analysis, data set is having a churn column with employee details li name, age, salary and so on. Employee details are independent variables and employee churn is dependent variable. In the churn column, employee retention is denoted as 1 and attrition as 0. So, in such case, logistic regression should be used.

Applications of Logistic Regression

As discussed earlier that the data of logistics regression is either binary or multinomial or ordinal. So, we have such kind of data in case of fraud detection data, loan defaulter, attrition of employee and many more.

Logistics Regression in R

R is a statistical tool which are used for statistical modeling. To perform logistics regression in R, following codes or steps are being followed. In this case, we’ll not split the data into training set and test set but will take the final output and check the accuracy. Information value or IV not checked in the following example as the number of variables are only five (example). IV check is required if the number of independent variables are more than 20.

As example and to show the code structure, we have assumed that in the data, there are independent variables like Independent_var_1/2/3 and dependent variable like Dep_var.

Library to install:

library(caret)

library(ggplot2)

library(MASS)

library(car)

library(mlogit)

library(sqldf)

library(Hmisc)

Import the data (from excel):

setwd(“C:\\Users\\Desktop\\Logistic regression\\Logistic Regression_1”) ## File path

data <- read.csv(“data1.csv”) ## Reading csv data

head(data) ## Check first 6 rows of data set

str(data) ## Data sanity check

Change the variables into factor:

data$Independent_var_1 <- as.factor(data$ Independent_var_1)

data$ Independent_var_2<- as.factor(data$ Independent_var_2)

Descriptive Analytics:

summary(data)

boxplot(data$column name) ## To check the outliers

sapply(data, function(x) sum(is.na(x))) ## To check Missing Value

Find the column names:

names(data)

Perform Logistic Regression:

model <- glm(Dep_var~Independent_var_1+ Independent_var_2+ Independent_var_3+ Independent_var_4+ Independent_var_5, data=data, family=binomial())

## Dep_var as dependent variable & Independent_var as independent variable

summary(model)

Adjusting the final model:

After running the “glm” model, output will show p value for each variable. Where p value is more than 0.05 and highest, drop the variable one by one from the model and finalize the model with variables.

Accuracy Check:

vif(model) ## Check variance Inflation Factor to understand multicolinearity. VIF output should be <2 for a good model.

modelChi <- model$null.deviance – model$deviance ## To check R²

Predict:

prediction <- predict(model,newdata = data,type=”response”)

Export the data into excel as .csv:

write.csv(data,”result.csv”)

All the best.

Please follow and like us: