Machine learning has a variety of applications in the programming world. It is useful in solving real-life problems. Learning machine learning algorithms is essential as it learns from data and experience without the need for humans. You are going to learn about Logistic Regression in machine learning.
What is Logistic Regression in machine learning?
Logistic regression makes the use of the logistic function for modeling a binary dependent variable. It studies data along with showing the interconnection between a dependent binary variable and nominal, ordinal, interval, or ratio-level independent variable. It is applied in various fields including machine learning, medical, social science, human resource, finance, HR and so on.
In machine learning and data science, around 70% of the problems are based on classification. Logistic regression is a machine learning algorithm for classification. It is used for finding out the categorical dependent variable. Sometimes, the dependent variable is known as target variable and independent variables are called predictors.
In simple words, logistic regression can predict P(Y=1) as a function of X. The outcome is always dichotomous that means two possible classes. In some classification problems, there are more than two classes. This kind of problem is known as multivariate classification problem.
Characteristic of Logistic Regression
- In logistic regression, target variable follows Bernoulli distribution
- Maximum likelihood is used for estimation
- Model fitness is computed with the help of Concordance, KS-Statistics
- It contains an S-shaped line
Sigmoid Function (Logistic function)
Programmers use the sigmoid function for predicting categorical value. It can take any value and provides 0 or 1 as the end result. The sigmoid function is represented in the following equation.
p = 1 / 1 + e-y
Regardless of the value you put into p, you will always get the outcomes as 0 or 1.
Python Implementation
- Collect data
Suppose you want a logistic regression model in python for finding that candidates will get admission to their favorite university or not. The only two possible outcomes here are admitted (represented as 1) or rejected (represented as 0).
You can create a logistic regression model in python where the target variable represents admitted person and 3 predictors can be GMAT score, GPA and Years of work experience. When you are considering a dataset then it’s important to keep sample size large for getting accurate results.
- Importing Python Packages
Make sure that panda, sklearn and seaborn packages are installed in python or else import as follows
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sn
- Building a Dataframe
It will be done by capturing the dataset using panda Dataframe.
import pandas as pd
candidates = {'gmat': [780,750,690,710,680,730,690,720,740,690,610,690,710,680,770,610],
'gpa': [4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3],
'work_experience': [3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1],
'admitted': [1,1,1,1,1,1,0,1,1,0,0,1,1,1,1,0]
}
df = pd.DataFrame(candidates,columns= ['gmat', 'gpa','work_experience','admitted'])
print (df)
- Creating Logistic Regression in Python
There will be different components that after keeping together will give following code-
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sn
candidates = {'gmat': [780,750,690,710,680,730,690,720,740,690,610,690,710,680,770,610],
'gpa': [4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3],
'work_experience': [3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1],
'admitted': [1,1,1,1,1,1,0,1,1,0,0,1,1,1,1,0]
}
df = pd.DataFrame(candidates,columns= ['gmat', 'gpa','work_experience','admitted'])
#print (df)
X = df[['gmat', 'gpa','work_experience']]
y = df['admitted']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0)
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)
confusion_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sn.heatmap(confusion_matrix, annot=True)
print('Accuracy: ',metrics.accuracy_score(y_test, y_pred))
However, your job hasn’t over yet. This code will only provide you a confusion matrix and help you in calculating accuracy. For the complete code to get the expected result:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
candidates = {'gmat': [780,750,690,710,680,730,690,720,740,690,610,690,710,680,770,610],
'gpa': [4,3.9,3.3,3.7,3.9,3.7,2.3,3.3,3.3,1.7,2.7,3.7,3.7,3.3,3.3,3],
'work_experience': [3,4,3,5,4,6,1,4,5,1,3,5,6,4,3,1],
'admitted': [1,1,1,1,1,1,0,1,1,0,0,1,1,1,1,0]
}
df = pd.DataFrame(candidates,columns= ['gmat', 'gpa','work_experience','admitted'])
X = df[['gmat', 'gpa','work_experience']]
y = df['admitted']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=0) #train is based on 75% of the dataset, test is based on 25% of dataset
logistic_regression= LogisticRegression()
logistic_regression.fit(X_train,y_train)
y_pred=logistic_regression.predict(X_test)
print (X_test) #test dataset
print (y_pred) #predicted values
After running the code, you will get to know the admitted and rejected students.