Logistic Regression

Portuguese Version

To talk about Logistic Regression, I used the same dataset of the post about Multiple Linear Regression. It has scores of exams that help students to enter colleges.

The Logistic Regression uses binary values to classify and this dataset doesn’t have this type of data. So, I inserted a new column Admitted.

Table showing fourteen lines and ten columns

I added a new column with a basis on column Chance of Admit. If the chances are greater than 80% then the value is 1 if not is 0.

base['Admitted'] = [1 if chance > 0.8 else 0 for chance in base['Chance of Admit ']]

To get a better visualization I plotted the chart below.

labels = ['Yes', 'No']
x_pos = [0, 1]

admitted = len(base[base['Admitted'] == 1])
not_admitted = len(base[base['Admitted'] == 0])

plt.bar(x_pos, [admitted, not_admitted])
plt.xticks(x_pos, labels)
plt.title('Admitted students')

Bar chart showing total admitted and not admitted students

After I started to prepare the data. Firstly, I made a new DataFrame only with the necessary columns.

toefl = base.iloc[:, 2].values
gre   = base.iloc[:, 1].values
cgpa  = base.iloc[:, 6].values


x = pd.DataFrame(np.c_[toefl, gre, cgpa], columns=['toefl','gre', 'cgpa'])

# Admitted Column
y = base.iloc[:, -1].values

Table showing nine lines and three columns

I separated dataset on training and test data for I can create a model to predict.

X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.20, random_state=42)

logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

predictions = logmodel.predict(X_test)

Then I generated a Confusion Matrix and scores to evaluate the result.

cnf_matrix = metrics.confusion_matrix(y_test, predictions)
sns.heatmap(cnf_matrix, annot=True)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Confusion Matrix

print("Accuracy: ", metrics.accuracy_score(y_test, predictions))
print("Precision: ", metrics.precision_score(y_test, predictions))
print("Recall: ", metrics.recall_score(y_test, predictions))

# Output
# Accuracy: 0.85
# Precision: 0.730769230769
# Recall: 0.703703703704

Considerations

With the Confusion Matrix we can conclude that:

True positive: 19 (the model predicted a positive result and it was positive).
True negative: 66 (the model predicted a negative result and it was negative).
False-positive: 7 (the model predicted a positive result and it was negative).
False-negative: 8 (the model predicted a negative result and it was positive).

and with the scores:

Accuracy: 85% was predicted correctly.
Precision: 73% of the time the model is correct about its predictions.
Recall: 70% of the time the model expected true result.

The contents that I have shared are about what I am learning, so, if you found some error, on the code, concepts, considerations even spelling please let me know, like this, you will help me a lot to avoid more mistakes.

python linear regression data science statistic

Discussion and feedback