Area Under the Receiver Operating Characteristic Curve (AUC) is a widely used numerical metric for evaluating and comparing the performance of binary classification models such as logistic regression.
The AUC is the entire area under this Receiver Operating Characteristic (ROC) curve and summarises the performance of the model as a single numerical value.
AUC ranges from 0 and 1. The perfect model will have an AUC of 1 (perfectly distinguishing between the two classes). The random model (equal chances of prediction) will have an AUC of 0.5.
AUC is commonly used for comparing and selecting the better model. The higher the AUC, the better the model.
The advantage of AUC is that it is scale-invariant (independent of the absolute scale of predicted probabilities) and classification-threshold-invariant (considers all possible threshold values).
In Python, the AUC can be calculated using the
auc() function from the
The following step-by-step example explains how to calculate the AUC in R for the logistic regression
Getting the dataset
Fit the logistic regression model using the sample breast cancer dataset.
This sample breast cancer dataset includes four features (predictors) and outcome [patient is healthy (0) or cancerous (1)].
# import package import pandas as pd # load dataset df = pd.read_csv("https://reneshbedre.github.io/assets/posts/logit/breast_cancer_sample_2.csv") # view first few rows # Classification is the outcome with two levels with cancer (1) or healthy (0) patients df.head(2) Age BMI Insulin Leptin Classification 0 48 23.500000 2.707 8.8071 0 1 83 20.690495 3.115 8.8438 0
Split the dataset into train and test datasets. We will use the
train_test_split() function from the
sklearn package to
split 75% as training and 25% as test datasets.
The training dataset will be used for training the model and the test dataset will be used for prediction.
# import package from sklearn.model_selection import train_test_split # split into training and testing df_train, df_test = train_test_split(df, random_state = 0)
Fit the logistic regression model
Fit the logistic regression model using training dataset,
# import package from sklearn.linear_model import LogisticRegression # get X and y X_train = df_train[["Age", "BMI", "Insulin", "Leptin"]] y_train = df_train["Classification"] # fit the model fit = LogisticRegression(random_state = 0).fit(X_train, y_train)
Predict the outcome of the test dataset using fitted model,
# perform prediction # # get X and y X_test = df_test[["Age", "BMI", "Insulin", "Leptin"]] y_test = df_test["Classification"] # calculate predicted probabilities pred_probs = fit.predict_proba(X_test)[:, 1]
Calculate the AUC using the
roc_auc_score() function from the
sklearn package. The
roc_auc_score() takes truth and predicted values and
returns the AUC.
# import packages from sklearn.metrics import roc_auc_score # calculate AUC roc_auc_score(y_true = y_test, y_score = pred_probs) # output 0.6078
The AUC of the fitted model is 0.6078.
The closer the AUC to 1, the better the model. AUC of 0.6078 implies that the fitted model has poor discrimination and may not perform well in predicting whether the patient is healthy or cancerous.
Enhance your skills with courses on machine learning
- Advanced Learning Algorithms
- Machine Learning Specialization
- Machine Learning with Python
- Machine Learning for Data Analysis
- Supervised Machine Learning: Regression and Classification
- Unsupervised Learning, Recommenders, Reinforcement Learning
- Deep Learning Specialization
- AI For Everyone
- AI in Healthcare Specialization
- Cluster Analysis in Data Mining
This work is licensed under a Creative Commons Attribution 4.0 International License
Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.