Calculate AUC in R: Step-by-Step Guide With Example

Renesh Bedre    2 minute read

Area Under the Receiver Operating Characteristic Curve (AUC) is a widely used numerical metric for evaluating and comparing the performance of binary classification models such as binary logistic regression.

The Receiver Operating Characteristic (ROC) curve is a plot between the true positive rate (sensitivity) and the false positive rate (1-specificity) at different threshold values. The AUC is the entire area under this ROC curve and summarises the performance of the model as a single numerical value.

AUC ranges from 0 and 1. The perfect model will have an AUC of 1 (perfectly distinguishing between the two classes). The random model (equal chances of prediction) will have an AUC of 0.5.

The higher the AUC, the better the model. AUC is commonly used for comparing and selecting the better model.

The advantage of AUC is that it is scale-invariant (independent of the absolute scale of predicted probabilities) and classification-threshold-invariant (considers all possible threshold values).

In R, the AUC can be calculated using the auc() function from the pROC package.

The following step-by-step example explains how to calculate the AUC in R for the logistic regression

Getting the dataset

Fit the logistic regression model using the sample breast cancer dataset. This sample breast cancer dataset contains the four features (predictors) and outcome (whether the patient is healthy or cancerous).

# load data
df <- read.csv("https://reneshbedre.github.io/assets/posts/logit/breast_cancer_sample.csv")

# view first few rows
# diagnosis is the outcome with two levels with cancer (1) or healthy (0) patients
head(df, 2)

 Age      BMI Glucose Insulin diagnosis
1  48 23.50000      70   2.707         0
2  83 20.69049      92   3.115         0

Split training and test datasets

Split the dataset into training and test datasets. We will use createDataPartition() function from caret package to split 70% as training and 30% as test datasets.

The training dataset will be used for training the model and the test dataset will be used for prediction.

# load package
library(caret)

# set random seed (for reproducibility)
set.seed(345)

# split into training and testing
index <- createDataPartition(df$diagnosis, p = 0.7, list = FALSE)

# Create the training and test datasets
train_df <- df[index, ]
test_df <- df[-index, ]

Fit the model

Fit the logistic regression model using training dataset,

# fit logistic regression model
fit <- glm(diagnosis ~ Age + BMI + Glucose + Insulin, family = binomial(), data = train_df)

Perform prediction

Predict the outcome of the test dataset using fitted model,

# perform prediction
pred_probs <- predict(fit, test_df, type = "response") 

Calculate AUC

Calculate the AUC using the auc() function from the pROC package. The auc() takes truth and predicted values and returns the AUC.

# load packages 
library(pROC)

# calculate AUC
auc(test_df$diagnosis, pred_probs)

# output
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Area under the curve: 0.8571

The AUC of the fitted model is 0.8571. It implies that the model has better discrimination and predictability to predict whether the patient is healthy or cancerous.

Enhance your skills with courses on machine learning




This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.

Tags: ,

Updated: