# Logistic regression in Python (feature selection, model fitting, and prediction)

## What is logistic regression?

- Logistic regression models the binary (dichotomous) response variable (e.g. 0 and 1, true and false) as linear combinations of the single or multiple independent (also called predictor or explanatory) variables.
- Univariate logistic regression has one independent variable, and multivariate logistic regression has more than one independent variables.
- In logistic regression, the probability or odds of the response variable (instead of values as in linear regression) are modeled as function of the independent variables.
- For example, prediction of death or survival of patients, which can be coded as 0 and 1, can be predicted by metabolic markers.

## Logistic regression assumptions

Logistic regression does not require to follow the assumptions of normality and equal variances of errors as in linear regression, but it needs to follow the below assumptions

- The linear relationship between the continuous independent variables and log odds of the dependent variable
- No multicollinearity among the independent variables. Multicollinearity can be tested using the Variance Inflation Factor (VIF).
- No influential outliers
- Independence of errors (residuals) or no significant autocorrelation. The residuals should not be correlated with each other. This can be tested using the Durbin-Watson test.
- The sample size should be large (at least 50 observations per independent variables are recommended)

## Logistic regression vs. Linear regression

Logistic regression | Linear regression |
---|---|

Dependent variable is categorical (binary or dichotomous) variable | Dependent variable is a continuous variable |

Models the estimated probabilities of the events (predicted values are within the range of 0 and 1) | Models the quantitative response of dependent variable (predicted values can be outside the range of 0 and 1) |

Coefficients of regression are estimated using maximum likelihood estimation (MLE) method | Coefficients of regression are estimated using the least square method |

S-shaped (sigmoidal) curve between independent variables and predicted probabilities | Linear relationship between the dependent and independent variables |

Predicts the categorical response (class assignment) of the observations (classification model) | Predicts the quantitative response of the observations (regression model) |

Coefficients of regression interpreted in terms of odds or odds ratio (OR) | Coefficients of regression interpreted directly based on the estimated values |

log odds of the dependent variable has a linear relationship with the continuous independent variables | Outcome of dependent variable has a linear relationship with the independent variables |

It does not require to follow the assumptions of normality and equal variances of errors, but errors should be independent | It should follow assumptions of normality and equal variances of errors |

## Logistic regression model

The logistic regression model follows a binomial distribution, and the coefficients of regression (parameter estimates) are estimated using the maximum likelihood estimation (MLE). The logistic regression model the output as the odds, which assign the probability to the observations for classification.

## Odds and Odds ratio (OR)

- Odds is the ratio of the probability of an event happening to the probability of an event not happening
(
*p*∕ 1-*p*). Odds can range from 0 to +∞. - The odds ratio (OR) is the ratio of two odds. OR can range from 0 to +∞. OR is useful in interpreting the coefficients of regressions i.e effect of independent variables on the response variable, as coefficients of regressions would not be easy to interpret. OR can be obtained by exponentiating the coefficients of regressions.

## Logistic regression in python

- We will use statsmodels, sklearn, seaborn, and bioinfokit (v1.0.4 or later)
- Follow complete python code for cancer prediction using Logistic regression

Note: If you have your own dataset, you should import it as pandas dataframe. Learn how to import data using pandas

```
from bioinfokit.analys import get_data
# get dataset for model training
df_train = get_data('wdbc_train').data
df_train.head(2)
ID dign rad_mean text_mean peri_mean area_mean smooth_mean comp_mean conv_mean conv_p_mean sym_mean frac_dim_mean
0 8711202 1 17.68 20.74 117.40 963.7 0.11150 0.16650 0.18550 0.10540 0.1971 0.06166
1 869218 0 11.43 17.31 73.66 398.0 0.10920 0.09486 0.02031 0.01861 0.1645 0.06562
# get test dataset
df_test = get_data('wdbc_test').data
```

- This dataset represents the characteristics of breast cancer cell nuclei computed from the digitized images (Dua and Graff 2019; Dr. William H. Wolberg, University Of Wisconsin Hospital at Madison).
- The features calculated from the digitized cell images include, radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension for mean, standard error, and largest (worst) values. We have only used the mean values of these features (continuous variables) for regression analysis.
- The outcome (response variable) measured as malignant (1, positive class) or benign (0, negative class) (see dign variable in dataframe)
- Using the logistic regression model, I will build a classifier to predict the outcome as malignant or benign from given test samples

Data distribution for the binary outcome variable,

```
# get count plot for the cancer outcome
import seaborn as sns
ax = sns.countplot(x='dign', data=df_train)
plt.show()
```

**Note**: It is crucial to have balanced class distribution, i.e., there should be no
significant difference between positive and negative classes (commonly negative classes are more than positives in the
life science field). The models trained on datasets with imbalanced class distribution tend to be biased and show poor
performance toward minor class ^{4}.

### Feature selection for model training

- For good predictions of the regression outcome, it is essential to include the good independent variables (features) for fitting the regression model (e.g. variables that are not highly correlated). If you include all features, there are chances that you may not get all significant predictors in the model.
- Let’s visualize the data for correlation among the independent variables

```
from bioinfokit import visuz
X = df_train.iloc[:,2:12]
visuz.stat.corr_mat(df=X, cmap='RdBu')
```

- As you see in the correlation figure, several variables are highly correlated (multicollinearity) to each other (e.g. rad_mean and peri_mean). Multicollinearity can be an issue and reduce the performance of the fitted model. These spurious variables can be detected and dropped using various methods such the VIF, dimension reduction by PCA, recursive feature elimination (RFE), fitting models with all variables and removing insignificant variables, Chi-squared test etc.
- I have used the model fitting and to drop the features with high multicollinearity and insignificant variables.
- Based on model fitting and VIF analysis, I am using only text_mean, peri_mean, smooth_mean, conv_mean, and frac_dim_mean for the logistic regression analysis.

### Logistic regression model fitting

```
# logistic regression model
import statsmodels.api as sm
# get independent variables
X = df_train[['text_mean', 'peri_mean', 'smooth_mean', 'conv_mean', 'frac_dim_mean']]
# to get intercept -- this is optional
# X = sm.add_constant(X)
# get response variables
Y = df_train[['dign']]
# fit the model with maximum likelihood function
model = sm.Logit(endog=Y, exog=X).fit()
# output message
Optimization terminated successfully.
Current function value: 0.147065
Iterations 10
print(model.summary())
# output
Logit Regression Results
==============================================================================
Dep. Variable: dign No. Observations: 426
Model: Logit Df Residuals: 421
Method: MLE Df Model: 4
Date: Wed, 25 Nov 2020 Pseudo R-squ.: 0.7757
Time: 10:24:44 Log-Likelihood: -62.650
converged: True LL-Null: -279.29
Covariance Type: nonrobust LLR p-value: 1.794e-92
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
text_mean 0.2511 0.060 4.204 0.000 0.134 0.368
peri_mean 0.0308 0.014 2.267 0.023 0.004 0.057
smooth_mean 103.0272 25.407 4.055 0.000 53.231 152.823
conv_mean 67.7988 8.614 7.871 0.000 50.916 84.682
frac_dim_mean -387.2520 56.462 -6.859 0.000 -497.915 -276.589
=================================================================================
# get odds ratio
np.exp(model.params)
# output
text_mean 1.285496e+00
peri_mean 1.031256e+00
smooth_mean 5.547844e+44
conv_mean 2.783928e+29
frac_dim_mean 6.585474e-169
```

### Interpretation

- The variable text_mean has an OR of 1.28 which suggests for one unit increase in text_mean we expect that about 1.28 times increase the odds of patient being malignant (assuming all other independent variables constant). Other independent variables can be interpreted in the same way.
- The
*p*values for all independent variables are significant (*p*< 0.05) and suggests that these variables are highly associated with the outcome. - Fractal dimension has a slight effect on cancer classification due to its very low OR
- The fitted model can be evaluated using the goodness-of-fit index pseudo R-squared (McFadden’s R2 index) which measures improvement in model likelihood over the null model (unlike OLS R-squared, which measures the proportion of explained variance). The pseudo R-squared value close to 1 suggests a better fitted model. However, it should be interpreted cautiously and other measures should also be considered for model evaluation. pseudo R-squared would be more useful when comparing the different models for similar datasets predicting the same outcome.

### Prediction of test dataset using fitted model

```
# get the predicted values for the test dataset [0, 1]
pred = model.predict(exog=df_test[['text_mean', 'peri_mean', 'smooth_mean', 'conv_mean', 'frac_dim_mean']])
pred.head()
# output
0 0.004102
1 0.585947
2 0.999832
3 0.032939
4 0.000001
# predicted values > 0.5 classified as malignant (1) and <= 0.05 as benign (0)
round(pred)
0 0.0
1 1.0
2 1.0
3 0.0
4 0.0
# get confusion matrix and accuracy of the prediction
# note: there may be slightly different results if you use sklearn LogisticRegression method
from sklearn.metrics import accuracy_score, confusion_matrix
confusion_matrix(y_true=list(df_test['dign']), y_pred=list(round(pred)))
# output
array([[79, 7],
[ 7, 50]], dtype=int64)
# fitted model accuracy
accuracy_score(y_true=list(df_test['dign']), y_pred=list(round(pred)))
# output
0.9020
```

Confusion matrix,

Predicted Observed |
B(0) | M(1) |
---|---|---|

B(0) | 79 | 7 |

M(1) | 7 | 50 |

In the confusion matrix, diagonal numbers (79 and 50) indicates the correct predictions [true negatives (TN) and true positives (TP)] for the benign (0) and malignant (1) outcomes for test cancer datasets. The other numbers (7 and 7) indicates incorrect predictions [false positives (FP) and false negatives (FN)]

Plot Receiver Operating Characteristic (ROC) curve,

```
from sklearn.metrics import roc_curve, auc, roc_auc_score
from bioinfokit.visuz import stat
fpr, tpr, thresholds = roc_curve(y_true=list(df_test['dign']), y_score=list(pred))
auc = roc_auc_score(y_true=list(df_test['dign']), y_score=list(pred))
# plot ROC
stat.roc(fpr=fpr, tpr=tpr, auc=auc, shade_auc=True, per_class=True, legendpos='upper center', legendanchor=(0.5, 1.08), legendcols=3)
```

#### Interpretation

- In ROC, we can summarize the model predictability based on the area under curve (AUC). AUC range from 0.5 to 1 and a model with higher AUC has higher predictability. AUC refers to the probability that randomly chosen benign patients will have high chances of classification as benign than randomly chosen malignant patients.
- The fitted model has AUC 0.9561 suggesting better predictability in classification for breast cancer. The points lying above the chance level and close to grey line (perfect performance) represents a model with higher predictability.
- The accuracy of the fitted model is 0.9020. Even though accuracy is a measure of model performance, it is not alone enough. The AUC outperforms accuracy for model predictability. Two models can have the same accuracy but can differ in AUC. The models which are evaluated solely on accuracy may lead to misleading classification.

## References

- Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
- Dr. William H. Wolberg, General Surgery Dept. University of Wisconsin, Clinical Sciences Center Madison, WI 53792
- Bewick V, Cheek L, Ball J. Statistics review 14: Logistic regression. Critical care. 2005 Feb 1;9(1):112.
- Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982 Apr;143(1):29-36.
- Smith TJ, McKenna CM. A comparison of logistic regression pseudo R2 indices. Multiple Linear Regression Viewpoints. 2013;39(2):17-26.
- Abdulhafedh A. Incorporating the multinomial logistic regression in vehicle crash severity modeling: a detailed overview. Journal of Transportation Technologies. 2017;7(03):279.
- Pearson RG, Thuiller W, Araújo MB, Martinez‐Meyer E, Brotons L, McClean C, Miles L, Segurado P, Dawson TP, Lees DC. Model‐based uncertainty in species range prediction. Journal of biogeography. 2006 Oct;33(10):1704-11.
- Josephat PK, Ame A. Effect of Testing Logistic Regression Assumptions on the Improvement of the Propensity Scores. Int. J. Stat. Appl. 2018;8:9-17.

This work is licensed under a Creative Commons Attribution 4.0 International License