Friedman test using Python (with examples and code)
This article explains how to perform the Friedman test in Python. You can refer to this article to know more about Friedman test, when to use Friedman test, assumptions, and how to interpret the Friedman test results.
Friedman test in Python
Friedman test data example
A researcher wants to study the effect of different locations on bacterial disease development in different plant varieties. The disease development is measured as a disease severity index with an ordinal scale (1 to 5, with 1 being no disease and 5 being severe disease symptoms). To check whether locations have an effect on disease development on each plant variety, the researcher evaluated the disease severity index for each plant variety at different locations.
Load the dataset
import pandas as pd df=pd.read_csv("https://reneshbedre.github.io/assets/posts/anova/plant_disease_friedman.csv") df.head(2) ` plant_var L1 L2 L3 L4 0 P1 4 2 5 4 1 P2 3 1 4 3 # convert to long format df_long = pd.melt(df.reset_index(), id_vars=['plant_var'], value_vars=['L1', 'L2', 'L3', 'L4']) df_long.columns = ['plant_var', 'locations', 'disease'] df_long.head(2) plant_var locations disease 0 P1 L1 4 1 P2 L1 3
Summary statistics and visualization of dataset
Get summary statistics based on dependent variable and covariate,
from dfply import * df_long >> group_by(X.locations) >> summarize(n=X['disease'].count(), mean=X['disease'].mean(), median=X['disease'].median(), std=X['disease'].std()) # output locations n mean median std 0 L1 5 4.2 4.0 0.836660 1 L2 5 1.4 1.0 0.547723 2 L3 5 4.0 4.0 0.707107 3 L4 5 4.0 4.0 0.707107
import seaborn as sns import matplotlib.pyplot as plt sns.boxplot(data=df_long, x="locations", y="disease", hue=df_long.locations.tolist()) plt.show()
perform Friedman test
We will use the
friedman function from
pingouin package to perform Friedman test in Python
Pass the following parameters to
- data : Dataframe (wide or long format)
- dv : Name of column in dataframe that contains dependent variable
- within : Name of column in dataframe that contains within-subject factor (treatment)
- subject : Name of column in dataframe that contains subjects (block)
import pingouin as pg pg.friedman(data=df_long, dv="disease", within="locations", subject="plant_var") # output Source W ddof1 Q p-unc Friedman locations 0.656522 3 9.847826 0.019905
Friedman test results with chi-squared test show that there are significant differences [χ2(3) = 9.84, p = 0.01] in disease severity in plant varieties based on their locations.
Friedman test effect size
From the result above, Kendall’s W is 0.656 and indicates a large effect size (degree of difference). Kendall’s W is based on Cohen’s interpretation guidelines (0.1: small effect; 0.3: moderate effect; and >0.5: large effect).
Friedman test is significant (there are significant differences among locations on disease severity), but it is an
omnibus test statistic and does not tell which locations have a significant effect on disease severity.
To know which locations are significantly different, I will perform the pairwise comparisons using the Conover post hoc test. In addition to Conover’s test, Wilcoxon-Nemenyi-McDonald-Thompson test (Nemenyi test) can also be used as post-hoc test for significant Friedman test.
The FDR method will be used to adjust the p values for multiple hypothesis testing at a 5% cut-off
I will use the
posthoc_conover_friedman function from the
scikit_posthocs package to perform Conover post-hoc test in
Pass the following parameters to
- a : pandas DataFrame
- y_col : Name of column in dataframe that contains dependent variable
- melted : Dataframe in long format (bool)
- group_col : Name of column in dataframe that contains within-subject factor (treatment)
- block_col : Name of column in dataframe that contains subjects (block)
- p_adjust : Adjust p value for multiple comparisons (see details here)
import scikit_posthocs as sp sp.posthoc_conover_friedman(a=df_long, y_col="disease", group_col="locations", block_col="plant_var", p_adjust="fdr_bh", melted=True) # output L1 L2 L3 L4 L1 1.000000 0.070557 0.902719 0.902719 L2 0.070557 1.000000 0.070557 0.070557 L3 0.902719 0.070557 1.000000 0.902719 L4 0.902719 0.070557 0.902719 1.000000
The multiple pairwise comparisons suggest that there are no statistically significant differences between different locations on disease severity for different plant varieties, despite there being low disease severity for location L2.
Enhance your skills with courses on Machine Learning and Python
- Machine Learning with Python
- Machine Learning for Data Analysis
- Cluster Analysis in Data Mining
- Python for Everybody Specialization
- MANOVA using R (with examples and code)
- What is p value and how to calculate p value by hand
- Repeated Measures ANOVA using Python and R (with examples)
- ANOVA using Python (with examples)
- Multiple hypothesis testing problem in Bioinformatics
If you have any questions, comments or recommendations, please email me at email@example.com
This work is licensed under a Creative Commons Attribution 4.0 International License