MANOVA using Python (using statsmodels and sklearn)
This article explains how to perform the one-way MANOVA in Python. You can refer to this article to know more about MANOVA, when to use MANOVA, assumptions, and how to interpret the MANOVA results.
One-way (one factor) MANOVA in Python
MANOVA example dataset
Suppose we have a dataset of various plant varieties (plant_var) and their associated phenotypic measurements for 
plant heights (height) and canopy volume (canopy_vol). We want to see if plant heights and canopy volume are 
associated with different plant varieties using MANOVA.
Load dataset,
import pandas as pd
df=pd.read_csv("https://reneshbedre.github.io/assets/posts/ancova/manova_data.csv")
df.head(2)
  plant_var  height  canopy_vol
0         A    20.0         0.7
1         A    22.0         0.8
Summary statistics and visualization of dataset
Get summary statistics based on each dependent variable,
from dfply import *
# summary statistics for dependent variable height 
df >> group_by(X.plant_var) >> summarize(n=X['height'].count(), mean=X['height'].mean(), std=X['height'].std())
# output
  plant_var   n   mean       std
0         A  10  18.90  2.923088
1         B  10  16.54  1.920185
2         C  10   3.05  1.039498
3         D  10   9.35  2.106735
# summary statistics for dependent variable canopy_vol 
df >> group_by(X.plant_var) >> summarize(n=X['canopy_vol'].count(), mean=X['canopy_vol'].mean(), std=X['canopy_vol'].std())
# output
  plant_var   n   mean       std
0         A  10  0.784  0.121308
1         B  10  0.608  0.096816
2         C  10  0.272  0.143279
3         D  10  0.474  0.094540
Visualize dataset,
import seaborn as sns
import matplotlib.pyplot as plt
fig, axs = plt.subplots(ncols=2)
sns.boxplot(data=df, x="plant_var", y="height", hue=df.plant_var.tolist(), ax=axs[0])
sns.boxplot(data=df, x="plant_var", y="canopy_vol", hue=df.plant_var.tolist(), ax=axs[1])
plt.show()
 
perform one-way MANOVA
from statsmodels.multivariate.manova import MANOVA
fit = MANOVA.from_formula('height + canopy_vol ~ plant_var', data=df)
print(fit.mv_test())
# output
                  Multivariate linear model
==============================================================
                                                              
--------------------------------------------------------------
       Intercept         Value  Num DF  Den DF F Value  Pr > F
--------------------------------------------------------------
          Wilks' lambda  0.0374 2.0000 35.0000 450.0766 0.0000
         Pillai's trace  0.9626 2.0000 35.0000 450.0766 0.0000
 Hotelling-Lawley trace 25.7187 2.0000 35.0000 450.0766 0.0000
    Roy's greatest root 25.7187 2.0000 35.0000 450.0766 0.0000
--------------------------------------------------------------
                                                              
--------------------------------------------------------------
       plant_var         Value  Num DF  Den DF F Value  Pr > F
--------------------------------------------------------------
          Wilks' lambda  0.0797 6.0000 70.0000  29.6513 0.0000
         Pillai's trace  1.0365 6.0000 72.0000  12.9093 0.0000
 Hotelling-Lawley trace 10.0847 6.0000 44.9320  58.0496 0.0000
    Roy's greatest root  9.9380 3.0000 36.0000 119.2558 0.0000
==============================================================
The Pillai’s Trace test statistics is statistically significant [Pillai’s Trace = 1.03, F(6, 72) = 12.90, p < 0.001] and indicates that plant varieties has a statistically significant association with both combined plant height and canopy volume.
post-hoc test
Here we will perform the linear discriminant analysis (LDA) using sklearn to see the differences between each group. 
LDA will discriminate the groups using information from both the dependent variables.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as lda
X = df[["height", "canopy_vol"]]
y = df["plant_var"]
post_hoc = lda().fit(X=X, y=y)
# get Prior probabilities of groups:
post_hoc.priors_
array([0.25, 0.25, 0.25, 0.25])
# get group means
post_hoc.means_
array([[18.9  ,  0.784],
       [16.54 ,  0.608],
       [ 3.05 ,  0.272],
       [ 9.35 ,  0.474]])
# get Coefficients of linear discriminants
post_hoc.scalings_
array([[-0.43883736, -0.2751091 ],
       [-1.39491582,  9.32562799]])
# get Proportion of trace (variance explained by each of the selected components)
post_hoc.explained_variance_ratio_
array([0.98545382, 0.01454618])
# plot 
X_new = pd.DataFrame(lda().fit(X=X, y=y).transform(X), columns=["lda1", "lda2"])
X_new["plant_var"] = df["plant_var"]
sns.scatterplot(data=X_new, x="lda1", y="lda2", hue=df.plant_var.tolist())
plt.show()
 
The LDA scatter plot discriminates against multiple plant varieties based on the two dependent variables. The C and D plant variety has a significant difference (well separated) as compared to A and B. A and B plant varieties are more similar to each other. Overall, LDA discriminated between multiple plant varieties.
Read detail about MANOVA here
If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com
This work is licensed under a Creative Commons Attribution 4.0 International License
 
      