22

I need to get the statistical data which were generated to draw a box plot in Pandas(using dataframe to create boxplots). i.e. Quartile1,Quartile2,Quartile3, lower whisker value, upper whisker value and outliers. I tried the following query to draw the boxplot.

import pandas as pd
df = pd.DataFrame(np.random.rand(100, 5), columns=['A', 'B', 'C', 'D', 'E'])
pd.DataFrame.boxplot(df,return_type = 'both')

Is there a way to do it instead of manually calculating the values?

0

2 Answers 2

34

One option is to use the y data from the plots - probably most useful for the outliers (fliers)

_, bp = pd.DataFrame.boxplot(df, return_type='both')

outliers = [flier.get_ydata() for flier in bp["fliers"]]
boxes = [box.get_ydata() for box in bp["boxes"]]
medians = [median.get_ydata() for median in bp["medians"]]
whiskers = [whiskers.get_ydata() for whiskers in bp["whiskers"]]

But it's probably more straightforward to get the other values (including IQR) using either

quantiles = df.quantile([0.01, 0.25, 0.5, 0.75, 0.99])

or, as suggested by WoodChopper

stats = df.describe()
Sign up to request clarification or add additional context in comments.

Comments

7
  • To get the boxplot data, use matplotlib.cbook.boxplot_stats, which returns a list of dictionaries of statistics used to draw a series of box and whisker plots using matplotlib.axes.Axes.bxp
    • To get the boxplot statistics, pass an array to boxplot_stats.
      • This is not specific to pandas.
  • The default plot engine for pandas, is matplotlib, so using boxplot_stats will return the correct metrics for pandas.DataFrame.plot.box.
  • Pass the numeric columns of interest, to boxplot_stats, as an array, using df.values.
  • There can be no NaN values in the columns.
  • Tested in python 3.11.4, pandas 2.1.0, matplotlib 3.7.2
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.cbook import boxplot_stats
import numpy as np

# test dataframe
np.random.seed(346)
df = pd.DataFrame(np.random.rand(100, 5), columns=['A', 'B', 'C', 'D', 'E'])

# plot the dataframe as needed
ax = df.plot.box(figsize=(8, 6), showmeans=True, grid=True)

enter image description here

  • Extract the boxplot metrics by passing an array to boxplot_stats
    • boxplot_stats(df) or boxplot_stats(df.values) will work.
    • The dicts are in the same order as the column arrays from df.
    • This data had no outliers, fliers, because it was generated with numpy.random.
# create a dict of dicts with the column names as the keyword for each dict of statistics
stats = dict(zip(df.columns, boxplot_stats(df)))

print(stats)
[out]:
{'A': {'cihi': 0.6008396701195271,
       'cilo': 0.45316512285356997,
       'fliers': array([], dtype=float64),
       'iqr': 0.47030110594253877,
       'mean': 0.49412631128104645,
       'med': 0.5270023964865486,
       'q1': 0.2603486498337239,
       'q3': 0.7306497557762627,
       'whishi': 0.9941975539538199,
       'whislo': 0.00892072823759571},
 'B': {'cihi': 0.5460977498205477,
       'cilo': 0.39283808760835964,
       'fliers': array([], dtype=float64),
       'iqr': 0.4880880962171596,
       'mean': 0.47578540593013985,
       'med': 0.4694679187144537,
       'q1': 0.2466015651284032,
       'q3': 0.7346896613455628,
       'whishi': 0.9906905357196321,
       'whislo': 0.002613905425137064},
 'C': {'cihi': 0.6327876179340386,
       'cilo': 0.47317829117336885,
       'fliers': array([], dtype=float64),
       'iqr': 0.5083099578365278,
       'mean': 0.5202481643792808,
       'med': 0.5529829545537037,
       'q1': 0.24608370844800756,
       'q3': 0.7543936662845353,
       'whishi': 0.9968264819096214,
       'whislo': 0.008450848029956215},
 'D': {'cihi': 0.5429786764060252,
       'cilo': 0.40089287519667627,
       'fliers': array([], dtype=float64),
       'iqr': 0.4525025516221303,
       'mean': 0.4948030963370377,
       'med': 0.4719357758013507,
       'q1': 0.279181107815125,
       'q3': 0.7316836594372553,
       'whishi': 0.9836196084903415,
       'whislo': 0.019864664399723786},
 'E': {'cihi': 0.5413819754851169,
       'cilo': 0.3838462046931251,
       'fliers': array([], dtype=float64),
       'iqr': 0.5017062764076173,
       'mean': 0.4922357500877824,
       'med': 0.462614090089121,
       'q1': 0.2490034171367362,
       'q3': 0.7507096935443536,
       'whishi': 0.9984043081918205,
       'whislo': 0.0036707224412856343}}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.