title: link: https://claireduvallet.wordpress.com/2018/01/21/__trashed/ author: cduvallet description: post_id: 1459 created: 2018/01/21 14:45:22 created_gmt: 2018/01/21 14:45:22 comment_status: open post_name: __trashed status: trash post_type: post


In this post, I’m going to go through some tricks I’ve learned to plot pretty boxplots in Python.

Boxplots are my absolute favorite way to look at data, but the defaults in Python aren’t publication-level pretty. When making boxplots, I think it’s very important to also plot the underlying raw data points whenever that’s possible. Especially for main text figures and/or in cases without that many points, this is the best way to simultaneously summarize the data and allow the reader to come to their interpretations of it. When I read a paper, I get fairly grumpy when I know that the sample size is low (say, less than 100) but I only get to see the boxplot. Just let me see the data - otherwise, I have to wonder if you’re trying to hide something… Anyway. In this post, we’ll be using seaborn to establish the main boxplots, with some additional tweaking to make them beautiful. Of course, I didn’t come up with these solutions all on my own, and deep gratitude goes to Stack Overflow for teaching me these tricks (in addition to almost everything else I know about coding).

Generate data

First, let’s set up some toy data. Here, we’ll be simulating making some measurements on three different body sites for 15 healthy patients and 10 diseased patients.

In [70]:

import pandas as pd
import numpy as np

# Set up the data
data = np.concatenate(
    [[np.random.normal(loc=1, size=15), 15*['site1'], 15*['healthy']],
     [np.random.normal(loc=3, size=15), 15*['site2'], 15*['healthy']],
     [np.random.normal(loc=0, size=15), 15*['site3'], 15*['healthy']],
     [np.random.normal(loc=1, size=10), 10*['site1'], 10*['disease']],
     [np.random.normal(loc=1, size=10), 10*['site2'], 10*['disease']],
     [np.random.normal(loc=3, size=10), 10*['site3'], 10*['disease']]], 
df = pd.DataFrame(columns=['value', 'site', 'label'], data=data.T)
df['value'] = df['value'].astype(float)

# Show every ninth row


value site label

0 -0.013008 site1 healthy

9 1.477600 site1 healthy

18 3.598471 site2 healthy

27 3.775151 site2