Carlo Cruz-Albrecht

Python Libraries: Matplotlib and Seaborn Tutorial

How to Use Matplotlib and Seaborn for Datascience


Matplotlib reference: https://matplotlib.org/2.1.2/index.html
Seaborn reference: https://seaborn.pydata.org/

Matplotlib is a ubiquitious plotting library for python with infinite customization. Seaborn allows you to make graphs very quickly and beautifully though with less modification options. Both are very compatible with pandas and numpy.

To best learn the material, I recommend using Jupyter Notebook to play with the code and exercises yourself! Jupyter Notebooks allow you to write text and run python code in the same document. Download the notebook.

Install jupyter:

pip3 install jupyter

Launch your notebook (opens in browser):

jupyter notebook [name_of_file.ipynb]

Alternatively, you can run Jupyter Notebooks in Google Drive using Colaboratory.

Import

Note: We’ll be relying on Pandas and Numpy in this tutorial.

We need to import matplotlib! Adding %matplotlib inline will make plotting a bit more convenient.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Using Matplotlib built into Pandas

If you have a Pandas dataframe, it actually comes with some basic plotting functions that run the matplotlib code for you. It’s a nice shortcut!

yearly_data contains the number of registered babies per year.

yearly_data.head()
Count
Year
1910 9164
1911 9984
1912 17944
1913 22094
1914 26925

Line Graphs

yearly_data.plot(kind="line")  #kind='line' is optional
<matplotlib.axes._subplots.AxesSubplot at 0x11a052198>

png

Study: Name History

# don't worry about this function unless you want to learn about groupby
def your_name_history(name):
    return baby_names[baby_names['Name'] == name].groupby('Year').sum()
table = your_name_history('John')

table.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x114a44470>

png

Bar Graphs

We can modify our data before we graph it to analyze different things.

yearly_data.plot(kind="bar")
plt.axis('off')
(-0.5, 106.5, 0.0, 580000.05000000005)

png

Class Exercise:

How could we graph only the 15 years after World War II (i.e. 1945-1960)?

Hint: create a table with only the desired years first

modified = yearly_data.loc[1945:1960]

modified.plot(kind="bar", figsize=(15,8))
<matplotlib.axes._subplots.AxesSubplot at 0x11b7f5fd0>

png

Plot with Pandas

Line Graphs

Use plt.plot() to create line graphs! The required arguments are a list of x-values and a list of y-values.

np.random.seed(42) # To ensure that the random number generation is always the same
plt.plot(np.arange(0, 7, 1), np.random.rand(7, 1))
plt.show()

png

%matplotlib inline

plt.plot(np.arange(0, 7, 1), np.random.rand(7, 1))
# plt.show() no longer required
[<matplotlib.lines.Line2D at 0x11bfb12e8>]

png

Histograms

@author previous sections: Carlo Cruz-Albrecht
@author this section and below: Jun Seo Park

To explore other types of charts, let’s load in a built-in dataset from Seaborn and first take a quick peek:

tips = sns.load_dataset('tips')
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Histograms can be plotted in matplotlib using plt.hist(). This will take one required argument of the x-axis variable.

plt.hist(tips['total_bill'])
(array([  7.,  42.,  68.,  51.,  31.,  19.,  12.,   7.,   3.,   4.]),
 array([  3.07 ,   7.844,  12.618,  17.392,  22.166,  26.94 ,  31.714,
         36.488,  41.262,  46.036,  50.81 ]),
 <a list of 10 Patch objects>)

png

Scatterplots

Scatterplots can be made using plt.scatter(). It takes in two arguments: x-values and y-values.

plt.scatter(tips['total_bill'], tips['tip'])
<matplotlib.collections.PathCollection at 0x11beb8390>

png

plt.scatter(tips['total_bill'], tips['tip'])
plt.xlabel('Total Bill')
plt.ylabel('Tip Amount')
plt.title('Total Bill vs Tip Amount')
<matplotlib.text.Text at 0x11bed28d0>

png

plt.figure(figsize=(15, 10)) # Increase the size of the returned plot

# Points with smoker == 'yes'
plt.scatter(x=tips.loc[tips['smoker'] == 'Yes', 'total_bill'], 
            y=tips.loc[tips['smoker'] == 'Yes', 'tip'],
            label='Smoker', alpha=0.6)

# Points with smoker == 'no'
plt.scatter(x=tips.loc[tips['smoker'] == 'No', 'total_bill'], 
            y=tips.loc[tips['smoker'] == 'No', 'tip'],
            label='Non-Smoker', alpha=0.6)

plt.xlabel('Total Bill')
plt.ylabel('Tip Amount')
plt.title('Total Bill vs Tip Amount (by Smoking Habits)')
plt.legend()
<matplotlib.legend.Legend at 0x11d5f2f28>

png

Exercises in Matplotlib

We’ll do the exercises using a famous dataset: the iris dataset. First, let’s load it in and take a look:

iris = sns.load_dataset('iris')
iris.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

alt text

Let’s also take a look at the different species:

iris['species'].unique()
array(['setosa', 'versicolor', 'virginica'], dtype=object)
Exercise:

Create a basic scatterplot of the petal lengths versus the petal widths. Label your axes (use the documentation linked above to make them meaningful)!

plt.scatter(iris['petal_length'], iris['petal_width'])
<matplotlib.collections.PathCollection at 0x11d68d0b8>

png

Exercise:

This time, create the same scatterplot, but assign a different color for each flower species.

plt.scatter(x=iris.loc[iris['species'] == 'setosa', 'petal_length'],
            y=iris.loc[iris['species'] == 'setosa', 'petal_width'],
            label='setosa')
plt.scatter(x=iris.loc[iris['species'] == 'versicolor', 'petal_length'],
            y=iris.loc[iris['species'] == 'versicolor', 'petal_width'],
            label='versicolor')
plt.scatter(x=iris.loc[iris['species'] == 'virginica', 'petal_length'],
            y=iris.loc[iris['species'] == 'virginica', 'petal_width'],
            label='virginica')
plt.legend();

png

def plot_by_species(species, x, y):
    plt.scatter(x=iris.loc[iris['species'] == species, x],
             y=iris.loc[iris['species'] == species, y],
             label=species)

for species in iris['species'].unique():
    plot_by_species(species, 'sepal_length', 'sepal_width')

plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Sepal Length vs Sepal Width (by Species)')
plt.legend()
<matplotlib.legend.Legend at 0x11d624cf8>

png

Seaborn

Histogram

Back to the tips dataset to explore seaborn! First off is seaborn’s take on the histogram, sns.distplot(). By default, it shows a relative distribution and overlays a kernel density estimator; if you would like seaborn to just show a plain histogram, you can add the argument kde=False.

sns.set()

plt.figure(figsize=(15, 10))
plt.subplot(1, 2, 1)
sns.distplot(tips['total_bill'])

plt.subplot(1, 2, 2)
sns.distplot(tips['total_bill'], kde=False)
<matplotlib.axes._subplots.AxesSubplot at 0x1227d54e0>

png

Scatterplot

To create a scatterplot using seaborn, you can use sns.lmplot(). It’ll take x-values and y-values, and overlay a least-squares regression line and standard deviation

Note: You can use pandas indexing, but check out the fancy ability to refer to columns by their names instead.

sns.lmplot(x='total_bill', y='tip', data=tips)
<seaborn.axisgrid.FacetGrid at 0x11bf87cc0>

png

Let’s do that same plot from earlier, where we faceted by smoker. It’s a lot easier in seaborn, since we only need to pass in an additional argument of hue:

sns.lmplot(x='total_bill', y='tip', hue='smoker', data=tips)
<seaborn.axisgrid.FacetGrid at 0x1229b72b0>

png

Cool. Do smokers’ and non-smokers’ generosities differ by day of the week? Let’s try out the row and col (column) arguments:

sns.lmplot(x='total_bill', y='tip', row='time', col='smoker', data=tips)
<seaborn.axisgrid.FacetGrid at 0x12296dfd0>

png

Seaborn Exercises

Exercise:

Your turn! Create a histogram of the petal widths in the iris dataset.

sns.distplot(iris['petal_width'], bins=np.linspace(0, 3, 20))
<matplotlib.axes._subplots.AxesSubplot at 0x122c63b00>

png

Exercise:

Now try to create a scatterplot of petal lengths versus petal widths, and color the points based on the species of flowers. Feel free to turn off the regression line using fit_reg=False.

sns.lmplot(x='petal_length', y='petal_width', hue='species', data=iris, fit_reg=False)
<seaborn.axisgrid.FacetGrid at 0x1231cc4e0>

png