Python Libraries Interactive Notebook

These are the solutions! To work on the interactive notebook yourself, download it here: Python_Libraries_Workshop.ipynb.

Don't have Anaconda installed? You can run Jupyter Notebooks interactively with Google Drive.
Download file above > Upload to Google Drive > right click file > "Open with" > 'Colaboratory' (click 'connect more apps' if Colaboratory isn't an option).

Today, we will be learning about four fundamental python libraries. These are widely used and plenty of documentation can be found online. Don't be afraid to search Google/Stack Overflow!

  1. Numpy: https://docs.scipy.org/doc/numpy-dev/user/index.html
  2. Pandas: http://pandas.pydata.org/pandas-docs/stable/
  3. Matplotlib: https://matplotlib.org/2.1.2/index.html
  4. Seaborn: https://seaborn.pydata.org/

Table of Contents

I. Numpy
II. Pandas
III. Plot with Pandas
IV. Matplotlib
V. Seaborn

Jupyter Notebook Recap

To run a cell: select cell, press SHIFT + ENTER

The last line of a cell is always displayed

In [2]:
"this will NOT be displayed"
"this will be displayed"
Out[2]:
'this will be displayed'

If cells contain ... we expect you to replace ... with your code :)

Import

numpy, pandas, matplotlib, and seaborn are made by other people! We need to import these modules in order to use them.

In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
# %matplotlib inline
# sns.set()

Numpy

Numpy's main use is ```np.array```

Numpy arrays take less space than built-in lists and come with a **wide variety of useful functions.**
In [4]:
# make an array
a = np.array([2,3,4])
a
Out[4]:
array([2, 3, 4])
In [5]:
# make a 2-dimensional array (matrix)
matrix = np.array([ [1,2,3],
                    [4,5,6],
                    [7,8,9] ])
matrix
Out[5]:
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Linear Algebra!

In [6]:
# you can multiply matrices with np.dot
np.dot(matrix, a)
Out[6]:
array([20, 47, 74])

Arithmetic with numpy!

You can add/subtract/multiply/divide with numpy arrays! You cannot do this with built-in python lists.

In [7]:
a + 5
Out[7]:
array([7, 8, 9])
In [8]:
a * -1
Out[8]:
array([-2, -3, -4])
In [9]:
b = np.array([3, 2, 1])
a + b
Out[9]:
array([5, 5, 5])

If you try to perform operations on two arrays of different lengths, an error will occur. Try running the following cell!

In [10]:
# Run me!
b + np.array([1, 2, 3, 4])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-01af955792b6> in <module>()
      1 # Run me!
----> 2 b + np.array([1, 2, 3, 4])

ValueError: operands could not be broadcast together with shapes (3,) (4,) 

Use len( array ) to find length of array.

In [11]:
len(b)
Out[11]:
3
In [12]:
len(np.array([1, 2, 3, 4]))
Out[12]:
4

Conditionals apply to every element of a numpy array as well. This will come in handy later!

In [13]:
a = np.array([1, 2, 3, 1, 1])
a == 1
Out[13]:
array([ True, False, False,  True,  True], dtype=bool)

Essential array functions

Why do we use Numpy? Numpy provides a multitude of useful functions for arrays. We'll teach you a few (many more exist!)

Exercise: Search online how to find the mean of a numpy array.

In [14]:
x = np.array([1, 5, -7, 18, 1, -2, 4])
In [15]:
# Find the mean of array x
x_mean = np.mean(x)

Here, we'll give you a list of some useful numpy functions. Remember, you can easily find info about these by searching google / numpy documentation!

In [16]:
np.sum(x)
Out[16]:
20
In [17]:
np.min(x)
Out[17]:
-7
In [18]:
np.max(x)
Out[18]:
18
In [19]:
np.median(x)
Out[19]:
1.0
In [20]:
np.cumsum(x)
Out[20]:
array([ 1,  6, -1, 17, 18, 16, 20])
In [21]:
np.abs(x)
Out[21]:
array([ 1,  5,  7, 18,  1,  2,  4])

What do you think np.cumsum does? Note, numpy has a similar function np.cumprod. Try it!

What do you think np.diff does?

In [22]:
np.diff(x)
Out[22]:
array([  4, -12,  25, -17,  -3,   6])

Two super useful functions in numpy are np.arange and np.linspace. They allow you to craft arrays with equidistant values:

  • np.arange asks for [start], stop, and [step]
  • np.linspace asks for start, stop, and num
In [23]:
np.arange(0, 100, 10)
Out[23]:
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])
In [24]:
np.linspace(0, 100, 15)
Out[24]:
array([   0.        ,    7.14285714,   14.28571429,   21.42857143,
         28.57142857,   35.71428571,   42.85714286,   50.        ,
         57.14285714,   64.28571429,   71.42857143,   78.57142857,
         85.71428571,   92.85714286,  100.        ])

Python

Using np.arrays in python is a little bit different than with built-in lists.

In [25]:
a = np.array([2, 3, 4])
b = [2, 3, 4]
print(a)
print(b)
[2 3 4]
[2, 3, 4]

Adding values to np.array is different

In [26]:
b.append("hello")
b
Out[26]:
[2, 3, 4, 'hello']
In [27]:
a = np.append(a, 'hello')
a
Out[27]:
array(['2', '3', '4', 'hello'], 
      dtype='<U21')

For loops work the same way

In [28]:
c = np.array([1, 2, 3, 4, 5])
cumulative_product = 1

for element in c:
    cumulative_product *= element
    
cumulative_product
Out[28]:
120

Numpy Exercises

Use np.arange to create an array called arr1 that contains every odd number from 1 to 100, inclusive.

In [29]:
arr1 = np.arange(1, 100, 2)
arr1
Out[29]:
array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33,
       35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67,
       69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 97, 99])

Use arr1 to create an array arr2 of every number divisible by 4 from 1 to 200, inclusive.

In [30]:
arr2 = (arr1 + 1) * 2 
arr2
Out[30]:
array([  4,   8,  12,  16,  20,  24,  28,  32,  36,  40,  44,  48,  52,
        56,  60,  64,  68,  72,  76,  80,  84,  88,  92,  96, 100, 104,
       108, 112, 116, 120, 124, 128, 132, 136, 140, 144, 148, 152, 156,
       160, 164, 168, 172, 176, 180, 184, 188, 192, 196, 200])

Create the same array, but using np.linspace instead. Call this array arr3.

In [31]:
arr3 = np.linspace(4, 200, 50)
arr3
Out[31]:
array([   4.,    8.,   12.,   16.,   20.,   24.,   28.,   32.,   36.,
         40.,   44.,   48.,   52.,   56.,   60.,   64.,   68.,   72.,
         76.,   80.,   84.,   88.,   92.,   96.,  100.,  104.,  108.,
        112.,  116.,  120.,  124.,  128.,  132.,  136.,  140.,  144.,
        148.,  152.,  156.,  160.,  164.,  168.,  172.,  176.,  180.,
        184.,  188.,  192.,  196.,  200.])

Print the following summary statistics for arr3:

  • minimum
  • 1st quartile (Hint: See np.percentile())
  • median
  • mean
  • standard deviation
  • 3rd quartile
  • max
In [32]:
print('Minimum: '            + str(np.min(arr3)))
print('1st quartile: '       + str(np.percentile(arr3, 25)))
print('Median: '             + str(np.median(arr3)))
print('Mean: '               + str(np.mean(arr3)))
print('Standard Deviation: ' + str(np.std(arr3)))
print('3rd Quartile: '       + str(np.percentile(arr3, 75)))
print('Max: '                + str(np.max(arr3)))
Minimum: 4.0
1st quartile: 53.0
Median: 102.0
Mean: 102.0
Standard Deviation: 57.7234787586
3rd Quartile: 151.0
Max: 200.0

Pandas

Pandas is all about tables!

A table is called a 'dataframe' in Pandas. Consider the table fruit_info:

**color****fruit**
redapple
orangeorange
yellowbanana
pinkraspberry

Pandas Series

Let's break this table down. DataFrames consist of columns called Series. Series act like numpy arrays.

How to make a Series:

  1. create a numpy array
  2. call pd.Series(array, name="...")    # name can be anything

Exercise: Make a Series that contains the colors from fruit_info and has name='color'

In [33]:
array = np.array(['red', 'orange', 'yellow', 'pink'])
color_column = pd.Series(array, name="color")
color_column
Out[33]:
0       red
1    orange
2    yellow
3      pink
Name: color, dtype: object

Exercise: Make another Series for the fruit column:

In [34]:
array = np.array(['apple', 'orange', 'banana', 'raspberry'])
fruit_column = pd.Series(array, name="fruit")
fruit_column
Out[34]:
0        apple
1       orange
2       banana
3    raspberry
Name: fruit, dtype: object

Combine your Series into a table!

pd.concat([ series1, series2, series3, ... ], 1)

Don't forget the 1 or you'll just make a giant Series.

In [35]:
fruit_info = pd.concat([color_column, fruit_column], 1)
fruit_info
Out[35]:
color fruit
0 red apple
1 orange orange
2 yellow banana
3 pink raspberry

What if we were given the DataFrame and we want to extract the columns?

In [36]:
fruit_info['fruit'] # we get the fruit_column Series back!
Out[36]:
0        apple
1       orange
2       banana
3    raspberry
Name: fruit, dtype: object

Dictionaries

Also, we can manually create tables by using a python dictionary. A dictionary has the following format:

d = { "name of column"   :  [  list of values  ],
      "name of column 2" :  [  list of values  ],
                        ...
                        ...
    }
In [37]:
d = { 'fruit' : ['apple', 'orange', 'banana', 'raspberry'],
      'color' : ['red', 'orange', 'yellow', 'pink']
    }
In [38]:
fruit_info_again = pd.DataFrame(d)
fruit_info_again
Out[38]:
color fruit
0 red apple
1 orange orange
2 yellow banana
3 pink raspberry

Add Columns

Add a column to table labeled "new column" like so:

table['new column'] = array

In [39]:
fruit_info['inventory'] = np.array([23, 18, 50, 20])
fruit_info
Out[39]:
color fruit inventory
0 red apple 23
1 orange orange 18
2 yellow banana 50
3 pink raspberry 20

Exercise: Add a column called rating that assigns your rating from 1 to 5 for each fruit :)

In [40]:
fruit_info['rating'] = np.array([3, 4, 5, 5])

fruit_info  # should now include a rating column
Out[40]:
color fruit inventory rating
0 red apple 23 3
1 orange orange 18 4
2 yellow banana 50 5
3 pink raspberry 20 5

Drop

Exercise: Now, use the .drop() method to drop the color column.

In [41]:
fruit_info_without_color = fruit_info.drop('color', 1) # must include axis=1

fruit_info_without_color
Out[41]:
fruit inventory rating
0 apple 23 3
1 orange 18 4
2 banana 50 5
3 raspberry 20 5

California Baby Names

Time to use a real dataset!

You can read a .csv file into pandas using pd.read_csv( url ).

Create a variable called baby_names that loads this data: https://raw.githubusercontent.com/carlocrza/Data_Science_Society/master/ca_baby_names.csv

In [42]:
baby_names = pd.read_csv("https://raw.githubusercontent.com/carlocrza/Data_Science_Society/master/baby_names.csv")

Let's display the table. We can just type baby_names and run the cell but baby_names is HUGE! So, let's display just the first five rows with:

DataFrame.head( # of rows )

In [43]:
baby_names.head(5)
Out[43]:
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

Row, Column Selection

Follow the structure:

table.loc[rows, columns]

table.loc[2:8, [ 'Name', 'Count']]

The above code will select columns "Name" and "Count" from rows 2 through 8.

In [44]:
# Returns the name of our columns
baby_names.columns
Out[44]:
Index(['State', 'Sex', 'Year', 'Name', 'Count'], dtype='object')
In [45]:
baby_names.loc[2:8, ['Name', "Count"]]
Out[45]:
Name Count
2 Dorothy 220
3 Margaret 163
4 Frances 134
5 Ruth 128
6 Evelyn 126
7 Alice 118
8 Virginia 101

Exercise: Return a table that includes rows 1000-1005 and only includes the column "Name".

In [46]:
baby_names.loc[1000:1005, ['Name']]
Out[46]:
Name
1000 Sara
1001 Teresa
1002 Cecilia
1003 Eda
1004 Etta
1005 Faye
In [47]:
# Want to select EVERY row?
# Don't put anything before and after the colon :
baby_names.loc[:, ['Sex', 'Name']].head(4)
Out[47]:
Sex Name
0 F Mary
1 F Helen
2 F Dorothy
3 F Margaret

Selecting an entire Column

Remember we can extract the column in the form of a Series using:

table_name['Name of column']

In [48]:
name_column = baby_names['Name']
name_column.head(5) # we can also use .head with Series!
Out[48]:
0        Mary
1       Helen
2     Dorothy
3    Margaret
4     Frances
Name: Name, dtype: object

Selecting rows with a Boolean Array

Lastly, we can select rows based off of True / False data. Let's go back to the simpler fruit_info table.

In [49]:
fruit_info
Out[49]:
color fruit inventory rating
0 red apple 23 3
1 orange orange 18 4
2 yellow banana 50 5
3 pink raspberry 20 5
In [50]:
# select row only if corresponding value in *selection* is True
selection = np.array([True, False, True, False])
fruit_info[selection]
Out[50]:
color fruit inventory rating
0 red apple 23 3
2 yellow banana 50 5

Filtering Data

So far we have selected data based off of row numbers and column headers. Let's work on filtering data more precisely.

table[condition]

In [51]:
condition = baby_names['Name'] == 'Carlo'
baby_names[condition].head(5)
Out[51]:
State Sex Year Name Count
219486 CA M 1917 Carlo 10
219941 CA M 1918 Carlo 8
220848 CA M 1920 Carlo 10
221935 CA M 1922 Carlo 9
222457 CA M 1923 Carlo 10

The above code only selects rows that have Name equal to 'Carlo'. Change it to your name!

Apply multiple conditions!

table[ (condition 1) & (condition 2) ]

Class Exercise: select the names in Year 2000 that have larger than 3000 counts.

In [52]:
result = baby_names[(baby_names['Count'] > 3000) & (baby_names['Year'] == 2000)]
result.head(3)
Out[52]:
State Sex Year Name Count
320198 CA M 2000 Daniel 4339
320199 CA M 2000 Anthony 3837
320200 CA M 2000 Jose 3803

Thorough explanation:

Remember that calling baby_names['Name'] returns a Series of all of the names.

Checking if values in the series are equal to Carlo results in an array of {True, False} values.

Then, we select rows based off of this boolean array. Thus, we could also do:

In [53]:
names = baby_names['Name']
equalto_Carlo = (names == 'Carlo')  # equalto_Carlo is now an array of True/False variables!
baby_names[equalto_Carlo].head(5)
Out[53]:
State Sex Year Name Count
219486 CA M 1917 Carlo 10
219941 CA M 1918 Carlo 8
220848 CA M 1920 Carlo 10
221935 CA M 1922 Carlo 9
222457 CA M 1923 Carlo 10

Using Numpy with Pandas

How many rows does our baby_names table have?

In [54]:
len(baby_names)
Out[54]:
367931

That's a lot of rows! We can't just look at the table and understand it.

Luckily, Numpy functions treat pandas Series as np.arrays.

Exercise: What is oldest and most recent year that we have data from in baby_names? HINT: np.min, np.max

In [55]:
recent_year = np.max(baby_names['Year'])
oldest_year = np.min(baby_names['Year'])
(recent_year, oldest_year)
Out[55]:
(2016, 1910)

Exercise: How many baby names were born in CA in 2015?

Hint: the 'Count' column refers the the number of occurrences of a baby name. How could we find the total number of baby names? Now narrow that to only 2015.

In [56]:
baby_names_2015 = baby_names[baby_names['Year'] == 2015]
baby_names_2015_counts = baby_names_2015['Count']
number_baby_names_2015 = np.sum(baby_names_2015_counts)
number_baby_names_2015
Out[56]:
429730
In [57]:
np.sum(baby_names[baby_names['Year'] == 2015]['Count'])
Out[57]:
429730

np.unique

In [58]:
# return an array with an element for each unique value in the Series/np.array
np.unique(baby_names['Sex'])
Out[58]:
array(['F', 'M'], dtype=object)
In [59]:
# demo
states = np.unique(baby_names['State']) # okay now we know this dataset only involves California babies.
In [60]:
names = np.unique(baby_names['Name'])
len(names)
Out[60]:
19234

Copy vs View

Depending on how you format your code, pandas might be returning a copy of the dataframe (i.e. a whole new dataframe, but just with the same values), or a view of the dataframe (i.e. the same dataframe itself).

In [61]:
carlos_fruits = fruit_info.copy()
carlos_fruits
Out[61]:
color fruit inventory rating
0 red apple 23 3
1 orange orange 18 4
2 yellow banana 50 5
3 pink raspberry 20 5

Let's say Carlo is happy with those ratings. But Jun Seo loves bananas! Let's make a "new" dataframe and change the ratings accordingly:

In [62]:
junseos_fruits = carlos_fruits
junseos_fruits['rating'] = [3, 4, 9999, 5]
junseos_fruits
Out[62]:
color fruit inventory rating
0 red apple 23 3
1 orange orange 18 4
2 yellow banana 50 9999
3 pink raspberry 20 5

And taking a look back at Carlo's fruits:

In [63]:
carlos_fruits
Out[63]:
color fruit inventory rating
0 red apple 23 3
1 orange orange 18 4
2 yellow banana 50 9999
3 pink raspberry 20 5

Wait, Carlo's banana rating shouldn't be that high! What happened is that junseos_fruits returned a view on Carlo's dataframe. Then did our shenanigans affect the original fruit_info dataframe too?

In [64]:
fruit_info
Out[64]:
color fruit inventory rating
0 red apple 23 3
1 orange orange 18 4
2 yellow banana 50 5
3 pink raspberry 20 5

No, because when we called carlos_fruits = fruit_info.copy(), we asked pandas to forcibly create a brand new dataframe with identical values instead.

SettingWithCopyWarning

This is arguably one of the most frustrating warnings you will see while using pandas. TL;DR: Use .loc instead of square brackets to index into data.

Let's say Jun Seo strongly dislikes apples.

In [65]:
junseos_fruits[junseos_fruits['fruit'] == 'apple']
Out[65]:
color fruit inventory rating
0 red apple 23 3
In [66]:
junseos_fruits[junseos_fruits['fruit'] == 'apple']['rating'] = -100
junseos_fruits
Out[66]:
color fruit inventory rating
0 red apple 23 3
1 orange orange 18 4
2 yellow banana 50 9999
3 pink raspberry 20 5
In [67]:
junseos_fruits['rating']
Out[67]:
0       3
1       4
2    9999
3       5
Name: rating, dtype: int64
In [68]:
junseos_fruits['rating'][0] = -100
junseos_fruits
Out[68]:
color fruit inventory rating
0 red apple 23 -100
1 orange orange 18 4
2 yellow banana 50 9999
3 pink raspberry 20 5
In [69]:
junseos_fruits.loc[1, 'rating'] = 1738
junseos_fruits
Out[69]:
color fruit inventory rating
0 red apple 23 -100
1 orange orange 18 1738
2 yellow banana 50 9999
3 pink raspberry 20 5

[optional] Group By

We won't have time to go through this thoroughly in lab. However, we encourage you to look into this material if you want to go further. Feel free to ask us any questions!

In the previous section we calculated the number of baby names registered in 2015.

In [70]:
np.sum(baby_names[baby_names['Year'] == 2015]['Count'])
Out[70]:
429730

There are 107 years though. If we wanted to know how many babies were born in California for each year we need to do something more efficient.

groupby to the rescue!

Groupby allows us to split our table into groups, each group having one similarity.

For example if we group by "Year" we would create 107 groups because there are 107 unique years.

`baby_names.groupby('Year')`

Now we have 107 groups but what do we do with them? We can apply the function sum to each group. This will sum the other numerical column, 'Counts' which reduces each group to a single row: Year and sum.

Excellent tutorial: http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/

In [71]:
# this will apply sum to the "Count" column of each year group
yearly_data = baby_names.groupby('Year').sum()
yearly_data.head(5)
Out[71]:
Count
Year
1910 9164
1911 9984
1912 17944
1913 22094
1914 26925

Plot with Pandas

In [72]:
%matplotlib inline

Pandas.plot documentation

Pandas comes with a built-in plot method that can be very useful! pandas.plot actually uses matplotlib behind the scenes!

yearly_data contains the number of registered babies per year.

In [73]:
yearly_data.head()
Out[73]:
Count
Year
1910 9164
1911 9984
1912 17944
1913 22094
1914 26925

Line Graphs

In [74]:
yearly_data.plot(kind="line")  #kind='line' is optional
Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x11a052198>

Study: Name History

In [75]:
# don't worry about this function unless you want to learn about groupby
def your_name_history(name):
    return baby_names[baby_names['Name'] == name].groupby('Year').sum()
In [76]:
table = your_name_history('John')

table.plot()
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x114a44470>

Bar Graphs

We can modify our data before we graph it to analyze different things.

In [77]:
yearly_data.plot(kind="bar")
plt.axis('off')
Out[77]:
(-0.5, 106.5, 0.0, 580000.05000000005)

Class Exercise: How could we graph only the 15 years after World War II (i.e. 1945-1960)?

Hint: create a table with only the desired years first

In [78]:
modified = yearly_data.loc[1945:1960]

modified.plot(kind="bar", figsize=(15,8))
Out[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x11b7f5fd0>

Matplotlib

Line Graphs

Use plt.plot() to create line graphs! The required arguments are a list of x-values and a list of y-values.

In [79]:
np.random.seed(42) # To ensure that the random number generation is always the same
plt.plot(np.arange(0, 7, 1), np.random.rand(7, 1))
plt.show()
In [80]:
%matplotlib inline

plt.plot(np.arange(0, 7, 1), np.random.rand(7, 1))
# plt.show() no longer required
Out[80]:
[<matplotlib.lines.Line2D at 0x11bfb12e8>]

Histograms

@author previous sections: Carlo Cruz-Albrecht
@author this section and below: Jun Seo Park

To explore other types of charts, let's load in a built-in dataset from Seaborn and first take a quick peek:

In [81]:
tips = sns.load_dataset('tips')
tips.head()
Out[81]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Histograms can be plotted in matplotlib using plt.hist(). This will take one required argument of the x-axis variable.

In [82]:
plt.hist(tips['total_bill'])
Out[82]:
(array([  7.,  42.,  68.,  51.,  31.,  19.,  12.,   7.,   3.,   4.]),
 array([  3.07 ,   7.844,  12.618,  17.392,  22.166,  26.94 ,  31.714,
         36.488,  41.262,  46.036,  50.81 ]),
 <a list of 10 Patch objects>)

Scatterplots

Scatterplots can be made using plt.scatter(). It takes in two arguments: x-values and y-values.

In [83]:
plt.scatter(tips['total_bill'], tips['tip'])
Out[83]:
<matplotlib.collections.PathCollection at 0x11beb8390>
In [84]:
plt.scatter(tips['total_bill'], tips['tip'])
plt.xlabel('Total Bill')
plt.ylabel('Tip Amount')
plt.title('Total Bill vs Tip Amount')
Out[84]:
<matplotlib.text.Text at 0x11bed28d0>
In [85]:
plt.figure(figsize=(15, 10)) # Increase the size of the returned plot

# Points with smoker == 'yes'
plt.scatter(x=tips.loc[tips['smoker'] == 'Yes', 'total_bill'], 
            y=tips.loc[tips['smoker'] == 'Yes', 'tip'],
            label='Smoker', alpha=0.6)

# Points with smoker == 'no'
plt.scatter(x=tips.loc[tips['smoker'] == 'No', 'total_bill'], 
            y=tips.loc[tips['smoker'] == 'No', 'tip'],
            label='Non-Smoker', alpha=0.6)

plt.xlabel('Total Bill')
plt.ylabel('Tip Amount')
plt.title('Total Bill vs Tip Amount (by Smoking Habits)')
plt.legend()
Out[85]:
<matplotlib.legend.Legend at 0x11d5f2f28>

Exercises in Matplotlib

We'll do the exercises using a famous dataset: the iris dataset. First, let's load it in and take a look:

In [86]:
iris = sns.load_dataset('iris')
iris.head()
Out[86]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

alt text

Let's also take a look at the different species:

In [87]:
iris['species'].unique()
Out[87]:
array(['setosa', 'versicolor', 'virginica'], dtype=object)

Exercise: Create a basic scatterplot of the petal lengths versus the petal widths. Label your axes (use the documentation linked above to make them meaningful)!

In [88]:
plt.scatter(iris['petal_length'], iris['petal_width'])
Out[88]:
<matplotlib.collections.PathCollection at 0x11d68d0b8>

Exercise: This time, create the same scatterplot, but assign a different color for each flower species.

In [89]:
plt.scatter(x=iris.loc[iris['species'] == 'setosa', 'petal_length'],
            y=iris.loc[iris['species'] == 'setosa', 'petal_width'],
            label='setosa')
plt.scatter(x=iris.loc[iris['species'] == 'versicolor', 'petal_length'],
            y=iris.loc[iris['species'] == 'versicolor', 'petal_width'],
            label='versicolor')
plt.scatter(x=iris.loc[iris['species'] == 'virginica', 'petal_length'],
            y=iris.loc[iris['species'] == 'virginica', 'petal_width'],
            label='virginica')
plt.legend();
In [90]:
def plot_by_species(species, x, y):
    plt.scatter(x=iris.loc[iris['species'] == species, x],
             y=iris.loc[iris['species'] == species, y],
             label=species)

for species in iris['species'].unique():
    plot_by_species(species, 'sepal_length', 'sepal_width')

plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.title('Sepal Length vs Sepal Width (by Species)')
plt.legend()
Out[90]:
<matplotlib.legend.Legend at 0x11d624cf8>

Seaborn

Histogram

Back to the tips dataset to explore seaborn! First off is seaborn's take on the histogram, sns.distplot(). By default, it shows a relative distribution and overlays a kernel density estimator; if you would like seaborn to just show a plain histogram, you can add the argument kde=False.

In [91]:
sns.set()

plt.figure(figsize=(15, 10))
plt.subplot(1, 2, 1)
sns.distplot(tips['total_bill'])

plt.subplot(1, 2, 2)
sns.distplot(tips['total_bill'], kde=False)
Out[91]:
<matplotlib.axes._subplots.AxesSubplot at 0x1227d54e0>

Scatterplot

To create a scatterplot using seaborn, you can use sns.lmplot(). It'll take x-values and y-values, and overlay a least-squares regression line and standard deviation

Note: You can use pandas indexing, but check out the fancy ability to refer to columns by their names instead.

In [92]:
sns.lmplot(x='total_bill', y='tip', data=tips)
Out[92]:
<seaborn.axisgrid.FacetGrid at 0x11bf87cc0>

Let's do that same plot from earlier, where we faceted by smoker. It's a lot easier in seaborn, since we only need to pass in an additional argument of hue:

In [93]:
sns.lmplot(x='total_bill', y='tip', hue='smoker', data=tips)
Out[93]:
<seaborn.axisgrid.FacetGrid at 0x1229b72b0>

Cool. Do smokers' and non-smokers' generosities differ by day of the week? Let's try out the row and col (column) arguments:

In [94]:
sns.lmplot(x='total_bill', y='tip', row='time', col='smoker', data=tips)
Out[94]:
<seaborn.axisgrid.FacetGrid at 0x12296dfd0>

Seaborn Exercises

Exercise: Your turn! Create a histogram of the petal widths in the iris dataset.

In [95]:
sns.distplot(iris['petal_width'], bins=np.linspace(0, 3, 20))
Out[95]:
<matplotlib.axes._subplots.AxesSubplot at 0x122c63b00>

Exercise: Now try to create a scatterplot of petal lengths versus petal widths, and color the points based on the species of flowers. Feel free to turn off the regression line using fit_reg=False.

In [96]:
sns.lmplot(x='petal_length', y='petal_width', hue='species', data=iris, fit_reg=False)
Out[96]:
<seaborn.axisgrid.FacetGrid at 0x1231cc4e0>

That's the end of our workshop.

We hope you learned something. Keep this notebook handy for reference later!

Hope to see you at our workshop next week: Python Modeling