Carlo Cruz-Albrecht

Python Libraries: Pandas Tutorial

How to Use Pandas for Datascience


Pandas reference: http://pandas.pydata.org/pandas-docs/stable/

Pandas is a useful python library for data manipulation and analysis.

Install jupyter:

pip3 install jupyter

Launch your notebook (opens in browser):

jupyter notebook [name_of_file.ipynb]

Alternatively, you can run Jupyter Notebooks in Google Drive using Colaboratory.

Import

pandas is a library made by other people! We need to import it. We’ll also be using numpy (see previous tutorial)

import numpy as np
import pandas as pd

A table is called a ‘dataframe’ in Pandas. Consider the table fruit_info:

**color****fruit**
redapple
orangeorange
yellowbanana
pinkraspberry

Pandas Series

Let’s break this table down. DataFrames consist of columns called Series. Series act like numpy arrays.

How to make a Series:

  1. create a numpy array
  2. call pd.Series(array, name="...")    # name can be anything
Exercise:

Make a Series that contains the colors from fruit_info and has name='color'

array = np.array(['red', 'orange', 'yellow', 'pink'])
color_column = pd.Series(array, name="color")
color_column
0       red
1    orange
2    yellow
3      pink
Name: color, dtype: object
Exercise:

Make another Series for the fruit column:

array = np.array(['apple', 'orange', 'banana', 'raspberry'])
fruit_column = pd.Series(array, name="fruit")
fruit_column
0        apple
1       orange
2       banana
3    raspberry
Name: fruit, dtype: object

Combine your Series into a table!

pd.concat([ series1, series2, series3, ... ], 1)

Don’t forget the 1 or you’ll just make a giant Series.

fruit_info = pd.concat([color_column, fruit_column], 1)
fruit_info
color fruit
0 red apple
1 orange orange
2 yellow banana
3 pink raspberry

What if we were given the DataFrame and we want to extract the columns?

fruit_info['fruit'] # we get the fruit_column Series back!
0        apple
1       orange
2       banana
3    raspberry
Name: fruit, dtype: object

Dictionaries

Also, we can manually create tables by using a python dictionary. A dictionary has the following format:

d = { "name of column"   :  [  list of values  ],
      "name of column 2" :  [  list of values  ],
                        ...
                        ...
    }```
    


```python
d = { 'fruit' : ['apple', 'orange', 'banana', 'raspberry'],
      'color' : ['red', 'orange', 'yellow', 'pink']
    }
fruit_info_again = pd.DataFrame(d)
fruit_info_again
color fruit
0 red apple
1 orange orange
2 yellow banana
3 pink raspberry

Add Columns

Add a column to table labeled “new column” like so:

table['new column'] = array

fruit_info['inventory'] = np.array([23, 18, 50, 20])
fruit_info
color fruit inventory
0 red apple 23
1 orange orange 18
2 yellow banana 50
3 pink raspberry 20
Exercise:

Add a column called rating that assigns your rating from 1 to 5 for each fruit :)

fruit_info['rating'] = np.array([3, 4, 5, 5])

fruit_info  # should now include a rating column
color fruit inventory rating
0 red apple 23 3
1 orange orange 18 4
2 yellow banana 50 5
3 pink raspberry 20 5

Drop

Exercise:

Now, use the .drop() method to drop the color column.

fruit_info_without_color = fruit_info.drop('color', 1) # must include axis=1

fruit_info_without_color
fruit inventory rating
0 apple 23 3
1 orange 18 4
2 banana 50 5
3 raspberry 20 5

California Baby Names

Time to use a real dataset!

You can read a .csv file into pandas using pd.read_csv( url ).

Create a variable called baby_names that loads this data: https://raw.githubusercontent.com/carlocrza/Data_Science_Society/master/baby_names.csv

baby_names = pd.read_csv("https://raw.githubusercontent.com/carlocrza/Data_Science_Society/master/baby_names.csv")

Let’s display the table. We can just type baby_names and run the cell but baby_names is HUGE! So, let’s display just the first five rows with:

DataFrame.head( # of rows )

baby_names.head(5)
State Sex Year Name Count
0 CA F 1910 Mary 295
1 CA F 1910 Helen 239
2 CA F 1910 Dorothy 220
3 CA F 1910 Margaret 163
4 CA F 1910 Frances 134

Row, Column Selection

Follow the structure:

table.loc[rows, columns]

table.loc[2:8, [ 'Name', 'Count']]

The above code will select columns “Name” and “Count” from rows 2 through 8.

# Returns the name of our columns
baby_names.columns
Index(['State', 'Sex', 'Year', 'Name', 'Count'], dtype='object')
baby_names.loc[2:8, ['Name', "Count"]]
Name Count
2 Dorothy 220
3 Margaret 163
4 Frances 134
5 Ruth 128
6 Evelyn 126
7 Alice 118
8 Virginia 101
Exercise:

Return a table that includes rows 1000-1005 and only includes the column “Name”.

baby_names.loc[1000:1005, ['Name']]
Name
1000 Sara
1001 Teresa
1002 Cecilia
1003 Eda
1004 Etta
1005 Faye
# Want to select EVERY row?
# Don't put anything before and after the colon :
baby_names.loc[:, ['Sex', 'Name']].head(4)
Sex Name
0 F Mary
1 F Helen
2 F Dorothy
3 F Margaret

Selecting an entire Column

Remember we can extract the column in the form of a Series using:

table_name['Name of column']

name_column = baby_names['Name']
name_column.head(5) # we can also use .head with Series!
0        Mary
1       Helen
2     Dorothy
3    Margaret
4     Frances
Name: Name, dtype: object

Selecting rows with a Boolean Array

Lastly, we can select rows based off of True / False data. Let’s go back to the simpler fruit_info table.

fruit_info
color fruit inventory rating
0 red apple 23 3
1 orange orange 18 4
2 yellow banana 50 5
3 pink raspberry 20 5
# select row only if corresponding value in *selection* is True
selection = np.array([True, False, True, False])
fruit_info[selection]
color fruit inventory rating
0 red apple 23 3
2 yellow banana 50 5

Filtering Data

So far we have selected data based off of row numbers and column headers. Let’s work on filtering data more precisely.

table[condition]

condition = baby_names['Name'] == 'Carlo'
baby_names[condition].head(5)
State Sex Year Name Count
219486 CA M 1917 Carlo 10
219941 CA M 1918 Carlo 8
220848 CA M 1920 Carlo 10
221935 CA M 1922 Carlo 9
222457 CA M 1923 Carlo 10

The above code only selects rows that have Name equal to ‘Carlo’. Change it to your name!

Apply multiple conditions!

table[ (condition 1) & (condition 2) ]

Class Exercise:

select the names in Year 2000 that have larger than 3000 counts.

result = baby_names[(baby_names['Count'] > 3000) & (baby_names['Year'] == 2000)]
result.head(3)
State Sex Year Name Count
320198 CA M 2000 Daniel 4339
320199 CA M 2000 Anthony 3837
320200 CA M 2000 Jose 3803

Thorough explanation:

Remember that calling baby_names['Name'] returns a Series of all of the names.

Checking if values in the series are equal to Carlo results in an array of {True, False} values.

Then, we select rows based off of this boolean array. Thus, we could also do:

names = baby_names['Name']
equalto_Carlo = (names == 'Carlo')  # equalto_Carlo is now an array of True/False variables!
baby_names[equalto_Carlo].head(5)
State Sex Year Name Count
219486 CA M 1917 Carlo 10
219941 CA M 1918 Carlo 8
220848 CA M 1920 Carlo 10
221935 CA M 1922 Carlo 9
222457 CA M 1923 Carlo 10

Using Numpy with Pandas

How many rows does our baby_names table have?

len(baby_names)
367931

That’s a lot of rows! We can’t just look at the table and understand it.

Luckily, Numpy functions treat pandas Series as np.arrays.

Exercise:

What is oldest and most recent year that we have data from in baby_names? HINT: np.min, np.max

recent_year = np.max(baby_names['Year'])
oldest_year = np.min(baby_names['Year'])
(recent_year, oldest_year)
(2016, 1910)
Exercise:

How many baby names were born in CA in 2015?

Hint: the ‘Count’ column refers the the number of occurrences of a baby name. How could we find the total number of baby names? Now narrow that to only 2015.

baby_names_2015 = baby_names[baby_names['Year'] == 2015]
baby_names_2015_counts = baby_names_2015['Count']
number_baby_names_2015 = np.sum(baby_names_2015_counts)
number_baby_names_2015
429730
np.sum(baby_names[baby_names['Year'] == 2015]['Count'])
429730

np.unique

# return an array with an element for each unique value in the Series/np.array
np.unique(baby_names['Sex'])
array(['F', 'M'], dtype=object)
# demo
states = np.unique(baby_names['State']) # okay now we know this dataset only involves California babies.
names = np.unique(baby_names['Name'])
len(names)
19234

Copy vs View

Depending on how you format your code, pandas might be returning a copy of the dataframe (i.e. a whole new dataframe, but just with the same values), or a view of the dataframe (i.e. the same dataframe itself).

carlos_fruits = fruit_info.copy()
carlos_fruits
color fruit inventory rating
0 red apple 23 3
1 orange orange 18 4
2 yellow banana 50 5
3 pink raspberry 20 5

Let’s say Carlo is happy with those ratings. But Jun Seo loves bananas! Let’s make a “new” dataframe and change the ratings accordingly:

junseos_fruits = carlos_fruits
junseos_fruits['rating'] = [3, 4, 9999, 5]
junseos_fruits
color fruit inventory rating
0 red apple 23 3
1 orange orange 18 4
2 yellow banana 50 9999
3 pink raspberry 20 5

And taking a look back at Carlo’s fruits:

carlos_fruits
color fruit inventory rating
0 red apple 23 3
1 orange orange 18 4
2 yellow banana 50 9999
3 pink raspberry 20 5

Wait, Carlo’s banana rating shouldn’t be that high! What happened is that junseos_fruits returned a view on Carlo’s dataframe. Then did our shenanigans affect the original fruit_info dataframe too?

fruit_info
color fruit inventory rating
0 red apple 23 3
1 orange orange 18 4
2 yellow banana 50 5
3 pink raspberry 20 5

No, because when we called carlos_fruits = fruit_info.copy(), we asked pandas to forcibly create a brand new dataframe with identical values instead.

SettingWithCopyWarning

This is arguably one of the most frustrating warnings you will see while using pandas. TL;DR: Use .loc instead of square brackets to index into data.

Let’s say Jun Seo strongly dislikes apples.

junseos_fruits[junseos_fruits['fruit'] == 'apple']
color fruit inventory rating
0 red apple 23 3
junseos_fruits[junseos_fruits['fruit'] == 'apple']['rating'] = -100
junseos_fruits
color fruit inventory rating
0 red apple 23 3
1 orange orange 18 4
2 yellow banana 50 9999
3 pink raspberry 20 5
junseos_fruits['rating']
0       3
1       4
2    9999
3       5
Name: rating, dtype: int64
junseos_fruits['rating'][0] = -100
junseos_fruits
color fruit inventory rating
0 red apple 23 -100
1 orange orange 18 4
2 yellow banana 50 9999
3 pink raspberry 20 5
junseos_fruits.loc[1, 'rating'] = 1738
junseos_fruits
color fruit inventory rating
0 red apple 23 -100
1 orange orange 18 1738
2 yellow banana 50 9999
3 pink raspberry 20 5

[optional] Group By

We won’t have time to go through this thoroughly in lab. However, we encourage you to look into this material if you want to go further. Feel free to ask us any questions!

In the previous section we calculated the number of baby names registered in 2015.

np.sum(baby_names[baby_names['Year'] == 2015]['Count'])
429730

There are 107 years though. If we wanted to know how many babies were born in California for each year we need to do something more efficient.

groupby to the rescue!

Groupby allows us to split our table into groups, each group having one similarity.

For example if we group by “Year” we would create 107 groups because there are 107 unique years.

`baby_names.groupby('Year')`

Now we have 107 groups but what do we do with them? We can apply the function sum to each group. This will sum the other numerical column, ‘Counts’ which reduces each group to a single row: Year and sum.

Excellent tutorial: http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/

# this will apply sum to the "Count" column of each year group
yearly_data = baby_names.groupby('Year').sum()
yearly_data.head(5)
Count
Year
1910 9164
1911 9984
1912 17944
1913 22094
1914 26925

Further reading: http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/

Plot with Pandas

%matplotlib inline

Pandas.plot documentation

Pandas comes with a built-in plot method that can be very useful! pandas.plot actually uses matplotlib behind the scenes!

yearly_data contains the number of registered babies per year.

yearly_data.head()
Count
Year
1910 9164
1911 9984
1912 17944
1913 22094
1914 26925

Line Graphs

yearly_data.plot(kind="line")  #kind='line' is optional
<matplotlib.axes._subplots.AxesSubplot at 0x11a052198>

png

Study: Name History

# don't worry about this function unless you want to learn about groupby
def your_name_history(name):
    return baby_names[baby_names['Name'] == name].groupby('Year').sum()
table = your_name_history('John')

table.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x114a44470>

png

Bar Graphs

We can modify our data before we graph it to analyze different things.

yearly_data.plot(kind="bar")
plt.axis('off')
(-0.5, 106.5, 0.0, 580000.05000000005)

png

Class Exercise:

How could we graph only the 15 years after World War II (i.e. 1945-1960)?

Hint: create a table with only the desired years first

modified = yearly_data.loc[1945:1960]

modified.plot(kind="bar", figsize=(15,8))
<matplotlib.axes._subplots.AxesSubplot at 0x11b7f5fd0>

png