Pandas reference: http://pandas.pydata.org/pandas-docs/stable/
Pandas is a useful python library for data manipulation and analysis.
Install jupyter:
pip3 install jupyter
Launch your notebook (opens in browser):
jupyter notebook [name_of_file.ipynb]
Alternatively, you can run Jupyter Notebooks in Google Drive using Colaboratory.
pandas
is a library made by other people! We need to import it. We’ll also be using numpy (see previous tutorial)
import numpy as np
import pandas as pd
A table is called a ‘dataframe’ in Pandas. Consider the table fruit_info
:
**color** | **fruit** |
red | apple |
orange | orange |
yellow | banana |
pink | raspberry |
Let’s break this table down. DataFrames consist of columns called Series
. Series act like numpy arrays.
How to make a Series:
array
pd.Series(array, name="...")
# name can be anythingMake a Series that contains the colors from fruit_info
and has name='color'
array = np.array(['red', 'orange', 'yellow', 'pink'])
color_column = pd.Series(array, name="color")
color_column
0 red
1 orange
2 yellow
3 pink
Name: color, dtype: object
Make another Series for the fruit column:
array = np.array(['apple', 'orange', 'banana', 'raspberry'])
fruit_column = pd.Series(array, name="fruit")
fruit_column
0 apple
1 orange
2 banana
3 raspberry
Name: fruit, dtype: object
Combine your Series into a table!
pd.concat([ series1, series2, series3, ... ], 1)
Don’t forget the 1
or you’ll just make a giant Series.
fruit_info = pd.concat([color_column, fruit_column], 1)
fruit_info
color | fruit | |
---|---|---|
0 | red | apple |
1 | orange | orange |
2 | yellow | banana |
3 | pink | raspberry |
What if we were given the DataFrame and we want to extract the columns?
fruit_info['fruit'] # we get the fruit_column Series back!
0 apple
1 orange
2 banana
3 raspberry
Name: fruit, dtype: object
Also, we can manually create tables by using a python dictionary. A dictionary has the following format:
d = { "name of column" : [ list of values ],
"name of column 2" : [ list of values ],
...
...
}```
```python
d = { 'fruit' : ['apple', 'orange', 'banana', 'raspberry'],
'color' : ['red', 'orange', 'yellow', 'pink']
}
fruit_info_again = pd.DataFrame(d)
fruit_info_again
color | fruit | |
---|---|---|
0 | red | apple |
1 | orange | orange |
2 | yellow | banana |
3 | pink | raspberry |
Add a column to table
labeled “new column” like so:
table['new column'] = array
fruit_info['inventory'] = np.array([23, 18, 50, 20])
fruit_info
color | fruit | inventory | |
---|---|---|---|
0 | red | apple | 23 |
1 | orange | orange | 18 |
2 | yellow | banana | 50 |
3 | pink | raspberry | 20 |
Add a column called rating
that assigns your rating from 1 to 5 for each fruit :)
fruit_info['rating'] = np.array([3, 4, 5, 5])
fruit_info # should now include a rating column
color | fruit | inventory | rating | |
---|---|---|---|---|
0 | red | apple | 23 | 3 |
1 | orange | orange | 18 | 4 |
2 | yellow | banana | 50 | 5 |
3 | pink | raspberry | 20 | 5 |
Now, use the .drop()
method to drop the color
column.
fruit_info_without_color = fruit_info.drop('color', 1) # must include axis=1
fruit_info_without_color
fruit | inventory | rating | |
---|---|---|---|
0 | apple | 23 | 3 |
1 | orange | 18 | 4 |
2 | banana | 50 | 5 |
3 | raspberry | 20 | 5 |
Time to use a real dataset!
You can read a .csv
file into pandas using pd.read_csv( url )
.
Create a variable called baby_names
that loads this data: https://raw.githubusercontent.com/carlocrza/Data_Science_Society/master/baby_names.csv
baby_names = pd.read_csv("https://raw.githubusercontent.com/carlocrza/Data_Science_Society/master/baby_names.csv")
Let’s display the table. We can just type baby_names
and run the cell but baby_names is HUGE! So, let’s display just the first five rows with:
DataFrame.head( # of rows )
baby_names.head(5)
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
0 | CA | F | 1910 | Mary | 295 |
1 | CA | F | 1910 | Helen | 239 |
2 | CA | F | 1910 | Dorothy | 220 |
3 | CA | F | 1910 | Margaret | 163 |
4 | CA | F | 1910 | Frances | 134 |
Follow the structure:
table.loc[rows, columns]
table.loc[2:8, [ 'Name', 'Count']]
The above code will select columns “Name” and “Count” from rows 2 through 8.
# Returns the name of our columns
baby_names.columns
Index(['State', 'Sex', 'Year', 'Name', 'Count'], dtype='object')
baby_names.loc[2:8, ['Name', "Count"]]
Name | Count | |
---|---|---|
2 | Dorothy | 220 |
3 | Margaret | 163 |
4 | Frances | 134 |
5 | Ruth | 128 |
6 | Evelyn | 126 |
7 | Alice | 118 |
8 | Virginia | 101 |
Return a table that includes rows 1000-1005 and only includes the column “Name”.
baby_names.loc[1000:1005, ['Name']]
Name | |
---|---|
1000 | Sara |
1001 | Teresa |
1002 | Cecilia |
1003 | Eda |
1004 | Etta |
1005 | Faye |
# Want to select EVERY row?
# Don't put anything before and after the colon :
baby_names.loc[:, ['Sex', 'Name']].head(4)
Sex | Name | |
---|---|---|
0 | F | Mary |
1 | F | Helen |
2 | F | Dorothy |
3 | F | Margaret |
Remember we can extract the column in the form of a Series using:
table_name['Name of column']
name_column = baby_names['Name']
name_column.head(5) # we can also use .head with Series!
0 Mary
1 Helen
2 Dorothy
3 Margaret
4 Frances
Name: Name, dtype: object
Lastly, we can select rows based off of True / False data. Let’s go back to the simpler fruit_info
table.
fruit_info
color | fruit | inventory | rating | |
---|---|---|---|---|
0 | red | apple | 23 | 3 |
1 | orange | orange | 18 | 4 |
2 | yellow | banana | 50 | 5 |
3 | pink | raspberry | 20 | 5 |
# select row only if corresponding value in *selection* is True
selection = np.array([True, False, True, False])
fruit_info[selection]
color | fruit | inventory | rating | |
---|---|---|---|---|
0 | red | apple | 23 | 3 |
2 | yellow | banana | 50 | 5 |
So far we have selected data based off of row numbers and column headers. Let’s work on filtering data more precisely.
table[condition]
condition = baby_names['Name'] == 'Carlo'
baby_names[condition].head(5)
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
219486 | CA | M | 1917 | Carlo | 10 |
219941 | CA | M | 1918 | Carlo | 8 |
220848 | CA | M | 1920 | Carlo | 10 |
221935 | CA | M | 1922 | Carlo | 9 |
222457 | CA | M | 1923 | Carlo | 10 |
The above code only selects rows that have Name equal to ‘Carlo’. Change it to your name!
table[ (condition 1) & (condition 2) ]
select the names in Year 2000 that have larger than 3000 counts.
result = baby_names[(baby_names['Count'] > 3000) & (baby_names['Year'] == 2000)]
result.head(3)
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
320198 | CA | M | 2000 | Daniel | 4339 |
320199 | CA | M | 2000 | Anthony | 3837 |
320200 | CA | M | 2000 | Jose | 3803 |
Remember that calling baby_names['Name']
returns a Series of all of the names.
Checking if values in the series are equal to Carlo
results in an array of {True, False} values.
Then, we select rows based off of this boolean array. Thus, we could also do:
names = baby_names['Name']
equalto_Carlo = (names == 'Carlo') # equalto_Carlo is now an array of True/False variables!
baby_names[equalto_Carlo].head(5)
State | Sex | Year | Name | Count | |
---|---|---|---|---|---|
219486 | CA | M | 1917 | Carlo | 10 |
219941 | CA | M | 1918 | Carlo | 8 |
220848 | CA | M | 1920 | Carlo | 10 |
221935 | CA | M | 1922 | Carlo | 9 |
222457 | CA | M | 1923 | Carlo | 10 |
How many rows does our baby_names
table have?
len(baby_names)
367931
That’s a lot of rows! We can’t just look at the table and understand it.
Luckily, Numpy functions treat pandas Series as np.arrays.
Exercise:What is oldest and most recent year that we have data from in baby_names
?
HINT: np.min, np.max
recent_year = np.max(baby_names['Year'])
oldest_year = np.min(baby_names['Year'])
(recent_year, oldest_year)
(2016, 1910)
How many baby names were born in CA in 2015?
Hint: the ‘Count’ column refers the the number of occurrences of a baby name. How could we find the total number of baby names? Now narrow that to only 2015.
baby_names_2015 = baby_names[baby_names['Year'] == 2015]
baby_names_2015_counts = baby_names_2015['Count']
number_baby_names_2015 = np.sum(baby_names_2015_counts)
number_baby_names_2015
429730
np.sum(baby_names[baby_names['Year'] == 2015]['Count'])
429730
# return an array with an element for each unique value in the Series/np.array
np.unique(baby_names['Sex'])
array(['F', 'M'], dtype=object)
# demo
states = np.unique(baby_names['State']) # okay now we know this dataset only involves California babies.
names = np.unique(baby_names['Name'])
len(names)
19234
Depending on how you format your code, pandas might be returning a copy of the dataframe (i.e. a whole new dataframe, but just with the same values), or a view of the dataframe (i.e. the same dataframe itself).
carlos_fruits = fruit_info.copy()
carlos_fruits
color | fruit | inventory | rating | |
---|---|---|---|---|
0 | red | apple | 23 | 3 |
1 | orange | orange | 18 | 4 |
2 | yellow | banana | 50 | 5 |
3 | pink | raspberry | 20 | 5 |
Let’s say Carlo is happy with those ratings. But Jun Seo loves bananas! Let’s make a “new” dataframe and change the ratings accordingly:
junseos_fruits = carlos_fruits
junseos_fruits['rating'] = [3, 4, 9999, 5]
junseos_fruits
color | fruit | inventory | rating | |
---|---|---|---|---|
0 | red | apple | 23 | 3 |
1 | orange | orange | 18 | 4 |
2 | yellow | banana | 50 | 9999 |
3 | pink | raspberry | 20 | 5 |
And taking a look back at Carlo’s fruits:
carlos_fruits
color | fruit | inventory | rating | |
---|---|---|---|---|
0 | red | apple | 23 | 3 |
1 | orange | orange | 18 | 4 |
2 | yellow | banana | 50 | 9999 |
3 | pink | raspberry | 20 | 5 |
Wait, Carlo’s banana rating shouldn’t be that high! What happened is that junseos_fruits returned a view on Carlo’s dataframe. Then did our shenanigans affect the original fruit_info dataframe too?
fruit_info
color | fruit | inventory | rating | |
---|---|---|---|---|
0 | red | apple | 23 | 3 |
1 | orange | orange | 18 | 4 |
2 | yellow | banana | 50 | 5 |
3 | pink | raspberry | 20 | 5 |
No, because when we called carlos_fruits = fruit_info.copy()
, we asked pandas to forcibly create a brand new dataframe with identical values instead.
This is arguably one of the most frustrating warnings you will see while using pandas. TL;DR: Use .loc instead of square brackets to index into data.
Let’s say Jun Seo strongly dislikes apples.
junseos_fruits[junseos_fruits['fruit'] == 'apple']
color | fruit | inventory | rating | |
---|---|---|---|---|
0 | red | apple | 23 | 3 |
junseos_fruits[junseos_fruits['fruit'] == 'apple']['rating'] = -100
junseos_fruits
color | fruit | inventory | rating | |
---|---|---|---|---|
0 | red | apple | 23 | 3 |
1 | orange | orange | 18 | 4 |
2 | yellow | banana | 50 | 9999 |
3 | pink | raspberry | 20 | 5 |
junseos_fruits['rating']
0 3
1 4
2 9999
3 5
Name: rating, dtype: int64
junseos_fruits['rating'][0] = -100
junseos_fruits
color | fruit | inventory | rating | |
---|---|---|---|---|
0 | red | apple | 23 | -100 |
1 | orange | orange | 18 | 4 |
2 | yellow | banana | 50 | 9999 |
3 | pink | raspberry | 20 | 5 |
junseos_fruits.loc[1, 'rating'] = 1738
junseos_fruits
color | fruit | inventory | rating | |
---|---|---|---|---|
0 | red | apple | 23 | -100 |
1 | orange | orange | 18 | 1738 |
2 | yellow | banana | 50 | 9999 |
3 | pink | raspberry | 20 | 5 |
We won’t have time to go through this thoroughly in lab. However, we encourage you to look into this material if you want to go further. Feel free to ask us any questions!
In the previous section we calculated the number of baby names registered in 2015.
np.sum(baby_names[baby_names['Year'] == 2015]['Count'])
429730
There are 107 years though. If we wanted to know how many babies were born in California for each year we need to do something more efficient.
groupby
to the rescue!
Groupby allows us to split our table into groups, each group having one similarity.
For example if we group by “Year” we would create 107 groups because there are 107 unique years.
Now we have 107 groups but what do we do with them? We can apply the function sum
to each group. This will sum the other numerical column, ‘Counts’ which reduces each group to a single row: Year and sum.
Excellent tutorial: http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/
# this will apply sum to the "Count" column of each year group
yearly_data = baby_names.groupby('Year').sum()
yearly_data.head(5)
Count | |
---|---|
Year | |
1910 | 9164 |
1911 | 9984 |
1912 | 17944 |
1913 | 22094 |
1914 | 26925 |
Further reading: http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/
%matplotlib inline
Pandas comes with a built-in plot
method that can be very useful! pandas.plot
actually uses matplotlib
behind the scenes!
yearly_data
contains the number of registered babies per year.
yearly_data.head()
Count | |
---|---|
Year | |
1910 | 9164 |
1911 | 9984 |
1912 | 17944 |
1913 | 22094 |
1914 | 26925 |
yearly_data.plot(kind="line") #kind='line' is optional
<matplotlib.axes._subplots.AxesSubplot at 0x11a052198>
# don't worry about this function unless you want to learn about groupby
def your_name_history(name):
return baby_names[baby_names['Name'] == name].groupby('Year').sum()
table = your_name_history('John')
table.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x114a44470>
We can modify our data before we graph it to analyze different things.
yearly_data.plot(kind="bar")
plt.axis('off')
(-0.5, 106.5, 0.0, 580000.05000000005)
How could we graph only the 15 years after World War II (i.e. 1945-1960)?
Hint: create a table with only the desired years first
modified = yearly_data.loc[1945:1960]
modified.plot(kind="bar", figsize=(15,8))
<matplotlib.axes._subplots.AxesSubplot at 0x11b7f5fd0>