Python Libraries: Pandas Tutorial

How to Use Pandas for Datascience

Pandas reference: http://pandas.pydata.org/pandas-docs/stable/

Pandas is a useful python library for data manipulation and analysis.

Install jupyter:

pip3 install jupyter

Launch your notebook (opens in browser):

jupyter notebook [name_of_file.ipynb]

Alternatively, you can run Jupyter Notebooks in Google Drive using Colaboratory.

Import

pandas is a library made by other people! We need to import it. We’ll also be using numpy (see previous tutorial)

import numpy as np
import pandas as pd

A table is called a ‘dataframe’ in Pandas. Consider the table fruit_info:

color	fruit
red	apple
orange	orange
yellow	banana
pink	raspberry

Pandas Series

Let’s break this table down. DataFrames consist of columns called Series. Series act like numpy arrays.

How to make a Series:

create a numpy array
call pd.Series(array, name="...") # name can be anything

Exercise:

Make a Series that contains the colors from fruit_info and has name='color'

array = np.array(['red', 'orange', 'yellow', 'pink'])
color_column = pd.Series(array, name="color")
color_column

     red
  orange
  yellow
    pink
Name: color, dtype: object

Exercise:

Make another Series for the fruit column:

array = np.array(['apple', 'orange', 'banana', 'raspberry'])
fruit_column = pd.Series(array, name="fruit")
fruit_column

      apple
     orange
     banana
  raspberry
Name: fruit, dtype: object

Combine your Series into a table!

pd.concat([ series1, series2, series3, ... ], 1)

Don’t forget the 1 or you’ll just make a giant Series.

fruit_info = pd.concat([color_column, fruit_column], 1)
fruit_info

	color	fruit
0	red	apple
1	orange	orange
2	yellow	banana
3	pink	raspberry

What if we were given the DataFrame and we want to extract the columns?

fruit_info['fruit'] # we get the fruit_column Series back!

      apple
     orange
     banana
  raspberry
Name: fruit, dtype: object

Dictionaries

Also, we can manually create tables by using a python dictionary. A dictionary has the following format:

d = { "name of column"   :  [  list of values  ],
      "name of column 2" :  [  list of values  ],
                        ...
                        ...
    }```
    


```python
d = { 'fruit' : ['apple', 'orange', 'banana', 'raspberry'],
      'color' : ['red', 'orange', 'yellow', 'pink']
    }

fruit_info_again = pd.DataFrame(d)
fruit_info_again

	color	fruit
0	red	apple
1	orange	orange
2	yellow	banana
3	pink	raspberry

Add Columns

Add a column to table labeled “new column” like so:

table['new column'] = array

fruit_info['inventory'] = np.array([23, 18, 50, 20])
fruit_info

	color	fruit	inventory
0	red	apple	23
1	orange	orange	18
2	yellow	banana	50
3	pink	raspberry	20

Exercise:

Add a column called rating that assigns your rating from 1 to 5 for each fruit :)

fruit_info['rating'] = np.array([3, 4, 5, 5])

fruit_info  # should now include a rating column

	color	fruit	inventory	rating
0	red	apple	23	3
1	orange	orange	18	4
2	yellow	banana	50	5
3	pink	raspberry	20	5

Drop

Exercise:

Now, use the .drop() method to drop the color column.

fruit_info_without_color = fruit_info.drop('color', 1) # must include axis=1

fruit_info_without_color

	fruit	inventory	rating
0	apple	23	3
1	orange	18	4
2	banana	50	5
3	raspberry	20	5

California Baby Names

Time to use a real dataset!

You can read a .csv file into pandas using pd.read_csv( url ).

Create a variable called baby_names that loads this data: https://raw.githubusercontent.com/carlocrza/Data_Science_Society/master/baby_names.csv

baby_names = pd.read_csv("https://raw.githubusercontent.com/carlocrza/Data_Science_Society/master/baby_names.csv")

Let’s display the table. We can just type baby_names and run the cell but baby_names is HUGE! So, let’s display just the first five rows with:

DataFrame.head( # of rows )

baby_names.head(5)

	State	Sex	Year	Name	Count
0	CA	F	1910	Mary	295
1	CA	F	1910	Helen	239
2	CA	F	1910	Dorothy	220
3	CA	F	1910	Margaret	163
4	CA	F	1910	Frances	134

Row, Column Selection

Follow the structure:

table.loc[rows, columns]

table.loc[2:8, [ 'Name', 'Count']]

The above code will select columns “Name” and “Count” from rows 2 through 8.

# Returns the name of our columns
baby_names.columns

Index(['State', 'Sex', 'Year', 'Name', 'Count'], dtype='object')

baby_names.loc[2:8, ['Name', "Count"]]

	Name	Count
2	Dorothy	220
3	Margaret	163
4	Frances	134
5	Ruth	128
6	Evelyn	126
7	Alice	118
8	Virginia	101

Exercise:

Return a table that includes rows 1000-1005 and only includes the column “Name”.

baby_names.loc[1000:1005, ['Name']]

	Name
1000	Sara
1001	Teresa
1002	Cecilia
1003	Eda
1004	Etta
1005	Faye

# Want to select EVERY row?
# Don't put anything before and after the colon :
baby_names.loc[:, ['Sex', 'Name']].head(4)

	Sex	Name
0	F	Mary
1	F	Helen
2	F	Dorothy
3	F	Margaret

Selecting an entire Column

Remember we can extract the column in the form of a Series using:

table_name['Name of column']

name_column = baby_names['Name']
name_column.head(5) # we can also use .head with Series!

      Mary
     Helen
   Dorothy
  Margaret
   Frances
Name: Name, dtype: object

Selecting rows with a Boolean Array

Lastly, we can select rows based off of True / False data. Let’s go back to the simpler fruit_info table.

fruit_info

	color	fruit	inventory	rating
0	red	apple	23	3
1	orange	orange	18	4
2	yellow	banana	50	5
3	pink	raspberry	20	5

# select row only if corresponding value in *selection* is True
selection = np.array([True, False, True, False])
fruit_info[selection]

	color	fruit	inventory	rating
0	red	apple	23	3
2	yellow	banana	50	5

Filtering Data

So far we have selected data based off of row numbers and column headers. Let’s work on filtering data more precisely.

table[condition]

condition = baby_names['Name'] == 'Carlo'
baby_names[condition].head(5)

	State	Sex	Year	Name	Count
219486	CA	M	1917	Carlo	10
219941	CA	M	1918	Carlo	8
220848	CA	M	1920	Carlo	10
221935	CA	M	1922	Carlo	9
222457	CA	M	1923	Carlo	10

The above code only selects rows that have Name equal to ‘Carlo’. Change it to your name!

Apply multiple conditions!

table[ (condition 1) & (condition 2) ]

Class Exercise:

select the names in Year 2000 that have larger than 3000 counts.

result = baby_names[(baby_names['Count'] > 3000) & (baby_names['Year'] == 2000)]
result.head(3)

	State	Sex	Year	Name	Count
320198	CA	M	2000	Daniel	4339
320199	CA	M	2000	Anthony	3837
320200	CA	M	2000	Jose	3803

Thorough explanation:

Remember that calling baby_names['Name'] returns a Series of all of the names.

Checking if values in the series are equal to Carlo results in an array of {True, False} values.

Then, we select rows based off of this boolean array. Thus, we could also do:

names = baby_names['Name']
equalto_Carlo = (names == 'Carlo')  # equalto_Carlo is now an array of True/False variables!
baby_names[equalto_Carlo].head(5)

	State	Sex	Year	Name	Count
219486	CA	M	1917	Carlo	10
219941	CA	M	1918	Carlo	8
220848	CA	M	1920	Carlo	10
221935	CA	M	1922	Carlo	9
222457	CA	M	1923	Carlo	10

Using Numpy with Pandas

How many rows does our baby_names table have?

len(baby_names)

That’s a lot of rows! We can’t just look at the table and understand it.

Luckily, Numpy functions treat pandas Series as np.arrays.

Exercise:

What is oldest and most recent year that we have data from in baby_names? HINT: np.min, np.max

recent_year = np.max(baby_names['Year'])
oldest_year = np.min(baby_names['Year'])
(recent_year, oldest_year)

(2016, 1910)

Exercise:

How many baby names were born in CA in 2015?

Hint: the ‘Count’ column refers the the number of occurrences of a baby name. How could we find the total number of baby names? Now narrow that to only 2015.

baby_names_2015 = baby_names[baby_names['Year'] == 2015]
baby_names_2015_counts = baby_names_2015['Count']
number_baby_names_2015 = np.sum(baby_names_2015_counts)
number_baby_names_2015

np.sum(baby_names[baby_names['Year'] == 2015]['Count'])

np.unique

# return an array with an element for each unique value in the Series/np.array
np.unique(baby_names['Sex'])

array(['F', 'M'], dtype=object)

# demo
states = np.unique(baby_names['State']) # okay now we know this dataset only involves California babies.

names = np.unique(baby_names['Name'])
len(names)

Copy vs View

Depending on how you format your code, pandas might be returning a copy of the dataframe (i.e. a whole new dataframe, but just with the same values), or a view of the dataframe (i.e. the same dataframe itself).

carlos_fruits = fruit_info.copy()
carlos_fruits

	color	fruit	inventory	rating
0	red	apple	23	3
1	orange	orange	18	4
2	yellow	banana	50	5
3	pink	raspberry	20	5

Let’s say Carlo is happy with those ratings. But Jun Seo loves bananas! Let’s make a “new” dataframe and change the ratings accordingly:

junseos_fruits = carlos_fruits
junseos_fruits['rating'] = [3, 4, 9999, 5]
junseos_fruits

	color	fruit	inventory	rating
0	red	apple	23	3
1	orange	orange	18	4
2	yellow	banana	50	9999
3	pink	raspberry	20	5

And taking a look back at Carlo’s fruits:

carlos_fruits

	color	fruit	inventory	rating
0	red	apple	23	3
1	orange	orange	18	4
2	yellow	banana	50	9999
3	pink	raspberry	20	5

Wait, Carlo’s banana rating shouldn’t be that high! What happened is that junseos_fruits returned a view on Carlo’s dataframe. Then did our shenanigans affect the original fruit_info dataframe too?

fruit_info

	color	fruit	inventory	rating
0	red	apple	23	3
1	orange	orange	18	4
2	yellow	banana	50	5
3	pink	raspberry	20	5

No, because when we called carlos_fruits = fruit_info.copy(), we asked pandas to forcibly create a brand new dataframe with identical values instead.

SettingWithCopyWarning

This is arguably one of the most frustrating warnings you will see while using pandas. TL;DR: Use .loc instead of square brackets to index into data.

Let’s say Jun Seo strongly dislikes apples.

junseos_fruits[junseos_fruits['fruit'] == 'apple']

	color	fruit	inventory	rating
0	red	apple	23	3

junseos_fruits[junseos_fruits['fruit'] == 'apple']['rating'] = -100
junseos_fruits

	color	fruit	inventory	rating
0	red	apple	23	3
1	orange	orange	18	4
2	yellow	banana	50	9999
3	pink	raspberry	20	5

junseos_fruits['rating']

     3
     4
  9999
     5
Name: rating, dtype: int64

junseos_fruits['rating'][0] = -100
junseos_fruits

	color	fruit	inventory	rating
0	red	apple	23	-100
1	orange	orange	18	4
2	yellow	banana	50	9999
3	pink	raspberry	20	5

junseos_fruits.loc[1, 'rating'] = 1738
junseos_fruits

	color	fruit	inventory	rating
0	red	apple	23	-100
1	orange	orange	18	1738
2	yellow	banana	50	9999
3	pink	raspberry	20	5

[optional] Group By

We won’t have time to go through this thoroughly in lab. However, we encourage you to look into this material if you want to go further. Feel free to ask us any questions!

In the previous section we calculated the number of baby names registered in 2015.

np.sum(baby_names[baby_names['Year'] == 2015]['Count'])

There are 107 years though. If we wanted to know how many babies were born in California for each year we need to do something more efficient.

groupby to the rescue!

Groupby allows us to split our table into groups, each group having one similarity.

For example if we group by “Year” we would create 107 groups because there are 107 unique years.

`baby_names.groupby('Year')`

Now we have 107 groups but what do we do with them? We can apply the function sum to each group. This will sum the other numerical column, ‘Counts’ which reduces each group to a single row: Year and sum.

Excellent tutorial: http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/

# this will apply sum to the "Count" column of each year group
yearly_data = baby_names.groupby('Year').sum()
yearly_data.head(5)

	Count
Year
1910	9164
1911	9984
1912	17944
1913	22094
1914	26925

Further reading: http://bconnelly.net/2013/10/summarizing-data-in-python-with-pandas/

Plot with Pandas

%matplotlib inline

Pandas.plot documentation

Pandas comes with a built-in plot method that can be very useful! pandas.plot actually uses matplotlib behind the scenes!

yearly_data contains the number of registered babies per year.

yearly_data.head()

	Count
Year
1910	9164
1911	9984
1912	17944
1913	22094
1914	26925

Line Graphs

yearly_data.plot(kind="line")  #kind='line' is optional

<matplotlib.axes._subplots.AxesSubplot at 0x11a052198>

png

Study: Name History

# don't worry about this function unless you want to learn about groupby
def your_name_history(name):
    return baby_names[baby_names['Name'] == name].groupby('Year').sum()

table = your_name_history('John')

table.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x114a44470>

png

Bar Graphs

We can modify our data before we graph it to analyze different things.

yearly_data.plot(kind="bar")
plt.axis('off')

(-0.5, 106.5, 0.0, 580000.05000000005)

png

Class Exercise:

How could we graph only the 15 years after World War II (i.e. 1945-1960)?

Hint: create a table with only the desired years first

modified = yearly_data.loc[1945:1960]

modified.plot(kind="bar", figsize=(15,8))

<matplotlib.axes._subplots.AxesSubplot at 0x11b7f5fd0>

png