Carlo Cruz-Albrecht

Python Libraries: Numpy Tutorial

How to Use Numpy for Datascience


Numpy reference: https://docs.scipy.org/doc/numpy-1.13.0/reference/

Numpy is a powerful python library that allows for efficient operations on arrays.

Install jupyter:

pip3 install jupyter

Launch your notebook (opens in browser):

jupyter notebook [name_of_file.ipynb]

Alternatively, you can run Jupyter Notebooks in Google Drive using Colaboratory.

Import

numpy is a library made by other people! We need to import libraries in order to use them.

import numpy as np

Numpy Arrays

Numpy’s main use is np.array

Numpy arrays take less space than built-in lists and come with a wide variety of useful functions.

# make an array
a = np.array([2,3,4])
a
array([2, 3, 4])
# make a 2-dimensional array (matrix)
matrix = np.array([ [1,2,3],
                    [4,5,6],
                    [7,8,9] ])
matrix
array([[1, 2, 3],
      [4, 5, 6],
      [7, 8, 9]])

Linear Algebra

# you can multiply matrices with np.dot
np.dot(matrix, a)
array([20, 47, 74])

Arithmetic

These operations are convenient and extremeley fast. Much faster than accomplishing the same thing with a for loop.

You can add/subtract/multiply/divide with numpy arrays! You cannot do this with built-in python lists.

a + 5
array([7, 8, 9])
a * -1
array([-2, -3, -4])
b = np.array([3, 2, 1])
a + b
array([5, 5, 5])

Length Errors

If you try to perform operations on two arrays of different lengths, an error will occur. Try running the following cell!

# Run me!
b + np.array([1, 2, 3, 4])
---------------------------------------------------------------------------

ValueError                                Traceback (most recent call last)

<\ipython-input-10-01af955792b6> in <\module>()
      1 # Run me!
----> 2 b + np.array([1, 2, 3, 4])


ValueError: operands could not be broadcast together with shapes (3,) (4,) 

You will also get an error when trying to access the value at an index that does not exist in the array.

Essential Array Functions

Why do we use Numpy? Numpy provides a multitude of useful functions for arrays. We’ll teach you a few (many more exist!)

Exercise:

Search online how to find the mean of a numpy array.

Use len( array ) to find length of array.

len(b)
3
len(np.array([1, 2, 3, 4]))
4

Conditionals apply to every element of a numpy array as well. This will come in handy later!

a = np.array([1, 2, 3, 1, 1])
a == 1
array([ True, False, False,  True,  True], dtype=bool)
x = np.array([1, 5, -7, 18, 1, -2, 4])
# Find the mean of array x
x_mean = np.mean(x)

Here, we’ll give you a list of some useful numpy functions. Remember, you can easily find info about these by searching google / numpy documentation!

np.sum(x)
20
np.min(x)
-7
np.max(x)
18
np.median(x)
1.0
np.cumsum(x)
array([ 1,  6, -1, 17, 18, 16, 20])
np.abs(x)
array([ 1,  5,  7, 18,  1,  2,  4])

What do you think np.cumsum does? Note, numpy has a similar function np.cumprod. Try it!

What do you think np.diff does?

np.diff(x)
array([  4, -12,  25, -17,  -3,   6])

Two super useful functions in numpy are np.arange and np.linspace. They allow you to craft arrays with equidistant values:

np.arange(0, 100, 10)
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])
np.linspace(0, 100, 15)
array([   0.        ,    7.14285714,   14.28571429,   21.42857143,
         28.57142857,   35.71428571,   42.85714286,   50.        ,
         57.14285714,   64.28571429,   71.42857143,   78.57142857,
         85.71428571,   92.85714286,  100.        ])

Python Lists vs Numpy Arrays

Using np.arrays in python is a little bit different than with built-in lists.

a = np.array([2, 3, 4])
b = [2, 3, 4]
print(a)
print(b)
[2 3 4]
[2, 3, 4]

Adding values to np.array is different

b.append("hello")
b
[2, 3, 4, 'hello']
a = np.append(a, 'hello')
a
array(['2', '3', '4', 'hello'], 
      dtype='< U21')

For loops work the same way

c = np.array([1, 2, 3, 4, 5])
cumulative_product = 1

for element in c:
    cumulative_product *= element
    
cumulative_product
120

Numpy Exercises

Use np.arange to create an array called arr1 that contains every odd number from 1 to 100, inclusive.

arr1 = np.arange(1, 100, 2)
arr1
array([ 1,  3,  5,  7,  9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33,
       35, 37, 39, 41, 43, 45, 47, 49, 51, 53, 55, 57, 59, 61, 63, 65, 67,
       69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93, 95, 97, 99])

Use arr1 to create an array arr2 of every number divisible by 4 from 1 to 200, inclusive.

arr2 = (arr1 + 1) * 2 
arr2
array([  4,   8,  12,  16,  20,  24,  28,  32,  36,  40,  44,  48,  52,
        56,  60,  64,  68,  72,  76,  80,  84,  88,  92,  96, 100, 104,
       108, 112, 116, 120, 124, 128, 132, 136, 140, 144, 148, 152, 156,
       160, 164, 168, 172, 176, 180, 184, 188, 192, 196, 200])

Create the same array, but using np.linspace instead. Call this array arr3.

arr3 = np.linspace(4, 200, 50)
arr3
array([   4.,    8.,   12.,   16.,   20.,   24.,   28.,   32.,   36.,
         40.,   44.,   48.,   52.,   56.,   60.,   64.,   68.,   72.,
         76.,   80.,   84.,   88.,   92.,   96.,  100.,  104.,  108.,
        112.,  116.,  120.,  124.,  128.,  132.,  136.,  140.,  144.,
        148.,  152.,  156.,  160.,  164.,  168.,  172.,  176.,  180.,
        184.,  188.,  192.,  196.,  200.])

Print the following summary statistics for arr3:

print('Minimum: '            + str(np.min(arr3)))
print('1st quartile: '       + str(np.percentile(arr3, 25)))
print('Median: '             + str(np.median(arr3)))
print('Mean: '               + str(np.mean(arr3)))
print('Standard Deviation: ' + str(np.std(arr3)))
print('3rd Quartile: '       + str(np.percentile(arr3, 75)))
print('Max: '                + str(np.max(arr3)))
Minimum: 4.0
1st quartile: 53.0
Median: 102.0
Mean: 102.0
Standard Deviation: 57.7234787586
3rd Quartile: 151.0
Max: 200.0

Conclusions

While it may not have been obvious from the token examples in this tutorial, when we are dealing with huge, multi-dimensional arrays numpy is vastly superior than python lists in terms of speed.

Applying arithmetic operations or functions on numpy arrays is also much faster than manually going through a python for loop to accomplish the same task.