Lecture Notes 14#

Reading excel#

import pandas as pd
pd.read_csv('/home/python-lll/Downloads/survey-data - person.csv')
id personal family
0 dyer William Dyer
1 pb Frank Pabodie
2 lake Anderson Lake
3 roe Valentina Roerich
4 danforth Frank Danforth
pd.read_excel('/home/python-lll/Downloads/survey-data.xlsx')
id personal family
0 dyer William Dyer
1 pb Frank Pabodie
2 lake Anderson Lake
3 roe Valentina Roerich
4 danforth Frank Danforth
  • Use option sheet_name=None to get all tabs as a dictionary

pd.read_excel('/home/python-lll/Downloads/survey-data.xlsx', sheet_name=None)
{'person':          id   personal    family
 0      dyer    William      Dyer
 1        pb      Frank   Pabodie
 2      lake   Anderson      Lake
 3       roe  Valentina   Roerich
 4  danforth      Frank  Danforth,
 'site':     name    lat    long
 0   DR-1 -49.85 -128.57
 1   DR-3 -47.15 -126.72
 2  MSK-4 -48.87 -123.40,
 'survey':     taken person quant  reading
 0     619   dyer   rad     9.82
 1     619   dyer   sal     0.13
 2     622   dyer   rad     7.80
 3     622   dyer   sal     0.09
 4     734     pb   rad     8.41
 5     734   lake   sal     0.05
 6     734     pb  temp   -21.50
 7     735     pb   rad     7.22
 8     735    NaN   sal     0.06
 9     735    NaN  temp   -26.00
 10    751     pb   rad     4.35
 11    751     pb  temp   -18.50
 12    751   lake   sal     0.10
 13    752   lake   rad     2.19
 14    752   lake   sal     0.09
 15    752   lake  temp   -16.00
 16    752    roe   sal    41.60
 17    837   lake   rad     1.46
 18    837   lake   sal     0.21
 19    837    roe   sal    22.50
 20    844    roe   rad    11.25,
 'visited':     id   site      dated
 0  619   DR-1 1927-02-08
 1  622   DR-1 1927-02-10
 2  734   DR-3 1930-01-07
 3  735   DR-3 1930-01-12
 4  751   DR-3 1930-02-26
 5  752   DR-3        NaT
 6  837  MSK-4 1932-01-14
 7  844   DR-1 1932-03-22}
dfs = pd.read_excel('/home/python-lll/Downloads/survey-data.xlsx', sheet_name=None)
dfs['person']
id personal family
0 dyer William Dyer
1 pb Frank Pabodie
2 lake Anderson Lake
3 roe Valentina Roerich
4 danforth Frank Danforth

Reading google spreadsheets#

  • Two steps: download and read from local disk witn read_excel

  • One step: Modify url with to end with .../export?format=xlsx and pass to read_excel

dfs = pd.read_excel('https://docs.google.com/spreadsheets/d/1VVw6O5ncoc2R-bBvu1Xc0PqylA6tKW2pyh5bDHhE_r8/export?format=xlsx', sheet_name=None)
dfs['person']
id personal family
0 dyer William Dyer
1 pb Frank Pabodie
2 lake Anderson Lake
3 roe Valentina Roerich
4 danforth Frank Danforth
dfs.keys()
dict_keys(['person', 'site', 'survey', 'visited'])
person = dfs['person']
site = dfs['site']
survey = dfs['survey']
visited = dfs['visited']
person
id personal family
0 dyer William Dyer
1 pb Frank Pabodie
2 lake Anderson Lake
3 roe Valentina Roerich
4 danforth Frank Danforth

Common operations with dataframes#

Select columns#

person[['personal', 'family']]
personal family
0 William Dyer
1 Frank Pabodie
2 Anderson Lake
3 Valentina Roerich
4 Frank Danforth

Get unique values in a column#

survey # which are the unique quantities we measure?
taken person quant reading
0 619 dyer rad 9.82
1 619 dyer sal 0.13
2 622 dyer rad 7.80
3 622 dyer sal 0.09
4 734 pb rad 8.41
5 734 lake sal 0.05
6 734 pb temp -21.50
7 735 pb rad 7.22
8 735 NaN sal 0.06
9 735 NaN temp -26.00
10 751 pb rad 4.35
11 751 pb temp -18.50
12 751 lake sal 0.10
13 752 lake rad 2.19
14 752 lake sal 0.09
15 752 lake temp -16.00
16 752 roe sal 41.60
17 837 lake rad 1.46
18 837 lake sal 0.21
19 837 roe sal 22.50
20 844 roe rad 11.25
survey['quant']
0      rad
1      sal
2      rad
3      sal
4      rad
5      sal
6     temp
7      rad
8      sal
9     temp
10     rad
11    temp
12     sal
13     rad
14     sal
15    temp
16     sal
17     rad
18     sal
19     sal
20     rad
Name: quant, dtype: object
  • Using the built-in set

  • Using the pandas Series method unique

set(survey['quant'])
{'rad', 'sal', 'temp'}
survey['quant'].unique()
array(['rad', 'sal', 'temp'], dtype=object)
  • If we want to know how the values are distributed

survey['quant'].value_counts()
sal     9
rad     8
temp    4
Name: quant, dtype: int64

sorting by a column#

person
id personal family
0 dyer William Dyer
1 pb Frank Pabodie
2 lake Anderson Lake
3 roe Valentina Roerich
4 danforth Frank Danforth
person.sort_values('family')
id personal family
4 danforth Frank Danforth
0 dyer William Dyer
2 lake Anderson Lake
1 pb Frank Pabodie
3 roe Valentina Roerich

filtering with boolean Series#

visited
id site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1930-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaT
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
at_dr_1 = visited['site'] == 'DR-1'
at_dr_1
0     True
1     True
2    False
3    False
4    False
5    False
6    False
7     True
Name: site, dtype: bool
visited[at_dr_1]
id site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
7 844 DR-1 1932-03-22
  • Combining with logical operators

    • | for OR

    • & for AND

at_dr_1_or_3 = (visited['site'] == 'DR-1') | (visited['site'] == 'DR-3')
at_dr_1_or_3
0     True
1     True
2     True
3     True
4     True
5     True
6    False
7     True
Name: site, dtype: bool
visited[at_dr_1_or_3]
id site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1930-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaT
7 844 DR-1 1932-03-22
'DR-3'.startswith('DR')
True
at_dr = visited['site'].str.startswith('DR')
at_dr
0     True
1     True
2     True
3     True
4     True
5     True
6    False
7     True
Name: site, dtype: bool
visited[at_dr]
id site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1930-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaT
7 844 DR-1 1932-03-22

select all record from survey with salinity values outsited [0, 1]#

survey
taken person quant reading
0 619 dyer rad 9.82
1 619 dyer sal 0.13
2 622 dyer rad 7.80
3 622 dyer sal 0.09
4 734 pb rad 8.41
5 734 lake sal 0.05
6 734 pb temp -21.50
7 735 pb rad 7.22
8 735 NaN sal 0.06
9 735 NaN temp -26.00
10 751 pb rad 4.35
11 751 pb temp -18.50
12 751 lake sal 0.10
13 752 lake rad 2.19
14 752 lake sal 0.09
15 752 lake temp -16.00
16 752 roe sal 41.60
17 837 lake rad 1.46
18 837 lake sal 0.21
19 837 roe sal 22.50
20 844 roe rad 11.25
sal = survey['quant'] == 'sal'
sal
0     False
1      True
2     False
3      True
4     False
5      True
6     False
7     False
8      True
9     False
10    False
11    False
12     True
13    False
14     True
15    False
16     True
17    False
18     True
19     True
20    False
Name: quant, dtype: bool
out_of_range = (survey['reading'] > 1) | (survey['reading'] < 0) # | is a logical OR operation
out_of_range
0      True
1     False
2      True
3     False
4      True
5     False
6      True
7      True
8     False
9      True
10     True
11     True
12    False
13     True
14    False
15     True
16     True
17     True
18    False
19     True
20     True
Name: reading, dtype: bool
sal & out_of_range  # & is a logical AND operation
0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16     True
17    False
18    False
19     True
20    False
dtype: bool
survey[sal & out_of_range]
taken person quant reading
16 752 roe sal 41.6
19 837 roe sal 22.5

Add new columns#

Provide temp readings in Fahrenheit#

fahrenheit = survey[survey['quant'] == 'temp']['reading']*9/5 + 32
fahrenheit
6     -6.7
9    -14.8
11    -1.3
15     3.2
Name: reading, dtype: float64
survey['fahrenheit'] = fahrenheit
survey
taken person quant reading fahrenheit
0 619 dyer rad 9.82 NaN
1 619 dyer sal 0.13 NaN
2 622 dyer rad 7.80 NaN
3 622 dyer sal 0.09 NaN
4 734 pb rad 8.41 NaN
5 734 lake sal 0.05 NaN
6 734 pb temp -21.50 -6.7
7 735 pb rad 7.22 NaN
8 735 NaN sal 0.06 NaN
9 735 NaN temp -26.00 -14.8
10 751 pb rad 4.35 NaN
11 751 pb temp -18.50 -1.3
12 751 lake sal 0.10 NaN
13 752 lake rad 2.19 NaN
14 752 lake sal 0.09 NaN
15 752 lake temp -16.00 3.2
16 752 roe sal 41.60 NaN
17 837 lake rad 1.46 NaN
18 837 lake sal 0.21 NaN
19 837 roe sal 22.50 NaN
20 844 roe rad 11.25 NaN

Missing data#

visited  # find missing data
id site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1930-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaT
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
missing_date = visited['dated'].isna() # is not valid 
missing_date
0    False
1    False
2    False
3    False
4    False
5     True
6    False
7    False
Name: dated, dtype: bool
visited[missing_date]
id site dated
5 752 DR-3 NaT

what is the most recent reading#

visited
id site dated
0 619 DR-1 1927-02-08
1 622 DR-1 1927-02-10
2 734 DR-3 1930-01-07
3 735 DR-3 1930-01-12
4 751 DR-3 1930-02-26
5 752 DR-3 NaT
6 837 MSK-4 1932-01-14
7 844 DR-1 1932-03-22
visited.sort_values('dated', ascending=False)
id site dated
7 844 DR-1 1932-03-22
6 837 MSK-4 1932-01-14
4 751 DR-3 1930-02-26
3 735 DR-3 1930-01-12
2 734 DR-3 1930-01-07
1 622 DR-1 1927-02-10
0 619 DR-1 1927-02-08
5 752 DR-3 NaT
visited.sort_values('dated', ascending=False).head(1)
id site dated
7 844 DR-1 1932-03-22

operation with grouping#

what is the average measurement for each person and quantity#

survey
taken person quant reading fahrenheit
0 619 dyer rad 9.82 NaN
1 619 dyer sal 0.13 NaN
2 622 dyer rad 7.80 NaN
3 622 dyer sal 0.09 NaN
4 734 pb rad 8.41 NaN
5 734 lake sal 0.05 NaN
6 734 pb temp -21.50 -6.7
7 735 pb rad 7.22 NaN
8 735 NaN sal 0.06 NaN
9 735 NaN temp -26.00 -14.8
10 751 pb rad 4.35 NaN
11 751 pb temp -18.50 -1.3
12 751 lake sal 0.10 NaN
13 752 lake rad 2.19 NaN
14 752 lake sal 0.09 NaN
15 752 lake temp -16.00 3.2
16 752 roe sal 41.60 NaN
17 837 lake rad 1.46 NaN
18 837 lake sal 0.21 NaN
19 837 roe sal 22.50 NaN
20 844 roe rad 11.25 NaN
survey.groupby('quant')['reading'].mean().round(2)
quant
rad      6.56
sal      7.20
temp   -20.50
Name: reading, dtype: float64
survey.groupby(['person', 'quant'])['reading'].mean().round(2)
person  quant
dyer    rad       8.81
        sal       0.11
lake    rad       1.82
        sal       0.11
        temp    -16.00
pb      rad       6.66
        temp    -20.00
roe     rad      11.25
        sal      32.05
Name: reading, dtype: float64

plotting reviewed#

with arrays#

import numpy as np
import matplotlib.pyplot as plt
x = np.arange(0, 1, .1)
y1 = x**2
y2 = x**3
plt.plot(x, y1, label='square')
plt.plot(x, y2, label='cube')
plt.legend()
<matplotlib.legend.Legend at 0x7f54d11f7b50>
_images/5dd41087b17959826ce12380bbaffab349f925816f44ca37d39dbe5e1aca25e4.png

with dataframes#

pd.DataFrame({'square': y1, 'cube': y2}, index=x).plot()
<Axes: >
_images/5dd41087b17959826ce12380bbaffab349f925816f44ca37d39dbe5e1aca25e4.png