Select

library(reticulate)
py_install("pandas")
py_install("numpy")
py_install("matplotlib")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Read CSV

Let’s read a csv file so we can have some data to use for demonstration. We will contend with Pandas DataFrame and Series structures in this document.

Series

s = pd.Series(np.random.randn(5), index = ["a","b","c","d","e"])
s
a    1.110744
b   -0.978069
c    2.241242
d   -3.491906
e   -0.331338
dtype: float64

Series Attributes

Pandas Series come with various attributes and methods to help you manipulate and analyze data effectively. Here are a few essential ones:

  • values: Returns the Series data as a NumPy array.
  • index: Returns the index (labels) of the Series.
  • shape: Returns a tuple representing the dimensions of the Series.
  • size: Returns the number of elements in the Series.
  • mean(), sum(), min(), max(): Calculate summary statistics of the data.
  • unique(), nunique(): Get unique values or the number of unique values.
  • sort_values(), sort_index(): Sort the Series by values or index labels.
  • isnull(), notnull(): Check for missing (NaN) or non-missing values.
  • apply(): Apply a custom function to each element of the Series.

DataFrame


  • First we read the csv using pd.read_csv()
# Read csv file
df = pd.read_csv('D:/yourdataiq/main/datasets/avgpm25.csv')

Preview

df.head(5)
        pm25  fips region  longitude   latitude
0   9.771185  1003   east -87.748260  30.592781
1   9.993817  1027   east -85.842858  33.265810
2  10.688618  1033   east -87.725960  34.731480
3  11.337424  1049   east -85.798919  34.459133
4  12.119764  1055   east -86.032125  34.018597

If we wanted to view parts of the df we can target specific rows and columns using:

  • getitem = by item
  • iloc = by label
  • loc = by location

getitem []


One Column

  • Passing a single label results in that column represented as a series

Note: cannot be used to fetch by position. This will cause an error: df[2]

df["region"]
0      east
1      east
2      east
3      east
4      east
       ... 
571    west
572    west
573    west
574    west
575    west
Name: region, Length: 576, dtype: object

Multiple Columns

df[["pm25", "fips"]]
          pm25   fips
0     9.771185   1003
1     9.993817   1027
2    10.688618   1033
3    11.337424   1049
4    12.119764   1055
..         ...    ...
571   4.955570  56029
572   6.549239  56033
573   5.632587  56035
574   6.349710  56037
575   4.565808  56039

[576 rows x 2 columns]

Slice of Rows

  • It can be used to extract a slice of the first N rows
  • Or a slice from row X to row Y df[x:y]
df[0:2]
       pm25  fips region  longitude   latitude
0  9.771185  1003   east -87.748260  30.592781
1  9.993817  1027   east -85.842858  33.265810

Range of Rows

  • We can extract the slice UP TO row Number
# Extract UP TO row 2
df[:2]
       pm25  fips region  longitude   latitude
0  9.771185  1003   east -87.748260  30.592781
1  9.993817  1027   east -85.842858  33.265810
  • Extract from row number to end
# Extract from row 574 to end
df[574:]
         pm25   fips region   longitude   latitude
574  6.349710  56037   west -109.168198  41.640708
575  4.565808  56039   west -110.673534  43.612124

loc by Label


.loc or .at can be used to target labeled rows or columns

Row X

  • To access a specific row by label
# to access row 3 the output is a list
df.loc[2]
pm25         10.688618
fips              1033
region            east
longitude    -87.72596
latitude      34.73148
Name: 2, dtype: object
  • if we want the output to be a df place above code inside df[]

NOTE: the value above is a series, just one row. If we wanted the answer to be a df then enclose the argument in [[]]. If we are extracting/slicing more than one row or column the answer will be returned as a df

df.loc[[2]]
        pm25  fips region  longitude  latitude
2  10.688618  1033   east  -87.72596  34.73148

X Rows X Columns

df.loc[[0,1,2], ["pm25","region"]]
        pm25 region
0   9.771185   east
1   9.993817   east
2  10.688618   east

All Rows 1 Column

# All rows for one column
df.loc[:,["pm25"]]
          pm25
0     9.771185
1     9.993817
2    10.688618
3    11.337424
4    12.119764
..         ...
571   4.955570
572   6.549239
573   5.632587
574   6.349710
575   4.565808

[576 rows x 1 columns]

All Rows X Columns

df.loc[:,["pm25","region"]]
          pm25 region
0     9.771185   east
1     9.993817   east
2    10.688618   east
3    11.337424   east
4    12.119764   east
..         ...    ...
571   4.955570   west
572   6.549239   west
573   5.632587   west
574   6.349710   west
575   4.565808   west

[576 rows x 2 columns]

Slice Rows X Columns

df.loc[550:553,["pm25","region"]]
          pm25 region
550  12.840365   east
551   5.715239   east
552  11.366288   east
553  11.098744   east

1 Cell

df.loc[550,["pm25"]]
pm25    12.840365
Name: 550, dtype: object
  • To get a scalar from above just omit the [] in the column argument
# results in a scalar
df.loc[550,"pm25"]
np.float64(12.8403648688476)

at by Label


Retrieve Value

Similar to .loc in that both provide label-based lookups. But at is used to extract or set a single value in DF or Series

df.at[550,'pm25']
np.float64(12.8403648688476)

Set Value

  • You can set a value at a specific cell using at as well just assign a value to it like this
df.at[550,'pm25'] = 1094
# check accuracy with
df.at[550,'pm25']
1094

iloc by Position


.iloc[] is an indexer used for integer-location-based indexing of data in a DataFrame. It allows users to select specific rows and columns by providing integer indices

  • Syntax: df.iloc(row index position, column index position)
  • iloc has been deprecated
  • Remember it’s (row,colum)
  • allowed input:
    • integer
    • list or array of integers: [4,12,0]
    • slice with integers: 1:7
    • boolean array
    • a callable
    • a tuple of row and column indexes

1 Row All Columns

# First row all columns
df.iloc[0]
pm25          9.771185
fips              1003
region            east
longitude    -87.74826
latitude     30.592781
Name: 0, dtype: object

NOTE: the value above is a series, just one row. If we wanted the answer to be a df then enclose the argument in [[]]. If we are extracting/slicing more than one row or column the answer will be returned as a df

  • Here is how to retrieve the above series as a dataframe
# return as a df
a = df.iloc[[0]]
a
       pm25  fips region  longitude   latitude
0  9.771185  1003   east  -87.74826  30.592781
type(a)
<class 'pandas.core.frame.DataFrame'>

X Rows 1 Column

df.iloc[[0,1,2], 0]
0     9.771185
1     9.993817
2    10.688618
Name: pm25, dtype: float64

All Rows 1 Column

# All rows one column
df.iloc[:,2]
0      east
1      east
2      east
3      east
4      east
       ... 
571    west
572    west
573    west
574    west
575    west
Name: region, Length: 576, dtype: object

X Rows All Columns

# Slice of rows, all columns
df.iloc[0:2,:]
       pm25  fips region  longitude   latitude
0  9.771185  1003   east -87.748260  30.592781
1  9.993817  1027   east -85.842858  33.265810

X Rows X Columns

# Slice of rows, slice of columns
df.iloc[0:2,0:3]
       pm25  fips region
0  9.771185  1003   east
1  9.993817  1027   east

1 Cell

# One cell in 1 row 1 column
df.iloc[0,0]
np.float64(9.77118522614686)

iat by Position


Similar to at but iat targets the df by position and returns a specific value as a scalar directly

df.iat[0,0]
np.float64(9.77118522614686)

Filter


Boolean

Rows = True

  • Select rows that meet a certain boolean condition
  • Let’s say we only want rows where pm25 < 11
  • Note that it will include all columns in the df, and we end up with 389 rows
  • pm25 <11 will create a list of rows with True or False values, the row where pm25<11 would be True and the others will be False
  • In order to filter out the True rows we enclose the list within df[] and get df1
  • Now we have a new df1 with just the values that meet the condition
# Select rows that meet the boolean condition pm25 < 11
df1= df[df["pm25"]<11]
df1
          pm25   fips region   longitude   latitude
0     9.771185   1003   east  -87.748260  30.592781
1     9.993817   1027   east  -85.842858  33.265810
2    10.688618   1033   east  -87.725960  34.731480
5    10.827805   1069   east  -85.350387  31.189731
8     9.414423   1097   east  -88.139667  30.722256
..         ...    ...    ...         ...        ...
571   4.955570  56029   west -108.999016  44.568961
572   6.549239  56033   west -106.968130  44.783629
573   5.632587  56035   west -109.991147  42.750949
574   6.349710  56037   west -109.168198  41.640708
575   4.565808  56039   west -110.673534  43.612124

[389 rows x 5 columns]

DataFrame = True

  • Be careful using this command because it could involve a large volume of data or on the other end of the spectrum a very small number of data depending on your df
  • What if we want to search the entire df for a certain value and we don’t know which column or row it might be in
df[df > 11]

# Here we get an error
TypeError: '>' not supported between instances of 'str' and 'int'
You can see why, as we discovered earlier when we previewed the dtypes that some columns are integers while some are strings, and we can't compare str to int

NaN

  • Continuing with the section prior how about if we search for all cells not equal NaN
  • As you can see we get 576 rows the original number and 5 columns as well so we have to assume that all values are NOT NaN
# Let's extract all cells that are not NaN
df[df != "NaN"]
          pm25   fips region   longitude   latitude
0     9.771185   1003   east  -87.748260  30.592781
1     9.993817   1027   east  -85.842858  33.265810
2    10.688618   1033   east  -87.725960  34.731480
3    11.337424   1049   east  -85.798919  34.459133
4    12.119764   1055   east  -86.032125  34.018597
..         ...    ...    ...         ...        ...
571   4.955570  56029   west -108.999016  44.568961
572   6.549239  56033   west -106.968130  44.783629
573   5.632587  56035   west -109.991147  42.750949
574   6.349710  56037   west -109.168198  41.640708
575   4.565808  56039   west -110.673534  43.612124

[576 rows x 5 columns]
# How about all cells  in pm25 that are NaN
df[df["pm25"] == "NaN" ]
Empty DataFrame
Columns: [pm25, fips, region, longitude, latitude]
Index: []

Equal

  • of course we can always test the equality of a value in a column
# Let's say we want to filter out all rows where "region" = west. Of course we can do it this way
df[df["region"]== "west"]  
          pm25   fips region   longitude   latitude
15    6.058860   2020   west -149.762097  61.191900
16   11.101467   2090   west -147.568384  64.818590
17    7.308113   2110   west -134.511579  58.351422
18    7.147626   2170   west -149.481089  61.762742
19    6.929844   4003   west -109.904319  31.750272
..         ...    ...    ...         ...        ...
571   4.955570  56029   west -108.999016  44.568961
572   6.549239  56033   west -106.968130  44.783629
573   5.632587  56035   west -109.991147  42.750949
574   6.349710  56037   west -109.168198  41.640708
575   4.565808  56039   west -110.673534  43.612124

[134 rows x 5 columns]

ISIN

  • isin is a very useful boolean condition, and it does what it sounds like
  • If we want to filter a column for specific row values we can use either df[“columname”] = variable or we can use isin
# or we can do it using isin
df[df["region"].isin(["west"])]
          pm25   fips region   longitude   latitude
15    6.058860   2020   west -149.762097  61.191900
16   11.101467   2090   west -147.568384  64.818590
17    7.308113   2110   west -134.511579  58.351422
18    7.147626   2170   west -149.481089  61.762742
19    6.929844   4003   west -109.904319  31.750272
..         ...    ...    ...         ...        ...
571   4.955570  56029   west -108.999016  44.568961
572   6.549239  56033   west -106.968130  44.783629
573   5.632587  56035   west -109.991147  42.750949
574   6.349710  56037   west -109.168198  41.640708
575   4.565808  56039   west -110.673534  43.612124

[134 rows x 5 columns]
  • Why use isin?
  • Because we can have an entire list of values, it’s like an OR statement built-in
  • Let’s say we want to know the rows that have a values of “west” or “south”
  • As it appears from the output, that none of the rows have a region == “south”
df[df["region"].isin(["west","south"])]
          pm25   fips region   longitude   latitude
15    6.058860   2020   west -149.762097  61.191900
16   11.101467   2090   west -147.568384  64.818590
17    7.308113   2110   west -134.511579  58.351422
18    7.147626   2170   west -149.481089  61.762742
19    6.929844   4003   west -109.904319  31.750272
..         ...    ...    ...         ...        ...
571   4.955570  56029   west -108.999016  44.568961
572   6.549239  56033   west -106.968130  44.783629
573   5.632587  56035   west -109.991147  42.750949
574   6.349710  56037   west -109.168198  41.640708
575   4.565808  56039   west -110.673534  43.612124

[134 rows x 5 columns]

Unique

  • We can find all the unique elements in a column
  • Let’s say we wanted to know how many regions are in the df
  • That would be the 3rd column or the column labeled “region”
  • Syntax: df.colname.unique()
df["region"].unique()
array(['east', 'west'], dtype=object)

NUnique

  • This is a derivative of .unique() as it gives us the number or unique values in a column
df["region"].nunique()
2

Columns


See Pandas intro for more column manipulation.

Columnnames

  • We can target the column index directly just as we did when we listed all the column names/index earlier with
# Print column names/index
df.columns
Index(['pm25', 'fips', 'region', 'longitude', 'latitude'], dtype='object')

Columname x To End

# print 3rd columnname to end
df.columns[2:]
Index(['region', 'longitude', 'latitude'], dtype='object')

1 Column

# List first columnname
df.columns[0]
'pm25'

1 Column All Rows

  • In the code above we extracted one columname
  • If we want to extract the values for each row in that column we just factor it with the DataFrame, in this case df
# All rows for 1 column, all we do is enclose the code above inside df[]
df[df.columns[0]]
0       9.771185
1       9.993817
2      10.688618
3      11.337424
4      12.119764
         ...    
571     4.955570
572     6.549239
573     5.632587
574     6.349710
575     4.565808
Name: pm25, Length: 576, dtype: float64

Series


getitem []

1 Cell

s["b"]
np.float64(-0.9780693521984415)

Slice of Cells

s["a":"d"]
a    1.110744
b   -0.978069
c    2.241242
d   -3.491906
dtype: float64

iloc by Position

s
a    1.110744
b   -0.978069
c    2.241242
d   -3.491906
e   -0.331338
dtype: float64

1 Cell

s.iloc[1]
np.float64(-0.9780693521984415)

First X Cells

s.iloc[:3]
a    1.110744
b   -0.978069
c    2.241242
dtype: float64

From X to End

s.iloc[3:]
d   -3.491906
e   -0.331338
dtype: float64

Slice of Cells

s.iloc[0:2]
a    1.110744
b   -0.978069
dtype: float64

Calculate

Calculations

  • Let’s create a series
  • Then extract the unique values from the series
  • When we don’t supply an index one is created for us
d = pd.Series([2,1,3,3,4,1,5,6,2])
d
0    2
1    1
2    3
3    3
4    4
5    1
6    5
7    6
8    2
dtype: int64
# Let's multiply first element=1 by 5
d[1]*5
np.int64(5)
# Let's multiply first element=1 by the fourth=4
d[1]*d[4]
np.int64(4)

> median

s[s > s.median()]
a    1.110744
c    2.241242
dtype: float64

median

s.median()
np.float64(-0.3313383669873633)

Unique

  • Let’s find the unique values in the series d
d.unique()
array([2, 1, 3, 4, 5, 6])
  • We can also use pandas unique method
pd.unique(d)
array([2, 1, 3, 4, 5, 6])