library(reticulate)
py_install("pandas")
py_install("numpy")
py_install("matplotlib")
Select
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Read CSV
Let’s read a csv file so we can have some data to use for demonstration. We will contend with Pandas DataFrame and Series structures in this document.
Series
= pd.Series(np.random.randn(5), index = ["a","b","c","d","e"])
s s
a 1.110744
b -0.978069
c 2.241242
d -3.491906
e -0.331338
dtype: float64
Series Attributes
Pandas Series come with various attributes and methods to help you manipulate and analyze data effectively. Here are a few essential ones:
- values: Returns the Series data as a NumPy array.
- index: Returns the index (labels) of the Series.
- shape: Returns a tuple representing the dimensions of the Series.
- size: Returns the number of elements in the Series.
- mean(), sum(), min(), max(): Calculate summary statistics of the data.
- unique(), nunique(): Get unique values or the number of unique values.
- sort_values(), sort_index(): Sort the Series by values or index labels.
- isnull(), notnull(): Check for missing (NaN) or non-missing values.
- apply(): Apply a custom function to each element of the Series.
DataFrame
- First we read the csv using
pd.read_csv()
# Read csv file
= pd.read_csv('D:/yourdataiq/main/datasets/avgpm25.csv') df
Preview
5) df.head(
pm25 fips region longitude latitude
0 9.771185 1003 east -87.748260 30.592781
1 9.993817 1027 east -85.842858 33.265810
2 10.688618 1033 east -87.725960 34.731480
3 11.337424 1049 east -85.798919 34.459133
4 12.119764 1055 east -86.032125 34.018597
If we wanted to view parts of the df we can target specific rows and columns using:
- getitem = by item
- iloc = by label
- loc = by location
getitem []
One Column
- Passing a single label results in that column represented as a series
Note: cannot be used to fetch by position. This will cause an error: df[2]
"region"] df[
0 east
1 east
2 east
3 east
4 east
...
571 west
572 west
573 west
574 west
575 west
Name: region, Length: 576, dtype: object
Multiple Columns
"pm25", "fips"]] df[[
pm25 fips
0 9.771185 1003
1 9.993817 1027
2 10.688618 1033
3 11.337424 1049
4 12.119764 1055
.. ... ...
571 4.955570 56029
572 6.549239 56033
573 5.632587 56035
574 6.349710 56037
575 4.565808 56039
[576 rows x 2 columns]
Slice of Rows
- It can be used to extract a slice of the first N rows
- Or a slice from row X to row Y df[x:y]
0:2] df[
pm25 fips region longitude latitude
0 9.771185 1003 east -87.748260 30.592781
1 9.993817 1027 east -85.842858 33.265810
Range of Rows
- We can extract the slice UP TO row Number
# Extract UP TO row 2
2] df[:
pm25 fips region longitude latitude
0 9.771185 1003 east -87.748260 30.592781
1 9.993817 1027 east -85.842858 33.265810
- Extract from row number to end
# Extract from row 574 to end
574:] df[
pm25 fips region longitude latitude
574 6.349710 56037 west -109.168198 41.640708
575 4.565808 56039 west -110.673534 43.612124
loc by Label
.loc or .at can be used to target labeled rows or columns
Row X
- To access a specific row by label
# to access row 3 the output is a list
2] df.loc[
pm25 10.688618
fips 1033
region east
longitude -87.72596
latitude 34.73148
Name: 2, dtype: object
- if we want the output to be a df place above code inside df[]
NOTE: the value above is a series, just one row. If we wanted the answer to be a df then enclose the argument in [[]]. If we are extracting/slicing more than one row or column the answer will be returned as a df
2]] df.loc[[
pm25 fips region longitude latitude
2 10.688618 1033 east -87.72596 34.73148
X Rows X Columns
0,1,2], ["pm25","region"]] df.loc[[
pm25 region
0 9.771185 east
1 9.993817 east
2 10.688618 east
All Rows 1 Column
# All rows for one column
"pm25"]] df.loc[:,[
pm25
0 9.771185
1 9.993817
2 10.688618
3 11.337424
4 12.119764
.. ...
571 4.955570
572 6.549239
573 5.632587
574 6.349710
575 4.565808
[576 rows x 1 columns]
All Rows X Columns
"pm25","region"]] df.loc[:,[
pm25 region
0 9.771185 east
1 9.993817 east
2 10.688618 east
3 11.337424 east
4 12.119764 east
.. ... ...
571 4.955570 west
572 6.549239 west
573 5.632587 west
574 6.349710 west
575 4.565808 west
[576 rows x 2 columns]
Slice Rows X Columns
550:553,["pm25","region"]] df.loc[
pm25 region
550 12.840365 east
551 5.715239 east
552 11.366288 east
553 11.098744 east
1 Cell
550,["pm25"]] df.loc[
pm25 12.840365
Name: 550, dtype: object
- To get a scalar from above just omit the [] in the column argument
# results in a scalar
550,"pm25"] df.loc[
np.float64(12.8403648688476)
at by Label
Retrieve Value
Similar to .loc in that both provide label-based lookups. But at is used to extract or set a single value in DF or Series
550,'pm25'] df.at[
np.float64(12.8403648688476)
Set Value
- You can set a value at a specific cell using at as well just assign a value to it like this
550,'pm25'] = 1094
df.at[# check accuracy with
550,'pm25']
df.at[1094
iloc by Position
.iloc[] is an indexer used for integer-location-based indexing of data in a DataFrame. It allows users to select specific rows and columns by providing integer indices
- Syntax:
df.iloc(row index position, column index position)
- iloc has been deprecated
- Remember it’s (row,colum)
- allowed input:
- integer
- list or array of integers: [4,12,0]
- slice with integers: 1:7
- boolean array
- a callable
- a tuple of row and column indexes
1 Row All Columns
# First row all columns
0] df.iloc[
pm25 9.771185
fips 1003
region east
longitude -87.74826
latitude 30.592781
Name: 0, dtype: object
NOTE: the value above is a series, just one row. If we wanted the answer to be a df then enclose the argument in [[]]. If we are extracting/slicing more than one row or column the answer will be returned as a df
- Here is how to retrieve the above series as a dataframe
# return as a df
= df.iloc[[0]]
a a
pm25 fips region longitude latitude
0 9.771185 1003 east -87.74826 30.592781
type(a)
<class 'pandas.core.frame.DataFrame'>
X Rows 1 Column
0,1,2], 0] df.iloc[[
0 9.771185
1 9.993817
2 10.688618
Name: pm25, dtype: float64
All Rows 1 Column
# All rows one column
2] df.iloc[:,
0 east
1 east
2 east
3 east
4 east
...
571 west
572 west
573 west
574 west
575 west
Name: region, Length: 576, dtype: object
X Rows All Columns
# Slice of rows, all columns
0:2,:] df.iloc[
pm25 fips region longitude latitude
0 9.771185 1003 east -87.748260 30.592781
1 9.993817 1027 east -85.842858 33.265810
X Rows X Columns
# Slice of rows, slice of columns
0:2,0:3] df.iloc[
pm25 fips region
0 9.771185 1003 east
1 9.993817 1027 east
1 Cell
# One cell in 1 row 1 column
0,0] df.iloc[
np.float64(9.77118522614686)
iat by Position
Similar to at but iat targets the df by position and returns a specific value as a scalar directly
0,0] df.iat[
np.float64(9.77118522614686)
Filter
Boolean
Rows = True
- Select rows that meet a certain boolean condition
- Let’s say we only want rows where pm25 < 11
- Note that it will include all columns in the df, and we end up with 389 rows
- pm25 <11 will create a list of rows with True or False values, the row where pm25<11 would be True and the others will be False
- In order to filter out the True rows we enclose the list within df[] and get df1
- Now we have a new df1 with just the values that meet the condition
# Select rows that meet the boolean condition pm25 < 11
= df[df["pm25"]<11]
df1 df1
pm25 fips region longitude latitude
0 9.771185 1003 east -87.748260 30.592781
1 9.993817 1027 east -85.842858 33.265810
2 10.688618 1033 east -87.725960 34.731480
5 10.827805 1069 east -85.350387 31.189731
8 9.414423 1097 east -88.139667 30.722256
.. ... ... ... ... ...
571 4.955570 56029 west -108.999016 44.568961
572 6.549239 56033 west -106.968130 44.783629
573 5.632587 56035 west -109.991147 42.750949
574 6.349710 56037 west -109.168198 41.640708
575 4.565808 56039 west -110.673534 43.612124
[389 rows x 5 columns]
DataFrame = True
- Be careful using this command because it could involve a large volume of data or on the other end of the spectrum a very small number of data depending on your df
- What if we want to search the entire df for a certain value and we don’t know which column or row it might be in
> 11]
df[df
# Here we get an error
TypeError: '>' not supported between instances of 'str' and 'int'
as we discovered earlier when we previewed the dtypes that some columns are integers while some are strings, and we can't compare str to int You can see why,
NaN
- Continuing with the section prior how about if we search for all cells not equal NaN
- As you can see we get 576 rows the original number and 5 columns as well so we have to assume that all values are NOT NaN
# Let's extract all cells that are not NaN
!= "NaN"] df[df
pm25 fips region longitude latitude
0 9.771185 1003 east -87.748260 30.592781
1 9.993817 1027 east -85.842858 33.265810
2 10.688618 1033 east -87.725960 34.731480
3 11.337424 1049 east -85.798919 34.459133
4 12.119764 1055 east -86.032125 34.018597
.. ... ... ... ... ...
571 4.955570 56029 west -108.999016 44.568961
572 6.549239 56033 west -106.968130 44.783629
573 5.632587 56035 west -109.991147 42.750949
574 6.349710 56037 west -109.168198 41.640708
575 4.565808 56039 west -110.673534 43.612124
[576 rows x 5 columns]
# How about all cells in pm25 that are NaN
"pm25"] == "NaN" ] df[df[
Empty DataFrame
Columns: [pm25, fips, region, longitude, latitude]
Index: []
Equal
- of course we can always test the equality of a value in a column
# Let's say we want to filter out all rows where "region" = west. Of course we can do it this way
"region"]== "west"] df[df[
pm25 fips region longitude latitude
15 6.058860 2020 west -149.762097 61.191900
16 11.101467 2090 west -147.568384 64.818590
17 7.308113 2110 west -134.511579 58.351422
18 7.147626 2170 west -149.481089 61.762742
19 6.929844 4003 west -109.904319 31.750272
.. ... ... ... ... ...
571 4.955570 56029 west -108.999016 44.568961
572 6.549239 56033 west -106.968130 44.783629
573 5.632587 56035 west -109.991147 42.750949
574 6.349710 56037 west -109.168198 41.640708
575 4.565808 56039 west -110.673534 43.612124
[134 rows x 5 columns]
ISIN
- isin is a very useful boolean condition, and it does what it sounds like
- If we want to filter a column for specific row values we can use either df[“columname”] = variable or we can use isin
# or we can do it using isin
"region"].isin(["west"])] df[df[
pm25 fips region longitude latitude
15 6.058860 2020 west -149.762097 61.191900
16 11.101467 2090 west -147.568384 64.818590
17 7.308113 2110 west -134.511579 58.351422
18 7.147626 2170 west -149.481089 61.762742
19 6.929844 4003 west -109.904319 31.750272
.. ... ... ... ... ...
571 4.955570 56029 west -108.999016 44.568961
572 6.549239 56033 west -106.968130 44.783629
573 5.632587 56035 west -109.991147 42.750949
574 6.349710 56037 west -109.168198 41.640708
575 4.565808 56039 west -110.673534 43.612124
[134 rows x 5 columns]
- Why use isin?
- Because we can have an entire list of values, it’s like an OR statement built-in
- Let’s say we want to know the rows that have a values of “west” or “south”
- As it appears from the output, that none of the rows have a region == “south”
"region"].isin(["west","south"])] df[df[
pm25 fips region longitude latitude
15 6.058860 2020 west -149.762097 61.191900
16 11.101467 2090 west -147.568384 64.818590
17 7.308113 2110 west -134.511579 58.351422
18 7.147626 2170 west -149.481089 61.762742
19 6.929844 4003 west -109.904319 31.750272
.. ... ... ... ... ...
571 4.955570 56029 west -108.999016 44.568961
572 6.549239 56033 west -106.968130 44.783629
573 5.632587 56035 west -109.991147 42.750949
574 6.349710 56037 west -109.168198 41.640708
575 4.565808 56039 west -110.673534 43.612124
[134 rows x 5 columns]
Unique
- We can find all the unique elements in a column
- Let’s say we wanted to know how many regions are in the df
- That would be the 3rd column or the column labeled “region”
- Syntax:
df.colname.unique()
"region"].unique() df[
array(['east', 'west'], dtype=object)
NUnique
- This is a derivative of .unique() as it gives us the number or unique values in a column
"region"].nunique() df[
2
Columns
See Pandas intro for more column manipulation.
Columnnames
- We can target the column index directly just as we did when we listed all the column names/index earlier with
# Print column names/index
df.columns
Index(['pm25', 'fips', 'region', 'longitude', 'latitude'], dtype='object')
Columname x To End
# print 3rd columnname to end
2:] df.columns[
Index(['region', 'longitude', 'latitude'], dtype='object')
1 Column
# List first columnname
0] df.columns[
'pm25'
1 Column All Rows
- In the code above we extracted one columname
- If we want to extract the values for each row in that column we just factor it with the DataFrame, in this case df
# All rows for 1 column, all we do is enclose the code above inside df[]
0]] df[df.columns[
0 9.771185
1 9.993817
2 10.688618
3 11.337424
4 12.119764
...
571 4.955570
572 6.549239
573 5.632587
574 6.349710
575 4.565808
Name: pm25, Length: 576, dtype: float64
Series
getitem []
1 Cell
"b"] s[
np.float64(-0.9780693521984415)
Slice of Cells
"a":"d"] s[
a 1.110744
b -0.978069
c 2.241242
d -3.491906
dtype: float64
iloc by Position
s
a 1.110744
b -0.978069
c 2.241242
d -3.491906
e -0.331338
dtype: float64
1 Cell
1] s.iloc[
np.float64(-0.9780693521984415)
First X Cells
3] s.iloc[:
a 1.110744
b -0.978069
c 2.241242
dtype: float64
From X to End
3:] s.iloc[
d -3.491906
e -0.331338
dtype: float64
Slice of Cells
0:2] s.iloc[
a 1.110744
b -0.978069
dtype: float64
Calculate
Calculations
- Let’s create a series
- Then extract the unique values from the series
- When we don’t supply an index one is created for us
= pd.Series([2,1,3,3,4,1,5,6,2])
d d
0 2
1 1
2 3
3 3
4 4
5 1
6 5
7 6
8 2
dtype: int64
# Let's multiply first element=1 by 5
1]*5 d[
np.int64(5)
# Let's multiply first element=1 by the fourth=4
1]*d[4] d[
np.int64(4)
> median
> s.median()] s[s
a 1.110744
c 2.241242
dtype: float64
median
s.median()
np.float64(-0.3313383669873633)
Unique
- Let’s find the unique values in the series d
d.unique()
array([2, 1, 3, 4, 5, 6])
- We can also use pandas unique method
pd.unique(d)
array([2, 1, 3, 4, 5, 6])