library(reticulate)
# Installed it the first time so no need to again comment it out
#py_install("pandas")
#py_install("numpy")
#py_install("matplotlib")
Pandas DF & Series
Pandas is a popular open-source data manipulation and analysis library for the Python programming language. It provides a powerful and flexible set of tools for working with structured data, making it a fundamental tool for data scientists, analysts, and engineers. User guide.
Pandas is designed to handle data in various formats, such as tabular data, time series data, and more, making it an essential part of the data processing workflow in many industries.
Data Structure:
- A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled rows and column (like a df in R, or an excel sheet)
- A series is a one-dimensional labeled array, basically a single column or row of data holding data of any type
Data Import and Export: Pandas makes it easy to read data from various sources, including CSV files, Excel spreadsheets, SQL databases, and more. It can also export data to these formats, enabling seamless data exchange.
Data Merging and Joining: You can combine multiple DataFrames using methods like merge and join, similar to SQL operations, to create more complex datasets from different sources.
Efficient Indexing: Pandas provides efficient indexing and selection methods, allowing you to access specific rows and columns of data quickly.
Custom Data Structures: You can create custom data structures and manipulate data in ways that suit your specific needs, extending Pandas’ capabilities.
Import Pandas
Read CSV
- First we read the csv using
pd.read_csv()
import pandas as pd
import numpy as np
# Read csv file
= 'D:/yourdataiq/main/datasets/avgpm25.csv'
file_path = pd.read_csv(file_path) df
DataFrame
View Data
Head
- Let’s we view the first
(n)
rows of the df usingDataFrame.head(n)
or - the last rows using
tail(n)
- Since we already assigned the DataFrame to df we can just use
df.head(n)
5) df.head(
pm25 fips region longitude latitude
0 9.771185 1003 east -87.748260 30.592781
1 9.993817 1027 east -85.842858 33.265810
2 10.688618 1033 east -87.725960 34.731480
3 11.337424 1049 east -85.798919 34.459133
4 12.119764 1055 east -86.032125 34.018597
Types
- We can view each column data type bu using
dtypes
df.dtypes
pm25 float64
fips int64
region object
longitude float64
latitude float64
dtype: object
Index
- To get a better understanding of how a df is structured let’s display the df index
- The output tells us we have rows=576 with the first value start=0 being 0 and it increases in step=1 till it stops at 575
- Remember it starts at 0
df.index
RangeIndex(start=0, stop=576, step=1)
Tail
- To better understand the RangeIndex let’s look at the last 5 rows of the df
- As you see the last index is 575 even though the RangeIndex: stop=576
5) df.tail(
pm25 fips region longitude latitude
571 4.955570 56029 west -108.999016 44.568961
572 6.549239 56033 west -106.968130 44.783629
573 5.632587 56035 west -109.991147 42.750949
574 6.349710 56037 west -109.168198 41.640708
575 4.565808 56039 west -110.673534 43.612124
Shape
- This gives us the size of the df in a tuple (rows, cols)
df.shape
(576, 5)
Columns
- Columns are index just like we used df.index to get the index of the rows let’s use
- df.columns to get the index of the columns
df.columns
Index(['pm25', 'fips', 'region', 'longitude', 'latitude'], dtype='object')
Add Columns
- We can simply add columns if none are provided with
= ['Name','Phone',....] df.columns
Rename Columns
- We can do this in several ways
# Method 1
= {'Oldname' : 'Newname'})
df.rename(columns
# Method 2
= ["Col1", "Col2", ....]
df.columns
# Method 3 is to use a range where df.shape[1] is the number of columns from shape()
= range(df.shape[1])
df.columns # so if the shape is (214, 8) then shape[1] = 8 and the range(8) would give us an index from 0 to 7, so the new column names will be 0..3...7
Convert type
# convert column to another type AS AN EXAMPLE
'pm25'] = df['pm25'].astype(int) df[
Extract Column as List
- Refer to the Pandas - Select page for more
- We can extract a column by name. Let’s use the example where we renamed the columns based on the number in the range from 0 to 7 as described above
- So if we want to extract a column as a list we use [ single bracket ]
Extract Column as DF
- To continue with the thought, to extract the columns to a df use [[ double brackets ]]
# To extract one column named 2 as a list
2]
df[# two columns as a list
0, 2]
df[
# Extract column as a df we use double bracket
0,2]] df[[
Extract Top 10 rows
- See example in GDP Table to CSV project
Describe
- To show a quick statistic summary of the data
df.describe()
pm25 fips longitude latitude
count 576.000000 576.000000 576.000000 576.000000
mean 9.836358 28431.092014 -91.651369 38.559141
std 2.277034 15923.382269 15.755843 5.123974
min 3.382626 1003.000000 -158.035861 19.683885
25% 8.548799 16037.500000 -97.380743 35.295496
50% 10.046697 28034.000000 -87.373118 39.093971
75% 11.356012 41045.000000 -80.722611 41.751875
max 18.440731 56039.000000 -68.261663 64.818590
Isnull
Built-in functions to identify missing values.
- The output is boolean indicating whether the value that is passed is missing or not
# here below you'll see that the values are False meaning isnull=False
= df.isnull()
missing 5) missing.head(
pm25 fips region longitude latitude
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
Notnull
= df.notnull()
notmissing 5) notmissing.head(
pm25 fips region longitude latitude
0 True True True True True
1 True True True True True
2 True True True True True
3 True True True True True
4 True True True True True
Count Null
- We can use a loop to count the number of missing values in each column
- value_counts counts the number of True values
for column in missing.columns.values.tolist():
print(column)
print(missing[column].value_counts())
print(" ")
pm25
pm25
False 576
Name: count, dtype: int64
fips
fips
False 576
Name: count, dtype: int64
region
region
False 576
Name: count, dtype: int64
longitude
longitude
False 576
Name: count, dtype: int64
latitude
latitude
False 576
Name: count, dtype: int64
Manipulate
Transpose
- If you want to flip direction of df use transpose
df.T
0 1 2 ... 573 574 575
pm25 9.771185 9.993817 10.688618 ... 5.632587 6.34971 4.565808
fips 1003 1027 1033 ... 56035 56037 56039
region east east east ... west west west
longitude -87.74826 -85.842858 -87.72596 ... -109.991147 -109.168198 -110.673534
latitude 30.592781 33.26581 34.73148 ... 42.750949 41.640708 43.612124
[5 rows x 576 columns]
Sort_index
- If you ever need to sort your index
- Syntax:
*, axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None) DataFrame.sort_index(
- Note: ascending = True is the default, so if you want descending =False to reverse it
- Note: axis =0 is default which is the row index, to change it to sort by column use: axis =1
- Since our rows: axis=0 are already sorted in ascending order let’s reverse them
=False) df.sort_index(ascending
pm25 fips region longitude latitude
575 4.565808 56039 west -110.673534 43.612124
574 6.349710 56037 west -109.168198 41.640708
573 5.632587 56035 west -109.991147 42.750949
572 6.549239 56033 west -106.968130 44.783629
571 4.955570 56029 west -108.999016 44.568961
.. ... ... ... ... ...
4 12.119764 1055 east -86.032125 34.018597
3 11.337424 1049 east -85.798919 34.459133
2 10.688618 1033 east -87.725960 34.731480
1 9.993817 1027 east -85.842858 33.265810
0 9.771185 1003 east -87.748260 30.592781
[576 rows x 5 columns]
- Let’s sort by column and keep the default of ascending
- See example inGDP Table to CSV project
=1) df.sort_index(axis
fips latitude longitude pm25 region
0 1003 30.592781 -87.748260 9.771185 east
1 1027 33.265810 -85.842858 9.993817 east
2 1033 34.731480 -87.725960 10.688618 east
3 1049 34.459133 -85.798919 11.337424 east
4 1055 34.018597 -86.032125 12.119764 east
.. ... ... ... ... ...
571 56029 44.568961 -108.999016 4.955570 west
572 56033 44.783629 -106.968130 6.549239 west
573 56035 42.750949 -109.991147 5.632587 west
574 56037 41.640708 -109.168198 6.349710 west
575 56039 43.612124 -110.673534 4.565808 west
[576 rows x 5 columns]
Sort_values
- This is comes in handy often, we can sort either index or columns and in ascending or descending just as we did with sort_index
- Most defaults and arguments still apply
- Syntax:
*, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None) DataFrame.sort_values(by,
- Since it is not helpful to sort the index=0 let’s sort the values by column 1 = (pm25)
="pm25") df.sort_values(by
pm25 fips region longitude latitude
568 3.382626 56009 west -105.502721 42.857379
51 3.494351 6033 west -122.751791 39.023264
570 4.132739 56021 west -104.762083 41.196199
471 4.175901 46071 west -101.669286 43.768510
85 4.186090 8039 west -104.292156 39.293642
.. ... ... ... ... ...
45 16.194524 6019 west -119.903470 36.638374
75 16.251904 6099 west -120.958800 37.613805
52 16.661801 6037 west -118.234216 34.088510
60 17.429049 6065 west -116.803605 33.783307
50 18.440731 6031 west -119.811272 36.155141
[576 rows x 5 columns]
Copy
- Before manipulating a df it’s always a good idea to copy it
- Simply use
df.copy()
= df.copy() df2
Transform
Python’s Transform function returns a self-produced dataframe with transformed values after applying the function specified in its parameter.
# Create df
=pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
df df
a b c
0 1 2 3
1 4 5 6
2 7 8 9
# Let's say we want to add 5 to each element
+ 5 df
a b c
0 6 7 8
1 9 10 11
2 12 13 14
- It’s easy to get it done as above with just an addition, and we come out with the same results
# Let's do it using Transform function
= df.transform(func = lambda x : x + 5)
df1 df1
a b c
0 6 7 8
1 9 10 11
2 12 13 14
- What if we want to take the square root of df?
import math
np.sqrt(df)
a b c
0 1.000000 1.414214 1.732051
1 2.000000 2.236068 2.449490
2 2.645751 2.828427 3.000000
= df.transform(func = ['sqrt'])
sr sr
a b c
sqrt sqrt sqrt
0 1.000000 1.414214 1.732051
1 2.000000 2.236068 2.449490
2 2.645751 2.828427 3.000000
Round
- Rounding a calculation or a float can be done using the
np.round()
- Note here we are rounding the value in df[col2] back to the same column
- We could also perform a calculation and round at the same time
- See more examples in
- Bank SQLite3_BS project
- GDP SQLite3_Pandasproject
# Round col2 back into col2
'col2']] = np.round(df[['col2']], 2)
df[[
# Calculate, round and assign a new name to col2
'col3']] = np.round(df[['col2']]/1000, 2) df[[
Convert Type
- We can convert a df type with a simple
astype(
) as you can see in GDP Table to CSV project - Or use
pd.to_numeric()
as shown in GDP SQLite3_Pandas project
'GDP (Millions USD)'] = df['GDP (Millions USD)'].astype(float)
df['GDP_USD_millions'] = pd.to_numeric(df['GDP_USD_millions'], errors='coerce') df[
Convert DF to Dict
Here we’ll go through an example of reading a CSV file into a df then converting the df to a dictionary.
- The entire example is detailed in Bank SQLite3_BS project
def get_exchange_rate(filename):
= pd.read_csv(filename)
exchange_df = exchange_df.set_index('Currency')['Rate'.to_dict()]
df_dict return df_dict
Series
More information is here.
Once again, a series is a one dimensional labeled array holding any type of data
Create from ndarray
- Data can be any data, a dictionary, an ndarray, a scalar value
- index: is a list of axis labels, and at times will declare what type of data it is
= pd.Series(data, index = index) s
Set index
- If data is an ndarray, index must be the same length as data
- If no index is passed one will be created automatically having values [0,….., len(data)-1]
- Create a series = s of 5 random numbers, with an index from a:e
= pd.Series(np.random.randn(5), index = ["a","b","c","d","e"])
s s
a 0.146504
b 0.571251
c 0.318455
d 0.444745
e 1.161085
dtype: float64
Index
- Similar to df we use
.index
to view the index of a series
= pd.Series(np.random.randn(5), index = ["a","b","c","d","e"])
s s.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
Type
s.dtype
dtype('float64')
Create from dict
dict = {"b": 1, "a": 0, "c": 2}
dict) pd.Series(
b 1
a 0
c 2
dtype: int64
Create from scalar
6.0, index=["a","b","c","turkey","g"]) pd.Series(
a 6.0
b 6.0
c 6.0
turkey 6.0
g 6.0
dtype: float64