Pandas DF & Series

Pandas is a popular open-source data manipulation and analysis library for the Python programming language. It provides a powerful and flexible set of tools for working with structured data, making it a fundamental tool for data scientists, analysts, and engineers. User guide.

Pandas is designed to handle data in various formats, such as tabular data, time series data, and more, making it an essential part of the data processing workflow in many industries.

Data Structure:

A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled rows and column (like a df in R, or an excel sheet)
A series is a one-dimensional labeled array, basically a single column or row of data holding data of any type

Data Import and Export: Pandas makes it easy to read data from various sources, including CSV files, Excel spreadsheets, SQL databases, and more. It can also export data to these formats, enabling seamless data exchange.

Data Merging and Joining: You can combine multiple DataFrames using methods like merge and join, similar to SQL operations, to create more complex datasets from different sources.

Efficient Indexing: Pandas provides efficient indexing and selection methods, allowing you to access specific rows and columns of data quickly.

Custom Data Structures: You can create custom data structures and manipulate data in ways that suit your specific needs, extending Pandas’ capabilities.

Import Pandas

library(reticulate)
# Installed it the first time so no need to again comment it out
#py_install("pandas")
#py_install("numpy")
#py_install("matplotlib")

Read CSV

First we read the csv using pd.read_csv()

import pandas as pd
import numpy as np

# Read csv file
file_path = 'D:/yourdataiq/main/datasets/avgpm25.csv'
df = pd.read_csv(file_path)

DataFrame

View Data

Head

Let’s we view the first(n) rows of the df using DataFrame.head(n) or
the last rows using tail(n)
Since we already assigned the DataFrame to df we can just use df.head(n)

df.head(5)

        pm25  fips region  longitude   latitude
0   9.771185  1003   east -87.748260  30.592781
1   9.993817  1027   east -85.842858  33.265810
2  10.688618  1033   east -87.725960  34.731480
3  11.337424  1049   east -85.798919  34.459133
4  12.119764  1055   east -86.032125  34.018597

Types

We can view each column data type bu using dtypes

df.dtypes

pm25         float64
fips           int64
region        object
longitude    float64
latitude     float64
dtype: object

Index

To get a better understanding of how a df is structured let’s display the df index
The output tells us we have rows=576 with the first value start=0 being 0 and it increases in step=1 till it stops at 575
Remember it starts at 0

df.index

RangeIndex(start=0, stop=576, step=1)

Tail

To better understand the RangeIndex let’s look at the last 5 rows of the df
As you see the last index is 575 even though the RangeIndex: stop=576

df.tail(5)

         pm25   fips region   longitude   latitude
571  4.955570  56029   west -108.999016  44.568961
572  6.549239  56033   west -106.968130  44.783629
573  5.632587  56035   west -109.991147  42.750949
574  6.349710  56037   west -109.168198  41.640708
575  4.565808  56039   west -110.673534  43.612124

Shape

This gives us the size of the df in a tuple (rows, cols)

df.shape

(576, 5)

Columns

Columns are index just like we used df.index to get the index of the rows let’s use
df.columns to get the index of the columns

df.columns

Index(['pm25', 'fips', 'region', 'longitude', 'latitude'], dtype='object')

Add Columns

We can simply add columns if none are provided with

df.columns = ['Name','Phone',....]

Rename Columns

We can do this in several ways

# Method 1
df.rename(columns = {'Oldname' : 'Newname'})

# Method 2
df.columns = ["Col1", "Col2", ....]

# Method 3 is to use a range where df.shape[1] is the number of columns from shape()
df.columns = range(df.shape[1])
# so if the shape is (214, 8) then shape[1] = 8 and the range(8) would give us an index from 0 to 7, so the new column names will be 0..3...7

Convert type

# convert column to another type AS AN EXAMPLE
df['pm25'] = df['pm25'].astype(int)

Extract Column as List

Refer to the Pandas - Select page for more
We can extract a column by name. Let’s use the example where we renamed the columns based on the number in the range from 0 to 7 as described above
So if we want to extract a column as a list we use [ single bracket ]

Extract Column as DF

To continue with the thought, to extract the columns to a df use [[ double brackets ]]

# To extract one column named 2 as a list
df[2]
# two columns as a list
df[0, 2]

# Extract column as a df we use double bracket
df[[0,2]]

Extract Top 10 rows

See example in GDP Table to CSV project

Describe

To show a quick statistic summary of the data

df.describe()

             pm25          fips   longitude    latitude
count  576.000000    576.000000  576.000000  576.000000
mean     9.836358  28431.092014  -91.651369   38.559141
std      2.277034  15923.382269   15.755843    5.123974
min      3.382626   1003.000000 -158.035861   19.683885
25%      8.548799  16037.500000  -97.380743   35.295496
50%     10.046697  28034.000000  -87.373118   39.093971
75%     11.356012  41045.000000  -80.722611   41.751875
max     18.440731  56039.000000  -68.261663   64.818590

Isnull

Built-in functions to identify missing values.

The output is boolean indicating whether the value that is passed is missing or not

# here below you'll see that the values are False meaning isnull=False
missing = df.isnull()
missing.head(5)

    pm25   fips  region  longitude  latitude
0  False  False   False      False     False
1  False  False   False      False     False
2  False  False   False      False     False
3  False  False   False      False     False
4  False  False   False      False     False

Notnull

notmissing = df.notnull()
notmissing.head(5)

   pm25  fips  region  longitude  latitude
0  True  True    True       True      True
1  True  True    True       True      True
2  True  True    True       True      True
3  True  True    True       True      True
4  True  True    True       True      True

Count Null

We can use a loop to count the number of missing values in each column
value_counts counts the number of True values

for column in missing.columns.values.tolist():
        print(column)
        print(missing[column].value_counts())
        print(" ")

pm25
pm25
False    576
Name: count, dtype: int64
 
fips
fips
False    576
Name: count, dtype: int64
 
region
region
False    576
Name: count, dtype: int64
 
longitude
longitude
False    576
Name: count, dtype: int64
 
latitude
latitude
False    576
Name: count, dtype: int64

Manipulate

Transpose

If you want to flip direction of df use transpose

df.T

                 0          1          2    ...         573         574         575
pm25        9.771185   9.993817  10.688618  ...    5.632587     6.34971    4.565808
fips            1003       1027       1033  ...       56035       56037       56039
region          east       east       east  ...        west        west        west
longitude  -87.74826 -85.842858  -87.72596  ... -109.991147 -109.168198 -110.673534
latitude   30.592781   33.26581   34.73148  ...   42.750949   41.640708   43.612124

[5 rows x 576 columns]

Sort_index

If you ever need to sort your index
Syntax:

DataFrame.sort_index(*, axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None)

Note: ascending = True is the default, so if you want descending =False to reverse it
Note: axis =0 is default which is the row index, to change it to sort by column use: axis =1
Since our rows: axis=0 are already sorted in ascending order let’s reverse them

df.sort_index(ascending=False)

          pm25   fips region   longitude   latitude
575   4.565808  56039   west -110.673534  43.612124
574   6.349710  56037   west -109.168198  41.640708
573   5.632587  56035   west -109.991147  42.750949
572   6.549239  56033   west -106.968130  44.783629
571   4.955570  56029   west -108.999016  44.568961
..         ...    ...    ...         ...        ...
4    12.119764   1055   east  -86.032125  34.018597
3    11.337424   1049   east  -85.798919  34.459133
2    10.688618   1033   east  -87.725960  34.731480
1     9.993817   1027   east  -85.842858  33.265810
0     9.771185   1003   east  -87.748260  30.592781

[576 rows x 5 columns]

Let’s sort by column and keep the default of ascending
See example inGDP Table to CSV project

df.sort_index(axis=1)

      fips   latitude   longitude       pm25 region
0     1003  30.592781  -87.748260   9.771185   east
1     1027  33.265810  -85.842858   9.993817   east
2     1033  34.731480  -87.725960  10.688618   east
3     1049  34.459133  -85.798919  11.337424   east
4     1055  34.018597  -86.032125  12.119764   east
..     ...        ...         ...        ...    ...
571  56029  44.568961 -108.999016   4.955570   west
572  56033  44.783629 -106.968130   6.549239   west
573  56035  42.750949 -109.991147   5.632587   west
574  56037  41.640708 -109.168198   6.349710   west
575  56039  43.612124 -110.673534   4.565808   west

[576 rows x 5 columns]

Sort_values

This is comes in handy often, we can sort either index or columns and in ascending or descending just as we did with sort_index
Most defaults and arguments still apply
Syntax:

DataFrame.sort_values(by, *, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)

Since it is not helpful to sort the index=0 let’s sort the values by column 1 = (pm25)

df.sort_values(by="pm25")

          pm25   fips region   longitude   latitude
568   3.382626  56009   west -105.502721  42.857379
51    3.494351   6033   west -122.751791  39.023264
570   4.132739  56021   west -104.762083  41.196199
471   4.175901  46071   west -101.669286  43.768510
85    4.186090   8039   west -104.292156  39.293642
..         ...    ...    ...         ...        ...
45   16.194524   6019   west -119.903470  36.638374
75   16.251904   6099   west -120.958800  37.613805
52   16.661801   6037   west -118.234216  34.088510
60   17.429049   6065   west -116.803605  33.783307
50   18.440731   6031   west -119.811272  36.155141

[576 rows x 5 columns]

Copy

Before manipulating a df it’s always a good idea to copy it
Simply use df.copy()

df2 = df.copy()

Transform

Python’s Transform function returns a self-produced dataframe with transformed values after applying the function specified in its parameter.

# Create df
df=pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
df

# Let's say we want to add 5 to each element
df + 5

    a   b   c
0   6   7   8
1   9  10  11
2  12  13  14

It’s easy to get it done as above with just an addition, and we come out with the same results

# Let's do it using Transform function
df1 = df.transform(func = lambda x : x + 5)
df1

    a   b   c
0   6   7   8
1   9  10  11
2  12  13  14

What if we want to take the square root of df?

import math
np.sqrt(df)

          a         b         c
0  1.000000  1.414214  1.732051
1  2.000000  2.236068  2.449490
2  2.645751  2.828427  3.000000

sr = df.transform(func = ['sqrt'])
sr

          a         b         c
       sqrt      sqrt      sqrt
0  1.000000  1.414214  1.732051
1  2.000000  2.236068  2.449490
2  2.645751  2.828427  3.000000

Round

Rounding a calculation or a float can be done using the np.round()
Note here we are rounding the value in df[col2] back to the same column
We could also perform a calculation and round at the same time
See more examples in
- Bank SQLite3_BS project
- GDP SQLite3_Pandasproject

# Round col2 back into col2
df[['col2']] = np.round(df[['col2']], 2)

# Calculate, round and assign a new name to col2
df[['col3']] = np.round(df[['col2']]/1000, 2)

Convert Type

We can convert a df type with a simple astype() as you can see in GDP Table to CSV project
Or use pd.to_numeric() as shown in GDP SQLite3_Pandas project

df['GDP (Millions USD)'] = df['GDP (Millions USD)'].astype(float)
df['GDP_USD_millions'] = pd.to_numeric(df['GDP_USD_millions'], errors='coerce')

Convert DF to Dict

Here we’ll go through an example of reading a CSV file into a df then converting the df to a dictionary.

The entire example is detailed in Bank SQLite3_BS project

def get_exchange_rate(filename):
        exchange_df = pd.read_csv(filename)
        df_dict = exchange_df.set_index('Currency')['Rate'.to_dict()]
        return df_dict

Series

More information is here.

Once again, a series is a one dimensional labeled array holding any type of data

Create from ndarray

Data can be any data, a dictionary, an ndarray, a scalar value
index: is a list of axis labels, and at times will declare what type of data it is

s = pd.Series(data, index = index)

Set index

If data is an ndarray, index must be the same length as data
If no index is passed one will be created automatically having values [0,….., len(data)-1]
Create a series = s of 5 random numbers, with an index from a:e

s = pd.Series(np.random.randn(5), index = ["a","b","c","d","e"])
s

a   -0.153017
b   -1.019730
c    0.933597
d    0.317367
e    1.190710
dtype: float64

Index

Similar to df we use .index to view the index of a series

s = pd.Series(np.random.randn(5), index = ["a","b","c","d","e"])
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

Type

s.dtype

dtype('float64')

Create from dict

dict = {"b": 1, "a": 0, "c": 2} 
pd.Series(dict)

b    1
a    0
c    2
dtype: int64

Create from scalar

pd.Series(6.0, index=["a","b","c","turkey","g"])

a         6.0
b         6.0
c         6.0
turkey    6.0
g         6.0
dtype: float64