Pandas DF & Series

Pandas is a popular open-source data manipulation and analysis library for the Python programming language. It provides a powerful and flexible set of tools for working with structured data, making it a fundamental tool for data scientists, analysts, and engineers. User guide.

Pandas is designed to handle data in various formats, such as tabular data, time series data, and more, making it an essential part of the data processing workflow in many industries.

Data Structure:

  • A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled rows and column (like a df in R, or an excel sheet)
  • A series is a one-dimensional labeled array, basically a single column or row of data holding data of any type

Data Import and Export: Pandas makes it easy to read data from various sources, including CSV files, Excel spreadsheets, SQL databases, and more. It can also export data to these formats, enabling seamless data exchange.

Data Merging and Joining: You can combine multiple DataFrames using methods like merge and join, similar to SQL operations, to create more complex datasets from different sources.

Efficient Indexing: Pandas provides efficient indexing and selection methods, allowing you to access specific rows and columns of data quickly.

Custom Data Structures: You can create custom data structures and manipulate data in ways that suit your specific needs, extending Pandas’ capabilities.

Import Pandas

library(reticulate)
# Installed it the first time so no need to again comment it out
#py_install("pandas")
#py_install("numpy")
#py_install("matplotlib")

Read CSV

  • First we read the csv using pd.read_csv()
import pandas as pd
import numpy as np

# Read csv file
file_path = 'D:/yourdataiq/main/datasets/avgpm25.csv'
df = pd.read_csv(file_path)

DataFrame


View Data

Types

  • We can view each column data type bu using dtypes
df.dtypes
pm25         float64
fips           int64
region        object
longitude    float64
latitude     float64
dtype: object

Index

  • To get a better understanding of how a df is structured let’s display the df index
  • The output tells us we have rows=576 with the first value start=0 being 0 and it increases in step=1 till it stops at 575
  • Remember it starts at 0
df.index
RangeIndex(start=0, stop=576, step=1)

Tail

  • To better understand the RangeIndex let’s look at the last 5 rows of the df
  • As you see the last index is 575 even though the RangeIndex: stop=576
df.tail(5)
         pm25   fips region   longitude   latitude
571  4.955570  56029   west -108.999016  44.568961
572  6.549239  56033   west -106.968130  44.783629
573  5.632587  56035   west -109.991147  42.750949
574  6.349710  56037   west -109.168198  41.640708
575  4.565808  56039   west -110.673534  43.612124

Shape

  • This gives us the size of the df in a tuple (rows, cols)
df.shape
(576, 5)

Columns

  • Columns are index just like we used df.index to get the index of the rows let’s use
  • df.columns to get the index of the columns
df.columns
Index(['pm25', 'fips', 'region', 'longitude', 'latitude'], dtype='object')

Add Columns

  • We can simply add columns if none are provided with
df.columns = ['Name','Phone',....]

Rename Columns

  • We can do this in several ways
# Method 1
df.rename(columns = {'Oldname' : 'Newname'})

# Method 2
df.columns = ["Col1", "Col2", ....]

# Method 3 is to use a range where df.shape[1] is the number of columns from shape()
df.columns = range(df.shape[1])
# so if the shape is (214, 8) then shape[1] = 8 and the range(8) would give us an index from 0 to 7, so the new column names will be 0..3...7

Convert type

# convert column to another type AS AN EXAMPLE
df['pm25'] = df['pm25'].astype(int)

Extract Column as List

  • Refer to the Pandas - Select page for more
  • We can extract a column by name. Let’s use the example where we renamed the columns based on the number in the range from 0 to 7 as described above
  • So if we want to extract a column as a list we use [ single bracket ]

Extract Column as DF

  • To continue with the thought, to extract the columns to a df use [[ double brackets ]]
# To extract one column named 2 as a list
df[2]
# two columns as a list
df[0, 2]

# Extract column as a df we use double bracket
df[[0,2]]

Extract Top 10 rows

Describe

  • To show a quick statistic summary of the data
df.describe()
             pm25          fips   longitude    latitude
count  576.000000    576.000000  576.000000  576.000000
mean     9.836358  28431.092014  -91.651369   38.559141
std      2.277034  15923.382269   15.755843    5.123974
min      3.382626   1003.000000 -158.035861   19.683885
25%      8.548799  16037.500000  -97.380743   35.295496
50%     10.046697  28034.000000  -87.373118   39.093971
75%     11.356012  41045.000000  -80.722611   41.751875
max     18.440731  56039.000000  -68.261663   64.818590

Isnull

Built-in functions to identify missing values.

  • The output is boolean indicating whether the value that is passed is missing or not
# here below you'll see that the values are False meaning isnull=False
missing = df.isnull()
missing.head(5)
    pm25   fips  region  longitude  latitude
0  False  False   False      False     False
1  False  False   False      False     False
2  False  False   False      False     False
3  False  False   False      False     False
4  False  False   False      False     False

Notnull

notmissing = df.notnull()
notmissing.head(5)
   pm25  fips  region  longitude  latitude
0  True  True    True       True      True
1  True  True    True       True      True
2  True  True    True       True      True
3  True  True    True       True      True
4  True  True    True       True      True

Count Null

  • We can use a loop to count the number of missing values in each column
  • value_counts counts the number of True values
for column in missing.columns.values.tolist():
        print(column)
        print(missing[column].value_counts())
        print(" ")
pm25
pm25
False    576
Name: count, dtype: int64
 
fips
fips
False    576
Name: count, dtype: int64
 
region
region
False    576
Name: count, dtype: int64
 
longitude
longitude
False    576
Name: count, dtype: int64
 
latitude
latitude
False    576
Name: count, dtype: int64
 

Manipulate

Transpose

  • If you want to flip direction of df use transpose
df.T
                 0          1          2    ...         573         574         575
pm25        9.771185   9.993817  10.688618  ...    5.632587     6.34971    4.565808
fips            1003       1027       1033  ...       56035       56037       56039
region          east       east       east  ...        west        west        west
longitude  -87.74826 -85.842858  -87.72596  ... -109.991147 -109.168198 -110.673534
latitude   30.592781   33.26581   34.73148  ...   42.750949   41.640708   43.612124

[5 rows x 576 columns]

Sort_index

  • If you ever need to sort your index
  • Syntax:
DataFrame.sort_index(*, axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index=False, key=None)
  • Note: ascending = True is the default, so if you want descending =False to reverse it
  • Note: axis =0 is default which is the row index, to change it to sort by column use: axis =1
  • Since our rows: axis=0 are already sorted in ascending order let’s reverse them
df.sort_index(ascending=False)
          pm25   fips region   longitude   latitude
575   4.565808  56039   west -110.673534  43.612124
574   6.349710  56037   west -109.168198  41.640708
573   5.632587  56035   west -109.991147  42.750949
572   6.549239  56033   west -106.968130  44.783629
571   4.955570  56029   west -108.999016  44.568961
..         ...    ...    ...         ...        ...
4    12.119764   1055   east  -86.032125  34.018597
3    11.337424   1049   east  -85.798919  34.459133
2    10.688618   1033   east  -87.725960  34.731480
1     9.993817   1027   east  -85.842858  33.265810
0     9.771185   1003   east  -87.748260  30.592781

[576 rows x 5 columns]
  • Let’s sort by column and keep the default of ascending
  • See example inGDP Table to CSV project
df.sort_index(axis=1)
      fips   latitude   longitude       pm25 region
0     1003  30.592781  -87.748260   9.771185   east
1     1027  33.265810  -85.842858   9.993817   east
2     1033  34.731480  -87.725960  10.688618   east
3     1049  34.459133  -85.798919  11.337424   east
4     1055  34.018597  -86.032125  12.119764   east
..     ...        ...         ...        ...    ...
571  56029  44.568961 -108.999016   4.955570   west
572  56033  44.783629 -106.968130   6.549239   west
573  56035  42.750949 -109.991147   5.632587   west
574  56037  41.640708 -109.168198   6.349710   west
575  56039  43.612124 -110.673534   4.565808   west

[576 rows x 5 columns]

Sort_values

  • This is comes in handy often, we can sort either index or columns and in ascending or descending just as we did with sort_index
  • Most defaults and arguments still apply
  • Syntax:
DataFrame.sort_values(by, *, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)
  • Since it is not helpful to sort the index=0 let’s sort the values by column 1 = (pm25)
df.sort_values(by="pm25")
          pm25   fips region   longitude   latitude
568   3.382626  56009   west -105.502721  42.857379
51    3.494351   6033   west -122.751791  39.023264
570   4.132739  56021   west -104.762083  41.196199
471   4.175901  46071   west -101.669286  43.768510
85    4.186090   8039   west -104.292156  39.293642
..         ...    ...    ...         ...        ...
45   16.194524   6019   west -119.903470  36.638374
75   16.251904   6099   west -120.958800  37.613805
52   16.661801   6037   west -118.234216  34.088510
60   17.429049   6065   west -116.803605  33.783307
50   18.440731   6031   west -119.811272  36.155141

[576 rows x 5 columns]

Copy

  • Before manipulating a df it’s always a good idea to copy it
  • Simply use df.copy()
df2 = df.copy()

Transform

Python’s Transform function returns a self-produced dataframe with transformed values after applying the function specified in its parameter.

# Create df
df=pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
df
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9
# Let's say we want to add 5 to each element
df + 5
    a   b   c
0   6   7   8
1   9  10  11
2  12  13  14
  • It’s easy to get it done as above with just an addition, and we come out with the same results
# Let's do it using Transform function
df1 = df.transform(func = lambda x : x + 5)
df1
    a   b   c
0   6   7   8
1   9  10  11
2  12  13  14
  • What if we want to take the square root of df?
import math
np.sqrt(df)
          a         b         c
0  1.000000  1.414214  1.732051
1  2.000000  2.236068  2.449490
2  2.645751  2.828427  3.000000
sr = df.transform(func = ['sqrt'])
sr
          a         b         c
       sqrt      sqrt      sqrt
0  1.000000  1.414214  1.732051
1  2.000000  2.236068  2.449490
2  2.645751  2.828427  3.000000

Round

  • Rounding a calculation or a float can be done using the np.round()
  • Note here we are rounding the value in df[col2] back to the same column
  • We could also perform a calculation and round at the same time
  • See more examples in
# Round col2 back into col2
df[['col2']] = np.round(df[['col2']], 2)

# Calculate, round and assign a new name to col2
df[['col3']] = np.round(df[['col2']]/1000, 2)

Convert Type

df['GDP (Millions USD)'] = df['GDP (Millions USD)'].astype(float)
df['GDP_USD_millions'] = pd.to_numeric(df['GDP_USD_millions'], errors='coerce')

Convert DF to Dict

Here we’ll go through an example of reading a CSV file into a df then converting the df to a dictionary.

def get_exchange_rate(filename):
        exchange_df = pd.read_csv(filename)
        df_dict = exchange_df.set_index('Currency')['Rate'.to_dict()]
        return df_dict

Series


More information is here.

Once again, a series is a one dimensional labeled array holding any type of data

Create from ndarray

  • Data can be any data, a dictionary, an ndarray, a scalar value
  • index: is a list of axis labels, and at times will declare what type of data it is
s = pd.Series(data, index = index)

Set index

  • If data is an ndarray, index must be the same length as data
  • If no index is passed one will be created automatically having values [0,….., len(data)-1]
  • Create a series = s of 5 random numbers, with an index from a:e
s = pd.Series(np.random.randn(5), index = ["a","b","c","d","e"])
s
a    0.146504
b    0.571251
c    0.318455
d    0.444745
e    1.161085
dtype: float64

Index

  • Similar to df we use .index to view the index of a series
s = pd.Series(np.random.randn(5), index = ["a","b","c","d","e"])
s.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

Type

s.dtype
dtype('float64')

Create from dict

dict = {"b": 1, "a": 0, "c": 2} 
pd.Series(dict)
b    1
a    0
c    2
dtype: int64

Create from scalar

pd.Series(6.0, index=["a","b","c","turkey","g"])
a         6.0
b         6.0
c         6.0
turkey    6.0
g         6.0
dtype: float64