GroupBy & Pivot_Table
GroupBy
If we want to group and aggregate columns we can use groupby () and it functions just as it does in R or SQL.
Pivot Table
In PySpark we can use Rollup to subtotal and group columns by categories or whatever condition we wish. In Pandas we can use pivot_table()
pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc=‘mean’, fill_value=None, margins=False, dropna=True, margins_name=‘All’, observed=<no_default>, sort=True)
data: DataFrame
- valueslist-like or scalar, optional
-
Column(s) to aggregate.
- index: column, Grouper, array, or list of the previous
-
The column(s) you want to use the row labels in the resultiing pivot table. Keys to group by on the pivot table index. If a list is passed, it can contain any of the other types (except list). If an array is passed, it must be the same length as the data and will be used in the same manner as column values.
- columns: column, Grouper, array, or list of the previous
-
The column(s) you want to use as column labels in the resulting pivot table. Keys to group by on the pivot table column. If a list is passed, it can contain any of the other types (except list). If an array is passed, it must be the same length as the data and will be used in the same manner as column values.
- aggfunc: function, list of functions, dict, default “mean”
-
The aggregation function to apply to the values(e.g. sum, mean, If a list of functions is passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves). If a dict is passed, the key is column to aggregate and the value is function or list of functions. If
margin=True
, aggfunc will be used to calculate the partial aggregates. - fill_valuescalar, default None
-
Value to replace missing values with (in the resulting pivot table, after aggregation).
- marginsbool, default False
-
If
margins=True
, specialAll
columns and rows will be added with partial group aggregates across the categories on the rows and columns. - dropnabool, default True
-
Do not include columns whose entries are all NaN. If True, rows with a NaN value in any column will be omitted before computing margins.
- margins_namestr, default ‘All’
-
Name of the row / column that will contain the totals when margins is True.