Aggregation And Grouping
Aggregation: We will study aggregations like
sum()
, mean()
, median()
, min()
, and max()
, in which a single number gives insight into the nature of a potentially large dataset **Grouping: **When we are dealing with large datasets, it is useful to group data by common categories (value of particular column). To accomplish this, we use the GroupBy
function pandas DataFrame🛳 Titanic Ship Data for Demonstration
- We will use Titanic survivors data from our analysis
import numpy as np
import pandas as pd
titanic = pd.read_csv('data/titanic.csv')
# fetching column names using .keys method
print(titanic.keys())
Index(['survived', 'pclass', 'sex', 'age', 'fare', 'embarked', 'who',
'embark_town', 'alive', 'alone'],
dtype='object')
print(titanic.head()) # show the header, includes first five rows
survived pclass sex age fare embarked who embark_town alive \
0 0 3 male 22.0 7.2500 S man Southampton no
1 1 1 female 38.0 71.2833 C woman Cherbourg yes
2 1 3 female 26.0 7.9250 S woman Southampton yes
3 1 1 female 35.0 53.1000 S woman Southampton yes
4 0 3 male 35.0 8.0500 S man Southampton no
alone
0 False
1 False
2 True
3 False
4 True
**Features of Dataset: ** Columns of a DataFrame is referred to as features of a dataset, these features can be:
- Quantitative feature: Any value represented by numbers. We apply numerical metrics (sum, mean, std) on them
- Categorical feature: Values are categories that can be used to group the dataset. GroupBy is applied on categorical features
Computes several common aggregates for each column (containing integers and floats) i.e,
.describe
applies to categorical featuresprint(titanic.describe())
survived pclass age fare
count 891.000000 891.000000 714.000000 891.000000
mean 0.383838 2.308642 29.699118 32.204208
std 0.486592 0.836071 14.526497 49.693429
min 0.000000 1.000000 0.420000 0.000000
25% 0.000000 2.000000 20.125000 7.910400
50% 0.000000 3.000000 28.000000 14.454200
75% 1.000000 3.000000 38.000000 31.000000
max 1.000000 3.000000 80.000000 512.329200
For categorical features, we can use
.unique()
method to obtain unique instances of each categorical feature:# fetching unique values under column 'embark_station'
titanic['embark_town'].unique()
array(['Southampton', 'Cherbourg', 'Queenstown', nan], dtype=object)
For categorical features, we can use
.value_counts()
method to obtain frequency counts of each category in column feature# fetching all unique instances of column 'embark_town'
# along with the frequency count
titanic['embark_town'].value_counts()
Southampton 644
Cherbourg 168
Queenstown 77
Name: embark_town, dtype: int64
We can use
normalize=True
argument to get proportion of frequency count# sum of all values is equal to 1
titanic['embark_town'].value_counts(normalize=True)
Southampton 0.724409
Cherbourg 0.188976
Queenstown 0.086614
Name: embark_town, dtype: float64
GroupBy
, split-apply-combine; is one of the most common and useful strategy of data analysis:- The split step involves breaking up and grouping a DataFrame depending on the values of the specified column
- The apply step involves computing some function - usually an aggregate, transformation, or filtering; within the individual groups. Apply can take following forms:
- Aggregation
- Tranformation
- Filteration
- The combine step merges the results of these operations into an output array.
Split (Creating a DataFrame GroupBy Object): Let suppose we would like to
GroupBy
column name sex
. Doing this will create a GroupBy object which stores the data of the individual groups in the form of key value pairs, which we can fetch using .groups
method on this DataFrame object#grouping by column 'sex'
group_by_sex = titanic.groupby('sex')
# the output is DataFrameGroupBy object
group_by_sex
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9778b203d0>
# getting all groups as key-value pair
group_by_sex.groups
{'female': Int64Index([ 1, 2, 3, 8, 9, 10, 11, 14, 15, 18,
...
866, 871, 874, 875, 879, 880, 882, 885, 887, 888],
dtype='int64', length=314),
'male': Int64Index([ 0, 4, 5, 6, 7, 12, 13, 16, 17, 20,
...
873, 876, 877, 878, 881, 883, 884, 886, 889, 890],
dtype='int64', length=577)}
Suppose we are interested to know
.mean()
of all columns values, once the data is grouped by sex
This step will apply the mean to each instance of group and then combine the result to give us a Series (for single column selection) or DataFrame (for multiple columns selection)titanic.groupby('sex').mean()
survived pclass age fare alone
sex
female 0.742038 2.159236 27.915709 44.479818 0.401274
male 0.188908 2.389948 30.726645 25.523893 0.712305
➞ Let suppose further that instead of getting mean of all column values, we are only interested in finding mean of values under column
survived
and age
print(titanic.groupby('sex')['survived','age'].mean())
survived age
sex
female 0.742038 27.915709
male 0.188908 30.726645
➞ In addition, we can also fetch specific group using
get_group('group_name_here')
and apply aggregator on that object:group_by_sex.get_group('male').mean()
survived 0.188908
pclass 2.389948
age 30.726645
fare 25.523893
alone 0.712305
dtype: float64
We are not limited to GroupBy single column. In this example, we group the DataFrame by columns
sex
and alive
and then apply the mean
aggregator. Further, we are only interested to get the output on age
and fare
columnsprint(titanic.groupby(['sex','alive'])['age','fare'].mean())
age fare
sex alive
female no 25.046875 23.024385
yes 28.847716 51.938573
male no 31.618056 21.960993
yes 27.276022 40.821484
We have discussed earlier that GroupBy function creates a GroupBy object, which stores the values in dictionary style key-value pair. To demonstrate this concept, we will apply a
for
loop on GroupBy
object# let get the shape of each group, when we GroupBy 'pclass'
for (name, group) in titanic.groupby('pclass'):
print(f"{name}: shape={group.shape}")
1: shape=(216, 10)
2: shape=(184, 10)
3: shape=(491, 10)
GroupBy objects have
aggregate()
, filter()
, transform()
, and apply()
methods that efficiently implement a variety of useful operations before combining the grouped dataFirst step is to construct a DataFrame for this discussion
rand = np.random.RandomState(42)
df = pd.DataFrame({'key': ['A','B','C','A','B','C'],
'data1': rand.randint(1,10, size=6),
'data2': rand.randint(1,20, size=6)})
print(df)
key data1 data2
0 A 7 11
1 B 4 11
2 C 8 4
3 A 5 8
4 B 7 3
5 C 3 2
In the above example, we just calculated a single aggregator(mean), however using the
aggregate()
method, we can compute multiple aggregators in a single command. The aggregate()
method takes a string, a function or a list of all the required aggregates to compute.# passing list of aggregators we need for the data1 and data 2
df.groupby('key').aggregate([min,max,np.median,np.mean,np.std])
data1 data2
min max median mean std min max median mean std
key
A 5 7 6.0 6.0 1.414214 8 11 9.5 9.5 2.121320
B 4 7 5.5 5.5 2.121320 3 11 7.0 7.0 5.656854
C 3 8 5.5 5.5 3.535534 2 4 3.0 3.0 1.414214
Let suppose, we don’t want to compute one type of aggregator for all columns, rather, we want to compute median for
data1
and mean
for data2
# we will pass the dictionary of key(column-name) and value(aggregator)
df.groupby('key').aggregate({'data1':np.median,
'data2':np.mean})
data1 data2
key
A 6.0 9.5
B 5.5 7.0
C 5.5 3.0
➞ Now, as we grasp the concept of
aggregate()
method, let apply this on our titanic datatitanic.groupby('sex')['age','fare'].aggregate([min,max,np.mean])
age fare
min max mean min max mean
sex
female 0.75 63.0 27.915709 6.75 512.3292 44.479818
male 0.42 80.0 30.726645 0.00 512.3292 25.523893
A filtering operation allows us to drop data based on some group properties Filter is applied in form of function
Let’s apply
filter
to drop rows where standard deviation of data1
is greater than 2df.groupby('key').filter(lambda x: x['data1'].std() > 2)
key data1 data2
1 B 4 11
2 C 8 4
4 B 7 3
5 C 3 2
Let’s find out the standard deviation of each key, so we can say for sure that key
A
is dropped because standard deviation of data1
is less than 2
print(df.groupby('key').std())
data1 data2
key
A 1.414214 2.121320
B 2.121320 5.656854
C 3.535534 1.414214
In the above example, when we applied the
aggregate
function, we end up with reduced version of the data (For example, in aggregate
example above, we end up with 3 rows from 6 rows). However, transformation
function/method can return some transformed version of the full data but the output remains the same shape as the input.Suppose we would like to create a new column that list sum of each key for ‘data1’ and ‘data2’
df_new_col = df.groupby('key').transform('sum')
print(df_new_col)
data1 data2
0 12 19
1 11 14
2 11 6
3 12 19
4 11 14
5 11 6
The
apply()
method applies function to the group results. In the example below, let me add another column data3
which is sum of corresponding row value of columns data1
and data2
def apply_func(x):
x['data3'] = x['data1'] + x['data2']
return x
print(df.groupby('key').apply(apply_func))
key data1 data2 data3
0 A 7 11 18
1 B 4 11 15
2 C 8 4 12
3 A 5 8 13
4 B 7 3 10
5 C 3 2 5
Last modified 6mo ago