Hierarchical Indexing In Pandas
While Pandas does provide Panel and Panel4D objects to natively handle three-dimensional and four-dimensional data, a far more common practice is to use hierarchical indexing (also known as multi-indexing) to incorporate multiple index levels within a single index. Doing this, higher-dimensional data can be compactly represented with the familiar one-dimensional Series and two-dimensional DataFrame objects
import numpy as np
import pandas as pd
First, let’s create a multi-index data from the tuples as follows:
# defining multi-index in form of tuples
index = [('City A', 2018),('City A', 2019),
('City B', 2018), ('City B', 2019),
('City C', 2018), ('City C', 2019)]
index
[('City A', 2018),
('City A', 2019),
('City B', 2018),
('City B', 2019),
('City C', 2018),
('City C', 2019)]
Second, provide the above multi-index data to Pandas
pd.MultiIndex.from_tuples()
function:# creating multi-index in Pandas from Tuples (we created above)
index_pd = pd.MultiIndex.from_tuples(index)
index_pd
MultiIndex([('City A', 2018),
('City A', 2019),
( 'City B', 2018),
( 'City B', 2019),
('City C', 2018),
('City C', 2019)],
)
Third, define the data,
pop
for our multi-index series, in the form of list
:# Defining data for the multi-index
pop = [33871648, 37253956, 18976457, 19378102, 20851820, 25145561]
Fourth, use
pd.Series
constructor with data and index as arguments:# Linking multi-index with population data
pop = pd.Series(pop, index=index_pd)
pop
City A 2018 33871648
2019 37253956
City B 2018 18976457
2019 19378102
City C 2018 20851820
2019 25145561
dtype: int64
- In the above example, the first two columns of the Series representation show the multiple index values, while the third column shows the data
- Notice that some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it
➞ Indexing and Slicing syntax will be the same as we covered in Indexing Pandas Series and DataFrame
Population of City A, for all years:
pop['City A']
2018 33871648
2019 37253956
dtype: int64
Population of all cities, for year 2018
pop[:,2018]
City A 33871648
City B 18976457
City C 20851820
dtype: int64
Population of City A for year 2018
pop['City A',2018]
33871648
We could easily have stored the same data using a simple DataFrame with index and column labels. The
unstack()
method will quickly convert a multi-indexed Series into a conventionally indexed DataFramepop_df = pop.unstack()
print(pop_df)
year 2018 2019
city
City B 18976457 19378102
City A 33871648 37253956
City C 20851820 25145561
The
stack()
method provides the opposite operation than unstack()
— converts DataFrame to multi-indexed Seriespop_df = pop_df.stack()
pop_df
City B 2018 18976457
2019 19378102
City A 2018 33871648
2019 37253956
City C 2018 20851820
2019 25145561
dtype: int64
Just as we were able to use multi-indexing to represent two-dimensional DataFrame within a one-dimensional Series, we can also use it to represent data of three or more dimensions in a Series or DataFrame. Each extra level in a multi-index represents an extra dimension of data.
pop_df = pd.DataFrame({'total': pop,
'under18': [9267089, 9284094, 4687374, 4318033, 5906301, 6879014]})
print(pop_df)
total under18
city year
City B 2018 18976457 9267089
2019 19378102 9284094
City A 2018 33871648 4687374
2019 37253956 4318033
City C 2018 20851820 5906301
2019 25145561 6879014
Let’s find the percentage of under 18 population in each city, each year:
pop_u18_percent = pop_df['under18'] / pop_df['total']
pop_u18_percent
City A 2018 0.273594
2019 0.249211
City B 2018 0.247010
2019 0.222831
City C 2018 0.283251
2019 0.273568
dtype: float64
print(pop_u18_percent.unstack())
year 2018 2019
city
City B 0.247010 0.222831
City A 0.273594 0.249211
City C 0.283251 0.273568
In Section 1, we studied one way to create a multi-index object using tuples. In this section, we will study various methods/techniques for creating multi-index object and using it to create Series and DataFrame:
➞ The most straightforward way to construct a multi-indexed Series or DataFrame is to simply pass a list of two or more index arrays to the
pd.Series()
or pd.DataFrame
constructor. The number of data points should be equal to number of indices.create_mi_df = pd.DataFrame(np.random.rand(4,2),
index=[['a','a','b','b'],[1,2,1,2]],
columns=['data1','data2'])
print(create_mi_df)
data1 data2
a 1 0.695555 0.776309
2 0.696634 0.502602
b 1 0.322619 0.127614
2 0.293457 0.415007
➞ We can also create multi-index Series by passing dictionary with appropriate tuples as keys, Pandas will automatically recognize the indices and data values:
data_for_series = {('California', 2000): 33871648,
('California', 2010): 37253956,
('Texas', 2000): 20851820,
('Texas', 2010): 25145561,
('New York', 2000): 18976457,
('New York', 2010): 19378102}
pd.Series(data_for_series)
California 2000 33871648
2010 37253956
Texas 2000 20851820
2010 25145561
New York 2000 18976457
2010 19378102
dtype: int64
We can use the class method available in the
pd.MultiIndex
pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]])
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
pd.MultiIndex.from_tuples([('a',1),('a',2),('b',1),('b',2)])
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
This one is easiest of all three, needs to input least amount of data:
pd.MultiIndex.from_product([['a','b'],[1,2]])
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
)
In this sub-section, we will learn various methods to name the multi-index:
In sub-section 2.1, we studied three explicit multi-index constructor. In them, we can provide keyword argument
names=[]
to define the name of each index level:pd.MultiIndex.from_arrays([['a','a','b','b'],[1,2,1,2]], names=['alpha','num'])
MultiIndex([('a', 1),
('a', 2),
('b', 1),
('b', 2)],
names=['alpha', 'num'])
If we have already created a multi-index Series or DataFrame object without index level names, we can use the method
index.names=[]
to explicitly set the names of each index level# reproducing Series that don't have index name
pop
City A 2018 33871648
2019 37253956
City B 2018 18976457
2019 19378102
City C 2018 20851820
2019 25145561
dtype: int64
# setting index level names
pop.index.names = ['city','year']
pop
city year
City A 2018 33871648
2019 37253956
City B 2018 18976457
2019 19378102
City C 2018 20851820
2019 25145561
dtype: int64
In a DataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can also have multiple levels
# creating hierarchical (multiple) indices and columns
indices = pd.MultiIndex.from_product([[2018,2019],[1,2]],
names=['year','exam'])
columns = pd.MultiIndex.from_product([['Tom', 'Harry', 'John'], ['HR', 'Marketing']],
names=['student', 'marks'])
# mock data
# how do we know that we need to have 4*6 = 24 data points?
data = np.random.randint(50,100, size=(4,6))
# create Dataframe with multiple indices and colums
df_multi = pd.DataFrame(data,
index=indices,
columns=columns)
# result
print(df_multi)
student Tom Harry John
marks HR Marketing HR Marketing HR Marketing
year exam
2018 1 58 85 54 91 73 69
2 70 98 64 70 79 96
2019 1 67 79 87 79 81 70
2 81 72 90 88 80 66
Remember that[]
method of indexing and slicing on Series object, applies on the index labels
# multi-index series to perform indexing and slicing
pop
city year
City A 2018 33871648
2019 37253956
City B 2018 18976457
2019 19378102
City C 2018 20851820
2019 25145561
dtype: int64
Let’s use indexing techniques to answer some basic question about the above multi-indexed Series,
pop
→ What is population of City A in 2018
pop['City A', 2018]
33871648
→ What is population of City A in all available years
pop['City A']
year
2018 33871648
2019 37253956
dtype: int64
→ To use
.loc
, make sure that index is sorted. If it is not, use .index.sort()
on Series or DataFrame object. Now, let ask question, what is the population of City A and City B, in all available year# using .loc
pop.loc['City A':'City B']
city year
City A 2018 33871648
2019 37253956
City B 2018 18976457
2019 19378102
dtype: int64
→ We can also use the integer based indexing. Let’s fetch first two rows using
[:2]
# using integer value of index
pop[:2]
city year
City A 2018 33871648
2019 37253956
dtype: int64
→ What is population of all cities, for year 2018
pop[:,2018]
city
City A 33871648
City B 18976457
City C 20851820
dtype: int64
→ What is population of City A and City B in all available years:
# fancy indexing
# pay attention that we use the array here
pop[['City A','City B']]
city year
City B 2018 18976457
2019 19378102
City A 2018 33871648
2019 37253956
dtype: int64
Remember that[]
method of indexing and slicing on DataFrame object, applies on the column labels. Therefore to apply indexing on index level, we can use.iloc[]
andloc[]
# reproduing dataframe we will work on
print(df_multi)
student Tom Harry John
marks HR Marketing HR Marketing HR Marketing
year exam
2018 1 58 85 54 91 73 69
2 70 98 64 70 79 96
2019 1 67 79 87 79 81 70
2 81 72 90 88 80 66
Let’s use indexing techniques to answer some basic question about the above multi-indexed DataFrame,
df_multi
→ What are marks of student, Tom, in all the subjects , for all available years and exams:
print(df_multi['Tom'])
marks HR Marketing
year exam
2018 1 58 85
2 70 98
2019 1 67 79
2 81 72
→ Tom marks in HR, for all available years and exams:
df_multi['Tom','HR']
year exam
2018 1 81
2 80
2019 1 79
2 55
Name: (Tom, HR), dtype: int64
➞ Fetching first row of a multi-index DataFrame using
iloc[]
methodprint(df_multi.iloc[:1])
student Tom Harry John
marks HR Marketing HR Marketing HR Marketing
year exam
2018 1 58 85 54 91 73 69
➞ Fetching first two rows and first two columns using
iloc[,]
method. The integers for slicing that we provide before the ,
in iloc[ , ]
applies to row and after the ,
applies to columnprint(df_multi.iloc[:2,:2])
student Tom
marks HR Marketing
year exam
2018 1 58 85
2 70 98
→ We can also use the explicit values of index and column labels using
.loc[]
For example, let’s get score of all students, in all the subjects, for all the exams, but only in year 2018:print(df_multi.loc[2018])
student Tom Harry John
marks HR Marketing HR Marketing HR Marketing
exam
1 58 85 54 91 73 69
2 70 98 64 70 79 96
→ We can also use
.loc[ , ]
to slice at both index and column levels. Let’s fetch scores of Tom, in all subjects and all exams, but only in year 2018:print(df_multi.loc[2018,'Tom'])
marks HR Marketing
exam
1 58 85
2 70 98
We saw few examples of this concept, sub-section 1.1. under
stack()
and unstack()
methods, but there are many more ways to finely control the rearrangement of data between hierarchical indices and columnsWe can sort the index of a Series or DataFrame object using
.sort_index()
method:# defining index of Series
index_series = pd.MultiIndex.from_product([['c','a','b'],[1,2,3]])
# defining data
data_series = np.random.randint(100, size=9)
# constructing Series object
series_object = pd.Series(data_series, index=index_series)
# printing Series in unordered (original) form
print(series_object)
# printing Series in ordered (alphabetically) form
print(series_object.sort_index())
c 1 34
2 72
3 98
a 1 40
2 40
3 74
b 1 48
2 19
3 11
dtype: int64
a 1 40
2 40
3 74
b 1 48
2 19
3 11
c 1 34
2 72
3 98
dtype: int64
Earlier, in sub-section 1.1, we applied
stack
and unstuck
on Pandas Series object. Let’s us apply the same methods on DataFrame (in the example below, we intentionally edit the DataFrame by removing “John” so that the DataFrame is easy to read and understand)# reproducing multi-index DataFrame
print(df_multi)
student Tom Harry
marks HR Marketing HR Marketing
year exam
2018 1 65 69 87 84
2 76 98 53 87
2019 1 58 62 70 93
2 62 71 83 66
Let unstack the results, which by default applies to
level=-1
, i.e, the last index in the multi-index series.# unstack
df_multi_unstack = df_multi.unstack()
print(df_multi_unstack)
As we can see, the index name ‘exam’ is now unstacked and become part of another level in the column:
student Tom Harry
marks HR Marketing HR Marketing
exam 1 2 1 2 1 2 1 2
year
2018 65 76 69 98 87 53 84 87
2019 58 62 62 71 70 83 93 66
Let’s
unstack
with level=0
which will unstack the index, name year
into another level in the column# unstack with 'level=0'
print(df_multi.unstack(level=0))
student Tom Harry
marks HR Marketing HR Marketing
year 2018 2019 2018 2019 2018 2019 2018 2019
exam
1 65 58 69 62 87 70 84 93
2 76 62 98 71 53 83 87 66
Let’s stack one of the DataFrame columns into index. By default, it applies to the last level in the column, which is
exam
in our example:df_multi_unstack.stack()
student Harry Tom
marks HR Marketing HR Marketing
year exam
2018 1 87 84 65 69
2 53 87 76 98
2019 1 70 93 58 62
2 83 66 62 71
Let suppose, we would like to stack the
student
column instead of exam
To do that we can provide level=0
because student
is the at position of 0
print(df_multi_unstack.stack(level=0))
marks HR Marketing
exam 1 2 1 2
year student
2018 Harry 87 53 84 87
Tom 65 76 69 98
2019 Harry 70 83 93 66
Tom 58 62 62 71
Index to column: We can use
reset_index
method to turn the index labels into columns. We can also fine control the result using various parameters of this method :# reproducing multi-index series, pop
pop
city year
City B 2018 18976457
2019 19378102
City A 2018 33871648
2019 37253956
City C 2018 20851820
2019 25145561
dtype: int64
Let’s apply
reset_index()
which will turn the old indices into columns and new integer based sequential index is used:print(pop.reset_index())
city year 0
0 City B 2018 18976457
1 City B 2019 19378102
2 City A 2018 33871648
3 City A 2019 37253956
4 City C 2018 20851820
5 City C 2019 25145561
Last column has no name, so let’s give it a name to make the results both presentable and meaningful:
# let give name to the column
pop_resetindex = pop.reset_index(name='population')
print(pop_resetindex)
city year population
0 City B 2018 18976457
1 City B 2019 19378102
2 City A 2018 33871648
3 City A 2019 37253956
4 City C 2018 20851820
5 City C 2019 25145561
If we don’t want to reset all indices to columns, we can use the argument
level=
to fine tune our resultsColumn-to-index: We can use
set_index()
method to build a multi-index Series or DataFrame by providing the list of column labels that we would like to convert into indices:print(pop_resetindex.set_index(['city','year']))
population
city year
City B 2018 18976457
2019 19378102
City A 2018 33871648
2019 37253956
City C 2018 20851820
2019 25145561
In this section, we will perform
sum()
, mean()
, max()
kind of aggregation on multi-index DataFrame# reproducing multi-index dataframe
print(df_multi)
student Tom Harry John
marks HR Marketing HR Marketing HR Marketing
year exam
2018 1 58 85 54 91 73 69
2 70 98 64 70 79 96
2019 1 67 79 87 79 81 70
2 81 72 90 88 80 66
Let suppose we would like to find the mean scores in each year, for each subject and each student. To accomplish this, we will use the keyword argument
level=year
df_mean = df_multi.mean(level='year')
print(df_mean)
student Tom Harry John
marks HR Marketing HR Marketing HR Marketing
year
2018 64.0 91.5 59.0 80.5 76.0 82.5
2019 74.0 75.5 88.5 83.5 80.5 68.0
Let suppose we would like to find the mean scores each year for each subject, for all subjects and exams. To accomplish this, we will use two keyword arguments
level='marks
and axis=1
(to tell Pandas to look for level under column)print(df_mean.mean(axis=1, level='marks'))
marks HR Marketing
year
2018 66.333333 84.833333
2019 81.000000 75.666667
Last modified 4mo ago