Combining Datasets In Pandas
We will start with basic examples of concatenation of
Series
and DataFrames
objects, with the pd.concat
function; later we will dive into more sophisticated in-memory merge
and join
implemented in Pandas.import numpy as np
import pandas as pd
We will first define a function that will be used to make the DataFrame from letters and numbers, with fewer keystrokes and help us keep the code clean:
def make_df(columns, indices):
data = {c: [f"{c}{i}" for i in indices]
for c in columns}
return pd.DataFrame(data, indices)
# understanding the components of function
data = {c: [f"{c}{i}" for i in range(3)]
for c in 'ABC'}
data
{'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2'], 'C': ['C0', 'C1', 'C2']}
# example to run function
example_df = make_df('ABC', [1,2,3])
print(example_df)
A B C
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3
Pandas has a function,
pd.concat()
, which has a similar syntax to np.concatenate
but contains a number of options Syntax for Pandas concat function:pd.concat(objs, paramters)
Parameter | Default |
---|---|
axis= | 0 |
join= | ‘outer’ |
join_axes= | None |
ignore_index= | False |
keys= | None |
levels= | None |
names= | None |
verify_integrity= | False |
copy | True |
# concatenating along columns (stacking one on top of other),
# because default value of keyword argument, index=0
ser1 = pd.Series(['A','B','C'], index=[1,2,3])
ser2 = pd.Series(['X','Y','Z'], index=[4,5,6])
pd.concat([ser1,ser2])
1 A
2 B
3 C
4 X
5 Y
6 Z
dtype: object
→ In this first example, for both DataFrame objects that we are going to concatenate, column names are the same i.e, A,B and C and indices are different. Because we are going to use
index=0
default value which will concatenate along columns:# concatenating along columns (stacking one on top of other),
# because default value of keyword argument, index=0
df1 = make_df('ABC',[1,2,3])
df2 = make_df('ABC',[4,5,6])
print(df1)
print(df2)
print(pd.concat([df1,df2]))
A B C
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3
A B C
4 A4 B4 C4
5 A5 B5 C5
6 A6 B6 C6
A B C
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3
4 A4 B4 C4
5 A5 B5 C5
6 A6 B6 C6
→ In this second example, for both DataFrame objects that we are going to concatenate, column names are the different and indices are same (1,2). Because we are going to use
index=1
which will concatenate along rows:# concatenating along rows
# keyword argument, axis=1
df3 = make_df(['A','B'], [1,2])
df4 = make_df(['C','D'], [1,2])
print(df3); print(df4); print(pd.concat([df3,df4], axis=1))
A B
1 A1 B1
2 A2 B2
C D
1 C1 D1
2 C2 D2
A B C D
1 A1 B1 C1 D1
2 A2 B2 C2 D2
One important difference between
np.concatenate
and pd.concat
is that Pandas concatenation preserves indices, even if the result will have duplicate indices# making two series with same index
sr3 = pd.Series(['A','B'], index=[1,2])
sr4 = pd.Series(['A','B'], index=[1,2])
pd.concat([sr3,sr4])
1 A
2 B
1 A
2 B
dtype: object
pd.concat()
gives us a few ways to handle the repeated indices issuevarify_integrity=True
checks whether the new concatenated axis contains duplicates. If yes, it will raise the ValueEror
pd.concat([sr3,sr4], verify_integrity=True)
ValueError: Indexes have overlapping values: Int64Index([1, 2], dtype='int64')
We can use
keys=[]
kwarg to form Multi-index Series or DataFramepd.concat([sr3,sr4], keys=['y','z'])
y 1 A
2 B
z 1 A
2 B
dtype: object
In the examples above, we discussed cases of concatenating DataFrames with shared column names. In practice, data from different sources might have different sets of column names, and
pd.concat
offers several options to handle this.# look what happens when we concat the two df with not identical columns
df5 = make_df('ABC',[1,2])
df6 = make_df('BCD', [3,4])
pd.concat([df5,df6], sort=False)
#remember the default value of kwarg, join=outer
A B C D
1 A1 B1 C1 NaN
2 A2 B2 C2 NaN
3 NaN B3 C3 D3
4 NaN B4 C4 D4
join='inner'
: joins intersection of columns in the DataFrames
print(pd.concat([df5,df6], join='inner'))
B C
1 B1 C1
2 B2 C2
3 B3 C3
4 B4 C4
Series and DataFrame objects have an
.append()
method that can accomplish the same behaviour like .concat()
but in fewer keystrokes Unlike the append()
and extend()
methods of Python lists, the append()
method in Pandas does not modify the original objectprint(df1.append(df2))
A B C
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3
4 A4 B4 C4
5 A5 B5 C5
6 A6 B6 C6
We will very briefly discuss
pd.join
, but the discussion is mainly focused on pd.merge
functionA note on Relational Algebra
pd.merge()
works in a manner that is considered to be a subset of what is known as relational algebra — formal set of rules for manipulating relational data. Pandas implements several of these fundamental building blocks in the pd.merge()
function and the related join()
method of Series and DataFrames.The
pd.merge()
function implements a number of types of joins: the one-to-one, many-to-one, and many-to-many joins — the type of join performed depends on the form of the input data# creating two DataFrames with one identical column name
df11 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df12 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
print(df11)
print(df12)
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR
employee hire_date
0 Lisa 2004
1 Bob 2008
2 Jake 2012
3 Sue 2014
When we merge these two DataFrame object,
df11
and df12
, the pd.merge()
function recognizes that each DataFrame has a common employee
column, and automatically joins using this column as a keydf13 = pd.merge(df11,df12)
print(df13)
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014
It is type of Join in which one of the two matching key columns contains duplicate entries. For the many-to-one case, the resulting DataFrame will preserve those duplicate entries as appropriate:
df14 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
print(df13)
print(df14)
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014
group supervisor
0 Accounting Carly
1 Engineering Guido
2 HR Steve
When we merge these two DataFrame object,
df13
and df14
, the pd.merge()
function recognizes the common column group
and automatically joins using this column as a key. However, group
column under df13
contain 4 values (3 unique), while under df14
contain 3 values (all 3 unique) In this case, the many-to-one use same supervisor
value for both Engineering
rowprint(pd.merge(df13,df14))
employee group hire_date supervisor
0 Bob Accounting 2008 Carly
1 Jake Engineering 2012 Guido
2 Lisa Engineering 2004 Guido
3 Sue HR 2014 Steve
If the key column in both the left and right DataFrame contains duplicates, then the result is a many-to-many merge. Example makes understanding it easier:
df15 = pd.DataFrame({'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR', 'HR'],
'skills': ['math', 'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']})
print(df11); print(df15)
employee group
0 Bob Accounting
1 Jake Engineering
2 Lisa Engineering
3 Sue HR
group skills
0 Accounting math
1 Accounting spreadsheets
2 Engineering coding
3 Engineering linux
4 HR spreadsheets
5 HR organization
When we merge these two DataFrame objects,
df11
and df15
, the pd.merge()
function recognizes the common column group
and automatically joins using this column as a key. However, there are duplicates values under group
for both DataFrame object so the join would be many-to-many join. Bob
is related to accounting
group
but accounting
has two skills
, so both will have their own row. Same logic goes for Jake
, Lisa
under engineering
print(pd.merge(df11,df15))
employee group skills
0 Bob Accounting math
1 Bob Accounting spreadsheets
2 Jake Engineering coding
3 Jake Engineering linux
4 Lisa Engineering coding
5 Lisa Engineering linux
6 Sue HR spreadsheets
7 Sue HR organization
Often the column names will not match so nicely as we have seen in above examples, but
pd.merge()
provides a variety of options to handle this scenario and explicitly tell to merge which column/key.Explicitly specify the name of the key column using the
on=
keyword argument, which takes a column name or a list of column namesprint(pd.merge(df11,df12, on='employee'))
employee group hire_date
0 Bob Accounting 2008
1 Jake Engineering 2012
2 Lisa Engineering 2004
3 Sue HR 2014
For example, we may have a dataset in which the employee name is labeled as “name” rather than “employee” but the values are same inside i.e, just the column label is different. In such case, we can specify which column to merge
left_on=
and which to merge on right_on=
Example will make it clearer:df16 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'salary': [70000, 80000, 120000, 90000]})
print(df16)
name salary
0 Bob 70000
1 Jake 80000
2 Lisa 120000
3 Sue 90000
df17 = pd.merge(df11,df16, left_on='employee', right_on='name')
print(df17)
employee group name salary
0 Bob Accounting Bob 70000
1 Jake Engineering Jake 80000
2 Lisa Engineering Lisa 120000
3 Sue HR Sue 90000
What to do if we want to delete a specified label from an index or column? use
.drop()
method. There are two ways to implement this:- Specify the label names and
axis
(0 if along row, 1 if along column) - Directly specify the
columns
orindex
name to remove the column or row, respectively.
This is the example of the first case — specifying the label name and axis:
print(df17.drop('name', axis=1))
employee group salary
0 Bob Accounting 70000
1 Jake Engineering 80000
2 Lisa Engineering 120000
3 Sue HR 90000
Sometimes, rather than merging on a column, we would instead like to merge on an index.
left_index=
and right_index=
keyword arguments, takes in boolean value of True or False.set_index()
method will set the provided column as the index of a DataFrame# first let set_index of df11 and df12 so we can merge on rows
df11a = df11.set_index('employee')
df12a = df12.set_index('employee')
print(df11a)
print(df12a)
group
employee
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR
hire_date
employee
Lisa 2004
Bob 2008
Jake 2012
Sue 2014
print(pd.merge(df11a, df12a, right_index=True, left_index=True))
group hire_date
employee
Bob Accounting 2008
Jake Engineering 2012
Lisa Engineering 2004
Sue HR 2014
To join by indices we can also use
.join
method, which by-default merges by indices. Here is an example:print(df11a.join(df12a))
group hire_date
employee
Bob Accounting 2008
Jake Engineering 2012
Lisa Engineering 2004
Sue HR 2014
print(df11a)
print(df16)
group
employee
Bob Accounting
Jake Engineering
Lisa Engineering
Sue HR
name salary
0 Bob 70000
1 Jake 80000
2 Lisa 120000
3 Sue 90000
We would like to merge DataFrame
df11a
and df16
However, the common key is index, employee
from df11a
and column, name
from df16
Therefore, we will use left_index=True
for df11a
and right_on='name'
for df16
print(pd.merge(df11a, df16, left_index=True, right_on='name'))
group name salary
0 Accounting Bob 70000
1 Engineering Jake 80000
2 Engineering Lisa 120000
3 HR Sue 90000
The
how
keyword argument defines the type of merge to perform, which takes one of these four values:how=inner
which is default value and keeps the intersection of keys. It preserve the keys order of the DataFrame mentioned on the lefthow=outer
keeps the union of the keyshow=left
keeps the keys of DataFrame mentioned on the left sidehow=right
keeps the keys of DataFrame mentioned on the right side
# starting by defining dataframes
df18 = pd.DataFrame({'name': ['Parker', 'Piper', 'Mia'],
'food': ['chicken', 'rice', 'bread']}, columns=['name', 'food'])
df19 = pd.DataFrame({'name': ['Mia', 'Justin'],
'drink': ['soda', 'water']}, columns=['name', 'drink'])
print(df18)
print(df19)
name food
0 Parker chicken
1 Piper rice
2 Mia bread
name drink
0 Mia soda
1 Justin water
print(pd.merge(df18, df19))
name food drink
0 Mia bread soda
Let change the order of frames in the
pd.merge()
print(pd.merge(df19, df18))
name drink food
0 Mia soda bread
print(pd.merge(df18,df19, how='outer'))
name food drink
0 Parker chicken NaN
1 Piper rice NaN
2 Mia bread soda
3 Justin NaN water
print(pd.merge(df18,df19, how='left'))
name food drink
0 Parker chicken NaN
1 Piper rice NaN
2 Mia bread soda
print(pd.merge(df18,df19, how='right'))
name food drink
0 Mia bread soda
1 Justin NaN water
We may end up in a situation where our two input DataFrames have two or more same labeled column. We can use
on=
to specify column name to merge on.df20 = pd.DataFrame({'name': ['A','B','C'],
'rank': [1,2,3]})
df21 = pd.DataFrame({'name':['C','A','B'],
'rank':[1,2,3]})
print(pd.merge(df20,df21, on='name'))
name rank_x rank_y
0 A 1 2
1 B 2 3
2 C 3 1
Instead of appending the column names with _x and _y we can give our own keywords, using argument
suffixes=[]
print(pd.merge(df20,df21, on='name', suffixes=['_bio','_stats']))
name rank_bio rank_stats
0 A 1 2
1 B 2 3
2 C 3 1
Last modified 6mo ago