Indexing Pandas Series And Dataframe
Techniques learned in Numpy like indexing, slicing, fancy indexing, boolean masking and combination - will be applied to Pandas
Series
and DataFrame
objectsSeries
object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary , we will see how.Series
essentially maps a collection of keys
to collection of values
import numpy as np
import pandas as pd
# making Data Series
data_series = pd.Series([1,2,3,4,5],
index=['a','b','c','d','e'])
data_series
a 1
b 2
c 3
d 4
e 5
dtype: int64
- We can use dictionary like Python expressions
'a' in data_series
True
- We can fetch index of
Series
object using.keys()
method
data_series.keys()
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
- We can fetch
index,value
pair using.items()
method
list(data_series.items())
[('a', 1), ('b', 2), ('c', 3), ('d', 4), ('e', 5)]
- Just like Python Dictionary, we can append Panda Series with index and its value
data_series['f'] = 6
data_series
a 1
b 2
c 3
d 4
e 5
f 6
dtype: int64
We can perform same operations on
Series
object as we do on Numpy Arrays — indexing, slicing, masking, fancy indexing- Indexing by providing explicit index (string, in our case)
data_series['d']
4
- Slicing with string as index ALERT: Notice that when you are slicing with an explicit index (i.e.,
data[:'d'])
, the stop index is included in the slice
data_series[:'d']
a 1
b 2
c 3
d 4
dtype: int64
- Indexing by providing implicit (integer) index
data_series[0]
1
- Slicing by providing implicit (integer) index. ALERT , note that stop index isn’t included in the output
data_series[1:3]
b 2
c 3
dtype: int64
- In masking, we provide the boolean array under
[]
to get subset ofSeries
This boolean array can be the result of some conditional operator. For masking, we can pass single condition or group of conditions. We will examine all this concepts in the examples below:
# conditional operator that result in boolean array
data_series > 3
a False
b False
c False
d True
e True
f True
dtype: bool
# boolean masking
data_series[(data_series > 3)]
d 4
e 5
f 6
dtype: int64
# another masking example with multiple conditions
data_series[(data_series > 0) & (data_series <4)]
a 1
b 2
c 3
dtype: int64
- Fancy Indexing is where we need to fetch values at arbitrary index points, as compared to simple slicing where we fetch values in some order (
[1:10]
,[::2]
, for example)
# fetch first and last item of the Series
data_series[[0,-1]]
a 1
f 6
dtype: int64
# fetch index values of 'a' and 'e' indices
data_series[['a','e']]
a 1
e 5
dtype: int64
PROBLEM:
- We have seen above in the example of slicing that how explicit indexing makes things confusing, this is specially true if the indices are in integer.
- For example, if your Series has an explicit integer index, an indexing operation such as
data[1]
will use the explicit indexing, that is fetch the value of index labeled1
and not the second item as in the implicit indexing. However, slicing operation likedata[1:3]
will use the implicit Python-style slicing, that is, fetching 2nd and 3rd items in the Series object
SOLUTION:
- Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes:
# first make pd.Series where confusion can happen
pd_series = pd.Series([10,20,30,40,50],
index=[1,2,3,4,5])
pd_series
1 10
2 20
3 30
4 40
5 50
dtype: int64
# Now let suppose you want to get the value of second index[1]
# but [1] will assume it as explicit index,
# and gives us first item
pd_series[1]
10
.loc()
always reference the explicit index schemepd_series.loc[1]
10
.iloc()
always reference the implicit index schemepd_series.iloc[1]
20
DataFrame
object acts in many ways like a two-dimensional NumPy array, and in many ways like a dictionary of related Series
objects, we will see how:DataFrame
as a dictionary of related Series objects# reproducing the data series we constructed earlier
# reproducing population dictionary
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population_series = pd.Series(population_dict)
# making the area dictionary
area_dict = {'California': 423967,
'Texas': 695662,
'New York': 141297,
'Florida': 170312,
'Illinois': 149995}
area_series = pd.Series(area_dict)
states_dataframe = pd.DataFrame({'population': population_series,
'area': area_series})
states_dataframe
Text | population | area |
---|---|---|
California | 38332521 | 423967 |
Texas | 26448193 | 695662 |
New York | 19651127 | 141297 |
Florida | 19552860 | 170312 |
Illinois | 12882135 | 149995 |
- Individual column data can be accesses via dictionary style indexing
states_dataframe['population']
California 38332521
Texas 26448193
New York 19651127
Florida 19552860
Illinois 12882135
Name: population, dtype: int64
- We can also access the column values through the column name as attribute
states_dataframe.population
California 38332521
Texas 26448193
New York 19651127
Florida 19552860
Illinois 12882135
Name: population, dtype: int64
- Dictionary-style syntax can be used to modify the object or add new column to
DataFrame
object
states_dataframe['density'] = states_dataframe['population'] / states_dataframe['area']
states_dataframe
Text | population | area | density |
---|---|---|---|
California | 38332521 | 423967 | 90.413926 |
Texas | 26448193 | 695662 | 38.018740 |
New York | 19651127 | 141297 | 139.076746 |
Florida | 19552860 | 170312 | 114.806121 |
Illinois | 12882135 | 149995 | 85.883763 |
.values
method provides underlying values ofDataFrame
object
states_dataframe.values
array([[38332521, 423967],
[26448193, 695662],
[19651127, 141297],
[19552860, 170312],
[12882135, 149995]])
.T
method transposes (columns to rows, rows to columns) theDataFrame
object
states_dataframe.T
Text | California | Texas | New York | Florida | Illinois |
---|---|---|---|---|---|
population | 38332521 | 26448193 | 19651127 | 19552860 | 12882135 |
area | 423967 | 695662 | 141297 | 170312 | 149995 |
states_dataframe.values[0]
array([38332521, 423967])
💡 Remember that
[]
indexing applies to column labels in DataFrame
object as opposed to row labels in Series
objectstates_dataframe['population']
California 38332521
Texas 26448193
New York 19651127
Florida 19552860
Illinois 12882135
Name: population, dtype: int64
.loc()
always reference the explicit index schemestates_dataframe.loc['New York']
population 19651127
area 141297
Name: New York, dtype: int64
states_dataframe.loc[:'New York']
Text | population | area |
---|---|---|
California | 38332521 | 423967 |
Texas | 26448193 | 695662 |
New York | 19651127 | 141297 |
# selection on both rows and columns
states_dataframe.loc[:'New York',:'area']
Text | population | area |
---|---|---|
California | 38332521 | 423967 |
Texas | 26448193 | 695662 |
New York | 19651127 | 141297 |
.iloc()
always reference the implicit index schemestates_dataframe.iloc[2]
population 19651127
area 141297
Name: New York, dtype: int64
states_dataframe.iloc[:3]
Text | population | area |
---|---|---|
California | 38332521 | 423967 |
Texas | 26448193 | 695662 |
New York | 19651127 | 141297 |
states_dataframe.iloc[:3,:1]
Text | population |
---|---|
California | 38332521 |
Texas | 26448193 |
New York | 19651127 |
Last modified 6mo ago