Pandas Series And Dataframe Object
In this first part, we will introduce two primary components of Pandas —
Series
and DataFrame
objects.PandasSeries
is a one-dimensional array of indexed data
- Pandas
Series
is essentially a columns - Values inside the Numpy array have an implicitly defined integer index, whereas the Pandas
Series
have an explicitly defined index, which can be integer or any data type - Panda
Series
object can be created from a list or an array or dictionary Pandaseries
constructor has following common parameters:
pd.Series(data= ,index=, dtype=)
data=
keyword argument: The first keyword argument forpd.Series()
constructor isdata=
, however, we don’t need to explicitly set it, if we provide data as first argumentindex=
keyword argument: Default index is integer from 0 to n-1, where n is the number of elements in the series. However, we can specify a custom index using theindex=
keyword argument. These integers or other data type is collectively called index of Series and each individual index element is called labeldtype=
keyword argument is used to explicitly set the data type ofSeries
object- Additional parameters includes,
name
andcopy
# importing pandas and numpy
import pandas as pd
import numpy as np
# creating panda series object
pd_series = pd.Series([0.25,0.50,0.75,1.0])
# printing panda series object
pd_series
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
[0,1,2,3]
is a sequence of index along with its sequence of values [0.25,0.50,0.75,1.0]
We can use the built-in methods of pandas object to fetch these indices and values
We use
.values
method to get values of Series
object# fetch values of given Series
pd_series.values
array([0.25, 0.5 , 0.75, 1. ])
We use
.index
method to get indices of Series
object# fetch indices of given Series
pd_series.index
RangeIndex(start=0, stop=4, step=1)
We will first create the Pandas
Series
object by providing data in form of explicit list, index is automatically set to integer from 0 to n-1:# creating Series object from list
pd.Series([1,2,3,4])
0 1
1 2
2 3
3 4
dtype: int64
We can also create the
Series
object by providing a previously defined 1D Numpy array# defining numpy array
arr = np.array([1,2,3,4])
# creating Series from Numpy Array
pd.Series(arr)
0 1
1 2
2 3
3 4
dtype: int64
Contrary to Numpy array, that has implicit integer index, the index in Pandas object can be any data type (
int
,float
,str
or combination of them). Let explicitly set the string based index:# string as index
data_index_string = pd.Series([0.25,0.50,0.75,1.0],
index=['w','x','y','z'])
data_index_string
w 0.25
x 0.50
y 0.75
z 1.00
dtype: float64
Pandas
Series
object can also be created from the dictionary. To understand the conceptual parallel, remember this:- A dictionary is a structure that maps arbitrary keys to a set of arbitrary values
- A Series is a structure that maps typed keys to a set of typed values
# defining dictionary, key-value pairs
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
# creating Series from dictionary
population_series = pd.Series(population_dict)
population_series
California 38332521
Texas 26448193
New York 19651127
Florida 19552860
Illinois 12882135
dtype: int64
Indexing: We can use index label to fetch the corresponding value
population_series['New York']
19651127
This is equivalent of using the implicit integer index. As
New York
is at index position of 2
so we can also fetch its value in following manner:population_series[2]
19651127
In the following examples, we will see how we can use the
index=
keyword argument to construct the Series
object from the subset of data
provided→ Using a scalar, with explicit index, that defines the number of scalar instances in a
Series
object. Look at the example below:# using scalar
# number of instances of '10' is
# defined by index=
pd.Series(10,index=[1,2,3,4,5])
1 10
2 10
3 10
4 10
5 10
dtype: int64
→ Using dictionary, but its subset, by providing index of required values
# pd.Series() takes-in all dict values
# but Series will be made from only those values
# whose keys are explicitly mentioned in index=
pd.Series({'a':1, 'b':2, 'c':3}, index=['a','b'])
a 1
b 2
dtype: int64
If aSeries
is analogous to one-dimensional array with flexible indices, aDataFrame
is analogous to a two-dimensional array with both flexible row indices and flexible column names Just as you might think of a 2D array as an ordered sequence of aligned (sharing same index) 1D columns, you can think of aDataFrame
as a sequence of aligned (sharing same index)Series
objects
Panda
DataFrame
constructor has following common parameters:pd.DataFrame(data=, index=, columns=, dtype=)
DataFrame
constructor has essentially the same keyword arguments as the PandaSeries
.- However,
DataFrame
can’t be constructed from a scalar(single value) - Besides, it also takes an additional
columns=
keyword argument, which represents the label for the column. The default value of columns is (0,1,2…n)
It seems similar to the
Series
object we created earlier, but we can set the column label in DataFrame
object, by using the column=
kwarg. In absence of this kwarg, the default value of first column is set to 0
as can be seen in the example below:df_list = pd.DataFrame([1,2,3,4])
df_list
Text | 0 |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
We can also explicitly set the label for column, as you can see in the example below:
df_list = pd.DataFrame([1,2,3,4], columns=['col1'])
df_list
Text | col1 |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 4 |
We can use 2D array to construct a DataFrame with more than one-columns. If we don’t provide the kwarg
columns=
the default is set to (0,1,2…n) See the example below:df_2darray = pd.DataFrame([[1,2],[3,4]])
df_2darray
Text | 0 | 1 |
---|---|---|
0 | 1 | 2 |
1 | 3 | 4 |
However, we can also set the custom(explicit)
index
and column
namesdf_2darray_custom = pd.DataFrame([[1,2],[3,4]],
index=['row1','row2'],
columns=['col1','col2'])
df_2darray_custom
Text | col1 | col2 |
---|---|---|
row1 | 1 | 2 |
row2 | 3 | 4 |
We can also create
DataFrame
object from previously defined Series
object# defining Series object
pd_sr = pd.Series([100,200,300,400])
# constructing DataFrame from Series object
pd_df = pd.DataFrame(pd_sr)
pd_df
Text | 0 |
---|---|
0 | 100 |
1 | 200 |
2 | 300 |
3 | 400 |
Let explicitly set the
index
and column
labels# defining Series object with index labels
pd_sr = pd.Series([100,200,300,400],
index=['a','b','c','d'])
# constructing DataFrame from Series object
# with custom columns labels
pd_df_custom = pd.DataFrame(pd_sr,
columns=['hundreds'])
pd_df_custom
Text | hundreds |
---|---|
a | 100 |
b | 200 |
c | 300 |
d | 400 |
In dictionary key-value pair, the value can be another dictionary. We will use this concept to construct our DataFrame object. Pay particular attention as how the key-values are used to assign the
index
and columns
values of the DataFrame# reproducing population dictionary
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
# making the area dictionary
area_dict = {'California': 423967,
'Texas': 695662,
'New York': 141297,
'Florida': 170312,
'Illinois': 149995}
# constructing DataFrame using dictionaries
states = pd.DataFrame({'population': population_dict,
'area': area_dict})
states
keys
provided underpd.DataFrame()
are used ascolumn
labelskeys
provided under assigned dictionaries, are used asindex
labels
Text | population | area |
---|---|---|
California | 38332521 | 423967 |
Texas | 26448193 | 695662 |
New York | 19651127 | 141297 |
Florida | 19552860 | 170312 |
Illinois | 12882135 | 149995 |
In the following example,
dictionaries
are nested inside the list
and we provide data=
inside DataFrame in the form of this list
. Pay special attention that how the keys
of dictionaries are used as column
labels# first, we will use simple for loop
# to construct the 'list of dictionaries'
list_of_dict = [{'a': i, 'b': 2*i, 'c': 3*i}
for i in range(5)]
list_of_dict
[{'a': 0, 'b': 0, 'c': 0},
{'a': 1, 'b': 2, 'c': 3},
{'a': 2, 'b': 4, 'c': 6},
{'a': 3, 'b': 6, 'c': 9},
{'a': 4, 'b': 8, 'c': 12}]
# creating DataFrame from the above 'list of dictionaries'
pd.DataFrame(list_of_dict)
Text | a | b | c |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 1 | 2 | 3 |
2 | 2 | 4 | 6 |
3 | 3 | 6 | 9 |
4 | 4 | 8 | 12 |
We will fetch the commonly used attributes of a DataFrame:
print(f"Index: {states.index}")
print(f"Columns Names: {states.columns}")
print(f"Shape: {states.shape}")
print(f"Size: {states.size}")
print(f"Values: {states.values}")
Index: Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')
Columns Names: Index(['population', 'area'], dtype='object')
Shape: (5, 2)
Size: 10
Values: [[38332521 423967]
[26448193 695662]
[19651127 141297]
[19552860 170312]
[12882135 149995]]
Indexing (using
[]
) a DataFrame object applies on the columns
labels# fetch all values where column label = population
states['population']
California 38332521
Texas 26448193
New York 19651127
Florida 19552860
Illinois 12882135
Name: population, dtype: int64
# fetch all values where column label = area
states['area']
California 423967
Texas 695662
New York 141297
Florida 170312
Illinois 149995
Name: area, dtype: int64
- Both Pandas
Series
andDataFrame
object contains an explicitindex
that lets us reference and modify its data. In some of the above examples, we explicitly provided theindex=
keyword argument underpd.Series
andpd.DataFrame
However, the index object can be predefined usingpd.Index()
constructor - This
Index
object can be considered either as an immutable array or as an ordered set
# creating Pandas Index object
index_obj = pd.Index([1,2,3,4,5])
index_obj
Int64Index([1, 2, 3, 4, 5], dtype='int64')
Index
object works in many ways like an array, for example, we can use standard indexing techniques:# fetch first index
index_obj[0]
1
# fetch every other index, starting from first
index_obj[::2]
Int64Index([1, 3, 5], dtype='int64')
However,
Index
object is immutable array i.e, values cant be changed. If we try to change, it results in TypeError:
Index does not support mutable operationsThe
Index
object follows many of the conventions used by Python’s built-in Set
data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:# creating index objects
indx1 = pd.Index([1,3,5,7,9,10,12])
indx2 =pd.Index([0,2,4,6,7,8,9,10])
# Intersection of sets
indx1 & indx2
Int64Index([7, 9, 10], dtype='int64')
# union of sets
indx1 | indx2
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12], dtype='int64')
# symmetric differences
indx1 ^ indx2
Int64Index([0, 1, 2, 3, 4, 5, 6, 8, 12], dtype='int64')
Last modified 6mo ago