Vectorized String Operations
import numpy as np
import pandas as pd
Vectorization is process of doing an operation on multiple items (in an array, for example) in one go.
x = np.array([1,2,3,4,5])
# performing vectorization of operations
x * 10
array([10, 20, 30, 40, 50])
However, it is not straightforward to perform vectorization on “array of strings” and Pandas addresses this need of performing vectorized string operations using various
str
methodsnames_series = pd.Series(['tom','JOhn','MARIA'])
names_series
0 tom
1 JOhn
2 MARIA
dtype: object
names_series.str.capitalize()
0 Tom
1 John
2 Maria
dtype: object
Let’s first define a Pandas Series to work with:
# Panda series use in this section
names = pd.Series(['Walter White', 'Jesse Pinkman', 'Skyler White', 'Hank Shrader', 'Mike Ehrmantraut', 'Gus Fring'])
names
0 Walter White
1 Jesse Pinkman
2 Skyler White
3 Hank Shrader
4 Mike Ehrmantraut
5 Gus Fring
dtype: objec
Nearly all Python’s built-in string methods are mirrored by a Pandas vectorized string method. Visit this link to get the complete list.
# lets apply some of these string methods to panda series
# to upper case
names.str.upper()
0 WALTER WHITE
1 JESSE PINKMAN
2 SKYLER WHITE
3 HANK SHRADER
4 MIKE EHRMANTRAUT
5 GUS FRING
dtype: objectt
# to check if it is digit
names.str.isdigit()
0 False
1 False
2 False
3 False
4 False
5 False
dtype: bool
# to get length of each item in the array
names.str.len()
0 12
1 13
2 12
3 12
4 16
5 9
dtype: int64
# to get boolean array, one that passes the condition
names.str.startswith('W')
0 True
1 False
2 False
3 False
4 False
5 False
dtype: bool
Regular expression is a special syntax to find string or set of strings. This topic is very broad and can be very dry. However, we are going to taste plain-vanilla flavor of them here.
The following methods accept regular expressions to examine the content of each string element, and follow some of the API conventions of Python’s built-in
re
module# let apply str.extract() method with regular expression to extract the first names
names.str.extract('([A-Za-z]+)')
0
0 Walter
1 Jesse
2 Skyler
3 Hank
4 Mike
5 Gus
# getting first letter of each element in the array
# using standard indexing method
names.str[0]
0 W
1 J
2 S
3 H
4 M
5 G
dtype: object
# getting first letter of each element in the array
# using str.get() method
names.str.get(0)
0 W
1 J
2 S
3 H
4 M
5 G
dtype: object
# str.slice()
names.str.slice(0,2)
0 Wa
1 Je
2 Sk
3 Ha
4 Mi
5 Gu
dtype: object
# str.split()
names.str.split()
0 [Walter, White]
1 [Jesse, Pinkman]
2 [Skyler, White]
3 [Hank, Shrader]
4 [Mike, Ehrmantraut]
5 [Gus, Fring]
dtype: object
# str.split() with str.get(0) to get first name
names.str.split().str.get(0)
0 Walter
1 Jesse
2 Skyler
3 Hank
4 Mike
5 Gus
dtype: object
The
get_dummies()
lets you quickly split out indicator variables into a DataFramedummy = pd.DataFrame({'info': ['A|B|C','A','A|C'],
'name': ['tom','dick','harry']})
print(dummy)
info name
0 A|B|C tom
1 A dick
2 A|C harry
# using get_dummies
print(dummy['info'].str.get_dummies('|'))
A B C
0 1 1 1
1 1 0 0
2 1 0 1
Last modified 4mo ago