Filtering And Sorting In Pandas

This article covers a very basic overview of filtering and sorting techniques in Pandas
import numpy as np
import pandas as pd

1. FILTERING

1.1. Conditional Operators

In Pandas (just like we covered in Numpy), we can create โ€˜filter conditionsโ€™ for DataFrame. For example, we can use conditional operators on a DataFrame column features, which return the boolean Series representing DataFrame that passes the filter condition
# reading csv and saving its content as 'df'
df = pd.read_csv('data/pedata.csv')
โ€‹
df.head()
Text
Industry Name
Number of firms
Current PE
Trailing PE
Forward PE
0
Advertising
47
20.01
23.77
13.84
1
Aerospace/Defense
77
35.11
44.26
22.91
2
Air Transport
18
14.87
10.55
10.16
3
Apparel
51
25.76
54.57
21.97
4
Auto & Truck
13
14.77
16.76
20.52
Letโ€™s apply a filter to column Current PE values:
df['Current PE'] > 50
Result, will be a boolean array with True returned for all cells that passes the conditional operator and False otherwise:
0 False
1 False
2 False
3 False
4 False
...
91 False
92 False
93 False
94 True
95 True
Name: Current PE, Length: 96, dtype: bool
Applying filter to column, Number of firms
df['Number of firms'] > 50
0 False
1 True
2 False
3 True
4 False
...
91 False
92 False
93 False
94 True
95 True
Name: Number of firms, Length: 96, dtype: bool

1.2. Filter using Functions

For columns with string values: we can use str.startswith, str.endswith and str.contains functions Using ~ before the operation, negates the filter conditions
df['Industry Name'].str.startswith('A')
0 True
1 True
2 True
3 True
4 True
...
91 False
92 False
93 False
94 False
95 False
Name: Industry Name, Length: 96, dtype: bool
df['Industry Name'].str.contains('Air')
0 False
1 False
2 True
3 False
4 False
...
91 False
92 False
93 False
94 False
95 False
Name: Industry Name, Length: 96, dtype: bool
# .isin function checks the value in the provided list
โ€‹
df['Current PE'].isin([10,20])
0 False
1 False
2 False
3 False
4 False
...
91 False
92 False
93 False
94 False
95 False
Name: Current PE, Length: 96, dtype: bool

1.3. Masking (Retrieving Rows that passes the Filter condition)

  • In above two sections, we studied how to create the filter that returns a list of boolean (True or False) along rows that passes or fails the test
  • We can apply the same filter under the square brackets [] of DataFrame variable to actually retrieve the rows, passing the filter condition.
df[df['Current PE'] > 150]
Text
Industry Name
Number of firms
Current PE
Trailing PE
Forward PE
14
Cable TV
14
156.56
80.57
26.37
30
Entertainment
107
150.39
47.68
40.25
38
Healthcare Products
242
159.85
84.43
71.38
43
Hotel/Gaming
65
170.91
134.20
22.75
45
Information Services
69
283.37
46.23
28.00
46
Insurance (General)
19
693.05
67.57
24.42
55
Oil/Gas Distribution
24
605.72
69.41
16.78
67
Reinsurance
2
153.21
57.40
15.90
70
Retail (Building Supply)
17
201.74
238.80
18.33
74
Retail (Online)
70
319.22
243.82
86.28
df[df['Industry Name'].str.startswith('A')]
Text
Industry Name
Number of firms
Current PE
Trailing PE
Forward PE
0
Advertising
47
20.01
23.77
13.84
1
Aerospace/Defense
77
35.11
44.26
22.91
2
Air Transport
18
14.87
10.55
10.16
3
Apparel
51
25.76
54.57
21.97
4
Auto & Truck
13
14.77
16.76
20.52
5
Auto Parts
46
16.42
17.58
15.63
Using multiple conditions. Let suppose we need to know the industries with Current PE greater than 100 and Forward PE greater than 120
df[(df['Current PE'] > 100) & (df['Forward PE'] > 120)]
Text
Industry Name
Number of firms
Current PE
Trailing PE
Forward PE
77
Semiconductor
72
109.36
97.09
248.11
87
Telecom. Services
67
115.24
742.09
121.34
As another example, let suppose we are interested to know which industries has Forward PE greater than its Current PE value
df[df['Forward PE'] > df['Current PE']]
Text
Industry Name
Number of firms
Current PE
Trailing PE
Forward PE
4
Auto & Truck
13
14.77
16.76
20.52
6
Bank (Money Center)
7
10.56
10.23
12.17
9
Beverage (Soft)
34
34.49
39.87
143.56
11
Brokerage & Investment Banking
39
14.08
18.05
16.34
15
Chemical (Basic)
43
14.40
16.11
22.42
16
Chemical (Diversified)
6
9.63
10.48
10.13
20
Computers/Peripherals
48
24.13
28.92
30.93
21
Construction Supplies
44
22.33
39.58
26.20
24
Drugs (Pharmaceutical)
267
22.48
58.18
35.43
25
Education
35
21.30
22.20
26.03
27
Electronics (Consumer & Office)
20
18.40
64.24
18.80
47
Insurance (Life)
24
15.27
21.05
66.72
53
Oil/Gas (Integrated)
4
12.73
22.67
31.99
54
Oil/Gas (Production and Exploration)
269
19.20
8.66
34.96
65
Real Estate (Operations & Services)
57
23.20
32.46
33.99
77
Semiconductor
72
109.36
97.09
248.11
78
Semiconductor Equip
39
25.57
39.73
28.46
81
Software (Entertainment)
86
60.25
33.98
82.81
82
Software (Internet)
30
90.14
66.75
100.71
84
Steel
32
10.61
14.34
24.60
85
Telecom (Wireless)
18
27.21
25.66
29.17
87
Telecom. Services
67
115.24
742.09
121.34
91
Trucking
33
17.56
18.36
23.54

2. SORTING

2.1. Sort by Feature

  • We can use sort_values function to sort DataFrame by one or more of its columns
  • we can either provide a single column label or list of column labels to sort by
  • Keyword arguments, ascending=True tells to sort in ascending order, ascending=False will sort in descending order
df.sort_values('Current PE')
Text
Industry Name
Number of firms
Current PE
Trailing PE
Forward PE
18
Coal & Related Energy
22
7.06
10.30
7.04
16
Chemical (Diversified)
6
9.63
10.48
10.13
6
Bank (Money Center)
7
10.56
10.23
12.17
84
Steel
32
10.61
14.34
24.60
53
Oil/Gas (Integrated)
4
12.73
22.67
31.99
...
...
...
...
...
...
70
Retail (Building Supply)
17
201.74
238.80
18.33
45
Information Services
69
283.37
46.23
28.00
74
Retail (Online)
70
319.22
243.82
86.28
55
Oil/Gas Distribution
24
605.72
69.41
16.78
46
Insurance (General)
19
693.05
67.57
24.42
96 rows ร— 5 columns
df.sort_values('Number of firms', ascending=False)
Text
Industry Name
Number of firms
Current PE
Trailing PE
Forward PE
94
Total Market
7053
60.52
70.85
35.79
95
Total Market (without financials)
5878
62.49
76.83
39.72
7
Banks (Regional)
611
16.99
15.41
13.70
23
Drugs (Biotechnology)
503
77.30
77.56
30.21
83
Software (System & Application)
363
144.40
110.90
76.82
...
...
...
...
...
...
6
Bank (Money Center)
7
10.56
10.23
12.17
16
Chemical (Diversified)
6
9.63
10.48
10.13
53
Oil/Gas (Integrated)
4
12.73
22.67
31.99
76
Rubber& Tires
4
15.27
21.55
8.95
67
Reinsurance
2
153.21
57.40
15.90
96 rows ร— 5 columns
# when using a list of column labels, any label used after first
# acts as a tiebreaker for its preceding label
df.sort_values(['Current PE', 'Number of firms'])
Text
Industry Name
Number of firms
Current PE
Trailing PE
Forward PE
18
Coal & Related Energy
22
7.06
10.30
7.04
16
Chemical (Diversified)
6
9.63
10.48
10.13
6
Bank (Money Center)
7
10.56
10.23
12.17
84
Steel
32
10.61
14.34
24.60
53
Oil/Gas (Integrated)
4
12.73
22.67
31.99
...
...
...
...
...
...
70
Retail (Building Supply)
17
201.74
238.80
18.33
45
Information Services
69
283.37
46.23
28.00
74
Retail (Online)
70
319.22
243.82
86.28
55
Oil/Gas Distribution
24
605.72
69.41
16.78
46
Insurance (General)
19
693.05
67.57
24.42
96 rows ร— 5 columns

3. MORE EXAMPLES

In this section, we will load IMDB dataset. It is not a complete dataset of all IMDB, but a subset of 1,000 popular movies on IMDB from 2006 to 2016
imdb = pd.read_csv('data/imdb.csv')
โ€‹
imdb.head()
Text
Rank
Title
Genre
Description
Director
Actors
Year
Runtime (Minutes)
Rating
Votes
Revenue (Millions)
Metascore
0
1
Guardians of the Galaxy
Action,Adventure,Sci-Fi
A group of intergalactic criminals are forced ...
James Gunn
Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...
2014
121
8.1
757074
333.13
76.0
1
2
Prometheus
Adventure,Mystery,Sci-Fi
Following clues to the origin of mankind, a te...
Ridley Scott
Noomi Rapace, Logan Marshall-Green, Michael Fa...
2012
124
7.0
485820
126.46
65.0
2
3
Split
Horror,Thriller
Three girls are kidnapped by a man with a diag...
M. Night Shyamalan
James McAvoy, Anya Taylor-Joy, Haley Lu Richar...
2016
117
7.3
157606
138.12
62.0
3
4
Sing
Animation,Comedy,Family
In a city of humanoid animals, a hustling thea...
Christophe Lourdelet
Matthew McConaughey,Reese Witherspoon, Seth Ma...
2016
108
7.2
60545
270.32
59.0
4
5
Suicide Squad
Action,Adventure,Fantasy
A secret government agency recruits some of th...
David Ayer
Will Smith, Jared Leto, Margot Robbie, Viola D...
2016
123
6.2
393727
325.02
40.0
print(imdb.shape)
(1000, 12)
๐Ÿค” Show us all movies from 2016 with rating greater than 8.5
imdb[(imdb['Year'] == 2016) & (imdb['Rating'] >= 8.5)].head()
Text
Rank
Title
Genre
Description
Director
Actors
Year
Runtime (Minutes)
Rating
Votes
Revenue (Millions)
Metascore
96
97
Kimi no na wa
Animation,Drama,Fantasy