Filtering And Sorting In Pandas
This article covers a very basic overview of filtering and sorting techniques in Pandas
import numpy as np
import pandas as pd
In Pandas (just like we covered in Numpy), we can create ‘filter conditions’ for DataFrame. For example, we can use conditional operators on a DataFrame column features, which return the boolean Series representing DataFrame that passes the filter condition
# reading csv and saving its content as 'df'
df = pd.read_csv('data/pedata.csv')
df.head()
Text | Industry Name | Number of firms | Current PE | Trailing PE | Forward PE |
---|---|---|---|---|---|
0 | Advertising | 47 | 20.01 | 23.77 | 13.84 |
1 | Aerospace/Defense | 77 | 35.11 | 44.26 | 22.91 |
2 | Air Transport | 18 | 14.87 | 10.55 | 10.16 |
3 | Apparel | 51 | 25.76 | 54.57 | 21.97 |
4 | Auto & Truck | 13 | 14.77 | 16.76 | 20.52 |
Let’s apply a filter to column
Current PE
values:df['Current PE'] > 50
Result, will be a boolean array with
True
returned for all cells that passes the conditional operator and False
otherwise:0 False
1 False
2 False
3 False
4 False
...
91 False
92 False
93 False
94 True
95 True
Name: Current PE, Length: 96, dtype: bool
Applying filter to column,
Number of firms
df['Number of firms'] > 50
0 False
1 True
2 False
3 True
4 False
...
91 False
92 False
93 False
94 True
95 True
Name: Number of firms, Length: 96, dtype: bool
For columns with string values: we can use
str.startswith
, str.endswith
and str.contains
functions Using ~
before the operation, negates the filter conditionsdf['Industry Name'].str.startswith('A')
0 True
1 True
2 True
3 True
4 True
...
91 False
92 False
93 False
94 False
95 False
Name: Industry Name, Length: 96, dtype: bool
df['Industry Name'].str.contains('Air')
0 False
1 False
2 True
3 False
4 False
...
91 False
92 False
93 False
94 False
95 False
Name: Industry Name, Length: 96, dtype: bool
# .isin function checks the value in the provided list
df['Current PE'].isin([10,20])
0 False
1 False
2 False
3 False
4 False
...
91 False
92 False
93 False
94 False
95 False
Name: Current PE, Length: 96, dtype: bool
- In above two sections, we studied how to create the filter that returns a list of boolean (True or False) along rows that passes or fails the test
- We can apply the same filter under the square brackets
[]
of DataFrame variable to actually retrieve the rows, passing the filter condition.
df[df['Current PE'] > 150]
Text | Industry Name | Number of firms | Current PE | Trailing PE | Forward PE |
---|---|---|---|---|---|
14 | Cable TV | 14 | 156.56 | 80.57 | 26.37 |
30 | Entertainment | 107 | 150.39 | 47.68 | 40.25 |
38 | Healthcare Products | 242 | 159.85 | 84.43 | 71.38 |
43 | Hotel/Gaming | 65 | 170.91 | 134.20 | 22.75 |
45 | Information Services | 69 | 283.37 | 46.23 | 28.00 |
46 | Insurance (General) | 19 | 693.05 | 67.57 | 24.42 |
55 | Oil/Gas Distribution | 24 | 605.72 | 69.41 | 16.78 |
67 | Reinsurance | 2 | 153.21 | 57.40 | 15.90 |
70 | Retail (Building Supply) | 17 | 201.74 | 238.80 | 18.33 |
74 | Retail (Online) | 70 | 319.22 | 243.82 | 86.28 |
df[df['Industry Name'].str.startswith('A')]
Text | Industry Name | Number of firms | Current PE | Trailing PE | Forward PE |
---|---|---|---|---|---|
0 | Advertising | 47 | 20.01 | 23.77 | 13.84 |
1 | Aerospace/Defense | 77 | 35.11 | 44.26 | 22.91 |
2 | Air Transport | 18 | 14.87 | 10.55 | 10.16 |
3 | Apparel | 51 | 25.76 | 54.57 | 21.97 |
4 | Auto & Truck | 13 | 14.77 | 16.76 | 20.52 |
5 | Auto Parts | 46 | 16.42 | 17.58 | 15.63 |
Using multiple conditions. Let suppose we need to know the industries with
Current PE
greater than 100 and Forward PE
greater than 120df[(df['Current PE'] > 100) & (df['Forward PE'] > 120)]
Text | Industry Name | Number of firms | Current PE | Trailing PE | Forward PE |
---|---|---|---|---|---|
77 | Semiconductor | 72 | 109.36 | 97.09 | 248.11 |
87 | Telecom. Services | 67 | 115.24 | 742.09 | 121.34 |
As another example, let suppose we are interested to know which industries has
Forward PE
greater than its Current PE
valuedf[df['Forward PE'] > df['Current PE']]
Text | Industry Name | Number of firms | Current PE | Trailing PE | Forward PE |
---|---|---|---|---|---|
4 | Auto & Truck | 13 | 14.77 | 16.76 | 20.52 |
6 | Bank (Money Center) | 7 | 10.56 | 10.23 | 12.17 |
9 | Beverage (Soft) | 34 | 34.49 | 39.87 | 143.56 |
11 | Brokerage & Investment Banking | 39 | 14.08 | 18.05 | 16.34 |
15 | Chemical (Basic) | 43 | 14.40 | 16.11 | 22.42 |
16 | Chemical (Diversified) | 6 | 9.63 | 10.48 | 10.13 |
20 | Computers/Peripherals | 48 | 24.13 | 28.92 | 30.93 |
21 | Construction Supplies | 44 | 22.33 | 39.58 | 26.20 |
24 | Drugs (Pharmaceutical) | 267 | 22.48 | 58.18 | 35.43 |
25 | Education | 35 | 21.30 | 22.20 | 26.03 |
27 | Electronics (Consumer & Office) | 20 | 18.40 | 64.24 | 18.80 |
47 | Insurance (Life) | 24 | 15.27 | 21.05 | 66.72 |
53 | Oil/Gas (Integrated) | 4 | 12.73 | 22.67 | 31.99 |
54 | Oil/Gas (Production and Exploration) | 269 | 19.20 | 8.66 | 34.96 |
65 | Real Estate (Operations & Services) | 57 | 23.20 | 32.46 | 33.99 |
77 | Semiconductor | 72 | 109.36 | 97.09 | 248.11 |
78 | Semiconductor Equip | 39 | 25.57 | 39.73 | 28.46 |
81 | Software (Entertainment) | 86 | 60.25 | 33.98 | 82.81 |
82 | Software (Internet) | 30 | 90.14 | 66.75 | 100.71 |
84 | Steel | 32 | 10.61 | 14.34 | 24.60 |
85 | Telecom (Wireless) | 18 | 27.21 | 25.66 | 29.17 |
87 | Telecom. Services | 67 | 115.24 | 742.09 | 121.34 |
91 | Trucking | 33 | 17.56 | 18.36 | 23.54 |
- We can use
sort_values
function to sort DataFrame by one or more of its columns - we can either provide a single column label or list of column labels to sort by
- Keyword arguments,
ascending=True
tells to sort in ascending order,ascending=False
will sort in descending order
df.sort_values('Current PE')
Text | Industry Name | Number of firms | Current PE | Trailing PE | Forward PE |
---|---|---|---|---|---|
18 | Coal & Related Energy | 22 | 7.06 | 10.30 | 7.04 |
16 | Chemical (Diversified) | 6 | 9.63 | 10.48 | 10.13 |
6 | Bank (Money Center) | 7 | 10.56 | 10.23 | 12.17 |
84 | Steel | 32 | 10.61 | 14.34 | 24.60 |
53 | Oil/Gas (Integrated) | 4 | 12.73 | 22.67 | 31.99 |
... | ... | ... | ... | ... | ... |
70 | Retail (Building Supply) | 17 | 201.74 | 238.80 | 18.33 |
45 | Information Services | 69 | 283.37 | 46.23 | 28.00 |
74 | Retail (Online) | 70 | 319.22 | 243.82 | 86.28 |
55 | Oil/Gas Distribution | 24 | 605.72 | 69.41 | 16.78 |
46 | Insurance (General) | 19 | 693.05 | 67.57 | 24.42 |
96 rows × 5 columns
df.sort_values('Number of firms', ascending=False)
Text | Industry Name | Number of firms | Current PE | Trailing PE | Forward PE |
---|---|---|---|---|---|
94 | Total Market | 7053 | 60.52 | 70.85 | 35.79 |
95 | Total Market (without financials) | 5878 | 62.49 | 76.83 | 39.72 |
7 | Banks (Regional) | 611 | 16.99 | 15.41 | 13.70 |
23 | Drugs (Biotechnology) | 503 | 77.30 | 77.56 | 30.21 |
83 | Software (System & Application) | 363 | 144.40 | 110.90 | 76.82 |
... | ... | ... | ... | ... | ... |
6 | Bank (Money Center) | 7 | 10.56 | 10.23 | 12.17 |
16 | Chemical (Diversified) | 6 | 9.63 | 10.48 | 10.13 |
53 | Oil/Gas (Integrated) | 4 | 12.73 | 22.67 | 31.99 |
76 | Rubber& Tires | 4 | 15.27 | 21.55 | 8.95 |
67 | Reinsurance | 2 | 153.21 | 57.40 | 15.90 |
96 rows × 5 columns
# when using a list of column labels, any label used after first
# acts as a tiebreaker for its preceding label
df.sort_values(['Current PE', 'Number of firms'])
Text | Industry Name | Number of firms | Current PE | Trailing PE | Forward PE |
---|---|---|---|---|---|
18 | Coal & Related Energy | 22 | 7.06 | 10.30 | 7.04 |
16 | Chemical (Diversified) | 6 | 9.63 | 10.48 | 10.13 |
6 | Bank (Money Center) | 7 | 10.56 | 10.23 | 12.17 |
84 | Steel | 32 | 10.61 | 14.34 | 24.60 |
53 | Oil/Gas (Integrated) | 4 | 12.73 | 22.67 | 31.99 |
... | ... | ... | ... | ... | ... |
70 | Retail (Building Supply) | 17 | 201.74 | 238.80 | 18.33 |
45 | Information Services | 69 | 283.37 | 46.23 | 28.00 |
74 | Retail (Online) | 70 | 319.22 | 243.82 | 86.28 |
55 | Oil/Gas Distribution | 24 | 605.72 | 69.41 | 16.78 |
46 | Insurance (General) | 19 | 693.05 | 67.57 | 24.42 |
96 rows × 5 columns
In this section, we will load IMDB dataset. It is not a complete dataset of all IMDB, but a subset of 1,000 popular movies on IMDB from 2006 to 2016
imdb = pd.read_csv('data/imdb.csv')
imdb.head()
Text | Rank | Title | Genre | Description | Director | Actors | Year | Runtime (Minutes) | Rating | Votes | Revenue (Millions) | Metascore |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Guardians of the Galaxy | Action,Adventure,Sci-Fi | A group of intergalactic criminals are forced ... | James Gunn | Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S... | 2014 | 121 | 8.1 | 757074 | 333.13 | 76.0 |
1 | 2 | Prometheus | Adventure,Mystery,Sci-Fi | Following clues to the origin of mankind, a te... | Ridley Scott | Noomi Rapace, Logan Marshall-Green, Michael Fa... | 2012 | 124 | 7.0 | 485820 | 126.46 | 65.0 |
2 | 3 | Split | Horror,Thriller | Three girls are kidnapped by a man with a diag... | M. Night Shyamalan | James McAvoy, Anya Taylor-Joy, Haley Lu Richar... | 2016 | 117 | 7.3 | 157606 | 138.12 | 62.0 |
3 | 4 | Sing | Animation,Comedy,Family | In a city of humanoid animals, a hustling thea... | Christophe Lourdelet | Matthew McConaughey,Reese Witherspoon, Seth Ma... | 2016 | 108 | 7.2 | 60545 | 270.32 | 59.0 |
4 | 5 | Suicide Squad | Action,Adventure,Fantasy | A secret government agency recruits some of th... | David Ayer | Will Smith, Jared Leto, Margot Robbie, Viola D... | 2016 | 123 | 6.2 | 393727 | 325.02 | 40.0 |
print(imdb.shape)
(1000, 12)
🤔 Show us all movies from 2016 with rating greater than 8.5
imdb[(imdb['Year'] == 2016) & (imdb['Rating'] >= 8.5)].head()
Text | Rank | Title | Genre | Description | Director | Actors | Year | Runtime (Minutes) | Rating | Votes | Revenue (Millions) | Metascore |
---|---|---|---|---|---|---|---|---|---|---|---|---|
96 | 97 | Kimi no na wa | Animation,Drama,Fantasy |