Io File Read And Write
We will learn how to read the three most commonly used file format - csv, xls(xlsx) and json
- CSV is a comma separated key-value pairs, where
key
is the column name andvalue
its corresponding column values pd.read_csv()
constructor is used to read thecsv
file- First (and only required) keyword argument is the location of the file. The returned output is the DataFrame with integer indexes and comma-separated column names as column labels for the DataFrame
- However, when we specify the keyword argument
index_col
, we can specify which columns we want to use as row labels
In the example below, we are going to read the file which is available here. I dropped some columns in the file to make it presentable because our main purpose is to read/write files rather than working with actual content of the file. In addition, following short terms are used column labels:
- Number of firms = No
- Average Unlevered Beta = AU Beta
- Average Levered Beta = AL Beta
import numpy as np
import pandas as pd
# location of csv file we are going to read
csv_location = "data/totalbeta.csv"
csv_df = pd.read_csv(csv_location)
# .head() shows only the first five rows in output
csv_df.head()
Text | Industry | No. | AU Beta | AL Beta |
---|---|---|---|---|
0 | Advertising | 47 | 0.93 | 1.44 |
1 | Aerospace/Defense | 77 | 1.08 | 1.23 |
2 | Air Transport | 18 | 0.84 | 1.44 |
3 | Apparel | 51 | 0.83 | 1.06 |
4 | Auto & Truck | 13 | 0.53 | 1.10 |
Let use the kwarg
index_col
to use specified column value as row label:# Assigning column "Industry Name" as row labels
# Index positon of "industry Name" is 0
pd.read_csv(csv_location, index_col=0).head()
Text | No. | AU Beta | AL Beta |
---|---|---|---|
Industry | | | |
--- | --- | --- | --- |
Advertising | 47 | 0.93 | 1.44 |
Aerospace/Defense | 77 | 1.08 | 1.23 |
Air Transport | 18 | 0.84 | 1.44 |
Apparel | 51 | 0.83 | 1.06 |
Auto & Truck | 13 | 0.53 | 1.10 |
- We can read all spreadsheets in an excel workbook or individual sheets inside the workbook
pd.read_excel
constructor is used to read an excel file- We can provide keyword argument,
sheet_name
to read a specific sheet in workbook — it can be either integer index or exact sheet name as string.sheet_name=None
returns all sheets - We can provide keyword argument,
index_col
to set the column name as row label (just like we do inpd.read_csv
)
xls_location = "data/totalbeta.xls"
xls_df = pd.read_excel(xls_location)
xls_df.head()
Text | Industry | No. | AU Beta | AL Beta |
---|---|---|---|---|
0 | Advertising | 47 | 0.934953 | 1.439612 |
1 | Aerospace/Defense | 77 | 1.078522 | 1.231584 |
2 | Air Transport | 18 | 0.843673 | 1.435348 |
3 | Apparel | 51 | 0.829641 | 1.055097 |
4 | Auto & Truck | 13 | 0.525735 | 1.095074 |
- Let’s fetch all sheet names inside the workbook using kwarg
sheet_name=None
and.keys
method:
# reading by sheet_name
# first, let fetch all the sheet names in the workbook
pd.read_excel(xls_location, sheet_name=None).keys()
odict_keys(['US', 'Global', 'Europe', 'Emerging'])
- Let’s store each sheet content inside the variable
# storing each sheet in its own dataframe
us_sheet = pd.read_excel(xls_location, sheet_name='US')
global_sheet = pd.read_excel(xls_location, sheet_name='Global')
europe_sheet = pd.read_excel(xls_location, sheet_name='Europe')
emerging_sheet = pd.read_excel(xls_location, sheet_name='Emerging')
# read the first sheet by integer indexing
pd.read_excel(xls_location, sheet_name=0).head()
Text | Industry | No. | AU Beta | AL Beta |
---|---|---|---|---|
0 | Advertising | 47 | 0.934953 | 1.439612 |
1 | Aerospace/Defense | 77 | 1.078522 | 1.231584 |
2 | Air Transport | 18 | 0.843673 | 1.435348 |
3 | Apparel | 51 | 0.829641 | 1.055097 |
4 | Auto & Truck | 13 | 0.525735 | 1.095074 |
# read the first sheet by its sheet name
pd.read_excel(xls_location, sheet_name='US').head()
Text | Industry | No. | AU Beta | AL Beta |
---|---|---|---|---|
0 | Advertising | 47 | 0.934953 | 1.439612 |
1 | Aerospace/Defense | 77 | 1.078522 | 1.231584 |
2 | Air Transport | 18 | 0.843673 | 1.435348 |
3 | Apparel | 51 | 0.829641 | 1.055097 |
4 | Auto & Truck | 13 | 0.525735 | 1.095074 |
# reading more than one sheets
excel_file = pd.read_excel(xls_location, sheet_name=[0,-1])
excel_file.keys()
odict_keys([0, -1])
excel_file[-1].head()
Text | Industry Name | No. | AU Beta | AL Beta |
---|---|---|---|---|
0 | Advertising | 103 | 1.463880 | 1.495675 |
1 | Aerospace/Defense | 89 | 1.099578 | 1.175447 |
2 | Air Transport | 86 | 0.590283 | 1.111339 |
3 | Apparel | 907 | 0.635054 | 0.749096 |
4 | Auto & Truck | 83 | 1.127485 | 1.379411 |
- Think of json file as a python dictionary with key-value pair
pd.read_json()
constructor is used to read json file- When no keyword argument is used: outer keys are used as column labels and inner keys are used as row labels
- When we use
orient='index'
keyword argument, outer keys are used as row labels and inner keys are used as column labels
json_location = "data/totalbeta.json"
json_df = pd.read_json(json_location)
json_df.head()
Text | Industry | No. | AU Beta | AL Beta |
---|---|---|---|---|
0 | Advertising | 47 | 0.93 | 1.44 |
1 | Aerospace/Defense | 77 | 1.08 | 1.23 |
2 | Air Transport | 18 | 0.84 | 1.44 |
3 | Apparel | 51 | 0.83 | 1.06 |
4 | Auto & Truck | 13 | 0.53 | 1.10 |
We will write in three most commonly used formats - csv, xls(xlsx) and json
.to_csv()
constructor is used to write incsv
format- The first keyword argument is the name of file
- By default,
to.csv()
uses the index labels as first column in the csv file, which can be non-meaningful if just integer, we can provideindex=False
keyword argument to not write the row label into a csv file - for the purpose of demonstration, we will use the data we have stored above under variables
csv_df
,excel_df
andjson_df
# writing
csv_df.to_csv('data/tocsv_file.csv')
# reading the written file
pd.read_csv("data/tocsv_file.csv").head()
Text | Unnamed: 0 | Industry | No. | AU Beta | AL Beta |
---|---|---|---|---|---|
0 | 0 | Advertising | 47 | 0.93 | 1.44 |
1 | 1 | Aerospace/Defense | 77 | 1.08 | 1.23 |
2 | 2 | Air Transport | 18 | 0.84 | 1.44 |
3 | 3 | Apparel | 51 | 0.83 | 1.06 |
4 | 4 | Auto & Truck | 13 | 0.53 | 1.10 |
We can see that the index label is used as first column, which is meaningless in our case, let use
index=False
to not use the index as first column# setting index=False
# writing
csv_df.to_csv('data/tocsv_index_false.csv', index=False)
# reading the written file
pd.read_csv("data/tocsv_index_false.csv").head()
Text | Industry | No. | AU Beta | AL Beta |
---|---|---|---|---|
0 | Advertising | 47 | 0.93 | 1.44 |
1 | Aerospace/Defense | 77 | 1.08 | 1.23 |
2 | Air Transport | 18 | 0.84 | 1.44 |
3 | Apparel | 51 | 0.83 | 1.06 |
4 | Auto & Truck | 13 | 0.53 | 1.10 |
.to_excel()
constructor is used to write single sheet to the excel workbook- To write multiple sheets in excel workbook, first we need to load the excel file using
pd.ExcelWriter
, then use it as first argument in the.to_excel()
- default label for sheet is
Sheet1
,Sheet2
unless we provide it explicitly usingsheet_name
keyword argument - to avoid writing row label as first column, use
index=False
# writing
xls_df.to_excel('data/toexcel_file.xls', index=False)
# reading the written file
pd.read_excel('data/toexcel_file.xls').head()
Text | Industry | No. | AU Beta | AL Beta |
---|---|---|---|---|
0 | Advertising | 47 | 0.934953 | 1.439612 |
1 | Aerospace/Defense | 77 | 1.078522 | 1.231584 |
2 | Air Transport | 18 | 0.843673 | 1.435348 |
3 | Apparel | 51 | 0.829641 | 1.055097 |
4 | Auto & Truck | 13 | 0.525735 | 1.095074 |
# writing individual sheets to excel workbook
with pd.ExcelWriter('data/toexcel_file_multiple_sheets.xls') as writer:
us_sheet.to_excel(writer, index=False, sheet_name='usa')
europe_sheet.to_excel(writer, index=False, sheet_name='europe')
global_sheet.to_excel(writer, index=False, sheet_name='global')
emerging_sheet.to_excel(writer, index=False, sheet_name='emerging')
# reading the written file
xls_df_new = pd.read_excel('data/toexcel_file_multiple_sheets.xls', sheet_name=None)
xls_df_new.keys()
odict_keys(['usa', 'europe', 'global', 'emerging'])
xls_df_new['usa'].head()
Text | Industry | No. | AU Beta | AL Beta |
---|---|---|---|---|
0 | Advertising | 47 | 0.934953 | 1.439612 |
1 | Aerospace/Defense | 77 | 1.078522 | 1.231584 |
2 | Air Transport | 18 | 0.843673 | 1.435348 |
3 | Apparel | 51 | 0.829641 | 1.055097 |
4 | Auto & Truck | 13 | 0.525735 | 1.095074 |
.to_json()
constructor is used- Meaning and application of
orient='index'
in.to_json
will remain the same as explained inpd.read_json()
# writing
json_df.to_json('data/tojson_file.json')
# reading the written file
pd.read_json('data/tojson_file.json').head()
Text | Industry | No. | AU Beta | AL Beta |
---|---|---|---|---|
0 | Advertising | 47 | 0.93 | 1.44 |
1 | Aerospace/Defense | 77 | 1.08 | 1.23 |
2 | Air Transport | 18 | 0.84 | 1.44 |
3 | Apparel | 51 | 0.83 | 1.06 |
4 | Auto & Truck | 13 | 0.53 | 1.10 |
Last modified 6mo ago