Io File Read And Write

1. READING THE FILE

We will learn how to read the three most commonly used file format - csv, xls(xlsx) and json

1.1 Reading a CSV file

  • CSV is a comma separated key-value pairs, where key is the column name and value its corresponding column values
  • pd.read_csv() constructor is used to read the csv file
  • First (and only required) keyword argument is the location of the file. The returned output is the DataFrame with integer indexes and comma-separated column names as column labels for the DataFrame
  • However, when we specify the keyword argument index_col, we can specify which columns we want to use as row labels
In the example below, we are going to read the file which is available here. I dropped some columns in the file to make it presentable because our main purpose is to read/write files rather than working with actual content of the file. In addition, following short terms are used column labels:
  • Number of firms = No
  • Average Unlevered Beta = AU Beta
  • Average Levered Beta = AL Beta
import numpy as np
import pandas as pd
# location of csv file we are going to read
csv_location = "data/totalbeta.csv"
csv_df = pd.read_csv(csv_location)
# .head() shows only the first five rows in output
csv_df.head()
Text
Industry
No.
AU Beta
AL Beta
0
Advertising
47
0.93
1.44
1
Aerospace/Defense
77
1.08
1.23
2
Air Transport
18
0.84
1.44
3
Apparel
51
0.83
1.06
4
Auto & Truck
13
0.53
1.10
Let use the kwarg index_col to use specified column value as row label:
# Assigning column "Industry Name" as row labels
# Index positon of "industry Name" is 0
pd.read_csv(csv_location, index_col=0).head()
Text
No.
AU Beta
AL Beta
Industry
---
---
---
---
Advertising
47
0.93
1.44
Aerospace/Defense
77
1.08
1.23
Air Transport
18
0.84
1.44
Apparel
51
0.83
1.06
Auto & Truck
13
0.53
1.10

1.2. Reading an Excel File

  • We can read all spreadsheets in an excel workbook or individual sheets inside the workbook
  • pd.read_excel constructor is used to read an excel file
  • We can provide keyword argument, sheet_name to read a specific sheet in workbook — it can be either integer index or exact sheet name as string. sheet_name=None returns all sheets
  • We can provide keyword argument, index_col to set the column name as row label (just like we do in pd.read_csv)
xls_location = "data/totalbeta.xls"
xls_df = pd.read_excel(xls_location)
xls_df.head()
Text
Industry
No.
AU Beta
AL Beta
0
Advertising
47
0.934953
1.439612
1
Aerospace/Defense
77
1.078522
1.231584
2
Air Transport
18
0.843673
1.435348
3
Apparel
51
0.829641
1.055097
4
Auto & Truck
13
0.525735
1.095074
  • Let’s fetch all sheet names inside the workbook using kwarg sheet_name=None and .keys method:
# reading by sheet_name
# first, let fetch all the sheet names in the workbook
pd.read_excel(xls_location, sheet_name=None).keys()
odict_keys(['US', 'Global', 'Europe', 'Emerging'])
  • Let’s store each sheet content inside the variable
# storing each sheet in its own dataframe
us_sheet = pd.read_excel(xls_location, sheet_name='US')
global_sheet = pd.read_excel(xls_location, sheet_name='Global')
europe_sheet = pd.read_excel(xls_location, sheet_name='Europe')
emerging_sheet = pd.read_excel(xls_location, sheet_name='Emerging')
# read the first sheet by integer indexing
pd.read_excel(xls_location, sheet_name=0).head()
Text
Industry
No.
AU Beta
AL Beta
0
Advertising
47
0.934953
1.439612
1
Aerospace/Defense
77
1.078522
1.231584
2
Air Transport
18
0.843673
1.435348
3
Apparel
51
0.829641
1.055097
4
Auto & Truck
13
0.525735
1.095074
# read the first sheet by its sheet name
pd.read_excel(xls_location, sheet_name='US').head()
Text
Industry
No.
AU Beta
AL Beta
0
Advertising
47
0.934953
1.439612
1
Aerospace/Defense
77
1.078522
1.231584
2
Air Transport
18
0.843673
1.435348
3
Apparel
51
0.829641
1.055097
4
Auto & Truck
13
0.525735
1.095074
# reading more than one sheets
excel_file = pd.read_excel(xls_location, sheet_name=[0,-1])
excel_file.keys()
odict_keys([0, -1])
excel_file[-1].head()
Text
Industry Name
No.
AU Beta
AL Beta
0
Advertising
103
1.463880
1.495675
1
Aerospace/Defense
89
1.099578
1.175447
2
Air Transport
86
0.590283
1.111339
3
Apparel
907
0.635054
0.749096
4
Auto & Truck
83
1.127485
1.379411

1.3. Reading a JSON file

  • Think of json file as a python dictionary with key-value pair
  • pd.read_json() constructor is used to read json file
  • When no keyword argument is used: outer keys are used as column labels and inner keys are used as row labels
  • When we use orient='index' keyword argument, outer keys are used as row labels and inner keys are used as column labels
json_location = "data/totalbeta.json"
json_df = pd.read_json(json_location)
json_df.head()
Text
Industry
No.
AU Beta
AL Beta
0
Advertising
47
0.93
1.44
1
Aerospace/Defense
77
1.08
1.23
2
Air Transport
18
0.84
1.44
3
Apparel
51
0.83
1.06
4
Auto & Truck
13
0.53
1.10

2. Writing to File

We will write in three most commonly used formats - csv, xls(xlsx) and json

2.1. Writing to a CSV file

  • .to_csv() constructor is used to write in csv format
  • The first keyword argument is the name of file
  • By default, to.csv() uses the index labels as first column in the csv file, which can be non-meaningful if just integer, we can provide index=False keyword argument to not write the row label into a csv file
  • for the purpose of demonstration, we will use the data we have stored above under variables csv_df, excel_df and json_df
# writing
csv_df.to_csv('data/tocsv_file.csv')
# reading the written file
pd.read_csv("data/tocsv_file.csv").head()
Text
Unnamed: 0
Industry
No.
AU Beta
AL Beta
0
0
Advertising
47
0.93
1.44
1
1
Aerospace/Defense
77
1.08
1.23
2
2
Air Transport
18
0.84
1.44
3
3
Apparel
51
0.83
1.06
4
4
Auto & Truck
13
0.53
1.10
We can see that the index label is used as first column, which is meaningless in our case, let use index=False to not use the index as first column
# setting index=False
# writing
csv_df.to_csv('data/tocsv_index_false.csv', index=False)
# reading the written file
pd.read_csv("data/tocsv_index_false.csv").head()
Text
Industry
No.
AU Beta
AL Beta
0
Advertising
47
0.93
1.44
1
Aerospace/Defense
77
1.08
1.23
2
Air Transport
18
0.84
1.44
3
Apparel
51
0.83
1.06
4
Auto & Truck
13
0.53
1.10

2.2. Writing to Excel File

  • .to_excel() constructor is used to write single sheet to the excel workbook
  • To write multiple sheets in excel workbook, first we need to load the excel file using pd.ExcelWriter, then use it as first argument in the .to_excel()
  • default label for sheet is Sheet1, Sheet2 unless we provide it explicitly using sheet_name keyword argument
  • to avoid writing row label as first column, use index=False
# writing
xls_df.to_excel('data/toexcel_file.xls', index=False)
# reading the written file
pd.read_excel('data/toexcel_file.xls').head()
Text
Industry
No.
AU Beta
AL Beta
0
Advertising
47
0.934953
1.439612
1
Aerospace/Defense
77
1.078522
1.231584
2
Air Transport
18
0.843673
1.435348
3
Apparel
51
0.829641
1.055097
4
Auto & Truck
13
0.525735
1.095074
# writing individual sheets to excel workbook
with pd.ExcelWriter('data/toexcel_file_multiple_sheets.xls') as writer:
us_sheet.to_excel(writer, index=False, sheet_name='usa')
europe_sheet.to_excel(writer, index=False, sheet_name='europe')
global_sheet.to_excel(writer, index=False, sheet_name='global')
emerging_sheet.to_excel(writer, index=False, sheet_name='emerging')
# reading the written file
xls_df_new = pd.read_excel('data/toexcel_file_multiple_sheets.xls', sheet_name=None)
xls_df_new.keys()
odict_keys(['usa', 'europe', 'global', 'emerging'])
xls_df_new['usa'].head()
Text
Industry
No.
AU Beta
AL Beta
0
Advertising
47
0.934953
1.439612
1
Aerospace/Defense
77
1.078522
1.231584
2
Air Transport
18
0.843673
1.435348
3
Apparel
51
0.829641
1.055097
4
Auto & Truck
13
0.525735
1.095074

2.3. Write to JSON File

  • .to_json() constructor is used
  • Meaning and application of orient='index' in .to_json will remain the same as explained in pd.read_json()
# writing
json_df.to_json('data/tojson_file.json')
# reading the written file
pd.read_json('data/tojson_file.json').head()
Text
Industry
No.
AU Beta
AL Beta
0
Advertising
47
0.93
1.44
1
Aerospace/Defense
77
1.08
1.23
2
Air Transport
18
0.84
1.44
3
Apparel
51
0.83
1.06
4
Auto & Truck
13
0.53
1.10