library(reticulate)
py_install("numpy")
py_install("pandas")
py_install("pyodide.http")
py_install("seaborn")
py_install("openpyxl")
I/O
import pandas as pd
import numpy as np
# pip install pyodide.http
# pip install seaborn
# pip install openpyxl
#install(['seaborn','xml','openpyxl'])
# from pyodide.http import pyfetch
.TXT file .READ
This first section we’ll focus on OPEN
OPEN ‘r’
- Reading text files involves extracting and processing the data stored within them. Text files can have various structures, and how you read them depends on their format.
- Here’s a general guide on reading text files with different structures.
- Plain text files contain unformatted text without any specific structure
- You can read plain text files line by line or load all the content into your memory
- Syntax:
# Open file in read ('r') mode
= open('sourcefile','r') material
with OPEN ‘r’
- It is always good to close the file after using Open, and it gets tedious to always type the close() command, we can avoid all that using
with()
which automatically closes the file afterread()
# Use with() which automatically closes the file after reading
with open(sourcefile, 'r') as file:
.READ
.read() Reads the content of a file into a variable as a string separated by
# Use with() which automatically closes the file after reading
with open(sourcefile, 'r') as file:
# Use read method to read the entire file
= file.read()
material # Now you can manipulate the content that's stored in material
print(material)
.READLINES
.readlines() reads all the lines line by line and save to a list
with open(sourcefile, 'r') as file:
= file.readlines() listoflines
- Then you can print any line by index
# Print first line
0] listoflines[
.READLINE
.readline() outputs every line as an element in a list. Reads one line ONLY
- material[0] is the first element of the list and corresponds to the first line and so on
- every time you call the method readline() it will read the next line and store its value in the next element of the list material[x]
- Therefore you can use a loop to go through the entire file and extract each line to an element in the list material
with open(sourcefile, 'r') as file:
= file.readline() # reads the first line
line1 = file.readline() # reads the second line line2
Extract Line
- You can work with each line you read, for example
- Print the lines that contain a certain word
with open(sourcefile, 'r') as file:
= file.readline() # reads the first line
line1 = file.readline() # reads the second line
line2 print(line1) # Prints the first line
if 'santa' in line 2:
print("Santa is in the house!?")
All Lines For Loop
- Loop through the file and print lines
with open(sourcefile, 'r') as file:
= 0;
i for line in file:
print("Iteration", str(i),": ", line)
= i+1 i
- Use a loop to read lines until no more lines are left
- Like reading the entire book line by line
All Lines While Loop
with open(sourcefile, 'r') as file:
while True:
= file.readline()
line if not line:
break # stop when all the lines have been read
You can specify the number of chars to read
Extract Chars in Line
- We can limit the number of characters to read from a line in the file by sending an argument in
readline()
- To read the first 4 chars we use:
material = sourcefile.readline(4)
- If we call 4 chars of a 15 char line then the next time we call the method it will read start with char 5 and count out
It reads the specified number of characters from the current position. So if we read the first 5 chars, the next read() will start where it left off, with the 6th character
with open(sourcefile, 'r') as file:
# Use read method to read the first 5 characters
= file.readline(5) material
SEEK Chars
- To read a specific character from a specific position in the file you use
seek()
seek()
moves the cursor (file pointer) to a particular position or char- The position is in bytes so you need to know the byte offset of the desired character
- To read the 11th character we use index 10 (it is 0-based index)
.seek(offset,from)
- changes the position by ‘offset’ bytes with respect to ‘from’. From can take the value of 0,1,2 corresponding to beginning, relative to current position and end
with open(sourcefile, 'r') as file:
file.seek(10) # read first 11 chars in file
.TXT file .WRITE
with OPEN ‘w’
- Similar to open(.., ‘r’) we change the mode to ‘w’ to write.
- We can also use ‘a’ to append which appends to the existing file and not write over it
- Here you can see the similarities
# Open file in write ('w') mode
list = open('sourcefile','w')
.WRITE
- Open the file using the write() to save the text file to a list
.write will overwrite existing data in the file
with open(sourcefile, 'w') as writefile:
"Write this line1 to file") writefile.write(
- We can check to see if it worked using the open read
with open(sourcefile, 'r') as checkfile:
print(checkfile.read())
.WRITE Lines
- write multiple lines
- similar to read it works the same way in reverse
with open(sourcefile, 'w') as writefile:
"This is line A\n")
writefile.write("this is line B\n")
writefile.write(# note the new line \n at the end of each line
# check the file to see if it worked
with open(sourcefile, 'r') as checkfile:
print(checkfile.read())
.WRITE List
- To write a list of lines to a file use a loop
= ["This is line A\n", "This is line B\n", "This is line C\n"]
Lines
with open('/sourcefile.txt', 'w') as writefile:
for line in lines:
print("We are about to add this line: ",line)
writefile.write(line)
.TXT file APPEND
with OPEN ‘a’
Appending to a file just adds to the end of it without affecting the previous content
- Writing to a file erases the existing data
- We just change the mode to ‘a’ and use the write method
with open(sourcefile, 'a') as writefile:
"This is line C\n")
writefile.write("this is line D\n")
writefile.write(# note the new line \n at the end of each line
# check the file to see if it worked
with open(sourcefile, 'r') as checkfile:
print(checkfile.read())
.TXT file A+
- It is used for appending and reading
with open('/Example2.txt', 'a+') as testwritefile:
print("Initial Location: {}".format(testwritefile.tell()))
= testwritefile.read()
data if (not data): #empty strings return false in python
print('Read nothing')
else:
print(testwritefile.read())
0,0) # move 0 bytes from beginning.
testwritefile.seek(
print("\nNew Location : {}".format(testwritefile.tell()))
= testwritefile.read()
data if (not data):
print('Read nothing')
else:
print(data)
print("Location after read: {}".format(testwritefile.tell()) )
Other Modes
It’s fairly inefficient to open the file in a or w and then reopening it in r to read any lines. Luckily we can access the file in the following modes:
- r+ : Reading and writing. Cannot truncate the file.
- w+ : Writing and reading. Truncates the file.
- a+ : Appending and Reading. Creates a new file, if none exists
The difference between w+ and r+: Both of these modes allow access to read and write methods however, opening a file in w+ overwrites it and deletes all pre-existing data
Opening the file in w is akin to opening the .txt file, moving your cursor to the beginning of the text file, writing new text and deleting everything that follows.
Manipulate File
.TRUNCATE
To work with a file on existing data, use r+ and a+. When using r+ it is useful to add .truncate() method at the end of your data. This will reduce the file to your data and delete everything that follows
with open('/Example2.txt', 'r+') as testwritefile:
0,0) #write at beginning of file
testwritefile.seek("Line 1" + "\n")
testwritefile.write("Line 2" + "\n")
testwritefile.write("Line 3" + "\n")
testwritefile.write("Line 4" + "\n")
testwritefile.write("finished\n")
testwritefile.write(#Uncomment the line below
testwritefile.truncate()0,0)
testwritefile.seek(print(testwritefile.read())
.TELL
Whereas opening the file in a is similiar to opening the .txt file, moving your cursor to the very end and then adding the new pieces of text.
It is often very useful to know where the ‘cursor’ is in a file and be able to control it. The following methods allow us to do precisely this -
.tell()
- returns the current position in bytes.seek()
we saw earlier
Copy File to File
You can copy the contents of one file to another by reading from the source file and writing to the destination file
# Open the source file for reading
with open('source.txt', 'r') as source_file:
# Open the destination file for writing
with open('destination.txt', 'w') as destination_file:
# Read lines from the source file and copy them to the destination file
for line in source_file:
destination_file.write(line)
# Destination file is automatically closed when the 'with' block exits
# Source file is automatically closed when the 'with' block exits
File Modes
More can be found here inI/O using Python.
Mode | Syntax | Description |
---|---|---|
‘r’ | 'r' |
Read mode. Opens an existing file for reading. Raises an error if the file doesn’t exist. |
‘w’ | 'w' |
Write mode. Creates a new file for writing. Overwrites the file if it already exists. |
‘a’ | 'a' |
Append mode. Opens a file for appending data. Creates the file if it doesn’t exist. |
‘x’ | 'x' |
Exclusive creation mode. Creates a new file for writing but raises an error if the file already exists. |
‘rb’ | 'rb' |
Read binary mode. Opens an existing binary file for reading. |
‘wb’ | 'wb' |
Write binary mode. Creates a new binary file for writing. |
‘ab’ | 'ab' |
Append binary mode. Opens a binary file for appending data. |
‘xb’ | 'xb' |
Exclusive binary creation mode. Creates a new binary file for writing but raises an error if it already exists. |
‘rt’ | 'rt' |
Read text mode. Opens an existing text file for reading. (Default for text files) |
‘wt’ | 'wt' |
Write text mode. Creates a new text file for writing. (Default for text files) |
‘at’ | 'at' |
Append text mode. Opens a text file for appending data. (Default for text files) |
‘xt’ | 'xt' |
Exclusive text creation mode. Creates a new text file for writing but raises an error if it already exists. |
‘r+’ | 'r+' |
Read and write mode. Opens an existing file for both reading and writing. |
‘w+’ | 'w+' |
Write and read mode. Creates a new file for reading and writing. Overwrites the file if it already exists. |
‘a+’ | 'a+' |
Append and read mode. Opens a file for both appending and reading. Creates the file if it doesn’t exist. |
‘x+’ | 'x+' |
Exclusive creation and read/write mode. Creates a new file for reading and writing but raises an error if it already exists. |
PANDAS
.CSV read
Look in Pandas - Intro for more
# Read the CSV file into a DataFrame
= 'D:/yourdataiq/main/datasets/avgpm25.csv'
file_path = pd.read_csv(file_path) df
- We can still use the open
- You will notice we used wb mode which is write & byte
from pyodide.http import pyfetch
= "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%205/data/addresses.csv"
filename
async def download(url, filename):
= await pyfetch(url)
response if response.status == 200:
with open(filename, "wb") as f:
await response.bytes())
f.write(
await download(filename, "addresses.csv")
= pd.read_csv("addresses.csv", header=None)
df 5) df.head(
Write JSON to File
JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write.JSON is built on two structures:
- A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
- An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.
- The text in JSON is done through quoted string which contains the values in key-value mappings within { }. It is similar to the dictionary in Python.
Serialization
The act of writing to a JSON file is called serialization.
- To handle the data flow in a file, the JSON library in Python uses the dump() or dumps() function to convert the Python objects into their respective JSON object. This makes it easy to write data to files.
import json
= {
person 'first_name' : 'Mark',
'last_name' : 'abc',
'age' : 27,
'address': {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
} }
Dump
- json.dump() method can be used for writing to JSON file.
- Syntax: json.dump(dict, file_pointer)
- Parameters:
- dictionary – name of the dictionary which should be converted to JSON object.
- file pointer – pointer of the file opened in write or append mode.
- Parameters:
with open('person.json', 'w') as f: # writing JSON object
json.dump(person, f)
Dumps
- json.dumps() that helps in converting a dictionary to a JSON object.
- It takes two parameters:
- dictionary – name of the dictionary which should be converted to JSON object.
- indent – defines the number of units for indentation
# Serializing json
= json.dumps(person, indent = 4)
json_object
# Writing to sample.json
with open("sample.json", "w") as outfile:
outfile.write(json_object)
219
print(json_object)
{
"first_name": "Mark",
"last_name": "abc",
"age": 27,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
}
}
Our python objects are now serialized to the file. For deserialization back to the python object we just use the load() function
Read JSON Files
- Language is similar to a dictionary
- Import json
- open file with the ‘r’ attribute
- Load the file into an object (we’ll use the file we created above)
import json
with open('sample.json','r') as openfile:
= json.load(openfile)
json_object
print(json_object)
print(type(json_object))
Read XLSX Files
- XLSX is a Microsoft Excel Open XML file format. It is another type of Spreadsheet file format.
- In XLSX data is organized under the cells and columns in a sheet.
import urllib.request
"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%205/data/file_example_XLSX_10.xlsx", "sample.xlsx")
urllib.request.urlretrieve(
= "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%205/data/file_example_XLSX_10.xlsx"
filename
async def download(url, filename):
= await pyfetch(url)
response if response.status == 200:
with open(filename, "wb") as f:
await response.bytes())
f.write(
await download(filename, "file_example_XLSX_10.xlsx")
= pd.read_excel("file_example_XLSX_10.xlsx") df
READ_XML
We can also read the downloaded xml file using the read_xml function present in the pandas library which returns a Dataframe object.
For more information read the pandas.read_xml documentation.
# Herein xpath we mention the set of xml nodes to be considered for migrating to the dataframe which in this case is details node under employees.
=pd.read_xml("Sample-employee-XML-file.xml", xpath="/employees/details") df
Read XML Files
- Pandas does not read this type of file directly but here is how we can parse it
- XML is also known as Extensible Markup Language. As the name suggests, it is a markup language.
- It has certain rules for encoding data. XML file format is a human-readable and machine-readable file format.
- Pandas does not include any methods to read and write XML files.
- Here, we will take a look at how we can use other modules to read data from an XML file, and load it into a Pandas DataFrame.
import pandas as pd
import xml.etree.ElementTree as etree
= etree.parse('filename.xml')
tree = tree.getroot()
root
# create column headers
= ['Name','Phone',....]
columns # assign columns to df
= pd.DataFrame(columns = columns)
df
# Create a loop to go through the document to collect the necessary data and append to a df
for node in root:
= node.find('name').text
name = node.find('phonenumber').text
phonenumber = node.find('birthday').text
birthday = df.append(pd.Series([name, phonenumer, birthday], index=columns)..., ignore_index = True) df
Another Example
# Parse the XML file
= etree.parse("Sample-employee-XML-file.xml")
tree
# Get the root of the XML tree
= tree.getroot()
root
# Define the columns for the DataFrame
= ["firstname", "lastname", "title", "division", "building", "room"]
columns
# Initialize an empty DataFrame
= pd.DataFrame(columns=columns)
datatframe
# Iterate through each node in the XML root
for node in root:
# Extract text from each element
= node.find("firstname").text
firstname = node.find("lastname").text
lastname = node.find("title").text
title = node.find("division").text
division = node.find("building").text
building = node.find("room").text
room
# Create a DataFrame for the current row
= pd.DataFrame([[firstname, lastname, title, division, building, room]], columns=columns)
row_df
# Concatenate with the existing DataFrame
= pd.concat([datatframe, row_df], ignore_index=True) datatframe
Write to XML File
The xml.etree.ElementTree module comes built-in with Python. It provides functionality for parsing and creating XML documents. ElementTree represents the XML document as a tree. We can move across the document using nodes which are elements and sub-elements of the XML file.
For more information please read the xml.etree.ElementTree documentation.
import xml.etree.ElementTree as ET
# create the file structure
= ET.Element('employee')
employee = ET.SubElement(employee, 'details')
details = ET.SubElement(details, 'firstname')
first = ET.SubElement(details, 'lastname')
second = ET.SubElement(details, 'age')
third = 'Shiv'
first.text = 'Mishra'
second.text = '23'
third.text
# create a new XML file with the results
= ET.ElementTree(employee)
mydata1 # myfile = open("items2.xml", "wb")
# myfile.write(mydata)
with open("new_sample.xml", "wb") as files:
mydata1.write(files)
Read HTML Content
The top-level read_html()
function can accept an HTML string/file/URL and will parse HTML tables into list of pandas DataFrames
. Let’s look at a few examples.
read_html returns a list of DataFrame objects, even if there is only a single table contained in the HTML content.
Scrape Table w Pandas
pd.read_html(url)
- Set url to the path
- Without setting any options
= "https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list"
url pd.read_html(url)
[ Bank Name ... Fund Sort ascending
0 The First National Bank of Lindsay ... 10547
1 Republic First Bank dba Republic Bank ... 10546
2 Citizens Bank ... 10545
3 Heartland Tri-State Bank ... 10544
4 First Republic Bank ... 10543
5 Signature Bank ... 10540
6 Silicon Valley Bank ... 10539
7 Almena State Bank ... 10538
8 First City Bank of Florida ... 10537
9 The First State Bank ... 10536
[10 rows x 7 columns]]
The drawback is this method ignores many attributes we might want from the page. But if all we needed was the table and what’s in it, this could work. If we need links or other elements then it might be smart to use other methods of scraping the page
Read Binary Files
“Binary” files are any files where the format isn’t made up of readable characters. It contain formatting information that only certain applications or processors can understand. While humans can read text files, binary files must be run on the appropriate software or processor before humans can read them.
Binary files can range from image files like JPEGs or GIFs, audio files like MP3s or binary document formats like Word or PDF.
Let’s see how to read an Image file.
Image Files
- Python supports very powerful tools when it comes to image processing. Let’s see how to process the images using the PIL library.
- PIL is the Python Imaging Library which provides the python interpreter with image editing capabilities.
# importing PIL
from PIL import Image
# Uncomment if running locally
# import urllib.request
# urllib.request.urlretrieve("https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/dog-puppy-on-garden-royalty-free-image-1586966191.jpg", "dog.jpg")
= "https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/dog-puppy-on-garden-royalty-free-image-1586966191.jpg"
filename
async def download(url, filename):
= await pyfetch(url)
response if response.status == 200:
with open(filename, "wb") as f:
await response.bytes())
f.write(
await download(filename, "./dog.jpg")
# Read image
= Image.open('./dog.jpg','r')
img
# Output image
img.show()
Save to CSV
- To save a DataFrame to a CSV file, use the to_csv method and specify the filename with a “.csv” extension.
- Pandas provides other functions for saving DataFrames in different formats.
'filename.csv', index = false) df.to_csv(
Read/Save Other Data Formats
Data Formate | Read | Save |
---|---|---|
csv | pd.read_csv() |
df.to_csv() |
json | pd.read_json() |
df.to_json() |
excel | pd.read_excel() |
df.to_excel() |
hdf | pd.read_hdf() |
df.to_hdf() |
sql | pd.read_sql() |
df.to_sql() |
… | … | … |
IO tools
The pandas I/O API is a set of top level reader
functions accessed like pandas.read_csv()
that generally return a pandas object. The corresponding writer
functions are object methods that are accessed like DataFrame.to_csv()
. Below is a table containing available readers
and writers
.
Format Type | Data Description | Reader | Writer |
---|---|---|---|
text | CSV | read_csv | to_csv |
text | Fixed-Width Text File | read_fwf | |
text | JSON | read_json | to_json |
text | HTML | read_html | to_html |
text | LaTeX | Styler.to_latex | |
text | XML | read_xml | to_xml |
text | Local clipboard | read_clipboard | to_clipboard |
binary | MS Excel | read_excel | to_excel |
binary | OpenDocument | read_excel | |
binary | HDF5 Format | read_hdf | to_hdf |
binary | Feather Format | read_feather | to_feather |
binary | Parquet Format | read_parquet | to_parquet |
binary | ORC Format | read_orc | to_orc |
binary | Stata | read_stata | to_stata |
binary | SAS | read_sas | |
binary | SPSS | read_spss | |
binary | Python Pickle Format | read_pickle | to_pickle |
SQL | SQL | read_sql | to_sql |
SQL | Google BigQuery | read_gbq | to_gbq |