pip install bs4 pip install html5lib
Documentation can be found on their page.
BeautifulSoup
Beautiful Soup is a Python library for pulling data out of HTML and XML files, we will focus on HTML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML.
We can navigate the HTML as a tree, and/or filter out what we are looking for. Consider:
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>
- We can store it as a string in the variable HTML:
="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>" my_html
Parser
Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands:
$ pip install lxml
This table summarizes the advantages and disadvantages of each parser library:
Parser | Typical usage | Advantages | Disadvantages |
Python’s html.parser | BeautifulSoup(markup, "html.parser") |
|
|
lxml’s HTML parser | BeautifulSoup(markup, "lxml") |
|
|
lxml’s XML parser | BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml") |
|
|
html5lib | BeautifulSoup(markup, "html5lib") |
|
|
Constructor
from bs4 import BeautifulSoup
import requests
= BeautifulSoup(my_html, 'html5lib') soup
- To parse a document, pass it into the
BeautifulSoup
constructor - The
BeautifulSoup
object represents the document as a nested data structure:- First, the document is converted to Unicode (similar to ASCII) and HTML entities are converted to Unicode characters
- Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.
- The
BeautifulSoup
object can create other types of objects. - Here, we will cover
BeautifulSoup
andTag
objects. - Finally, we will look at
NavigableString
objects. - We can use the method
prettify()
to display the HTML in the nested structure:
print(soup.prettify())
<!DOCTYPE html>
<html>
<head>
<title>
Page Title
</title>
</head>
<body>
<h3>
<b id="boldest">
Lebron James
</b>
</h3>
<p>
Salary: $ 92,000,000
</p>
<h3>
Stephen Curry
</h3>
<p>
Salary: $85,000, 000
</p>
<h3>
Kevin Durant
</h3>
<p>
Salary: $73,200, 000
</p>
</body>
</html>
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag
, NavigableString
, BeautifulSoup
, and Comment
. These objects represent the HTML elements that comprise the page.
HTML Attributes
If the tag has attributes, the tag
id="boldest"
has an attributeid
whose value isboldest
.
- You can access a tag’s attributes by treating the tag like a dictionary
- You can access that dictionary directly as
attrs
tag_child
<b id="boldest">Lebron James</b>
- Access the attribute
'id'] tag_child[
'boldest'
- Access the entire dictionary directly
tag_child.attrs
{'id': 'boldest'}
- You can use the get method as well
'id') tag_child.get(
'boldest'
Tables
Filter
Filters allow you to find complex patterns, the simplest filter is a string.
- In this section we will pass a string to a different filter method and Beautiful Soup will perform a match against that exact string.
- Consider the following HTML of rocket launches:
<table>
<tr>
<td id='flight' >Flight No</td>
<td>Launch site</td>
<td>Payload mass</td>
</tr>
<tr>
<td>1</td>
<td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
<td>300 kg</td>
</tr>
<tr>
<td>2</td>
<td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
<td>94 kg</td>
</tr>
<tr>
<td>3</td>
<td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>
<td>80 kg</td>
</tr>
</table>
- Store it in a string
="<table><tr><td id='flight'>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td><td>80 kg</td></tr></table>" table
= BeautifulSoup(table, 'html5lib')
table_bs print(table_bs.prettify())
<html>
<head>
</head>
<body>
<table>
<tbody>
<tr>
<td id="flight">
Flight No
</td>
<td>
Launch site
</td>
<td>
Payload mass
</td>
</tr>
<tr>
<td>
1
</td>
<td>
<a href="https://en.wikipedia.org/wiki/Florida">
Florida
</a>
<a>
</a>
</td>
<td>
300 kg
</td>
</tr>
<tr>
<td>
2
</td>
<td>
<a href="https://en.wikipedia.org/wiki/Texas">
Texas
</a>
</td>
<td>
94 kg
</td>
</tr>
<tr>
<td>
3
</td>
<td>
<a href="https://en.wikipedia.org/wiki/Florida">
Florida
</a>
<a>
</a>
</td>
<td>
80 kg
</td>
</tr>
</tbody>
</table>
</body>
</html>
Find All
The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.
- Syntax: The Method signature for
find_all(name, attrs, recursive, string, limit, **kwargs)
Name
- When we set the
name
parameter to a tag name, the method will extract all the tags with that name and its children. - If you remember
'tr'
is the name of the tag for each row in a table - So let’s extract all rows using
find_all
- The result is a Python Iterable just like a list, each element is a
tag
object:
= table_bs.find_all('tr')
table_rows table_rows
[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>, <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>, <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>, <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>]
- Extract first row by location
= table_rows[0]
first_row first_row
<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>
type(first_row)
<class 'bs4.element.Tag'>
Child of Row
- We can of course target the child since it has tags per row
- As you can see we also have a string within the tag
- Remember it list ONLY the first tag it encounters
= first_row.td
first_row_child first_row_child
<td id="flight">Flight No</td>
String
- Just as we did before we can extract the string
first_row_child.string
'Flight No'
Print all Rows
for i, row in enumerate(table_rows):
print('row', i, ' is ',row)
row 0 is <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>
row 1 is <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>
row 2 is <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>
row 3 is <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>
Iterate Extract Cells
Now that we filtered all the table rows we can iterate through them all and extract the cell values and create a table
- As
row
is acell
object, we can apply the methodfind_all
to it and extract table cells in the objectcells
using the tagtd
, this is all the children with the nametd
. - The result is a list, each element corresponds to a cell and is a
Tag
object, we can iterate through this list as well. - We can extract the content using the
string
attribute.
for i,row in enumerate(table_rows):
print("row",i)
=row.find_all('td')
cellsfor j,cell in enumerate(cells):
print('colunm',j,"cell",cell)
row 0
colunm 0 cell <td id="flight">Flight No</td>
colunm 1 cell <td>Launch site</td>
colunm 2 cell <td>Payload mass</td>
row 1
colunm 0 cell <td>1</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td>
colunm 2 cell <td>300 kg</td>
row 2
colunm 0 cell <td>2</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
colunm 2 cell <td>94 kg</td>
row 3
colunm 0 cell <td>3</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>
colunm 2 cell <td>80 kg</td>
Extract String
- If you look carefully, rows 1 and 3 don’t display columns 1 correctly because the data appears to be entered incorrectly in the table
for i,row in enumerate(table_rows):
print("row",i)
=row.find_all('td')
cellsfor j,cell in enumerate(cells):
print('colunm',j,"cell",cell.string
)
row 0
colunm 0 cell Flight No
colunm 1 cell Launch site
colunm 2 cell Payload mass
row 1
colunm 0 cell 1
colunm 1 cell None
colunm 2 cell 300 kg
row 2
colunm 0 cell 2
colunm 1 cell Texas
colunm 2 cell 94 kg
row 3
colunm 0 cell 3
colunm 1 cell None
colunm 2 cell 80 kg
- Match against a list
=table_bs .find_all(name=["tr", "td"])
list_input list_input
[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>, <td id="flight">Flight No</td>, <td>Launch site</td>, <td>Payload mass</td>, <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>, <td>1</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td>, <td>300 kg</td>, <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>, <td>2</td>, <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>, <td>94 kg</td>, <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>, <td>3</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>, <td>80 kg</td>]
Find
Unlike FindAll, this one only retrieves the first match
2 Tables
<h3>Rocket Launch </h3>
<p>
<table class='rocket'>
<tr>
<td>Flight No</td>
<td>Launch site</td>
<td>Payload mass</td>
</tr>
<tr>
<td>1</td>
<td>Florida</td>
<td>300 kg</td>
</tr>
<tr>
<td>2</td>
<td>Texas</td>
<td>94 kg</td>
</tr>
<tr>
<td>3</td>
<td>Florida </td>
<td>80 kg</td>
</tr>
</table>
</p>
<p>
<h3>Pizza Party </h3>
<table class='pizza'>
<tr>
<td>Pizza Place</td>
<td>Orders</td>
<td>Slices </td>
</tr>
<tr>
<td>Domino's Pizza</td>
<td>10</td>
<td>100</td>
</tr>
<tr>
<td>Little Caesars</td>
<td>12</td>
<td >144 </td>
</tr>
<tr>
<td>Papa John's </td>
<td>15 </td>
<td>165</td>
</tr>
="<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>" two_tables
HTML Parser
= BeautifulSoup(two_tables, 'html.parser') two_tables
Find the First Table
- Since find returns the first encounter, we use it here to find the first table tag which would be the first table
"table") two_tables.find(
<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>
Find Using Class
- Remember class is a python keyword so if we want to find the second table by using its attribute: class = ‘pizza’ we use an underscore to let python know it is not the reserve word we are using
'table', class_='pizza') two_tables.find(
<table class="pizza"><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table>
Attributes
If the argument is not recognized it will be turned into a filter on the tag’s attributes.
- For example with the
id
argument, Beautiful Soup will filter against each tag’sid
attribute. - For example, the first
td
elements have a value ofid
offlight
, therefore we can filter based on thatid
value.
Filter
- Filter by certain id value
# filter based on the id = 'flight'
id='flight') table_bs.find_all(
[<td id="flight">Flight No</td>]
- Filter by link to the Florida Wikipedia page
- We already know there should be two as we saw above
=table_bs.find_all(href="https://en.wikipedia.org/wiki/Florida")
list_input list_input
[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>, <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]
- Filter by href to extract all the cells that have links
=True) table_bs.find_all(href
[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>, <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>, <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]
- Filter elements that don’t have href values
= False) table_bs.find_all(href
[<html><head></head><body><table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table></body></html>, <head></head>, <body><table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table></body>, <table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table>, <tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody>, <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>, <td id="flight">Flight No</td>, <td>Launch site</td>, <td>Payload mass</td>, <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>, <td>1</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td>, <a></a>, <td>300 kg</td>, <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>, <td>2</td>, <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>, <td>94 kg</td>, <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>, <td>3</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>, <a> </a>, <td>80 kg</td>]
Soup id=x
- Use the soup object to find the element with the id content = boldest
id='boldest') soup.find_all(
[<b id="boldest">Lebron James</b>]
FindAll String
- Find all the strings = Florida
='Florida') table_bs.find_all(string
['Florida', 'Florida']
Example: Table with BS
Before proceeding to scrape a web site, you need to examine the contents and the way data is organized on the website. Open the above url in your browser and check how many rows and columns there are in the color table.
= "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"
url1
= requests.get(url1).text
data1 = BeautifulSoup(data1, 'html5lib')
soup
# find the first table on the page using the <table> tag
= soup.find('table') table1
Get all the Rows
Get all the Columns
# use the <tr> tag to extract all the rows
for row in table1.find_all('tr'):
# get all the columns which use the <td> tag
= row.find_all('td')
cols = cols[2].string # stores the value in col 3 as color_name
color_name = cols[3].string # store the value in col 4 as color_code
color_code print("{}------>{}".format(color_name,color_code))
Color Name------>None
lightsalmon------>#FFA07A
salmon------>#FA8072
darksalmon------>#E9967A
lightcoral------>#F08080
coral------>#FF7F50
tomato------>#FF6347
orangered------>#FF4500
gold------>#FFD700
orange------>#FFA500
darkorange------>#FF8C00
lightyellow------>#FFFFE0
lemonchiffon------>#FFFACD
papayawhip------>#FFEFD5
moccasin------>#FFE4B5
peachpuff------>#FFDAB9
palegoldenrod------>#EEE8AA
khaki------>#F0E68C
darkkhaki------>#BDB76B
yellow------>#FFFF00
lawngreen------>#7CFC00
chartreuse------>#7FFF00
limegreen------>#32CD32
lime------>#00FF00
forestgreen------>#228B22
green------>#008000
powderblue------>#B0E0E6
lightblue------>#ADD8E6
lightskyblue------>#87CEFA
skyblue------>#87CEEB
deepskyblue------>#00BFFF
lightsteelblue------>#B0C4DE
dodgerblue------>#1E90FF