Documentation can be found on their page.

BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files, we will focus on HTML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML.

We can navigate the HTML as a tree, and/or filter out what we are looking for. Consider:

pip install bs4
pip install html5lib

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>

We can store it as a string in the variable HTML:

my_html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

Parser

Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands:

$ pip install lxml

This table summarizes the advantages and disadvantages of each parser library:

Parser	Typical usage	Advantages	Disadvantages
Python’s html.parser	`BeautifulSoup(markup, "html.parser")`	Batteries included Decent speed	Not as fast as lxml, less lenient than html5lib.
lxml’s HTML parser	`BeautifulSoup(markup, "lxml")`	Very fast	External C dependency
lxml’s XML parser	`BeautifulSoup(markup, "lxml-xml")` `BeautifulSoup(markup, "xml")`	Very fast The only currently supported XML parser	External C dependency
html5lib	`BeautifulSoup(markup, "html5lib")`	Extremely lenient Parses pages the same way a web browser does Creates valid HTML5	Very slow External Python dependency

Constructor

from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(my_html, 'html5lib')

To parse a document, pass it into the BeautifulSoup constructor
The BeautifulSoup object represents the document as a nested data structure:
- First, the document is converted to Unicode (similar to ASCII) and HTML entities are converted to Unicode characters
- Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.
- The BeautifulSoup object can create other types of objects.
- Here, we will cover BeautifulSoup and Tag objects.
- Finally, we will look at NavigableString objects.
- We can use the method prettify() to display the HTML in the nested structure:

print(soup.prettify())

<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Stephen Curry
  </h3>
  <p>
   Salary: $85,000, 000
  </p>
  <h3>
   Kevin Durant
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: Tag, NavigableString, BeautifulSoup, and Comment. These objects represent the HTML elements that comprise the page.

HTML Attributes

If the tag has attributes, the tag id="boldest" has an attribute id whose value is boldest.

You can access a tag’s attributes by treating the tag like a dictionary
You can access that dictionary directly as attrs

tag_child

<b id="boldest">Lebron James</b>

Access the attribute

tag_child['id']

'boldest'

Access the entire dictionary directly

tag_child.attrs

{'id': 'boldest'}

You can use the get method as well

tag_child.get('id')

'boldest'

Navigable String

A string corresponds to a bit of text or content within a tag. In our example it would be Lebron James <b id="boldest">Lebron James</b>

Beautiful Soup uses the NavigableString class to contain this text.
In our HTML we can obtain the name of the first player by extracting the string of the Tag object tag_child as follows:

tag_string =  tag_child.string
tag_string

'Lebron James'

type(tag_string)

<class 'bs4.element.NavigableString'>

Convert String to Object

A NavigableString is similar to a Python string or Unicode string.
To be more precise, the main difference is that it also supports some BeautifulSoup features.
We can convert it to string object in Python:

unicode_string = str(tag_string)
unicode_string

'Lebron James'

Tables

Filter

Filters allow you to find complex patterns, the simplest filter is a string.

In this section we will pass a string to a different filter method and Beautiful Soup will perform a match against that exact string.
Consider the following HTML of rocket launches:

<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>
    <td>80 kg</td>
  </tr>
</table>

Store it in a string

table="<table><tr><td id='flight'>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td><td>80 kg</td></tr></table>"

table_bs = BeautifulSoup(table, 'html5lib')
print(table_bs.prettify())

<html>
 <head>
 </head>
 <body>
  <table>
   <tbody>
    <tr>
     <td id="flight">
      Flight No
     </td>
     <td>
      Launch site
     </td>
     <td>
      Payload mass
     </td>
    </tr>
    <tr>
     <td>
      1
     </td>
     <td>
      <a href="https://en.wikipedia.org/wiki/Florida">
       Florida
      </a>
      <a>
      </a>
     </td>
     <td>
      300 kg
     </td>
    </tr>
    <tr>
     <td>
      2
     </td>
     <td>
      <a href="https://en.wikipedia.org/wiki/Texas">
       Texas
      </a>
     </td>
     <td>
      94 kg
     </td>
    </tr>
    <tr>
     <td>
      3
     </td>
     <td>
      <a href="https://en.wikipedia.org/wiki/Florida">
       Florida
      </a>
      <a>
      </a>
     </td>
     <td>
      80 kg
     </td>
    </tr>
   </tbody>
  </table>
 </body>
</html>

Find All

The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.

Syntax: The Method signature for find_all(name, attrs, recursive, string, limit, **kwargs)

Name

When we set the name parameter to a tag name, the method will extract all the tags with that name and its children.
If you remember 'tr' is the name of the tag for each row in a table
So let’s extract all rows using find_all
The result is a Python Iterable just like a list, each element is a tag object:

table_rows = table_bs.find_all('tr')
table_rows

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>, <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>, <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>, <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>]

Extract first row by location

first_row = table_rows[0]
first_row

<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>

type(first_row)

<class 'bs4.element.Tag'>

Child of Row

We can of course target the child since it has tags per row
As you can see we also have a string within the tag
Remember it list ONLY the first tag it encounters

first_row_child = first_row.td
first_row_child

<td id="flight">Flight No</td>

String

Just as we did before we can extract the string

first_row_child.string

'Flight No'

Print all Rows

for i, row in enumerate(table_rows):
        print('row', i, ' is ',row)

row 0  is  <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>
row 1  is  <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>
row 2  is  <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>
row 3  is  <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>

Iterate Extract Cells

Now that we filtered all the table rows we can iterate through them all and extract the cell values and create a table

As row is a cell object, we can apply the method find_all to it and extract table cells in the object cells using the tag td, this is all the children with the name td.
The result is a list, each element corresponds to a cell and is a Tag object, we can iterate through this list as well.
We can extract the content using the string attribute.

for i,row in enumerate(table_rows):
    print("row",i)
    cells=row.find_all('td')
    for j,cell in enumerate(cells):
        print('colunm',j,"cell",cell)

row 0
colunm 0 cell <td id="flight">Flight No</td>
colunm 1 cell <td>Launch site</td>
colunm 2 cell <td>Payload mass</td>
row 1
colunm 0 cell <td>1</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td>
colunm 2 cell <td>300 kg</td>
row 2
colunm 0 cell <td>2</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
colunm 2 cell <td>94 kg</td>
row 3
colunm 0 cell <td>3</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>
colunm 2 cell <td>80 kg</td>

Extract String

If you look carefully, rows 1 and 3 don’t display columns 1 correctly because the data appears to be entered incorrectly in the table

for i,row in enumerate(table_rows):
    print("row",i)
    cells=row.find_all('td')
    for j,cell in enumerate(cells):
        print('colunm',j,"cell",cell.string
        )

row 0
colunm 0 cell Flight No
colunm 1 cell Launch site
colunm 2 cell Payload mass
row 1
colunm 0 cell 1
colunm 1 cell None
colunm 2 cell 300 kg
row 2
colunm 0 cell 2
colunm 1 cell Texas
colunm 2 cell 94 kg
row 3
colunm 0 cell 3
colunm 1 cell None
colunm 2 cell 80 kg

Match against a list

list_input=table_bs .find_all(name=["tr", "td"])
list_input

[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>, <td id="flight">Flight No</td>, <td>Launch site</td>, <td>Payload mass</td>, <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>, <td>1</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td>, <td>300 kg</td>, <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>, <td>2</td>, <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>, <td>94 kg</td>, <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>, <td>3</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>, <td>80 kg</td>]

Find

Unlike FindAll, this one only retrieves the first match

2 Tables

<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Florida</td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Texas</td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>
  
    
<table class='pizza'>
  <tr>
    <td>Pizza Place</td>
    <td>Orders</td> 
    <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Little Caesars</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>

two_tables="<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"

HTML Parser

two_tables = BeautifulSoup(two_tables, 'html.parser')

Find the First Table

Since find returns the first encounter, we use it here to find the first table tag which would be the first table

two_tables.find("table")

<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>

Find Using Class

Remember class is a python keyword so if we want to find the second table by using its attribute: class = ‘pizza’ we use an underscore to let python know it is not the reserve word we are using

two_tables.find('table', class_='pizza')

<table class="pizza"><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table>

Attributes

If the argument is not recognized it will be turned into a filter on the tag’s attributes.

For example with the id argument, Beautiful Soup will filter against each tag’s id attribute.
For example, the first td elements have a value of id of flight, therefore we can filter based on that id value.

Filter

Filter by certain id value

# filter based on the id = 'flight'
table_bs.find_all(id='flight')

[<td id="flight">Flight No</td>]

Filter by link to the Florida Wikipedia page
We already know there should be two as we saw above

list_input=table_bs.find_all(href="https://en.wikipedia.org/wiki/Florida")
list_input

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>, <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

Filter by href to extract all the cells that have links

table_bs.find_all(href=True)

[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>, <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>, <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]

Filter elements that don’t have href values

table_bs.find_all(href = False)

[<html><head></head><body><table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table></body></html>, <head></head>, <body><table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table></body>, <table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table>, <tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody>, <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>, <td id="flight">Flight No</td>, <td>Launch site</td>, <td>Payload mass</td>, <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>, <td>1</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td>, <a></a>, <td>300 kg</td>, <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>, <td>2</td>, <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>, <td>94 kg</td>, <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>, <td>3</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>, <a> </a>, <td>80 kg</td>]

Soup id=x

Use the soup object to find the element with the id content = boldest

soup.find_all(id='boldest')

[<b id="boldest">Lebron James</b>]

FindAll String

Find all the strings = Florida

table_bs.find_all(string='Florida')

['Florida', 'Florida']

Example: Table with BS

Before proceeding to scrape a web site, you need to examine the contents and the way data is organized on the website. Open the above url in your browser and check how many rows and columns there are in the color table.

url1 = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

data1 = requests.get(url1).text
soup = BeautifulSoup(data1, 'html5lib')

# find the first table on the page using the <table> tag
table1 = soup.find('table')

Get all the Rows

Get all the Columns

# use the <tr> tag to extract all the rows
for row in table1.find_all('tr'):
        # get all the columns which use the <td> tag
        cols = row.find_all('td')
        color_name = cols[2].string     # stores the value in col 3 as color_name
        color_code = cols[3].string     # store the value in col 4 as color_code
        print("{}------>{}".format(color_name,color_code))

Color Name------>None
lightsalmon------>#FFA07A
salmon------>#FA8072
darksalmon------>#E9967A
lightcoral------>#F08080
coral------>#FF7F50
tomato------>#FF6347
orangered------>#FF4500
gold------>#FFD700
orange------>#FFA500
darkorange------>#FF8C00
lightyellow------>#FFFFE0
lemonchiffon------>#FFFACD
papayawhip------>#FFEFD5
moccasin------>#FFE4B5
peachpuff------>#FFDAB9
palegoldenrod------>#EEE8AA
khaki------>#F0E68C
darkkhaki------>#BDB76B
yellow------>#FFFF00
lawngreen------>#7CFC00
chartreuse------>#7FFF00
limegreen------>#32CD32
lime------>#00FF00
forestgreen------>#228B22
green------>#008000
powderblue------>#B0E0E6
lightblue------>#ADD8E6
lightskyblue------>#87CEFA
skyblue------>#87CEEB
deepskyblue------>#00BFFF
lightsteelblue------>#B0C4DE
dodgerblue------>#1E90FF

BeautifulSoup

Parser

Constructor

Tags

Title

Name

Attrs

Contents

Heading H3

Child

Parent

Sibling

HTML Attributes

Navigable String

Convert String to Object

Tables

Filter

Find All

Name

Child of Row

String

Print all Rows

Iterate Extract Cells

Extract String

Find

2 Tables

HTML Parser

Find the First Table

Find Using Class

Attributes

Filter

Soup id=x

FindAll String

Example: Table with BS

Get all the Rows

Get all the Columns