Documentation can be found on their page.

BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files, we will focus on HTML files. This is accomplished by representing the HTML as a set of objects with methods used to parse the HTML.

We can navigate the HTML as a tree, and/or filter out what we are looking for. Consider:

pip install bs4
pip install html5lib
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary: $ 92,000,000 </p>
<h3> Stephen Curry</h3>
<p> Salary: $85,000, 000 </p>
<h3> Kevin Durant </h3>
<p> Salary: $73,200, 000</p>
</body>
</html>
  • We can store it as a string in the variable HTML:
my_html="<!DOCTYPE html><html><head><title>Page Title</title></head><body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>"

Parser

Beautiful Soup supports the HTML parser included in Python’s standard library, but it also supports a number of third-party Python parsers. One is the lxml parser. Depending on your setup, you might install lxml with one of these commands:

$ pip install lxml

This table summarizes the advantages and disadvantages of each parser library:

Parser Typical usage Advantages Disadvantages
Python’s html.parser BeautifulSoup(markup, "html.parser")
  • Batteries included

  • Decent speed

  • Not as fast as lxml, less lenient than html5lib.
lxml’s HTML parser BeautifulSoup(markup, "lxml")
  • Very fast
  • External C dependency
lxml’s XML parser BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml")
  • Very fast

  • The only currently supported XML parser

  • External C dependency
html5lib BeautifulSoup(markup, "html5lib")
  • Extremely lenient

  • Parses pages the same way a web browser does

  • Creates valid HTML5

  • Very slow

  • External Python dependency

Constructor

from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(my_html, 'html5lib')
  • To parse a document, pass it into the BeautifulSoup constructor
  • The BeautifulSoup object represents the document as a nested data structure:
    • First, the document is converted to Unicode (similar to ASCII) and HTML entities are converted to Unicode characters
    • Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.
    • The BeautifulSoup object can create other types of objects.
    • Here, we will cover BeautifulSoup and Tag objects.
    • Finally, we will look at NavigableString objects.
    • We can use the method prettify() to display the HTML in the nested structure:
print(soup.prettify())
<!DOCTYPE html>
<html>
 <head>
  <title>
   Page Title
  </title>
 </head>
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
  </h3>
  <p>
   Salary: $ 92,000,000
  </p>
  <h3>
   Stephen Curry
  </h3>
  <p>
   Salary: $85,000, 000
  </p>
  <h3>
   Kevin Durant
  </h3>
  <p>
   Salary: $73,200, 000
  </p>
 </body>
</html>

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. But you’ll only ever have to deal with about four kinds of objects: TagNavigableStringBeautifulSoup, and Comment. These objects represent the HTML elements that comprise the page.

Tags


  • Let’s say we want the title of the page and the name of the top paid player.
  • We can use the Tag.
  • The Tag object corresponds to an HTML tag in the original document, for example, the tag .title or .b or .h3 or .p .body
  • Tag object corresponds to an XML or HTML tag in the original document.
tag = soup.b
type(tag)
<class 'bs4.element.Tag'>

Title

  • Let’s pull out the all the titles using .title tag
  • Happens only one exists on this page

If there is more than one ONLY the first one is called

tag_object = soup.title
tag_object
<title>Page Title</title>

Name

  • Every tag has a name, for example the tag_object above which represents the .title tag is named:
tag_object.name
'title'

Attrs

  • An HTML or XML tag may have any number of attributes.
  • In out example here let’s see what the Attrs of the tag= .b is
  • As you see below: .attrs gives us the dictionary of that tag
  • Remember from above
 <body>
  <h3>
   <b id="boldest">
    Lebron James
   </b>
tag = soup.b
tag.attrs
{'id': 'boldest'}
tag['id']
'boldest'
tag.attrs.keys()
dict_keys(['id'])

Contents

  • A tag’s children are available in a list called .contents
  • Let’s look at the contents of soup
soup.contents
['html', <html><head><title>Page Title</title></head><body><h3><b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body></html>]
  • Let’s make up one in one line
  • Let’s look at the .contents
  • As you see it gives the first one it encounters
soup.h3.contents
[<b id="boldest">Lebron James</b>]
  • Loop through the entire page and retrieve the contents of every h3
for heading in soup.find_all('h3'):
        h = heading.contents
        print(h)
[<b id="boldest">Lebron James</b>]
[' Stephen Curry']
[' Kevin Durant ']

Heading H3

tag_object = soup.h3
tag_object
<h3><b id="boldest">Lebron James</b></h3>

Child

  • We can navigate down the tree of objects
  • So let’s move down the tag_object of h3 set above and
  • Extract the bold text using .b
# Move one step down from tag_object or to the child of tag_object
tag_child = tag_object.b
tag_child
<b id="boldest">Lebron James</b>

Parent

  • We can access the parent of a child tag_child
  • Which should be h3
parent_tag = tag_child.parent
parent_tag
<h3><b id="boldest">Lebron James</b></h3>
  • We can also see the parent of the tag_object which would be the body
tag_object.parent
<body><h3><b id="boldest">Lebron James</b></h3><p> Salary: $ 92,000,000 </p><h3> Stephen Curry</h3><p> Salary: $85,000, 000 </p><h3> Kevin Durant </h3><p> Salary: $73,200, 000</p></body>

Sibling

  • Let’s see what siblings does Lebron James have
  • I mean what other tags are at the same level in the document
  • Can find it with next_sibling which would be paragraph
sibling_1 = tag_object.next_sibling
sibling_1
<p> Salary: $ 92,000,000 </p>
  • Let’s see the sibling of sibling
  • Which would be the next h3
sibling_2 = sibling_1.next_sibling
sibling_2
<h3> Stephen Curry</h3>
  • To find the salary of Stephen Curry
  • We’d need the paragraph of the sibling_2
curry_salary = sibling_2.next_sibling
curry_salary
<p> Salary: $85,000, 000 </p>

HTML Attributes


If the tag has attributes, the tag id="boldest" has an attribute id whose value is boldest.

  • You can access a tag’s attributes by treating the tag like a dictionary
  • You can access that dictionary directly as attrs
tag_child
<b id="boldest">Lebron James</b>
  • Access the attribute
tag_child['id']
'boldest'
  • Access the entire dictionary directly
tag_child.attrs
{'id': 'boldest'}
  • You can use the get method as well
tag_child.get('id')
'boldest'

Tables


Filter

Filters allow you to find complex patterns, the simplest filter is a string.

  • In this section we will pass a string to a different filter method and Beautiful Soup will perform a match against that exact string.
  • Consider the following HTML of rocket launches:
<table>
  <tr>
    <td id='flight' >Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
   </tr>
  <tr> 
    <td>1</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida</a></td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td>
    <td>80 kg</td>
  </tr>
</table>
  • Store it in a string
table="<table><tr><td id='flight'>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a></td><td>300 kg</td></tr><tr><td>2</td><td><a href='https://en.wikipedia.org/wiki/Texas'>Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href='https://en.wikipedia.org/wiki/Florida'>Florida<a> </td><td>80 kg</td></tr></table>"
table_bs = BeautifulSoup(table, 'html5lib')
print(table_bs.prettify())
<html>
 <head>
 </head>
 <body>
  <table>
   <tbody>
    <tr>
     <td id="flight">
      Flight No
     </td>
     <td>
      Launch site
     </td>
     <td>
      Payload mass
     </td>
    </tr>
    <tr>
     <td>
      1
     </td>
     <td>
      <a href="https://en.wikipedia.org/wiki/Florida">
       Florida
      </a>
      <a>
      </a>
     </td>
     <td>
      300 kg
     </td>
    </tr>
    <tr>
     <td>
      2
     </td>
     <td>
      <a href="https://en.wikipedia.org/wiki/Texas">
       Texas
      </a>
     </td>
     <td>
      94 kg
     </td>
    </tr>
    <tr>
     <td>
      3
     </td>
     <td>
      <a href="https://en.wikipedia.org/wiki/Florida">
       Florida
      </a>
      <a>
      </a>
     </td>
     <td>
      80 kg
     </td>
    </tr>
   </tbody>
  </table>
 </body>
</html>

Find All

The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.

  • Syntax: The Method signature for find_all(name, attrs, recursive, string, limit, **kwargs)

Name

  • When we set the name parameter to a tag name, the method will extract all the tags with that name and its children.
  • If you remember 'tr' is the name of the tag for each row in a table
  • So let’s extract all rows using find_all
  • The result is a Python Iterable just like a list, each element is a tag object:
table_rows = table_bs.find_all('tr')
table_rows
[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>, <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>, <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>, <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>]
  • Extract first row by location
first_row = table_rows[0]
first_row
<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>
type(first_row)
<class 'bs4.element.Tag'>

Child of Row

  • We can of course target the child since it has tags per row
  • As you can see we also have a string within the tag
  • Remember it list ONLY the first tag it encounters
first_row_child = first_row.td
first_row_child
<td id="flight">Flight No</td>

String

  • Just as we did before we can extract the string
first_row_child.string
'Flight No'

Iterate Extract Cells

Now that we filtered all the table rows we can iterate through them all and extract the cell values and create a table

  • As row is a cell object, we can apply the method find_all to it and extract table cells in the object cells using the tag td, this is all the children with the name td.
  • The result is a list, each element corresponds to a cell and is a Tag object, we can iterate through this list as well.
  • We can extract the content using the string attribute.
for i,row in enumerate(table_rows):
    print("row",i)
    cells=row.find_all('td')
    for j,cell in enumerate(cells):
        print('colunm',j,"cell",cell)
row 0
colunm 0 cell <td id="flight">Flight No</td>
colunm 1 cell <td>Launch site</td>
colunm 2 cell <td>Payload mass</td>
row 1
colunm 0 cell <td>1</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td>
colunm 2 cell <td>300 kg</td>
row 2
colunm 0 cell <td>2</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>
colunm 2 cell <td>94 kg</td>
row 3
colunm 0 cell <td>3</td>
colunm 1 cell <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>
colunm 2 cell <td>80 kg</td>

Extract String

  • If you look carefully, rows 1 and 3 don’t display columns 1 correctly because the data appears to be entered incorrectly in the table
for i,row in enumerate(table_rows):
    print("row",i)
    cells=row.find_all('td')
    for j,cell in enumerate(cells):
        print('colunm',j,"cell",cell.string
        )
row 0
colunm 0 cell Flight No
colunm 1 cell Launch site
colunm 2 cell Payload mass
row 1
colunm 0 cell 1
colunm 1 cell None
colunm 2 cell 300 kg
row 2
colunm 0 cell 2
colunm 1 cell Texas
colunm 2 cell 94 kg
row 3
colunm 0 cell 3
colunm 1 cell None
colunm 2 cell 80 kg
  • Match against a list
list_input=table_bs .find_all(name=["tr", "td"])
list_input
[<tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>, <td id="flight">Flight No</td>, <td>Launch site</td>, <td>Payload mass</td>, <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>, <td>1</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td>, <td>300 kg</td>, <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>, <td>2</td>, <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>, <td>94 kg</td>, <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>, <td>3</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>, <td>80 kg</td>]

Find

Unlike FindAll, this one only retrieves the first match

2 Tables

<h3>Rocket Launch </h3>

<p>
<table class='rocket'>
  <tr>
    <td>Flight No</td>
    <td>Launch site</td> 
    <td>Payload mass</td>
  </tr>
  <tr>
    <td>1</td>
    <td>Florida</td>
    <td>300 kg</td>
  </tr>
  <tr>
    <td>2</td>
    <td>Texas</td>
    <td>94 kg</td>
  </tr>
  <tr>
    <td>3</td>
    <td>Florida </td>
    <td>80 kg</td>
  </tr>
</table>
</p>
<p>

<h3>Pizza Party  </h3>
  
    
<table class='pizza'>
  <tr>
    <td>Pizza Place</td>
    <td>Orders</td> 
    <td>Slices </td>
   </tr>
  <tr>
    <td>Domino's Pizza</td>
    <td>10</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Little Caesars</td>
    <td>12</td>
    <td >144 </td>
  </tr>
  <tr>
    <td>Papa John's </td>
    <td>15 </td>
    <td>165</td>
  </tr>
two_tables="<h3>Rocket Launch </h3><p><table class='rocket'><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table></p><p><h3>Pizza Party  </h3><table class='pizza'><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td >144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr>"

HTML Parser

two_tables = BeautifulSoup(two_tables, 'html.parser')

Find the First Table

  • Since find returns the first encounter, we use it here to find the first table tag which would be the first table
two_tables.find("table")
<table class="rocket"><tr><td>Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr><td>1</td><td>Florida</td><td>300 kg</td></tr><tr><td>2</td><td>Texas</td><td>94 kg</td></tr><tr><td>3</td><td>Florida </td><td>80 kg</td></tr></table>

Find Using Class

  • Remember class is a python keyword so if we want to find the second table by using its attribute: class = ‘pizza’ we use an underscore to let python know it is not the reserve word we are using
two_tables.find('table', class_='pizza')
<table class="pizza"><tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr><tr><td>Domino's Pizza</td><td>10</td><td>100</td></tr><tr><td>Little Caesars</td><td>12</td><td>144 </td></tr><tr><td>Papa John's </td><td>15 </td><td>165</td></tr></table>

Attributes


If the argument is not recognized it will be turned into a filter on the tag’s attributes.

  • For example with the id argument, Beautiful Soup will filter against each tag’s id attribute.
  • For example, the first td elements have a value of id of flight, therefore we can filter based on that id value.

Filter

  • Filter by certain id value
# filter based on the id = 'flight'
table_bs.find_all(id='flight')
[<td id="flight">Flight No</td>]
  • Filter by link to the Florida Wikipedia page
  • We already know there should be two as we saw above
list_input=table_bs.find_all(href="https://en.wikipedia.org/wiki/Florida")
list_input
[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>, <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]
  • Filter by href to extract all the cells that have links
table_bs.find_all(href=True)
[<a href="https://en.wikipedia.org/wiki/Florida">Florida</a>, <a href="https://en.wikipedia.org/wiki/Texas">Texas</a>, <a href="https://en.wikipedia.org/wiki/Florida">Florida</a>]
  • Filter elements that don’t have href values
table_bs.find_all(href = False)
[<html><head></head><body><table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table></body></html>, <head></head>, <body><table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table></body>, <table><tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody></table>, <tbody><tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr><tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr><tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr><tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr></tbody>, <tr><td id="flight">Flight No</td><td>Launch site</td> <td>Payload mass</td></tr>, <td id="flight">Flight No</td>, <td>Launch site</td>, <td>Payload mass</td>, <tr> <td>1</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td><td>300 kg</td></tr>, <td>1</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a></a></td>, <a></a>, <td>300 kg</td>, <tr><td>2</td><td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td><td>94 kg</td></tr>, <td>2</td>, <td><a href="https://en.wikipedia.org/wiki/Texas">Texas</a></td>, <td>94 kg</td>, <tr><td>3</td><td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td><td>80 kg</td></tr>, <td>3</td>, <td><a href="https://en.wikipedia.org/wiki/Florida">Florida</a><a> </a></td>, <a> </a>, <td>80 kg</td>]

Soup id=x

  • Use the soup object to find the element with the id content = boldest
soup.find_all(id='boldest')
[<b id="boldest">Lebron James</b>]

FindAll String

  • Find all the strings = Florida
table_bs.find_all(string='Florida')
['Florida', 'Florida']

Example: Table with BS

Before proceeding to scrape a web site, you need to examine the contents and the way data is organized on the website. Open the above url in your browser and check how many rows and columns there are in the color table.

url1 = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/labs/datasets/HTMLColorCodes.html"

data1 = requests.get(url1).text
soup = BeautifulSoup(data1, 'html5lib')

# find the first table on the page using the <table> tag
table1 = soup.find('table')

Get all the Rows

Get all the Columns

# use the <tr> tag to extract all the rows
for row in table1.find_all('tr'):
        # get all the columns which use the <td> tag
        cols = row.find_all('td')
        color_name = cols[2].string     # stores the value in col 3 as color_name
        color_code = cols[3].string     # store the value in col 4 as color_code
        print("{}------>{}".format(color_name,color_code))
Color Name------>None
lightsalmon------>#FFA07A
salmon------>#FA8072
darksalmon------>#E9967A
lightcoral------>#F08080
coral------>#FF7F50
tomato------>#FF6347
orangered------>#FF4500
gold------>#FFD700
orange------>#FFA500
darkorange------>#FF8C00
lightyellow------>#FFFFE0
lemonchiffon------>#FFFACD
papayawhip------>#FFEFD5
moccasin------>#FFE4B5
peachpuff------>#FFDAB9
palegoldenrod------>#EEE8AA
khaki------>#F0E68C
darkkhaki------>#BDB76B
yellow------>#FFFF00
lawngreen------>#7CFC00
chartreuse------>#7FFF00
limegreen------>#32CD32
lime------>#00FF00
forestgreen------>#228B22
green------>#008000
powderblue------>#B0E0E6
lightblue------>#ADD8E6
lightskyblue------>#87CEFA
skyblue------>#87CEEB
deepskyblue------>#00BFFF
lightsteelblue------>#B0C4DE
dodgerblue------>#1E90FF