# Start by importing requests
import requests
REST APIs
Rest APIs function by sending a request, the request is communicated via HTTP message.
- The HTTP message usually contains a JSON file. This contains instructions for what operation we would like the service or resource to perform.
- In a similar manner, API returns a response, via an HTTP message, this response is usually contained within a JSON.
HTML Basics
The Process
Web scraping, also known as web harvesting or web data extraction, is the process of extracting information from websites or web pages. It involves automated retrieval of data from web sources.
HTTP request
When you, the client, use a web page your browser sends an HTTP request to the server where the page is hosted. The server tries to find the desired resource by default “index.html
”. If your request is successful, the server will send the object to the client in an HTTP response. This includes information like the type of the resource, the length of the resource, and other information.
- The process typically begins with an HTTP request.
- A web scraper sends an HTTP request to a specific URL, similar to how a web browser would when you visit a website.
- The request is usually an HTTP GET request, which retrieves the web page’s content.
Web page retrieval
- The web server hosting the website responds to the request by returning the requested web page’s HTML content. This content includes the visible text and media elements and the underlying HTML structure that defines the page’s layout.
HTML parsing
- Once the HTML content is received, you need to parse the content.
- Parsing involves breaking down the HTML structure into components, such as tags, attributes, and text content.
- You can use BeautifulSoup in Python. It creates a structured representation of the HTML content that can be easily navigated and manipulated.
Data Extraction
- With the HTML content parsed, web scrapers can now identify and extract the specific data they need.
- This data can include text, links, images, tables, product prices, news articles, and more.
- Scrapers locate the data by searching for relevant HTML tags, attributes, and patterns in the HTML structure.
Data transformation
- Extracted data may need further processing and transformation.
- For instance, you can remove HTML tags from text, convert data formats, or clean up messy data.
- This step ensures the data is ready for analysis or other use cases.
Storage
- After extraction and transformation, you can store the scraped data in various formats, such as databases, spreadsheets, JSON, or CSV files.
- The choice of storage format depends on the specific project’s requirements.
Automation
- In many cases, scripts or programs automate web scraping.
- These automation tools allow recurring data extraction from multiple web pages or websites.
- Automated scraping is especially useful for collecting data from dynamic websites that regularly update their content.
HTML Structure
Hypertext markup language (HTML) serves as the foundation of web pages. Understanding its structure is crucial for web scraping.
<html>
is the root element of an HTML page.<head>
contains meta-information about the HTML page.<body>
displays the content on the web page, often the data of interest.<h3>
tags are type 3 headings, making text larger and bold, typically used for player names.<p>
tags represent paragraphs and contain player salary information.
HTML Tag
HTML tags define the structure of web content and can contain attributes.
- An HTML tag consists of an opening (start) tag and a closing (end) tag.
- Tags have names (
<a>
for an anchor tag). - Tags may contain attributes with an attribute name and value, providing additional information to the tag.
HTML Tree
You can visualize HTML documents as trees with tags as nodes.
- Tags can contain strings and other tags, making them the tag’s children.
- Tags within the same parent tag are considered siblings.
- For example, the
<html>
tag contains both<head>
and<body>
tags, making them descendants of<html
but children of<html>
.<head>
and<body>
are siblings.
HTML Tables
HTML tables are essential for presenting structured data.
- Define an HTML table using the
<table>
tag. - Each table row is defined with a
<tr>
tag. - The first row often uses the table header tag, typically
<th>
. - The table cell is represented by
<td>
tags, defining individual cells in a row.
HTTP Request Method
Now that we covered The Process, and HTML structure let’s put it all together and recap:
When you, the client, use a web page your browser sends an HTTP request to the server where the page is hosted. The server tries to find the desired resource by default “index.html
”. If your request is successful, the server will send the object to the client in an HTTP response. This includes information like the type of the resource, the length of the resource, and other information.
URL
We send the HTTP Request to a URL, what’s a Uniform Resource Locator (URL)?
URL is broken down by:
- Scheme: the protocol such as http://
- Base URL: used to find the location such as www.yourdataiq.com
- Route: this is the location on the server such as /images/blah.png
There are different HTTP request methods, you can import the request library with import requests
Request
- The process can be broken into the Request and Response process.
- The request using the get method is partially illustrated below.
- In the start line we have the
GET
method, this is anHTTP
method. Also the location of the resource/index.html
and theHTTP
version. - The Request header passes additional information with an
HTTP
request:
- When an HTTP request is made, an HTTP method is sent which tells the server what action to perform.
- A list of possible actions is below
HTTP METHODS | Description |
---|---|
GET | Retrieves data from the server |
POST | Submits data to the server |
PUT | Updates data on the server |
DELETE | Deletes data from the server |
Response
- Response is what we get back from the server
- starts out with the version number and status code, followed by a descriptive phrase (OK)
- Header contains useful information
- Body which contains the information we requested in an HTML document
- Note: some requests can have headers
Status Codes
- Some status codes are shown in the image above
Example - Text Request
Setup
- Let’s send a GET request to IBM main page www.ibm.com
- os allows us to perform tasks such as:
- accessing and manipulating files
- getting info about the working directory
- creating, renaming, deleting directories and files
- executing shell commands
Send request
- Setup the URL to be www.ibm.com, I named it url1 so we don’t get it confused with the example later in this section
- Send a get requests and assign the response to resp
= 'https://www.ibm.com/'
url1 = requests.get(url1) resp
Read Response Code
- As you see in the code: 200 = OK
resp.status_code
200
List Request Headers
- You can use the request.header to view the header
resp.request.headers
{'User-Agent': 'python-requests/2.32.3', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'Cookie': '_abck=AB0F188D44986D039B8C254479FCE322~-1~YAAQfu0ZuHei5/iSAQAAYLwF/gyPs1Tffmn1RfKFZORbfpQ3ZXnY2NZi9Y6UWuxieYujI3jhkmVmCLPOwXJ1YEuyX4TLwj2IflpQy4r76McXrMu5shiLe3G+0U8ez6VaP1jR7IKXJoHqcEdT2lqnfSYLq+HTSZxJpXxccJKmzqFUs3fqE6OkEdiyxmT++/wzuCt1OrXfWb8ns6ObXUDAbibyOJu2JGB5L1HNpLQKSB5Wvqa7QSBzhR2J8XV/0IPPC6+AU0f4Z4jWNvWiSr56uSI5CvfqWdZiPQRqpRXbCwmgCUbEhbEH2cGArmQ4AK8a2l3EPOI2XtT3m9fMkDZJM1fIIMtLS11NguKWpfLMR/INWAtPVIrTm1xZv4Hgx6o4M/DT/G2KAC4m0MwHdBOVHPewEyFfhAQ=~-1~-1~-1; bm_sz=0FB750F75FE11A5D79D534500EEC97DD~YAAQfu0ZuHii5/iSAQAAYLwF/hljxyEyPoHzRC81FNaLSp/l10G0JkcDSBdg2JCldWOEwaXXKFM1X+hW0l1b5Fvh0cD9IYZKfy5PM/DKTA2Wp5xW4D2cSfH97ZfanjgxxwhUjAJc0yhvNEnzyWYt4GxeYDjiYJK5x2SRa5QWEoNqEZ+kFBBhbcS28l3m74OzvA178/+vMiLEnyBxpFofZCNF9FPNf04vX+f0xHK31W2r6F8gMVGJo7yzbBw5pt0i3vZQ50m5qqCI6XLr74W6GmTEzGiJZHHEONh41VUyRyYplQEu2TBDhMD2z0sZoaOPnxvf1AORr5zPdqe0N/xheUxxOBa1GIRWZlW1kTpmiu6ZyyWKdP6uEMj7jCtxeUuQ7rU=~3425860~3224884'}
- or use the attribute headers, which returns a dictionary of HTTP response headers
- This might be easier to read
= resp.headers
header header
{'Content-Security-Policy': 'upgrade-insecure-requests', 'x-frame-options': 'SAMEORIGIN', 'Last-Modified': 'Tue, 05 Nov 2024 20:21:43 GMT', 'ETag': '"2a42a-6263026537382-gzip"', 'Accept-Ranges': 'bytes', 'Content-Type': 'text/html;charset=utf-8', 'X-Content-Type-Options': 'nosniff', 'Cache-Control': 'max-age=600', 'Expires': 'Tue, 05 Nov 2024 20:40:41 GMT', 'X-Akamai-Transformed': '9 28017 0 pmb=mTOE,2', 'Content-Encoding': 'gzip', 'Date': 'Tue, 05 Nov 2024 20:30:41 GMT', 'Content-Length': '28244', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'Strict-Transport-Security': 'max-age=31536000'}
List Keys
- We can extract values from the header dictionary
- Let’s look at the keys list
list(header)
['Content-Security-Policy',
'x-frame-options',
'Last-Modified',
'ETag',
'Accept-Ranges',
'Content-Type',
'X-Content-Type-Options',
'Cache-Control',
'Expires',
'X-Akamai-Transformed',
'Content-Encoding',
'Date',
'Content-Length',
'Connection',
'Vary',
'Strict-Transport-Security']
Read Values
Now that we have the list of keys we can read any of the values
'Date'] header[
'Tue, 05 Nov 2024 20:30:41 GMT'
Look at the content-type
'Content-Type'] header[
'text/html;charset=utf-8'
Request Body
- As we mentioned before, the request doesn’t include anything in the body
- Response body is where the server returns the requested material, so the req_body is “blank or none”
= resp.request.body
req_body req_body
Read Text
- We noticed the content-type is text/html
- so let’s read the attribute text
0:200] resp.text[
'\n<!DOCTYPE HTML>\n<html lang="en">\n<head>\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n <meta charset="UTF-8"/>\r\n <meta name="languageCode" content="en"/>\r\n <meta name="country'
# Notice the only difference is b at the beginning of the output
0:200] resp.content[
b'\n<!DOCTYPE HTML>\n<html lang="en">\n<head>\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n <meta charset="UTF-8"/>\r\n <meta name="languageCode" content="en"/>\r\n <meta name="country'
Example - Image Request
- Just as we requested text in the above example, we can request an image
- I will not break this down into section as it is redundant
- I’ll place comments in the code
= 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/IDSNlogo.png'
url2 = requests.get(url2) req2
# list the keys in the header
= req2.headers
heads list(heads)
['Date',
'X-Clv-Request-Id',
'Server',
'X-Clv-S3-Version',
'Accept-Ranges',
'x-amz-request-id',
'ETag',
'Content-Type',
'Last-Modified',
'Content-Length']
'Content-Type'] heads[
'image/png'
Save Image
An image is a response object that contains the image as a bytes-like object. As a result, we must save it using a file object. First, we specify the file path and name
import os
from PIL import Image
=os.path.join(os.getcwd(),'image.png')
path
# save the file using the content attribute
with open(path,'wb') as f:
f.write(req2.content)
# view the image
open(path) Image.
Example Download File
Data
- Here is the file we want to download to our system
= <https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%205/data/Example1.txt URL
Download & Save
import os
import requests
import pandas as pd
='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%205/data/Example1.txt'
url=os.path.join(os.getcwd(),'example1.txt')
path=requests.get(url)
rwith open(path,'wb') as f:
f.write(r.content)
GET Request
Similar to the HTTP get request, we can use the GET method to modify the results of our query.
- Just as before we send a GET request to the server.
- Like before we have the Base URL, in the Route we append
/get
, this indicates we would like to preform aGET
request.
= 'http://httpbin.org/get' url_get
- A query string is a part of a uniform resource locator (URL), this sends other information to the web server. The start of the query is a
?
, followed by a series of parameter and value pairs, as shown in the table below. The first parameter name isname
and the value isJoseph
. The second parameter name isID
and the Value is123
. Each pair, parameter, and value is separated by an equals sign,=
. The series of pairs is separated by the ampersand&
.
Create Dictionary to Pass
- To create a Query string, add a dictionary. The keys are the parameter names and the values are the value of the Query string.
={"name":"Joseph","ID":"123"} payload
Send Request
- Pass the dictionary payload to the
params
parameter of theget()
function
=requests.get(url_get,params=payload) r
Review Request
- Print out the url if we want to see it
r.url
'http://httpbin.org/get?name=Joseph&ID=123'
View Request Body
- remember there is not a request.body so this will be none
r.request.body
View Request Status
- let’s look at the request status code
r.status_code
200
Read Response as Text
- You can see from the output it might be a json format
- Let’s check for sure
r.text
'{\n "args": {\n "ID": "123", \n "name": "Joseph"\n }, \n "headers": {\n "Accept": "*/*", \n "Accept-Encoding": "gzip, deflate", \n "Host": "httpbin.org", \n "User-Agent": "python-requests/2.32.3", \n "X-Amzn-Trace-Id": "Root=1-672a8072-6bd2448c063879fb26f3ba68"\n }, \n "origin": "76.92.196.161", \n "url": "http://httpbin.org/get?name=Joseph&ID=123"\n}\n'
View Content_type
'Content-Type'] r.headers[
'application/json'
Read Response Json
- As the content is in the JSON format
- use
json()
method to return it as adict
r.json()
{'args': {'ID': '123', 'name': 'Joseph'},
'headers': {'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate',
'Host': 'httpbin.org',
'User-Agent': 'python-requests/2.32.3',
'X-Amzn-Trace-Id': 'Root=1-672a8072-6bd2448c063879fb26f3ba68'},
'origin': '76.92.196.161',
'url': 'http://httpbin.org/get?name=Joseph&ID=123'}
- As you see in the output the key (args) and its value: which is what we sent with the get request
List Args
'args'] r.json()[
{'ID': '123', 'name': 'Joseph'}
POST Request
- Like a
GET
request, aPOST
is used to send data to a server - But the
POST
request sends the data in a request body - In order to send the Post Request in Python, in the
URL
we change the route toPOST
Set up URL
='http://httpbin.org/post' url_post
Create Dictionary
- This endpoint will expect data as a file or as a form. A form is convenient way to configure an HTTP request to send data to a server.
- To make a
POST
request we use thepost()
function, the variablepayload
is passed to the parameterdata
={"name":"Joseph","ID":"123"} payload
Send Request
=requests.post(url_post,data=payload) r_post
Compare Requests
- Let’s compare the two requests:
- GET request
- Post request
- We should know that the POST request doesn’t carry any data in its request but sends it in the body
print("POST request URL:",r_post.url )
print("GET request URL:",r.url)
POST request URL: http://httpbin.org/post
GET request URL: http://httpbin.org/get?name=Joseph&ID=123
Compare Bodys
- Let’s compare bodies as well
print("POST request body:",r_post.request.body)
print("GET request body:",r.request.body)
POST request body: name=Joseph&ID=123
GET request body: None
View the Request Form
- As we mentioned that the requested payload is sent in a json form in the body
- Let’s look at it
'form'] r_post.json()[
{'ID': '123', 'name': 'Joseph'}