from bs4 import BeautifulSoup
import requests
= "http://www.example.com"
URL = requests.get(URL)
page = BeautifulSoup(page.content, "html.parser") soup
APIs & Web Scraping
APIs
APIs (Application Programming Interfaces) are essential for any engineer because they provide a way to access data and functionality from other systems, which can save time and resources. For instance, APIs can be used to integrate applications into the existing architecture of a server or application, allowing developers to communicate between various products and services without requiring direct impleme
APIs have a wide range of applications, some of which are:
Social media platforms: Social media platforms like Facebook, Twitter, and Instagram use APIs to allow developers to access their data and functionality. This allows developers to create applications that can interact with these platforms and provide additional functionality to users.
E-commerce websites: E-commerce websites like Amazon and eBay use APIs to allow developers to access their product catalogs and other data. This allows developers to create applications that can interact with these platforms and provide additional functionality to users.
Weather applications: Weather applications like AccuWeather and The Weather Channel use APIs to access weather data from various sources. This allows developers to create applications that can provide users with up-to-date weather information.
Maps and navigation applications: Maps and navigation applications like Google Maps and Waze use APIs to access location data and other information. This allows developers to create applications that can provide users with directions, traffic updates, and other location-based information.
Payment gateways: Payment gateways like PayPal and Stripe use APIs to allow developers to access their payment processing functionality. This allows developers to create applications that can process payments securely and efficiently.
Messaging applications: Messaging applications like WhatsApp and Facebook Messenger use APIs to allow developers to access their messaging functionality. This allows developers to create applications that can interact with these platforms and provide additional functionality to users.
Pros & Cons
The advantages of using APIs:
Automation. Less human effort is required and workflows can be easily updated to become faster and more
productive.Efficiency. It allows to use the capabilities of one of the already developed APIs than to try to independently implement some functionality from scratch.
The disadvantage of using APIs:
- Security. If the API is poorly integrated, it means it will be vulnerable to attacks, resulting in data breeches or losses having financial or reputation implications.
REST APIs
- Rest APIs function by sending a request, the request is communicated via HTTP message.
- The HTTP message usually contains a JSON file. This contains instructions for what operation we would like the service or resource to perform.
- In a similar manner, API returns a response, via an HTTP message, this response is usually contained within a JSON.
Free Open Source APIs
Random User API
RandomUser is an open-source, free API providing developers with randomly generated users to be used as placeholders for testing purposes.
- This makes the tool similar to Lorem Ipsum, but is a placeholder for people instead of text.
- The API can return multiple results, as well as specify generated user details such as gender, email, image, username, address, title, first and last name, and more.
- More information on RandomUser can be found here.
Fruityvice API
- The Fruityvice API web service which provides data for all kinds of fruit!
- You can use Fruityvice to find out interesting information about fruit and educate yourself.
- The web service is completely free
Free APIs List
Here is apage that contains a list of free public APIs
Web Scraping
Web scraping, also known as web harvesting or web data extraction, is a technique used to extract large amounts of data from websites. The data on websites is unstructured, and web scraping enables us to convert it into a structured form.
Web Scraping in Data Science
In the field of data science, web scraping plays an integral role. It is used for various purposes such as:
- Data Collection: Web scraping is a primary method of collecting data from the internet. This data can be used for analysis, research, etc.
- Real-time Application: Web scraping is used for real-time applications like weather updates, price comparison, etc.
- Machine Learning: Web scraping provides the data needed to train machine learning models.
Beautiful Soup
BeautifulSoup: BeautifulSoup is a Python library used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.
Scrapy
Scrapy: Scrapy is an open-source and collaborative web crawling framework for Python. It is used to extract the data from the website.
pip install scrapy
import scrapy
class QuotesSpider(scrapy.Spider):
= "quotes"
name = ['http://quotes.toscrape.com/tag/humor/',]
start_urls def parse(self, response):
for quote in response.css('div.quote'):
yield {'quote': quote.css('span.text::text').get()}
Selenium
Selenium: Selenium is a tool used for controlling web browsers through programs and automating browser tasks.
pip install selenium
from selenium import webdriver
= webdriver.Firefox()
driver "http://www.example.com") driver.get(
Example - Scraping
Web scraping, also known as web harvesting or web data extraction, is the process of extracting information from websites or web pages. It can save time and automate the process
- One such tool we can use with python is BeautifulSoup
- BS allows you to extract specific parts bases on their tags, attributes, or text
- To make requests to a server we need to also import requests
- Below is an example on how to scrape the wikipedia page for IBM
library(reticulate)
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Specify the URL of the webpage you want to scrape
= 'https://en.wikipedia.org/wiki/IBM'
url
# Send an HTTP GET request to the webpage
= requests.get(url)
response
# Store the HTML content in a variable
= response.text
html_content
# Create a BeautifulSoup object to parse the HTML
= BeautifulSoup(html_content, 'html.parser')
soup
# Display a snippet of the HTML content
print(html_content[:500])
<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-
Extract Links
- If you remember from above, links are marked with ‘a’ tag
- To extract all the links from the IBM page we just have BS find_all(‘a’)
# Find all <a> tags (anchor tags) in the HTML-result is a list
= soup.find_all('a')
links
# Iterate through the list of links and print their text - too long to execute
for link in links:
print(link.text)
Scrape Table w Pandas
read_html
Pandas allows us to read table data directly from websites’ tables and present it in a format suitable for analysis. See Webscraping GDP Table in Case Studies section.
Here is a segment:
= "https://web.archive.org/web/20230902185326/https://en.wikipedia.org/wiki/List_of_countries_by_GDP_%28nominal%29"
url = pd.read_html(url)
tables
# Since we know there are 3 tables on the page we can just choose the one we want
= tables[3] df
More will be detailed in other documents in this section