Web Scraping 101: Intro to BeautifulSoup and Python Requests

3D isometric illustration of a robotic spider harvesting data cubes from a website structure, representing web scraping with Python.

Sometimes the data you need isn’t in a nice CSV file; it’s stuck on a website. Web Scraping is the process of using code to automatically read and extract that data. Among the tools available, using BeautifulSoup for web scraping is a popular choice for its ease of use and flexibility.

We need two libraries:

  1. requests: To fetch the HTML code of the page (just like your browser does).
  2. BeautifulSoup (bs4): The essential library for parsing and scraping content efficiently.

Step 1: Installation

pip install requests beautifulsoup4

Step 2: Fetch the Page

Let’s use BeautifulSoup for web scraping a dedicated practice site: http://quotes.toscrape.com. It’s safe and legal to scrape.

import requests
from bs4 import BeautifulSoup

url = "http://quotes.toscrape.com"
response = requests.get(url)

# Check if it worked
if response.status_code == 200:
    print("Successfully fetched the page!")
else:
    print("Failed to fetch the page.")

Step 3: Parse the HTML

Now we feed the page content into BeautifulSoup for effective HTML parsing, illustrating the power of web scraping.

soup = BeautifulSoup(response.text, 'html.parser')

# Let's find the first quote on the page.
# (We know it's in a <span> with class="text" because we inspected the page in our browser first!)
quote = soup.find('span', class_='text')

print(quote.text)
# Output: "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."</span>

Step 4: Get All of Them

We can use find_all to get a list of every quote on the page, showcasing its capability.

all_quotes = soup.find_all('span', class_='text')

for q in all_quotes:
    print(q.text)
    print("---")

A Warning on Ethics

Always check a website’s robots.txt file (e.g., google.com/robots.txt) to see if they allow scraping. it should be done considerately, as aggressive scraping can crash small sites. Ethics are crucial in any form of web scraping.

Similar Posts

Leave a Reply