Practical Web Scraping for Data Science: Web scraping, also known as web harvesting or web data extraction, is a technique used to extract data from websites. It involves writing code to parse HTML content and extract information that is relevant to the user. Web scraping is an essential tool for data science, as it allows data scientists to gather information from various online sources quickly and efficiently. In this article, we will discuss practical web scraping techniques for data science using Python.
Before diving into the practical aspects of web scraping, it is essential to understand the legal and ethical implications of this technique. Web scraping can be used for both legal and illegal purposes, and it is essential to use it responsibly. It is crucial to ensure that the data being extracted is not copyrighted, and the website’s terms of service permit web scraping. Additionally, it is important to avoid overloading a website with requests, as this can be seen as a denial-of-service attack.

Now let’s dive into the practical aspects of web scraping for data science. The first step is to identify the website that contains the data you want to extract. In this example, we will use the website “https://www.imdb.com” to extract information about movies. The website contains a list of top-rated movies, and we will extract the movie title, release year, and rating.
To begin, we need to install the following Python libraries: Requests, Beautiful Soup, and Pandas. These libraries are essential for web scraping and data manipulation.
!pip install requests
!pip install beautifulsoup4
!pip install pandas
After installing the necessary libraries, we can begin writing the code to extract the data. The first step is to send a request to the website and retrieve the HTML content.
import requests
url = 'https://www.imdb.com/chart/top'
response = requests.get(url)
Once we have the HTML content, we can use Beautiful Soup to parse the HTML and extract the information we want.
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
movies = soup.select('td.titleColumn')
The select
method is used to select elements that match a specific CSS selector. In this example, we are selecting all the elements with the class “titleColumn.”
We can now loop through the movies
list and extract the movie title, release year, and rating.
movie_titles = []
release_years = []
ratings = []
for movie in movies:
title = movie.find('a').get_text()
year = movie.find('span', class_='secondaryInfo').get_text()[1:-1]
rating = movie.find('td', class_='ratingColumn imdbRating').get_text().strip()
movie_titles.append(title)
release_years.append(year)
ratings.append(rating)
Finally, we can create a Pandas dataframe to store the extracted data.
import pandas as pd
df = pd.DataFrame({'Title': movie_titles, 'Year': release_years, 'Rating': ratings})
print(df.head())
The output will be a dataframe containing the movie title, release year, and rating.
Title Year Rating
0 The Shawshank Redemption 1994 9.2
1 The Godfather 1972 9.1
2 The Godfather: Part II 1974 9.0
3 The Dark Knight 2008 9.0
4 12 Angry Men 1957 8.9
Comments are closed.