Web Scraping Using Beautiful Soup – Part 1

What is Web Scraping?

The method of extracting data from websites is called web scraping. It is also called web data extraction or web harvesting. This technique is not more than 3 decades old.

Why Web Scraping?

The purpose of web scraping is to fetch data from any website thereby saving a huge amount of manual labour in collecting data/information. For example, you can collect all the reviews of a movie from the IMDB website. Thereafter you can perform text analytics to gain insights about the movie from the huge corpus of reviews collected.

Ways to do Web Scraping

There are several chargeable/free web scraping tools available in the market today.

We can also write our own code for performing scraping using python with requests and Beautiful Soup. Another package named Scrapy can also be used for the same. In this article we will learn how to use Beautiful Soup for web scraping.

Required tools and knowledge

  • Python
  • HTML
  • Jupyter Notebook
  • Requests
  • BeautifulSoup4
  • Pandas

There are innumerable websites available which provides a lot of numeric or text information. Before starting with a scraping code, we need to identify what data we are going to scrape from the website. That will help us aim at those particular sections of the web page while coding. For example opencodez.com provides us with several posts on various technologies. I want to create an excel containing the title of all the articles written, the short paragraph, its author, date and the web link to those articles. The below screen shot shows the sections I need to target in my code.word-image-1

Understanding the website structure (CSS/HTML)

Every website has its own structure and is created using HTML along with CSS and JavaScript. HTML comprise of tags which we need to understand. We can use w3schools to get some basic knowledge on HTML/CSS. It is helpful if we understand the structure of our target website. Very lucid information related to several HTML tags is provided in the link https://www.elated.com/first-10-html-tags/

When we hit Inspect element after right clicking on any section of our website we can see its structure. See a screen shot provided below for the same article section as provided in the above snapshot for ease of understanding.

Once the Inspect Element is hit, the details open up as follows:

Observe the section named “Pavan” and its element tag span in the snapshot.

Understanding pagination

The opencodez website comprise of pagination i.e. we have several pages to scrape to generate the collection of all the articles. The first screen shot of the homepage below shows the web address and the pagination at the bottom. If we hit the ‘last’ button we can see that the address changes as highlighted in the second screenshot and points to page number 15. You can observe the weblink at the top consists of ‘/page/15’ to mark the page address. We will apply some logic to scrape this website upto page 15.

A snapshot of the home page is shown below.

Once we hit the LAST button the URL changes as in the below snapshot showing us the web link in terms of page number. It is 15 in our case.

word-image-5

Scraping the first page to begin

If we change the page number on the address space you will be able to see various pages from 0 to 15. We will begin scraping the first page which is https://www.opencodez.com/page/0.

As the first step we will send a request to the URL and store its response in a variable named response. This will send all the web code as a response.

Then we have to parse the HTML response content with html.parser. Let us name it as soup.

Now let us see how the response looks like.

We will use prettify function to organize it. See how the output is organized after using this command.

Let us observe the page section from where we have to pull details. If we inspect its element by the right click method I told earlier, we see the details of href and title of the any article lies within the tag h2 with a class named title.

The HTML code for the article title and its link is in the blue box above.

We will pull it all by the following command.

A list of 12 values will be pulled out. From these we will pull the titles and hrefs of all the articles posted by using the command as follows.

To collect short description of posts, author and date, we need to aim at the div tag containing the class named “post-content image-caption-format-1”.

The commands would be as follows:

Further explanation of code

Once these are collected for the first page we need to apply loop to pull these details from the other pages of pagination. We will be using a for loop and append variable values one after the other. A variable page_number is used and incremented to create the next webpage address to be feeded as an argument in the function. After the successful procurement of all the data from every page we create a data frame with all the variables and use pandas package to store it in a csv.

The Complete Code

Output file

A csv file snapshot is provided.

Using output in various ways

Now we have our desired csv. We can do some exploratory data analysis on this data to see for example the number of articles written by each author or to make a yearly analysis on the number of articles. We can also create a word cloud from the corpus of brief description column to see the most used words in the posts. These will be dealt in the next post.

Word of caution for web scraping

The legality of this practice is not well defined however. Websites usually describe in their terms of use and in their robots.txt file if they allow scrapers or not. So please be careful not to tread in restricted territories or not to hit the URL with a huge number of requests in a short duration causing issues with the website itself 🙂

Hope you found it useful.

In next article we will see what we can do with scrapped data. Please stay tuned !

Web Scraping Using Beautiful Soup Word Cloud – Part 2

 

 

Add a Comment

Your email address will not be published. Required fields are marked *