Web Scraping Using Beautiful Soup Word Cloud + Python – Tutorial Part 2

Web Scraping Using Beautiful Soup:

In our previous article, we understood what is Web Scraping? why Web Scraping? Different ways for web scraping and step by step tutorial of Web Scraping using Beautiful Soup.

What to do with the data we scraped?

There are several actions that can be performed to explore the data collected in the excel sheet after web scraping of the opencodez website. In order to extend our learning to what all can be done with the structured data extraction, here we will try explaining 2 interesting and powerful topics. First is wordcloud generation and the other concept which we will introduce is Topic Modeling which comes under the umbrella of NLP.

Word Cloud

1) What is Word cloud:

It is a visual representation that highlights the high-frequency words present in a corpus of text data after we have removed the least important regular English words called stopwords including other alphanumeric letters from the text.

2) Use of word cloud:

It is an interesting way to look at the text data and gain useful insight instantly without reading the whole text.

3) Required tools and knowledge:

  • Python
  • Pandas
  • wordcloud
  • matplotlib

4) A summary of Code:

In the web scraping code provided in the last article, we do have created a data frame named df using the pandas library and exported this data in a CSV. In this article, we will consider the excel data as input data afresh and start our code in a new manner from here. We will only focus on the column named Article_Para which consists of the most text data. Next, we will convert the desired Article_Para column into a string form and apply the wordcloud function on the text with various parameter inputs. Finally, we will plot the wordcloud using the matplotlib library.

5) Code:

6) Explanation of some terms used in the code:

Stopwords are general words which are used in sentence creation. These words do not generally add any value to the sentence and do not help us gain any insight. For example A, The, This, That, Who etc. We will remove these words from our analysis using the STOPWORDS as a parameter in the Wordcloud function. mplot.axis(‘off’) is mentioned to disable the axis display in the word cloud output

7) Word cloud output:

Web Scraping using Beautiful Soup

8) Reading the output:

The prominent words are QA, SQL, Testing, Developer, Microservices, etc which provides us with information about the most frequently used words in the Article_Para of the data frame. This gives instant insight into the articles which we can expect to be available including various other concepts.

Topic Modeling

1) What is Topic Modeling:

This is a topic that falls under the NLP concept. Here what we do is try to identify various flavors of the topic that exists in our corpus of text or documents.

2) Use of Topic Modeling:

Its use is in identifying what all topic flavors are available in a particular text/document. For instance, on the IMDB web page of Troy we have got say around 5000+ reviews. If we collect all the available reviews and perform a topic modeling on those review text input, we will get several flavors identified which can instantly help us understand what all colors the movie Troy can provide. Like heroics of
Achilles, the tragic end of Hector, etc. In our opencodez.com text data input, we can recognize the various types of topics that the various articles provide to the readers.

3) Required tools and knowledge:

  • Python
  • Pandas
  • gensim
  • NLTK

4) A Summary of Code:

We are going to incorporate the LDA ( Latent Dirichlet Allocation) for Topic Modelling for which we will use the gensim library. NLTK library is the Natural Language Toolkit which will be used to clean and tokenize our text data. By tokenization, we break our string sequence of text data into separate pieces of words, punctuations, symbols, etc. The tokenized data is next converted to a corpus. (The LDA concept is beyond the scope of this article.) We create an LDA model on the corpus using the gensim library to generate topics and print them to see the output.

5) Code:

6) Reading the output:

We can change the value in parameters to get any number of topics or the number of words to be displayed in each topic. Here we have wanted 5 topics with 7 words in each topic. We can observe that the topics are related to java, salesforce, unit testing, microservices. If we increase the number of topics to let say 10, we can find out other flavors of existing topics as well.

Conclusion:

In this two-article series we have taken you through Web Scraping, how it can be used for data collection, interpretation and in the end, effective ways to present the data analysis. Access to such data helps make informed decisions.

I hope you find the information useful. Please do not hesitate to write comments/questions. We will try our best to answer.

Add a Comment

Your email address will not be published. Required fields are marked *