Getting Started with Web Scraping
Working in different domains we often encounter scenarios where we are asked to procure data from any particular web page or blogs. These data could be as simple as author names, product reviews, sales number or images.
We’ll be covering a very simple process to extract product review from Amazon, and save it in notepad and csv file. This process could be replicated to extract product review from any webpage to scrape reviews for products or movies, and furthermore.
Importing Libraries
First we’ll import all the important libraries necessary for this job.
# Importing requests to extract content from a url
#(https://realpython.com/python-requests/#the-get-request)
import requests# Importing Beautifulsoup for web scrapping
from bs4 import BeautifulSoup as bs
Scraping Product Reviews From Website
Now that we have all the libraries we will start to scrape amazon for product reviews.
#creating an empty list to store reviews
review_list = []
A for loop which will run for as many times as the number of review pages we need to be scraped.
for i in range(1,5):
temp_list=[]
url="https://www.amazon.in/Alchemist-Paulo-Coelho/product- reviews/8172234988/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber="+str(i) response = requests.get(url)
soup = bs(response.content,"html.parser")
reviews = soup.find_all("span",attrs={"class","a-size-base review-text review-text-content"}) for i in range(len(reviews)):
temp_list.append(reviews[i].text) review_list=review_list+temp_list
We’ll now see how the code works.
for i in range(1,5): #here range function is used to provide the number of pages that we would like our code to scrape through.
response = requests.get(url) #get method is used to get or retrieve data from a specified resource.
soup = bs(response.content,”html.parser”)# creating soup object to iterate over the extracted content
reviews = soup.find_all(“span”,attrs={“class”,”a-size-base review-text review-text-content”}) #Extracting the content under specific tags. The find_all() method looks through a tag’s descendants and retrieves all descendants that matches our filters.
Writing scraped data into a text file
# writng reviews in a text filewith open("reviews.txt","w",encoding='utf8') as output:
output.write(str(review_list))
And so we have our product review saved in a notepad!