How to Extract Flipkart with Python as well as Visualize Data in the Power BI

In this blog, we will check how to extract Flipkart as well as utilize the web extracted data in the Power BI.

The application software and libraries utilized in a code as well as visualization are given below:

  • BS4
  • Pandas
  • Power BI
  • Requests
  • Time

Before we get started with coding, it’s time to understand a website as well as what data we would be extracting. We would be extracting the mobile data having cost less under Rs. 10,000. Here is the snap of the website as well as highlighted details would get scrapped.

Before-we-get-started-with-coding

After that, we have recognized the HTML id that would be used during the scraping procedure. After that, we have imported the necessary libraries as well as had made empty lists for storing highlighted data.

import requests
import pandas as pd
import time
from bs4 import BeautifulSoup
mobile_model_name=[]
mobile_description=[]
description_test=[]
mobile_cost=[]
mobile_rating=[]
mobile_rating_count_review=[]

After that, we are constructing a web URL using a base URL to facilitate that we can extract data from different pages through “for” loop. After that, check a status code about a web URL. In case, the responsive code is not “200”, meaning that it couldn’t create a connection and print a message called “Request returned error code:” given by a status code reverted through the request connections.

In case, a status code is “200”, utilizing the functionalities like “Html.parser”, “soup” as well as “find_all”, I’m getting the parent class id “_13oc-S” would hold the whole highlighted data underneath “results”. Utilizing the “for” loop on every element “results_value” reverted by key class id “results”.

for page_num in range(1,11):
    web_base_url=r'https://www.flipkart.com/mobiles/~mobile-phones-under-rs10000/pr?sid=tyy%2C4io&page='
    web_pages=page_num

    concat_web_url= web_base_url + str(web_pages)
    web_url_response=requests.get(concat_web_url)


    if web_url_response.status_code!=200:
        print("Request returned error code: " + str(web_url_response.status_code))
    else:
        soup = BeautifulSoup(web_url_response.content, "html.parser")
        results = soup.find_all("div", class_="_13oc-S")
        
        for results_value in results:
            
            result_mobile_model_name=results_value.find("div", class_="_4rR01T")
            if result_mobile_model_name is None:
                mobile_model_name.append("NA")
            else:
                mobile_model_name.append(result_mobile_model_name.text)
            
            result_mobile_description=results_value.find("ul", class_="_1xgFaf")
            if result_mobile_description is None:
                mobile_description.append("NA")
            else:
                mobile_description.append(result_mobile_description.text)
            
            result_mobile_cost=results_value.find("div", class_="_30jeq3 _1_WHN1")
            if result_mobile_cost is None:
                mobile_cost.append("0")    
            else:
                mobile_cost.append(result_mobile_cost.text)
            
            result_mobile_rating=results_value.find("div", class_="_3LWZlK")
            if result_mobile_rating is None:
                mobile_rating.append("0")
            else:
                mobile_rating.append(result_mobile_rating.text)
            
            result_mobile_rating_count_review=results_value.find("span", class_="_2_R_DZ")
            if result_mobile_rating_count_review is None:
                mobile_rating_count_review.append("0 Ratings & 0 Reviews")
            else:
                mobile_rating_count_review.append(result_mobile_rating_count_review.text)

    time.sleep(15)

Beneath every element id, we are matching a class id “_4rR01T” for the underlined element with returned class ids. In case, the result gets “None” meaning that passed id couldn’t get found then add the created listing using “NA” or “0” dependent on the data types.

Correspondingly, following the similar procedure for other formed list. And for every page, we are making the delay of “15” seconds to facilitate that the IP address is not getting blocked.

When we have ready-made data in different lists like “mobile_model_name”, “mobile_cost”, “mobile_description”, “mobile_rating_count_review”, “mobile_rating”, then load data to “df_data” panda’s dataframe. With the data frame doing some transformations like setting data types about columns like “Model Cost”, “Model Rating”, “Model Review Count”, as well as “Model Rating Counts”. Applying strips as well as split functions on the “Model Cost” column for removing the additional characters like “,” and “₹”. Similarly for the columns “Model Review Count” and “Model Rating Count” for removing “Reviews” and “Ratings” strings. Deleting the real column “Model Rating & Review Counts” that was holding about concatenated data.

df_data=pd.DataFrame()
df_data['Model Name']=mobile_model_name
df_data['Model Description']=mobile_description
df_data['Model Cost']=mobile_cost
df_data['Model Rating']=mobile_rating
df_data['Model Rating and Review Count']=mobile_rating_count_review

df_data['Model Rating']=df_data['Model Rating'].astype(float)
df_data['Model Cost']=df_data['Model Cost'].apply(lambda i: i.replace("₹","")) \
                       .apply(lambda i: i.replace(",",""))
df_data['Model Cost']=df_data['Model Cost'].apply(lambda i: i.strip()).astype(float)

df_data['Model Rating Count']=df_data['Model Rating and Review Count'].str.split("&").str[0]
df_data['Model Rating Count']=df_data['Model Rating Count'].apply(lambda i: i.replace("Ratings","")) \
                               .apply(lambda i: i.replace(",",""))
df_data['Model Rating Count']=df_data['Model Rating Count'].apply(lambda i: i.strip()).astype(int)

df_data['Model Review Count']=df_data['Model Rating and Review Count'].str.split("&").str[1]
df_data['Model Review Count']=df_data['Model Review Count'].apply(lambda i: i.replace("Reviews","")) \
                               .apply(lambda i: i.replace(",",""))
df_data['Model Review Count']=df_data['Model Review Count'].apply(lambda i: i.strip()).astype(int)

del df_data['Model Rating and Review Count']

Let’s go through the entire program:

import requests
import pandas as pd
import time
from bs4 import BeautifulSoup

mobile_model_name=[]
mobile_description=[]
description_test=[]
mobile_cost=[]
mobile_rating=[]
mobile_rating_count_review=[]

for page_num in range(1,11):
    web_base_url=r'https://www.flipkart.com/mobiles/~mobile-phones-under-rs10000/pr?sid=tyy%2C4io&page='
    web_pages=page_num

    concat_web_url= web_base_url + str(web_pages)
    web_url_response=requests.get(concat_web_url)


    if web_url_response.status_code!=200:
        print("Request returned error code: " + str(web_url_response.status_code))
    else:
        soup = BeautifulSoup(web_url_response.content, "html.parser")
        results = soup.find_all("div", class_="_13oc-S")
        
        for results_value in results:
            
            result_mobile_model_name=results_value.find("div", class_="_4rR01T")
            if result_mobile_model_name is None:
                mobile_model_name.append("NA")
            else:
                mobile_model_name.append(result_mobile_model_name.text)
            
            result_mobile_description=results_value.find("ul", class_="_1xgFaf")
            if result_mobile_description is None:
                mobile_description.append("NA")
            else:
                mobile_description.append(result_mobile_description.text)
            
            result_mobile_cost=results_value.find("div", class_="_30jeq3 _1_WHN1")
            if result_mobile_cost is None:
                mobile_cost.append("0")    
            else:
                mobile_cost.append(result_mobile_cost.text)
            
            result_mobile_rating=results_value.find("div", class_="_3LWZlK")
            if result_mobile_rating is None:
                mobile_rating.append("0")
            else:
                mobile_rating.append(result_mobile_rating.text)
            
            result_mobile_rating_count_review=results_value.find("span", class_="_2_R_DZ")
            if result_mobile_rating_count_review is None:
                mobile_rating_count_review.append("0 Ratings & 0 Reviews")
            else:
                mobile_rating_count_review.append(result_mobile_rating_count_review.text)

    time.sleep(15)
    
df_data=pd.DataFrame()
df_data['Model Name']=mobile_model_name
df_data['Model Description']=mobile_description
df_data['Model Cost']=mobile_cost
df_data['Model Rating']=mobile_rating
df_data['Model Rating and Review Count']=mobile_rating_count_review

df_data['Model Rating']=df_data['Model Rating'].astype(float)
df_data['Model Cost']=df_data['Model Cost'].apply(lambda i: i.replace("₹","")) \
                       .apply(lambda i: i.replace(",",""))
df_data['Model Cost']=df_data['Model Cost'].apply(lambda i: i.strip()).astype(float)

df_data['Model Rating Count']=df_data['Model Rating and Review Count'].str.split("&").str[0]
df_data['Model Rating Count']=df_data['Model Rating Count'].apply(lambda i: i.replace("Ratings","")) \
                               .apply(lambda i: i.replace(",",""))
df_data['Model Rating Count']=df_data['Model Rating Count'].apply(lambda i: i.strip()).astype(int)

df_data['Model Review Count']=df_data['Model Rating and Review Count'].str.split("&").str[1]
df_data['Model Review Count']=df_data['Model Review Count'].apply(lambda i: i.replace("Reviews","")) \
                               .apply(lambda i: i.replace(",",""))
df_data['Model Review Count']=df_data['Model Review Count'].apply(lambda i: i.strip()).astype(int)

del df_data['Model Rating and Review Count']

When all the given data is completed, we are utilizing Power BI for visualizing the data. Power BI has built-in Python UI that is utilized to connect data.

When-all-the-given-data-is-completed

Just click on “Python Script” as well as paste code in the provided space as well as click “OK” as given below:

Just-click-on-Python-Script

You would redirected to “Power Query Editor” whereas Power BI would automatically make 3 steps like “Source”, “Navigation”, as well as “Changed Type”. For which, we would add more steps to build our dashboard.

You-would-redirected-to-Power-Query-Editor

Here is the final view of a published dashboard:

Here-is-the-final-view

We hope that this blog will help everybody. For more information, contact X-Byte Enterprise Crawling or ask for a free quote!