web scraping mobile phone data from flipkart using python

Extract products data from Flipkart with Pandas, Selenium, BeautifulSoup4, and CSV.

These days, the Internet is submerged with a huge amount of data associated to what we had one decade ago. As per Forbes, the data we yield every day is mind-boggling! You can have 2.5 quintillion bytes of data produced daily at present pace, and it has become possible due to Internet of Things (IoT) devices. Accessing this information, either in form of video, text, audio, images, or other formats, the majority of businesses depend seriously on data for beating their competitors as well as succeed in the business. Inappropriately, the majority of data is not open. The majority of websites do not offer the option of saving data that they show on their sites. That is where Software or Web Scraping tools come to scrape data from different websites.

What is Web Scraping?

Web Scraping is a procedure of auto downloading the data shown on the site using a few computer programs. A data scraping tool could scrape different pages from the website and automate a tedious job of manually copy and paste the data shown. Web Scraping is very important as despite the industries, the web has information, which can offer actionable businesses insights to get a benefit over your competitors.

Steps Associated with Web Scraping

To fetch data through Web Scraping with Python, we require to go through these steps:

  • Get the URL, which you wish to extract.Checking the Page.
  • Find data you need to scrape.
  • Write a code.
  • Run a code & scrape data.
  • Lastly, store data in the necessary format.

Packages Utilized for Web Scraping

We’ll utilize the given Python packages:

Pandas: Pandas is the library utilized for data analysis and manipulation. This is used for storing data in desired formats.

BeautifulSoup4: BeautifulSoup4 is a Python web scraping library utilized to parse HTML documents. This makes parse trees, which are useful in scraping tags from HTML strings.

Selenium: Selenium is the tool specially designed to assist you in running automated tests of web applications. Though this is not its key objective, Selenium is used in Python also for data scraping as it could access JavaScript-rendered content (whereas regular extraction tools like BeautifulSoup can’t do it). We’ll utilize Selenium for downloading HTML-based content from Flipkart as well as see in the interactive way what’s taking place.

CSV: A CSV module implements different classes for reading and writing tabular information in the CSV format.

Project Demonstration

Import Libraries

Let’s begin with installing the necessary packages.


import csv
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
Starting the WebDriver

We start by firstly making a Webdriver object by importing a webdriver class from the docs as well as we can utilize this object for doing any operation(s) required. For instance, we have made a Chrome object here.


# Creating an instance of webdriver for google chrome
driver = webdriver.Chrome()

We start by firstly making a Webdriver object by importing a webdriver class from the docs as well as we can utilize this object for doing any operation(s) required. For instance, we have made a Chrome object here.


# Using webdriver we'll now open the flipkart website in chrome
url = 'https://flipkart.com'
# We;ll use the get method of driver and pass in the URL
driver.get(url)

Now, you can get some ways we can organize a product search:

The initial way is automating the browser through finding input elements and insert the text as well as hit ‘enter’ switch on a keyboard. The image here shows it:

flipkart-1

Although this type of automation is needless and it makes the potential of program failure. So, the rule for automation is automate what is absolutely necessary when doing Web Scraping.

Now, search the inputs inside a search area as well as press enter. You’ll see that a search term has been entrenched into a URL site. Currently, we can utilize this pattern for creating a function, which will create the required URL for the driver to recover. It would be much efficient in long term as well as less prone for the program failure. Just see the image given below:

flipkart-2

Now, let’s copy the pattern and make a function, which will insert search terms using the string formatting.


def get_url(search_item):
    '''
    This function fetches the URL of the item that you want to search
    '''
    template = 'https://www.flipkart.com/search?q={}&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=mobile+phones&requestId=e625b409-ca2a-456a-b53c-0fdb7618b658&as-backfill=on'
    # We'are replacing every space with '+' to adhere with the pattern 
    search_item = search_item.replace(" ","+")
    return template.format(search_item)

Currently, we have the function, which will produce a URL depending on a search term that we offer.


# Checking whether the function is working properly or not
url = get_url('mobile phones')
print(url)
https://www.flipkart.com/search?q=mobile+phones&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=mobile+phones&requestId=e625b409-ca2a-456a-b

A function produces similar results like before.

Scraping the Collection

Now, we will scrape the webpage content from which we wish to scrape data.

To perform that, we require to make a BeautifulSoup object that will parse HTML content from a page source.

Making a soup object with driver.page_source for retrieving HTML text as well as then we’ll utilize a default HTML parser for parsing the HTML.


# Creating a soup object using driver.page_source to retreive the HTML text and then we'll use the default html parser to parse
# the HTML.
soup = BeautifulSoup(driver.page_source, 'html.parser')
flipkart-3

Now as we have recognized that the given car or record specified by a box having all the details that we require for the mobile phone. Therefore, let’s discover tags for boxes or cards that have data we wish to scrape.

We’ll scrape — Models, stars, total reviews, total ratings, RAM, display, storage capacity, camera information, expandable options, processor, warranty, battery, and price data.

Inspect the Tags

Usually, the data is entrenched in tags. Therefore, we require to inspect a page to observe, under which tagging the data that we wish to extract is entrenched. For inspecting a page, just right-click on an element as well as choose ‘Inspect’.

flipkart-4

We can utilize a tag & precisely class=_11fQZEK to have all the boxes or cards and after that we could easily find information from the boxes of all mobile phones.

Prototype a Single Record

On 1st page, you have listed 24 mobile phones therefore let’s choose the first one.


# picking the 1st card from the complete list of cards
item = results[0]

To scrape a phone model, we discover a div tag using class=_4rR01T.


# Extracting the model of the phone from the 1st card
model = item.find('div',{'class':"_4rR01T"}).text
model
'REDMI 9i (Nature Green, 64 GB)'

Also, to have Stars provided by the users to any mobile phone, we will discover div tag using class=_3LWZlK.


# Extracting Stars from 1st card
star = item.find('div',{'class':"_3LWZlK"}).text
star
'4.3'

Now, scrape the remaining data through finding the tags.


# Extracting whether there is an option of expanding the storage or not
expandable = item.find('li',{'class':"rgWa7D"}).text[item.find('li',{'class':"rgWa7D"}).text.find('|')+1:][13:]
expandable
'Expandable Upto 512 GB'
# Extracting the display option from the 1st card
display = item.find_all('li')[1].text.strip()
display
'16.59 cm (6.53 inch) HD+ Display'
# Extracting camera options from the 1st card
camera = item.find_all('li')[2].text.strip()
camera
'13MP Rear Camera | 5MP Front Camera'
# Extracting the battery option from the 1st card
battery = item.find_all('li')[3].text
battery
'5000 mAh Lithium Polymer Battery'
# Extracting the processir option from the 1st card
processor = item.find_all('li')[4].text.strip()
processor
'MediaTek Helio G25 Processor'
# Extracting Warranty from the 1st card
warranty = item.find_all('li')[-1].text.strip()
warranty
'Brand Warranty of 1 Year Available for Mobile and 6 Months for Accessories'
# Extracting price of the model from the 1st card
price = item.find('div',{'class':'_30jeq3 _1_WHN1'}).text

Generalize a Pattern

It’s time to make a function, which will scrape all the data from one page.


# Extracting Stars from 1st card
def extract_phone_model_info(item):
    """

The function scrapes price, ram, model, storage, total ratings, stars, total reviews, storage expandable alternative, camera quality, display option, processor, battery, warranty at flipkart


# Extracting whether there is an option of expanding the storage or not
"""
    # Extracting the model of the phone from the 1st card
    model = item.find('div',{'class':"_4rR01T"}).text
    # Extracting Stars from 1st card
    star = item.find('div',{'class':"_3LWZlK"}).text
    # Extracting Number of Ratings from 1st card
    num_ratings = item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ")[0:item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ").find(';')].strip()
    # Extracting Number of Reviews from 1st card
    reviews = item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ")[item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ").find(';')+1:].strip()
    # Extracting RAM from the 1st card
    ram = item.find('li',{'class':"rgWa7D"}).text[0:item.find('li',{'class':"rgWa7D"}).text.find('|')]
    # Extracting Storage/ROM from 1st card
    storage = item.find('li',{'class':"rgWa7D"}).text[item.find('li',{'class':"rgWa7D"}).text.find('|')+1:][0:10].strip()
    # Extracting whether there is an option of expanding the storage or not
    expandable = item.find('li',{'class':"rgWa7D"}).text[item.find('li',{'class':"rgWa7D"}).text.find('|')+1:][13:]
    # Extracting the display option from the 1st card
    display = item.find_all('li')[1].text.strip()
    # Extracting camera options from the 1st card
    camera = item.find_all('li')[2].text.strip()
    # Extracting the battery option from the 1st card
    battery = item.find_all('li')[3].text
    # Extracting the processir option from the 1st card
    processor = item.find_all('li')[4].text.strip()
    # Extracting Warranty from the 1st card
    warranty = item.find_all('li')[-1].text.strip()
    # Extracting price of the model from the 1st card
    price = item.find('div',{'class':'_30jeq3 _1_WHN1'}).text
    result = (model,star,num_ratings,reviews,ram,storage,expandable,display,camera,battery,processor,warranty,price)
    return result

Put all data from a single page to one list.


# Now putting all the information from all the cards/phone models and putting them into a list
records_list = []
results = soup.find_all('a',{'class':"_1fQZEK"})
for item in results:
    records_list.append(extract_phone_model_info(item))

View how our data frames look like for 1st page through creating the DataFrame with the list created above.


# Now putting all the information from all the cards/phone models and putting them into a list
pd.DataFrame(records_list,columns=['model',"star","num_ratings"
   ,"reviews",'ram',"storage","expandable","display","camera","battery","processor","warranty","price"])

Table

Model Stars Num_of_Ratings Reviews Ram Storage Expandable Display Camera Battery Processor Warranty Price
0 REDMI 9i (Nature Green, 64 GB) 4.3 4,06,452 Ratings 23,336 Reviews 4 GB RAM 64 GB ROM Expandable Upto 512 GB 16.59 cm (6.53 inch) HD+ Display 13MP Rear Camera | 5MP Front Camera 5000 mAh Lithium Polymer Battery MediaTek Helio G25 Processor Brand Warranty of 1 Year Available for Mobile … ₹8,499
1 realme C21 (Cross Blue, 64 GB) 4.4 63,273 Ratings 2,912 Reviews 4 GB RAM 64 GB ROM Expandable Upto 256 GB 16.51 cm (6.5 inch) HD+ Display 13MP + 2MP + 2MP | 5MP Front Camera 5000 mAh Battery MediaTek Helio G35 Processor 1 Year Warranty for Mobile and 6 Months for Ac… ₹9,499
2 realme C21 (Cross Black, 64 GB) 4.4 63,273 Ratings 2,912 Reviews 4 GB RAM 64 GB ROM Expandable Upto 256 GB 16.51 cm (6.5 inch) HD+ Display 13MP + 2MP + 2MP | 5MP Front Camera 5000 mAh Battery MediaTek Helio G35 Processor 1 Year Warranty for Mobile and 6 Months for Ac… ₹9,499
3 realme C21 (Cross Black, 32 GB) 4.4 51,035 Ratings 2,564 Reviews 3 GB RAM 32 GB ROM Expandable Upto 256 GB 16.51 cm (6.5 inch) HD+ Display 13MP + 2MP + 2MP | 5MP Front Camera 5000 mAh Battery MediaTek Helio G35 Processor 1 Year Warranty for Mobile and 6 Months for Ac… ₹8,499
4 realme C21 (Cross Blue, 32 GB) 4.4 51,035 Ratings 2,564 Reviews 3 GB RAM 32 GB ROM Expandable Upto 256 GB 16.51 cm (6.5 inch) HD+ Display 13MP + 2MP + 2MP | 5MP Front Camera 5000 mAh Battery MediaTek Helio G35 Processor 1 Year Warranty for Mobile and 6 Months for Ac… ₹8,499
5 REDMI 9 Power (Mighty Black, 64 GB) 4.3 1,30,038 Ratings 9,051 Reviews 4 GB RAM 64 GB ROM 16.59 cm (6.53 inch) Full HD+ Display 48MP + 8MP + 2MP + 2MP | 8MP Front Camera 6000 mAh Battery Qualcomm Snapdragon 662 Processor 1 year manufacturer warranty for device and 6 … ₹10,999

Next Page’s Navigation

Writing a customized function, which will assist us in getting data from different pages.


# Now putting all the information from all the cards/phone models and putting them into a list
def get_url(search_item):
    '''
    This function fetches the URL of the item that you want to search
    '''
    template = 'https://www.flipkart.com/search?q={}&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=mobile+phones&requestId=e625b409-ca2a-456a-b53c-0fdb7618b658&as-backfill=on'
    search_item = search_item.replace(" ","+")
    # Add term query to URL
    url = template.format(search_item)
    # Add term query placeholder
    url += '&page{}'
    return url

Put All Pieces Together

Let’s combine what we have made so far through combining everything. In the end, we will write a main function, which will take a search query as well as provide us the DataFrame after scraping from 464 pages providing us data of around 11,000 mobile phones.


# Importing necessary Libraries
import csv
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd

def get_url(search_item):
    '''
    This function fetches the URL of the item that you want to search
    '''
    template = 'https://www.flipkart.com/search?q={}&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=mobile+phones&requestId=e625b409-ca2a-456a-b53c-0fdb7618b658&as-backfill=on'
    search_item = search_item.replace(" ","+")
    # Add term query to URL
    url = template.format(search_item)
    # Add term query placeholder
    url += '&page{}'
    return url

def extract_phone_model_info(item):
    """
    This function extracts model, price, ram, storage, stars , number of ratings, number of reviews, 
    storage expandable option, display option, camera quality, battery , processor, warranty of a phone model at flipkart
    """
    # Extracting the model of the phone from the 1st card
    model = item.find('div',{'class':"_4rR01T"}).text
    # Extracting Stars from 1st card
    star = item.find('div',{'class':"_3LWZlK"}).text
    # Extracting Number of Ratings from 1st card
    num_ratings = item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ")[0:item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ").find(';')].strip()
    # Extracting Number of Reviews from 1st card
    reviews = item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ")[item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ").find(';')+1:].strip()
    # Extracting RAM from the 1st card
    ram = item.find('li',{'class':"rgWa7D"}).text[0:item.find('li',{'class':"rgWa7D"}).text.find('|')]
    # Extracting Storage/ROM from 1st card
    storage = item.find('li',{'class':"rgWa7D"}).text[item.find('li',{'class':"rgWa7D"}).text.find('|')+1:][0:10].strip()
    # Extracting whether there is an option of expanding the storage or not
    expandable = item.find('li',{'class':"rgWa7D"}).text[item.find('li',{'class':"rgWa7D"}).text.find('|')+1:][13:]
    # Extracting the display option from the 1st card
    display = item.find_all('li')[1].text.strip()
    # Extracting camera options from the 1st card
    camera = item.find_all('li')[2].text.strip()
    # Extracting the battery option from the 1st card
    battery = item.find_all('li')[3].text
    # Extracting the processir option from the 1st card
    processor = item.find_all('li')[4].text.strip()
    # Extracting Warranty from the 1st card
    warranty = item.find_all('li')[-1].text.strip()
    # Extracting price of the model from the 1st card
    price = item.find('div',{'class':'_30jeq3 _1_WHN1'}).text
    result = (model,star,num_ratings,reviews,ram,storage,expandable,display,camera,battery,processor,warranty,price)
    return result

def main(search_item):
    '''
    This function will create a dataframe for all the details that we are fetching from all the multiple pages
    '''
    driver = webdriver.Chrome()
    records = []
    url = get_url(search_item)
    for page in range(1,464):
        driver.get(url.format(page))
        soup = BeautifulSoup(driver.page_source,'html.parser')
        results = soup.find_all('a',{'class':"_1fQZEK"})
        for item in results:
            records.append(extract_phone_model_info(item))
    driver.close()
    # Saving the data into a csv file
    with open('Flipkart_results.csv','w',newline='',encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['Model','Stars','Num_of_Ratings','Reviews','Ram','Storage','Expandable',
                        'Display','Camera','Battery','Processor','Warranty','Price'])
        writer.writerows(records)

Run our key function to scrape data of all Mobile phones available on different pages.


# Now putting all the information from all the cards/phone models and putting them into a list
%%time
main('mobile phones')
Wall time: 40min 54s

View the data

Let’s observe how the Result will look like.


scraped_df = pd.read_csv('C:\\Users\\DELL\\Desktop\\Jupyter Notebook\\Jovian Web Scraping\\Amazon Products Web Scrapper\\Flipkart_results.csv')
scraped_df.head()

Model Stars Num_of_Ratings Reviews Ram Storage Expandable Display Camera Battery Processor Warranty Price
0 REDMI 9i (Nature Green, 64 GB) 4.3 4,06,452 Ratings 23,336 Reviews 4 GB RAM 64 GB ROM Expandable Upto 512 GB 16.59 cm (6.53 inch) HD+ Display 13MP Rear Camera | 5MP Front Camera 5000 mAh Lithium Polymer Battery MediaTek Helio G25 Processor Brand Warranty of 1 Year Available for Mobile … ₹8,499
1 realme C21 (Cross Black, 64 GB) 4.4 63,273 Ratings 2,912 Reviews 4 GB RAM 64 GB ROM Expandable Upto 256 GB 16.51 cm (6.5 inch) HD+ Display 13MP + 2MP + 2MP | 5MP Front Camera 5000 mAh Battery MediaTek Helio G35 Processor 1 Year Warranty for Mobile and 6 Months for Ac… ₹9,499
2 realme C21 (Cross Blue, 64 GB) 4.4 63,273 Ratings 2,912 Reviews 4 GB RAM 64 GB ROM Expandable Upto 256 GB 16.51 cm (6.5 inch) HD+ Display 13MP + 2MP + 2MP | 5MP Front Camera 5000 mAh Battery MediaTek Helio G35 Processor 1 Year Warranty for Mobile and 6 Months for Ac… ₹9,499
3 realme C21 (Cross Blue, 64 GB) 4.4 63,273 Ratings 2,912 Reviews 4 GB RAM 64 GB ROM Expandable Upto 256 GB 16.51 cm (6.5 inch) HD+ Display 13MP + 2MP + 2MP | 5MP Front Camera 5000 mAh Battery MediaTek Helio G35 Processor 1 Year Warranty for Mobile and 6 Months for Ac… ₹9,499