How-to-Extract-Different-Prices-from-an-E-commerce-Website

Writing new scrapers for different e-commerce websites can be very tedious and expensive. This tutorial will demonstrate a few advanced methods to create a simple scraper, which could recognize patterns as well as bring data from an e-commerce page.

Let’s take a quick look at some product pages as well as identify some design patterns about how the product prices get displayed on different sites.

Sephora.com

sephora1

Amazon.com

amazon

Patterns and Observations

Certain patterns, which we recognized by searching at the product pages include:

    • Price generally comes above further currency figures
    • Price is a currency figure having the biggest font sizes
    • Prices look like currency figures (never like words)
    • Prices comes within initial 600 pixels of height

    Certainly, there might be exemptions to these comments, we’ll chat how to cope with these exemptions later in the blog. We can use all the observations to make an effective and general scraper.

    Execution of General E-commerce Scrapers

    1st Step: Installation

    Here, the tutorial utilizes Google Chrome as a web browser. In case, you are not using it, you can just install it and follow the instruction.

    Rather than Google Chrome, the developers use programmable versions of the Google Chrome named Puppeteer. It will eliminate the requirement of running GUI apps to run a scraper. Though, it is outside the range of the tutorial.

    2nd Step: Chrome Developer Tool

    Different codes presented here are designed in as easy as possible manner so it can’t fetch the prices from all product pages available there.

    For the meantime, we’ll visit any Sephora or Amazon product pages in the Google Chrome browser.

    • Visit that product pages in the Google Chrome
    • Then right-click anyplace on a page to choose ‘Inspect’ option and open Chrome DevTools
    • Then click on a DevTools’ Console tab

    Within a Console tab, enter some JavaScript codes and browser will accomplish the codes in context of a web page, which have been loaded. Also, you can study more about the DevTools through the official documentation.

      3rd Step: Running a Javascript snippet

      You need to copy this JavaScript snippet given below and paste that in a console.

      let elements = [
      …document.querySelectorAll(‘ body *’)
      ]

      function createRecordFromElement(element) {

      const text = element.textContent.trim()

      var record = {}

      const bBox = element.getBoundingClientRect()

      if(text.length <= 30 && !(bBox.x == 0 && bBox.y == 0)) {

      record[‘fontSize’] = parseInt(getComputedStyle(element)[‘fontSize’]) }

      record[‘y’] = bBox.y

      record[‘x’] = bBox.x

      record[‘text’] = text

      return record

      }

      let records = elements.map(createRecordFromElement)

      function canBePrice(record) {

      if( record[‘y’] > 600 ||

      record[‘fontSize’] == undefined ||
      !record[‘text’].match(/(^(US ){0,1}(rs\.|Rs\.|RS\.|\$|₹|INR|USD|CAD|C\$){0,1}(\s){0,1}[\d,]+(\.\d+){0,1}(\s){0,1}(AED){0,1}$)/)
      )

      return false

      else return true

      }

      let possiblePriceRecords = records.filter(canBePrice)

      let priceRecordsSortedByFontSize = possiblePriceRecords.sort(function(a, b) {

      if (a[‘fontSize’] == b[‘fontSize’]) return a[‘y’] > b[‘y’]

      return a[‘fontSize’] < b[‘fontSize’]

      })

      console.log(priceRecordsSortedByFontSize[0][‘text’]);

      Press the ‘Enter’ key and you will see the product price displayed on a console.

      If you don’t do that, you have perhaps visited the product page that is an exemption to our explanations. It is completely common, we’ll chat how we can increase our script for covering more product pages about these types. You can try any sample pages given in the step 2.

      This animated GIF given below indicates how we Extract the Prices from Amazon.com

      How Does It Work?

      First, we need to draw all the HTML DOM elements in a page

      let elements = [
      …document.querySelectorAll(‘ body *’)
      ]

      We have to convert all these elements into easy JavaScript objects that stores the XY position value, font size and text content that looks anything like {‘text’:’Tennis Ball’, ‘fontSize’:’14px’, ‘x’:100,’y’:200}. Therefore, we need to write some functions for that like given below:

      function createRecordFromElement(element) {

      const text = element.textContent.trim() // Brings content of an element

      var record = {} // Starts an easy JavaScript object

      const bBox = element.getBoundingClientRect()

      // getBoundingClientRect is the function given by Google Chrome, this returns

      // an object that comprises x,y values, width and height

      if(text.length <= 30 && !(bBox.x == 0 && bBox.y == 0)) {

      record[‘fontSize’] = parseInt(getComputedStyle(element)[‘fontSize’])

      }

      // getComputedStyle is the function given by Google Chrome, this returns an

      // object having all the style data. As this function is fairly

      // time-consuming, we only collect the font sizes of elements those

      // length of text content is nearly 30 and whose x as well as y coordinates are not 0

      record[‘y’] = bBox.y

      record[‘x’] = bBox.x

      record[‘text’] = text

      return record

      }

      Now, transform all the collected elements to JavaScript objects through applying the functions on all the elements through JavaScript map functions.

      let records = elements.map(createRecordFromElement)

      Think about the explanations we made about how the price gets displayed. Now, we can filter those records that match with the design observations. Therefore we require a function, which says whether the given records match with the design observations.

      function canBePrice(record) {

      if(record[‘y’] > 600 ||

      record[‘fontSize’] == undefined ||
      !record[‘text’].match(/(^(US ){0,1}(rs\.|Rs\.|RS\.|\$|₹|INR|USD|CAD|C\$){0,1}(\s){0,1}[\d,]+(\.\d+){0,1}(\s){0,1}(AED){0,1}$)/)
      )

      return false

      else return true

      }

      We use Regular Expression option for checking if the provided text is the currency figures or not. Also, you may modify that regular expression if it doesn’t include any pages, which you’re testing with.

      Currently, we may filter only the records, which are perhaps pricing records

      let possiblePriceRecords = records.filter(canBePrice)

      To conclude, as we’ve witnessed, prices come as a currency figure getting the maximum font size. In case, there are several currency figures having equally higher font sizes, then price perhaps corresponds to one residing with the higher positions. We sort out all our records depending on the conditions, through JavaScript’s sort functions.

      let priceRecordsSortedByFontSize = possiblePriceRecords.sort(function(a, b) {

      if (a[‘fontSize’] == b[‘fontSize’]) return a[‘y’] > b[‘y’]

      return a[‘fontSize’] < b[‘fontSize’]

      })

      Currently, we just have to show that on a console

      console.log(priceRecordsSortedByFontSize[0][‘text’])

      Take that Further

      Affecting to the GUI-less-dependent Scalable Programs

      You may replace the Google Chrome having the headless variety of that named Puppeteer. It is perhaps the quickest option for web rendering. This works completely depending on the similar ecosystem given in the Google Chrome. When the Puppeteer is all set, you can programmatically insert our script into a headless browser as well as have the pricing returned to the function in a program.

      Improve and Enhance the Scripts

      You will immediately notice that a few product pages won’t work with a script as they don’t trail the expectations we have fulfilled about how product prices are displayed as well as the patterns that we have recognized.

      Unfortunately, there are no “holy grails” or perfect solutions for that problem. This is quite possible to produce more pages and recognize more patterns as well as improve this scraper.

      Another important step, which you would utilize to deal with other pages include employing Artificial Intelligence or Machine Learning dependent methods to recognize and categorize patterns as well as automate the procedure to a bigger amount. This sector is a growing field we at X-Byte are using these methods already with variable degrees of attainment.

      If you want any help in Amazon price scraping, you can investigate our tutorial specially intended for Amazon:

      We Can Assist With Data and Automation Requirements

      Convert the Internet to structured, meaningful, and practical data

      Your Name

      email@company.com

      Please enter data sources, details, requests – everything relevant

      You SHOULD NOT contact X-Byte for all help with the Tutorials as well as Codes using a form or through calling us, in its place please add the comments to the end of this tutorial page to get help.

      Disclaimer

      Any codes given in the tutorials are for learning objectives and illustration. We aren’t accountable for how this is used as well as undertake no liabilities for any harmful usage of source codes. The mere occurrence of these codes on our website does not indicate that we inspire scraping or scraping the sites referenced in a code as well as supplementary tutorial. This tutorial only helps in illustrating the method of programming the web scraper for general internet sites. We aren’t thankful to offer any help for a code, though, in case you are adding your questions within the comment section, we might occasionally address them.