web-scraping-hotel-details-on-tripadvisor-using-python

Web scraping is a process in which computers automatically collect data from websites without human interaction. This blog post walks you through how to scrape hotel details using Python and feed them into your program so you can create your custom list of hotels to compare prices and get reviews on the fly.

As well as providing valuable tools for travel, this technique could be applied to any industry where data needs to be collected from different website sources in bulk amounts quickly and accurately at no cost or risk of breaking any rules.

Introduction

Web scraping is the automated data extraction from websites. There are two types of web scraping: content scraping and structure scraping. Content scraping extracts textual content from a website’s pages, whereas structure scraping involves removing relational data from HTML objects.

A web scraper is an agent that performs web scraping to extract information for further use.

The use of web scrapers can be diverse, such as monitoring online trends or news, updating existing data sets by extracting information from websites and analyzing them further, maintaining sites, detecting broken links, and correcting them.

In addition to being done manually, the software is generally used to automate it. Python is a popular language for web scraping because it has several libraries that make it easy to scrape data from websites.

Importing Packages

importing-packages

We need to import a few packages to scrape data from a website. The first package is the requests package, which allows us to make HTTP requests to websites. We also need the BeautifulSoup package, which will enable us to parse HTML and extract data. Finally, we need the Pandas package, which allows us to store data in a data frame.

From selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By 
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from bs4 import BeautifulSoup
import pandas as pd
TripAdvisor URL
tripadvisor-url

The first step in web scraping is to find the website URL we want to scrape. We can start by looking at the TripAdvisor home page. From there, we can navigate to the page for a specific hotel.

Want to   extract hotel details   on TripAdvisor?

Cta Image

Scraping Hotel Details in Python

scraping-hotel-details-in-python

Once we have the URL for the hotel, we can start scraping the data. We can use the requests package to make a GET request to the hotel’s TripAdvisor page. It will give us the HTML of the page, which we can then parse using Beautiful Soup. We can use BeautifulSoup to find all of the elements on the page that contain data about the hotel. In this case, we want to see the elements that comprise the hotel’s name, rating, number of reviews, and price. We can then extract the data from these elements and store it in a list. We can then write the hotel’s name and cost into a text file for later use and keep it in a Pandas data frame.

This article aims to write a piece of code in web scraping technique, extracting all the information on famous hotels and their hotels located around the world and comparing them with each other considering their ratings, location, prices, and reviews.

def parse_search_page(response):
    """parsed results from TripAdvisor search page"""
    sel = Selector(text=response.text)
    parsed = []
    # we go through each result box and extract id, url and name:
    for result_box in sel.css("div.listing_title>a"):
        parsed.append(
            {
                "id": result_box.xpath("@id").get("").split("_")[-1],
                "url": result_box.xpath("@href").get(""),
                "name": result_box.xpath("text()").get("").split(". ")[-1],
            }
        )
    return parsed


async def scrape_search(query:str, session:httpx.AsyncClient):
    """Scrape all search results of a search query"""
    # scrape first page
    log.info(f"{query}: scraping first search results page")
    hotel_search_url = 'https://www.tripadvisor.com/' + (await search_location(query, session))['HOTELS_URL']
    log.info(f"found hotel search url: {hotel_search_url}")
    first_page = await session.get(hotel_search_url)

    # extract paging meta information from the first page: How many pages there are?
    sel = Selector(text=first_page.text)
    total_results = int(sel.xpath("//div[@data-main-list-match-count]/@data-main-list-match-count").get())
    next_page_url = sel.css('a[data-page-number="2"]::attr(href)').get()
    page_size = int(sel.css('a[data-page-number="2"]::attr(data-offset)').get())
    total_pages = int(math.ceil(total_results / page_size))

    # scrape remaining pages concurrently
    log.info(f"{query}: found total {total_results} results, {page_size} results per page ({total_pages} pages)")
    other_page_urls = [
        # note "oa" stands for "offset anchors"
        urljoin(str(first_page.url), next_page_url.replace(f"oa{page_size}", f"oa{page_size * i}"))
        for i in range(1, total_pages)
    ]
    # we use assert to ensure that we don't accidentally produce duplicates which means something went wrong
    assert len(set(other_page_urls)) == len(other_page_urls)
    other_pages = await asyncio.gather(*[session.get(url) for url in other_page_urls])

    # parse all data and return listing preview results
    results = []
    for response in [first_page, *other_pages]:
        results.extend(parse_search_page(response))
    return results

Reading the Data into a Data Frame using Pandas

reading-the-data-into-a-data-frame-using-pandas

Once we have extracted the data from the HTML, we can use the Pandas package to read it into a data frame. It will allow us to analyze the data better. To read the data, we can use PD.read_html.

Our program should do the following:

Here is what our program looks like so far. It scrapes the data from TripAdvisor and stores it in a Pandas data frame.

def parse_hotels(driver):
    """ To parse the web page using the BeautifulSoup

    Args:
        driver (Chromedriver): The driver instance where the hotel details are loaded
    """
    # Getting the HTML page source
    html_source = driver.page_source

    # Creating the BeautifulSoup object with the html source
    soup = BeautifulSoup(html_source,"html.parser")
    
    # Finding all the Hotel Div's in the BeautifulSoup object 
    hotel_tags = soup.find_all("div",{"data-prwidget-name":"meta_hsx_responsive_listing"})
    
    # Parsing the hotel details 
    for hotel in hotel_tags:
        # condition to check if the hotel is sponsored, ignore this hotel if it is sponsored
        sponsored = False if hotel.find("span",class_="ui_merchandising_pill") is None else True
        if not sponsored:
            parse_hotel_details(hotel)
    print("The Hotels details in the current page are parsed")

Now that we have our hotel information stored in a Pandas data frame, we can plot the ratings of different hotels against each other to understand better how they differ. It can give us good insight into which hotels are better than others and help us make informed decisions when booking hotels.

Cleaning the Data

cleaning-the-data

Once we have the data in a data frame, we can clean it up. It may involve removing duplicates, filling in missing values, or changing the data format. In this case, we will remove the duplicates. We also want to remove the comments associated with the top five hotels so that we only have information for the three hotels in our analysis. To accomplish this, we will use a regex. Here is what the regex looks like:

phone_pattern = ".?(\\d{3}).*(\\d{3}).*(\\d{4})"
date_pattern = "(\\d{2}).(\\d{2}).(\\d{4})"
name_pattern = "(\\w+),\\s(Mr|Ms|Mrs|Dr).?\\s(\\w+)"
url_pattern = "(https?)://(www)?.?(\\w+).(\\w+)/?(\\w+)?"

This code will replace all ” #” with spaces and all “&” with & and append the comment before or after.

We can now store this data in a list and display our ranked hotel list to see how they compare.

Once our data is clean, we can analyze it further by plotting a scatter plot of hotel ratings against each other, as shown below.

Conclusion

  • Web scraping is a powerful tool for collecting data from websites.
  • Python makes it easy to scrape data from websites using a few different packages.
  • Once you have the data, you can use Pandas to read it into a data frame for further analysis.
  • With URL, one can scrab data from it and store it in Pandas data frame.
  • You can access the data in the data frame using the Pandas package.
  • We can also clean the data and remove unwanted data with regex.