how airbnb website data is scraped using python beautifulsoup and selenium
About Airbnb

Airbnb allows renting private stay at reasonable prices. Airbnb had a successful IPO towards the end of 2020, thanks to its brilliant idea of becoming a travel marketplace and flawless execution. The company plays a significant role in the IT industry, being a supporter of open-source projects [2], the famous of which is Airflow.

Every entry on the internet contains a wealth of information, ranging from the kind of Wi-Fi to something like a list of kitchen utensils. Because Airbnb does not have a public API, we will have to employ a workaround – web scraping — for our tiny instructional project.

Initiating
initiating

Python will be used as a computer language since it is ideal for experimentation, has a large online network. Furthermore, there are a variety of libraries to appeal to almost every requirement. Today, we’ll use two of them as our primary tools:

  • BeautifulSoup: Allows quick data extraction from Html files.
  • Selenium: A tool for scripting web-browser operations that can be used for a variety of purposes.

Prefer to use WSL as much as possible. You can install pip/conda install package for installing Python packages. Using Selenium, the process will become more complicated. Hence, you can contact X-Byte Enterprise Crawling for an efficient guide to set up Selenium on WSL.

Plan the Scraping
planning the scraping

A visitor will check out the destination on desired dates and will click a search button. The Airbnb ranking system then displays a list of options for them to choose from. That’s a search page, with numerous listings displayed at once and only a few lines of information for each one.

Our visitor clicks on a listing option after exploring for a time and then redirects to a detail page where they can obtain more information about the selected property. The Airbnb ranking system then displays a list of options for them to choose from. That’s a search page, with numerous listings displayed at once and only a few lines of information for each one.

We want to scrape as much information as possible, therefore we’ll process both query and detail pages. Furthermore, we must consider the listings that are located underneath the top search page. A normal search page yields 20 results, while a destination can provide up to 15 pages (Airbnb restricts further access).

It appears to be quite straightforward. Two primary portions of the software must be implemented: (1) reading a search page and (2) retrieving data from a detail page. Let’s get to work on some programming!

Accessing the Listings

With Python, scraping web pages is a breeze. The function that extracts HTML and converts it to a Beautiful Soup object is as follows:

def scrape_page(page_url):
"""Extracts HTML from a webpage"""

answer = requests.get(page_url)
content = answer.content
soup = BeautifulSoup(content, features='html.parser')

return soup

Beautiful Soup makes it simple to traverse and retrieve the elements of an HTML tree. Getting the text from a “div” object with the class “foobar” is as simple as:

text = soup.find("div", {"class": "foobar"}).get_text()

Individual listings are the objects of our attention on the Airbnb search page. We must first establish their tag kinds and class names to access them. The simplest approach to do this is to use a Chrome developer tool to investigate the page (press F12)

f12

A “div” item with the class “_8s3ctt” contains the listing. We also know that a single search page can have up to 20 different listings. We can get them all at once with the Beautiful Soup method findAll:

def extract_listing(page_url):
"""Extracts listings from an Airbnb search page"""

page_soup = scrape_page(page_url)
listings = page_soup.findAll("div", {"class": "_8s3ctt"})


return listings

A minor disadvantage of scraping is that the aforementioned identifier is only temporary, as Airbnb can alter it with any upcoming release at any time.

Finally, let us get to the core of the article that is data extraction.

Scraping Listings Basic Feature
extracting listing basic features

We can get high-level data about the listings from their detail pages, such as their name, total price, average rating, and so on.

All of these features can be found in various HTML objects with various classes. As a result, we may create numerous single extractions, one for each feature:

name = soup.find('div', {'class':'_hxt6u1e'}).get('aria-label')
price = soup.find('span', {'class':'_1p7iugi'}).get_text()
...

You can build a single extraction method that can be reused to access various components on the page.

def extract_element_data(soup, params):
     """Extracts data from a specified HTML element"""
     
     # 1. Find the right tag
     if 'class' in params:
         elements_found = soup.find_all(params['tag'], params['class'])
     else:
         elements_found = soup.find_all(params['tag'])
         
     # 2. Extract text from these tags
     if 'get' in params:
         element_texts = [el.get(params['get']) for el in elements_found]
     else:
         element_texts = [el.get_text() for el in elements_found]
         
     # 3. Select a particular text or concatenate all of them
     tag_order = params.get('order', 0)
     if tag_order == -1:
         output = '**__**'.join(element_texts)
     else:
         output = element_texts[tag_order]
     
     return output

We now have everything we need to process the entire page of listings and extract general details from every one of them. The whole code can be found in a git repo, however here is an example for extracting only two features.

RULES_SEARCH_PAGE = {
     'name': {'tag': 'div', 'class': '_hxt6u1e', 'get': 'aria-label'},
     'rooms': {'tag': 'div', 'class': '_kqh46o', 'order': 0},
 }
 

 listing_soups = extract_listing(page_url)
 

 features_list = []
 for listing in listing_soups:
     features_dict = {}
     for feature in RULES_SEARCH_PAGE:
         features_dict [feature] = extract_element_data(listing, RULES_SEARCH_PAGE [feature])
         features_list.append(features_dict)
Accessing Every Page Per Location

When it comes to statistics, it is better to have a greater number. We have accessibility for up to 300 listings per location through Airbnb, and we’re going to harvest them all. There are several options for paginating search results.

Simply adding the “items offset” option to our original URL will suffice. Let’s make a list of all the links for each region.

def build_urls(url, listings_per_page=20, pages_per_location=15):
"""Builds links for all search pages for a given location"""

url_list = []
for i in range(pages_per_location):
    offset = listings_per_page * i
    url_pagination = url + f'&items_offset={offset}'
    url_list.append(url_pagination)
    
return url_list

This is where half of the work is completed. We now can execute our parser and obtain the essential features for all listings in a given location. Everything we have to do now enters the starting URL.

Dynamic Pages
dynamic pages

You may have noticed that loading a detail page takes a bit of time if you viewed the “user journey GIF” above. It took a long time, around 3–4 seconds. Until then, we only could access the online page’s fundamental HTML, which didn’t include most of the features of the listing we wanted to scrape.

The requests package, unfortunately, doesn’t enable us both to wait until all of the page elements have loaded but Selenium is capable of performing the same task. It can imitate real-world user behavior such as waiting for all of the JavaScript to load, scrolling, clicking on buttons, filling out forms, and so on.

Waiting and clicking is the undergoing process. You must click on the corresponding items to obtain the facilities and price details.

To sum it up, our current actions are as follows:

  • Set up the Selenium driver.
  • Go to the detail page.
  • Wait until all of the buttons have been loaded.
  • Select the buttons with your mouse.
  • Wait a few moments for all of the elements to load.
  • HTML code should be extracted

Let us execute this into a Python function:

def extract_soup_js(listing_url, waiting_time=[5, 1]):
     """Extracts HTML from JS pages: open, wait, click, wait, extract"""
 

     options = Options()
     options.add_argument('--headless')
     options.add_argument('--no-sandbox')
     driver = webdriver.Chrome(options=options)
 

     driver.get(listing_url)
     time.sleep(waiting_time[0])
     
     try:
         driver.find_element_by_class_name('_13e0raay').click()
     except:
         pass # amenities button not found
     try:
         driver.find_element_by_class_name('_gby1jkw').click()
     except:
         pass # prices button not found
     
     time.sleep(waiting_time[1])
     detail_page = driver.page_source
 

     driver.quit()
 

     return BeautifulSoup(detail_page, features='html.parser')

Retrieving detailed listing information is now only a question of time, as we have all of the essential materials on hand. We must use a Chrome developer tool to fully investigate the website, note down all of the names and classes of Html tags, feed all of this to extract_element_data.py, and be satisfied with the results.

But…The process is very slow.

Parallel Execution

Because we don’t have to wait for JavaScript elements to load, scraping all 15 result pages each location is quite quick. That means we’ll have such a dataset with core functions for all postings after a few seconds. URLs to detail pages are among them.

We understand that processing one detail page requires at least 5–6 seconds (plus some time for the script to compute) plus some duration for the page to render. At the same time, it is also noticed that this entire operation only uses about 3–8% of the CPU on my laptop. Let’s put more effort into it!

Rather than accessing 300 webpages in a single loop, we can divide the URLs into groups and loop through them. We’ll have to experiment a little to figure out the best batch size.

from multiprocessing import Pool
with Pool(8) as pool:
result = pool.map(scrape_detail_page, url_list)

The chrome browser requires 5 seconds to load all the elements settled on the page and so the entire scraping gets blocked or even failed.

Results
sample data

After wrapping the functions into

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import ActionChains

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

import json
import time

import os

import pandas as pd

from multiprocessing import Pool

MAYRHOFEN_LINK = 'https://www.airbnb.com/s/Mayrhofen--Austria/homes?query=Mayrhofen%2C%20Austria&checkin=2021-04-06&checkout=2021-04-13&adults=4'

RULES_SEARCH_PAGE = {
    'url': {'tag': 'a', 'get': 'href'},
    'name': {'tag': 'div', 'class': '_hxt6u1e', 'get': 'aria-label'},
    'name_alt': {'tag': 'a', 'get': 'aria-label'},
    'header': {'tag': 'div', 'class': '_b14dlit'},
    'rooms': {'tag': 'div', 'class': '_kqh46o'},
    'facilities': {'tag': 'div', 'class': '_kqh46o', 'order': 1},
    'badge': {'tag': 'div', 'class': '_17bkx6k'},
    'rating_n_reviews': {'tag': 'span', 'class': '_18khxk1'},
    'price': {'tag': 'span', 'class': '_1p7iugi'},
    'price_alt': {'tag': 'span', 'class': '_olc9rf0'},
    'superhost': {'tag': 'div', 'class': '_ufoy4t'},
}

RULES_DETAIL_PAGE = {
    'location': {'tag': 'span', 'class': '_jfp88qr'},
    
    'specialties_1': {'tag': 'div', 'class': 't1bchdij', 'order': -1},
    'specialties_2': {'tag': 'div', 'class': '_1qsawv5', 'order': -1},

    'price_per_night': {'tag': 'div', 'class': '_ymq6as'},
    
    'refundables': {'tag': 'div', 'class': '_cexc0g', 'order': -1},
        
    'prices_1': {'tag': 'li', 'class': '_ryvszj', 'order': -1},
    'prices_2': {'tag': 'li', 'class': '_adhikmk', 'order': -1},
    
    'listing_ratings': {'tag': 'span', 'class': '_4oybiu', 'order': -1},
    
    'host_joined': {'tag': 'div', 'class': '_1fg5h8r', 'order': 1},
    'host_feats': {'tag': 'span', 'class': '_pog3hg', 'order': -1},
    
    'lang_responses': {'tag': 'li', 'class': '_1q2lt74', 'order': -1},
    'house_rules': {'tag': 'div', 'class': '_u827kd', 'order': -1},
}


def extract_listings(page_url, attempts=10):
    """Extracts all listings from a given page"""
    
    listings_max = 0
    listings_out = [BeautifulSoup('', features='html.parser')]
    for idx in range(attempts):
        try:
            answer = requests.get(page_url, timeout=5)
            content = answer.content
            soup = BeautifulSoup(content, features='html.parser')
            listings = soup.findAll("div", {"class": "_gig1e7"})
        except:
            # if no response - return a list with an empty soup
            listings = [BeautifulSoup('', features='html.parser')]

        if len(listings) == 20:
            listings_out = listings
            break

        if len(listings) >= listings_max:
            listings_max = len(listings)
            listings_out = listings

    return listings_out
        
        
def extract_element_data(soup, params):
    """Extracts data from a specified HTML element"""
    
    # 1. Find the right tag
    if 'class' in params:
        elements_found = soup.find_all(params['tag'], params['class'])
    else:
        elements_found = soup.find_all(params['tag'])
        
    # 2. Extract text from these tags
    if 'get' in params:
        element_texts = [el.get(params['get']) for el in elements_found]
    else:
        element_texts = [el.get_text() for el in elements_found]
        
    # 3. Select a particular text or concatenate all of them
    tag_order = params.get('order', 0)
    if tag_order == -1:
        output = '**__**'.join(element_texts)
    else:
        output = element_texts[tag_order]
    
    return output


def extract_listing_features(soup, rules):
    """Extracts all features from the listing"""
    features_dict = {}
    for feature in rules:
        try:
            features_dict
  • = extract_element_data(soup, rules
  • ) except: features_dict
  • = 'empty' return features_dict def extract_soup_js(listing_url, waiting_time=[20, 1]): """Extracts HTML from JS pages: open, wait, click, wait, extract""" options = Options() options.add_argument('--headless') options.add_argument('--blink-settings=imagesEnabled=false') driver = webdriver.Chrome(options=options) # if the URL is not valid - return an empty soup try: driver.get(listing_url) except: print(f"Wrong URL: {listing_url}") return BeautifulSoup('', features='html.parser') # waiting for an element on the bottom of the page to load ("More places to stay") try: myElem = WebDriverWait(driver, waiting_time[0]).until(EC.presence_of_element_located((By.CLASS_NAME, '_4971jm'))) except: pass # click cookie policy try: driver.find_element_by_xpath("/html/body/div[6]/div/div/div[1]/section/footer/div[2]/button").click() except: pass # alternative click cookie policy try: element = driver.find_element_by_xpath("//*[@data-testid='main-cookies-banner-container']") element.find_element_by_xpath("//button[@data-testid='accept-btn']").click() except: pass # looking for price details price_dropdown = 0 try: element = driver.find_element_by_class_name('_gby1jkw') price_dropdown = 1 except: pass # if the element is present - click on it if price_dropdown == 1: for i in range(10): # 10 attempts to scroll to the price button try: actions = ActionChains(driver) driver.execute_script("arguments[0].scrollIntoView(true);", element); actions.move_to_element_with_offset(element, 5, 5) actions.click().perform() break except: pass # looking for amenities driver.execute_script("window.scrollTo(0, 0);") try: driver.find_element_by_class_name('_13e0raay').click() except: pass # amenities button not found time.sleep(waiting_time[1]) detail_page = driver.page_source driver.quit() return BeautifulSoup(detail_page, features='html.parser') def scrape_detail_page(base_features): """Scrapes the detail page and merges the result with basic features""" detailed_url = 'https://www.airbnb.com' + base_features['url'] soup_detail = extract_soup_js(detailed_url) features_detailed = extract_listing_features(soup_detail, RULES_DETAIL_PAGE) features_amenities = extract_amenities(soup_detail) features_detailed['amenities'] = features_amenities features_all = {**base_features, **features_detailed} return features_all def extract_amenities(soup): amenities = soup.find_all('div', {'class': '_aujnou'}) amenities_dict = {} for amenity in amenities: header = amenity.find('div', {'class': '_1crk6cd'}).get_text() values = amenity.find_all('div', {'class': '_1dotkqq'}) values = [v.find(text=True) for v in values] amenities_dict['amenity_' + header] = values return json.dumps(amenities_dict) class Parser: def __init__(self, link, out_file): self.link = link self.out_file = out_file def build_urls(self, listings_per_page=20, pages_per_location=15): """Builds links for all search pages for a given location""" url_list = [] for i in range(pages_per_location): offset = listings_per_page * i url_pagination = self.link + f'&items_offset={offset}' url_list.append(url_pagination) self.url_list = url_list def process_search_pages(self): """Extract features from all search pages""" features_list = [] for page in self.url_list: listings = extract_listings(page) for listing in listings: features = extract_listing_features(listing, RULES_SEARCH_PAGE) features['sp_url'] = page features_list.append(features) self.base_features_list = features_list def process_detail_pages(self): """Runs detail pages processing in parallel""" n_pools = os.cpu_count() // 2 with Pool(n_pools) as pool: result = pool.map(scrape_detail_page, self.base_features_list) pool.close() pool.join() self.all_features_list = result def save(self, feature_set='all'): if feature_set == 'basic': pd.DataFrame(self.base_features_list).to_csv(self.out_file, index=False) elif feature_set == 'all': pd.DataFrame(self.all_features_list).to_csv(self.out_file, index=False) else: pass def parse(self): self.build_urls() self.process_search_pages() self.process_detail_pages() self.save('all') if __name__ == "__main__": new_parser = Parser(MAYRHOFEN_LINK, './test.csv') t0 = time.time() new_parser.parse() print(time.time() - t0)

    Interacting with actual data has the disadvantage of being imperfect. Many fields need cleaning and preprocessing, and there are “empty” columns. Some features proved to be useless since they are either blank or filled with the same values all the time.

    We could also make some changes to the script. It might be faster if you experiment with alternative parallelization approaches. There will be fewer empty columns if you investigate page loading times.

    Recap

    Here, we can learn to

    Scrape webpages using Python and BeautifulSoup.

    Dealing with Dynamic pages with the use of Selenium.

    Parallelize the code with Multiprocessing

    We can assist you in scraping Airbnb websites using Python, BeautifulSoup and Selenium.

    For any queries, feel free to contact X-Byte Enterprise Crawling