How To Use Python To Scrape Real Estate Website Data Using Web Scraping And Making Data Wrangling?

In all data science projects, amongst the most inquired questions is how to find the data and where is that data. We would say there are lots of data available, you only need to scrape it. For instance, there are billions of petabytes of data accessible and the majority of them is free. You just need to understand is how to scrape that and make that helpful for your business. We would say all types of businesses can use this freely available data on the internet to get business improvements. They can utilize web scraping for scraping it.

To demonstrate web scraping in the blog, we will be extracting data from domian.com as it is a property website. We will be extracting the pricing, total bedrooms, total bathrooms, total parking, addresses as well as locations of all houses in Australia’s Melbourne City.

Before using Python, you should know some fundamentals about HTML (Hyper Text Markup Language).

  • All webpages are inscribed in HTML.
  • HTML defines a webpage’s structure.
  • HTML essentials label portions of content like “that is the heading”, “that is the paragraph”, “that is the link”, etc.
  • HTML features tell a browser about how to show the content.
  • HTML is a standard markup language to create webpages.

An easy HTML document will look like this:

<!DOCTYPE html>
<html>
<head>
<title>Page Title<title>>
<head>
<body>
<h1>This is a Heading<h1>
<p>This is a paragraph.<p>
<body>
<html>

Wherever,

The <!DOCTYPE html> statement defines that the document is created in HTML5.

The <html> part is a root component of an HTML page.

The <head> part comprises Meta information regarding an HTML page.

The <title> part identifies the title for an HTML page (that is given in a browser’s title bar or within a page’s tab)

The <body> part describes a document’s body, as well as is the container for different visible contents like paragraphs, images, headings, tables, lists, hyperlinks, etc.

The <h1> part describes a huge heading.

The <p> part describes a paragraph.

You may get the HTML document about any website through doing the right-click on any webpage as well as choosing “View page resource” (accessible in Google Chrome and Microsoft Edge). All these content on a webpage would be within the HTML document within a well-structured format, you just need to scrape the necessary data from the HTML document.

1. Collecting Data

collecting data min

There are different libraries accessible in Python for getting the HTML document as well as parse that into necessary format required.


# sample code to get a HTML document and parse it into the required format you want
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen(“https://www.domain.com.au/sale/melbourne-region-vic/")
bsobj = BeautifulSoup(html, “lxml”)

In the given code, urlopen is scraping HTML document from given web pages as well as BeautifulSoup is wording it in the LXML format. LXML format is very easy-to-understand, so you can utilize another format that you wish like json etc.

The given screenshot here shows “https://www.domain.com.au/sale/melbourne-region-vic/” URL results as well as it shows all the available properties to sell in Melbourne

however, we require to get webpages for all Melbourne properties accessible on the page https://www.domain.com.au/sale/melbourne-region-vic/. We could do that by scraping all the available URLs on the page as well as store them in the list. Another thing to supplement here is, there are around 50 pages of the Melbourne search available on Domain.com as well as it is only the 1st page therefore we require to visit each 50 pages as well as scrape all URLs for all advertised houses in Melbourne. So, we require to apply a loop for different 50 iterations, every iteration for every page.


from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
# home url of domian.com australia
home_url = "https://www.domain.com.au"
# number of pages of search result are 50, so we need to 
page_numbers = list(range(50))[1:50]
# list to store all the urls of properties
list_of_links = []
# for loop for all 50 search(melbourne region) pages
for page in page_numbers:
 
    # extracting html document of search page
    html = urlopen(home_url + "/sale/melbourne-region-vic/?sort=price-desc&page=" + str(page))
    # parsing html document to 'lxml' format
    bsobj = BeautifulSoup(html, "lxml")
    # finding all the links available in 'ul' tag whos 'data-testid' is 'results'
    all_links = bsobj.find("ul", {"data-testid": "results"}).findAll("a", href=re.compile("https://www.domain.com.au/*"))
# inner loop to find links inside each property page because few properties are project so they have more properties inside their project page
    for link1 in all_links:
        # checking if it is a project and then performing similar thing I did above
        if 'project' in link1.attrs['href']:
            inner1_html = urlopen(link1.attrs['href'])
            inner1_bsobj = BeautifulSoup(inner1_html, "lxml")
            for link2 in inner1_bsobj.find("div", {"name": "listing-details__other-listings"}).findAll("a", href=re.compile("https://www.domain.com.au/*")):
                if 'href' in link2.attrs:
                    list_of_links.append(link2.attrs['href'])
        else:
            list_of_links.append(link1.attrs['href'])

You may just copy as well as paste the given code to make some modifications as per your requirements, and try and run.

Here, we did some different things:

  1. We used a search page that is arranged by price. We did it so that this would become easier to credit missing prices of houses. We will explain that more in the data wrangling section below.
  2. An Inner loop is utilized as some properties are like projects and every project has more property URL links within their pages.

Now, we also have the URLs about every property of Melbourne, Australia. Every URL is exclusive for every property in Melbourne. So, the next step would be, going inside every URL as well as scrape prices, total bedrooms, total bathrooms, total parking, addresses and locations.


# removing duplicate links while maintaining the order of urls
abc_links = [] 
for i in list_of_links: 
    if i not in abc_links: 
        abc_links.append(i) 
        
# defining required regural expression for data extraction     
pattern = re.compile(r'>(.+)(.+?).*')
pattern1 = re.compile(r'>(.+)<.')
pattern2 = re.compile(r'destination=(.+)" rel=.')
basic_feature_list = []
# loop to iterate through each url
for link in abc_links:
    
    # opening urls
    html = urlopen(link)
    
    # converting html document to 'lxml' format
    bsobj = BeautifulSoup(html, "lxml")
    
    # extracting address/name of property
    property_name = bsobj.find("h1", {"class": "css-164r41r"})
    
    # extracting baths, rooms, parking etc
    all_basic_features = bsobj.find("div", {"class": "listing-details__listing-summary-features css-er59q5"}).findAll("span", {"data-testid": "property-features-text-container"})
    
    # extracting property price
    property_price = bsobj.find("div", {"data-testid": "listing-details__summary-title"})
    
    # extracting latitudes and longitudes
    lat_long = bsobj.find("a", {"target": "_blank", 'rel': "noopener noreferer"})
    
    # dictionary to store temporary data
    basic_feature_dict = {}
    
    # few properties does not contain all the 4 features such as rooms, baths, parkings, area. So need to check
    # how many features they contain
    if len(all_basic_features) == 4:
        basic_feature_dict[pattern.findall(str(all_basic_features[0]))[0][1]] = pattern.findall(str(all_basic_features[0]))[0][0]
        basic_feature_dict[pattern.findall(str(all_basic_features[1]))[0][1]] = pattern.findall(str(all_basic_features[1]))[0][0]
        basic_feature_dict[pattern.findall(str(all_basic_features[2]))[0][1]] = pattern.findall(str(all_basic_features[2]))[0][0]
        basic_feature_dict[pattern.findall(str(all_basic_features[3]))[0][1]] = pattern.findall(str(all_basic_features[3]))[0][0]
        
    elif len(all_basic_features) == 3:
        basic_feature_dict[pattern.findall(str(all_basic_features[0]))[0][1]] = pattern.findall(str(all_basic_features[0]))[0][0]
        basic_feature_dict[pattern.findall(str(all_basic_features[1]))[0][1]] = pattern.findall(str(all_basic_features[1]))[0][0]
        basic_feature_dict[pattern.findall(str(all_basic_features[2]))[0][1]] = pattern.findall(str(all_basic_features[2]))[0][0]
        
    elif len(all_basic_features) == 2:
        basic_feature_dict[pattern.findall(str(all_basic_features[0]))[0][1]] = pattern.findall(str(all_basic_features[0]))[0][0]
        basic_feature_dict[pattern.findall(str(all_basic_features[1]))[0][1]] = pattern.findall(str(all_basic_features[1]))[0][0]
        
    elif len(all_basic_features) == 1:
        basic_feature_dict[pattern.findall(str(all_basic_features[0]))[0][1]] = pattern.findall(str(all_basic_features[0]))[0][0]
# putting 'none' if price is missing    
    if property_price is None:
        basic_feature_dict['price'] = None
        
    else:
        basic_feature_dict['price'] = pattern1.findall(str(property_price))[0]
        
    # putting 'none' if property name/address is missing       
    if property_name is None:
        basic_feature_dict['name'] = None
        
    else:
        basic_feature_dict['name'] = pattern1.findall(str(property_name))[0]
        
    # putting 'none' if latitude and logitude are missing        
    if lat_long is None:
        basic_feature_dict['lat'] = None
        basic_feature_dict['long'] = None
        
    else:
        basic_feature_dict['lat'] = pattern2.findall(str(lat_long))[0].split(',')[0]
        basic_feature_dict['long'] = pattern2.findall(str(lat_long))[0].split(',')[1]
# appending all the data into a list
    basic_feature_list.append(basic_feature_dict)

Now, an output of a given code provides us the listing of dictionaries having all the accessible scraped data. Here, we would convert that into different individual lists as we need to do a bit more cleaning as well as scraping of above-mined data as well as it would become easier to perform in the lists.


import random
# creating a new empty price list
actual_price_list = []
# defining some regural expressions, they will be used to extract price of properties
pattern1 = re.compile(r'\$\s?([0-9,\.]+).*\s?.+\s?\$\s?([0-9,\.]+)')
pattern2 = re.compile(r'\$([0-9,\.]+)')
# interating through price_list
for i in range(len(price_list)):
    
    # check that a price is given or range of price is given
    
    if str(price_list[i]).count('$') == 1:
        b_num = pattern2.findall(str(price_list[i]))
        
        # checking length of string, if it is less than or equal to 5 then price is in millions so need to convert the price
        if len(b_num[0].replace(',', '')) > 5:
            actual_price_list.append(float(b_num[0].replace(',', '')))
        else:
            actual_price_list.append(float(b_num[0].replace(',', ''))*1000000)
        
    elif str(price_list[i]).count('$') == 2:
        a_num = pattern1.findall(str(price_list[i]))
        random_error = random.randint(0, 10000)
        
        # checking length of string, if it is less than or equal to 5 then price is in millions so need to convert the price
        if len(a_num[0][0].replace(',', '')) > 5 and len(a_num[0][1].replace(',', '')) > 5:
            
            # to take average price between two price range given
            avg_price = (float(a_num[0][0].replace(',', '')) + float(a_num[0][1].replace(',','')))/2
        else:
            avg_price = (float(a_num[0][0].replace(',', '')) + float(a_num[0][1].replace(',',''))/2)*1000000
            
        # adding or subtracting the amount from the average price by normally distributed generated random number
        avg_price = avg_price + random_error
        actual_price_list.append(avg_price)
    else:
        actual_price_list.append('middle_price')

At present, we are having all the information in the list format.

2. Data Wrangling

data wrangling min

Some people do not wish to reveal the property pricing therefore, they do not show pricing in the property advertisement. At times, they will not put everything in the pricing column and at times, they put things like ‘after inspection pricing’ or ‘contact dealer’ or more. In addition, a few people do not show direct price and they put the range of pricing or pricing with a few additional text before their pricing or after pricing or both. Therefore, we require to deal with all the situations as well as scrape only the pricing and in case the pricing is not provided then use ‘none’ there. Here is the code:

You can have many missing values in pricing as a lot of people do not wish to provide or provide house pricing on a website. At present, we have to credit the missing pricing and we have used a trick.

We have used a trick that we have sorted houses as per their prices and all these houses having or without a shown price would get sorted. This sorting by a website is made using the price provided by the house owners to a website however, this is not given on a website for the users. That is the reason why we have scraped houses data from a website when the website results get sorted by the price.

We need to understand this with an example. Assume there are 10 houses with house pricing missing however, we can categorize houses as per their prices, so initially, we categorize them as per their prices and we see the price of house no. 4 as well as house no. 5 is missing therefore, we would take means of the prices of house no. 3 as well as house no. 6. After that, attribute missing prices with the mean values. Similar type of things we would be offering in the given code:


# for loop to impute missing values at the start of list, because here we cannot take mean
for i in range(len(actual_price_list)):
    if actual_price_list[i] != 'middle_price':
        for a in range(i, -1, -1):
            actual_price_list[a] = actual_price_list[i]
        break
# here we will be taking mean and then add random number with same standard deviation normal distribution and then impute it
for i in range(len(actual_price_list)):
    if actual_price_list[i] == 'middle_price':
        for j in range(i, len(actual_price_list)):
            if actual_price_list[j] != 'middle_price':
                mid = (actual_price_list[i-1] + actual_price_list[j])/2
                if actual_price_list[j] > 12000000:
                    for k in range(i, j):
                        random_error = random.randint(-1000000, 1000000)
                        mid = mid + random_error
                        actual_price_list[k] = mid
                    i = j
                    break
                elif actual_price_list[j] > 5000000:
                    for k in range(i, j):
                        random_error = random.randint(-100000, 100000)
                        mid = mid + random_error
                        actual_price_list[k] = mid
                    i = j
                    break
                else:
                    for k in range(i, j):
                        random_error = random.randint(-10000, 10000)
                        mid = mid + random_error
                        actual_price_list[k] = mid
                    i = j
                    break
            elif j == len(actual_price_list)-1:
                for n in range(i, len(actual_price_list)):
                    random_error = random.randint(-1000, 1000)
                    a_price = actual_price_list[i-1]
                    a_price = a_price + random_error
                    actual_price_list[n] = a_price
                break

Making Dataframe.


import pandas as pd
house_dict = {}
house_dict['Beds'] = beds_list
house_dict['Baths'] = baths_list
house_dict['Parking'] = parking_list
house_dict['Area'] = area_list
house_dict['Address'] = name_list
house_dict['Latitude'] = lat_list
house_dict['Longitude'] = long_list
house_dict['Price'] = actual_price_list
house_df = pd.DataFrame(house_dict)
house_df.info()

One ‘area’ column has different null values that cannot be credited so we would be removing the ‘area’ column.


house_df.drop('Area', axis=1, inplace=True)

In addition, convert baths, beds, and parking string types into numeric types.


house_df["Beds"] = pd.to_numeric(house_df["Beds"])
house_df["Baths"] = pd.to_numeric(house_df["Baths"])
house_df["Parking"] = pd.to_numeric(house_df["Parking"])

Now, do some descriptive data analytics for finding data problems and solve those problems. For instance, utilize scatter plot for checking outliers within data or utilize histogram to watch data distribution etc.


# scatter plot
house_df.plot.scatter(x='Beds',y='Baths')
# histogram
house_df["Price"].plot.hist(bins = 50)

Data cleansing is the iterative procedure. The initial step of data cleansing procedure is data auditing. Here, we recognize the kinds of anomalies, which decrease the quality of data. Data auditing means programmatically checking all data with some validation instructions, which are pre-specified, as well as creating the report about data quality as well as its problems. Also, we frequently apply a few statistical tests within this step of data examining.

Data Anomalies could be classified at higher level in three groups:

1. Syntactic Anomalies

Define characteristics about the values and formats used for the entity’s representation. Syntactic anomalies like syntactical errors and irregularities, lexical errors, and domain format errors.

2. Semantic Anomalies

Semantic Anomalies Delay the data collection from getting a non-redundant and comprehensive representation of a mini-world. These kinds of anomalies comprise contradictions, integrity constraint violations, invalid tuples and duplicates.

3. Coverage Anomalies

Reduce the entities as well as entity properties from a mini-world, which are symbolized in data collection. These coverage anomalies are considered as missing tuples and missing values.

Many ways are there to deal with these anomalies, we won’t go into the details about how to deal with these anomalies as our scraped data does not get these anomalies.

Information can also get transformed as per their requirements. The problem for predicting the house price is the reversion problem. In case, this is the linear reversion problem then we could make some transformation for making data please linear regression expectations.

We could create or add new features through current features in data sets so that we could make data more improved. We are making a new column that will have the distance of a house from its city.

Missing Values

Resource of different missing values:

Data Scraping : It is quite possible that you face problems with the scraping procedure. In these cases, we need to double-check regarding correct data having data guardians. A few hashing procedures could also be utilized to ensure that data scraping is correct. Errors during the data scraping stage are generally easy to get as well as could be easily corrected also.

Data Collection : The errors take place at the time of collecting data as well as are hard to correct.

Further, they could be categorized into four kinds:

Missing totally at Random : It is the case when probabilities about the missing variables are similar for all observations. For instance: defendants of data collection procedure decide that they would declare their earnings after tossing the fair coin. In case, a head takes place, respondent declares earnings or vice versa. So, here, every observation is having an equal chance of lost value.

Missing Randomly : It is the case where variables are missing randomly and the missing ratio differs for various values / levels of other input variables. For instance, we are gathering data for the age and females have higher missing values compare to males.

Missing, Which Rely on Overlooked Forecasters : That is the case when missing prices are not randomly selected as well as are associated to unobserved input variables. For instance, in the medical study, in case, any particular diagnostic creates discomfort, there is better chances of dropping out from this study. This lost value is not at the random except we have incorporated “discomfort” as the input variable for the patients.

Missing Which Relies on Missing Value Itself : That is the case when a probability about missing values is directly connected with the missing values itself. For instance, people having lower or higher incomes are expected to offer non-response to the earning.

As per the research, you can securely discover what kind of messiness is there in the data. Our scraped data is Missing totally at random as well as the data set is enormous so we have deleted the rows using any ‘none’ values.


import math
cleaned_house_df = house_df.dropna(how='any')
cleaned_house_df.reset_index(drop = True, inplace = True)
# radius of earth is 6378
r = 6378
dis_to_city = []
for i in range(len(cleaned_house_df)):
    
    lat1_n = math.radians(-37.818078)
    lat2 = math.radians(float(cleaned_house_df['Latitude'][i]))
    
    lon1_n = math.radians(144.96681)
    lon2 = math.radians(float(cleaned_house_df['Longitude'][i]))
    
    lon_diff_n = lon2 - lon1_n
    lat_diff_n = lat2 - lat1_n
    
    a_n = math.sin(lat_diff_n / 2)**2 + math.cos(lat1_n) * math.cos(lat2) * math.sin(lon_diff_n / 2)**2
    c_n = 2 * math.atan2(math.sqrt(a_n), math.sqrt(1 - a_n))
    
    dis_to_city.append(round(r*c_n, 4))
    
cleaned_house_df['distance_to_city'] = dis_to_city

The last step is exporting Dataframe to some other tabular formats file including a CSV or an excel file.


# exporting to csv file
cleaned_house_df.to_csv('real_estate_data_csv.csv', index=False)
# exporting to excel file
cleaned_house_df.to_excel('real_estate_data_excel.xlsx', index=False)

Now that we have our hotel information stored in a Pandas data frame, we can plot the ratings of different hotels against each other to understand better how they differ. It can give us good insight into which hotels are better than others and help us make informed decisions when booking hotels.