How Does Python and BeautifulSoup are Used to Scrape Top Insurance Companies Data?

Why Do You Need Web Scraping?

It all begins with information (It is the collection of facts). Data is required by businesses and organizations for market research. These data can be gathered through interviews, observations, surveys, and questionnaires, as well as government archives and the Internet.

Web scraping is a technique for extracting relevant large amounts of data from websites and saving it to a file or database. The data that is scraped is usually in tabular or spreadsheet format (e.g.: CSV file).

In this blog, we’ll scrape the value of a website. Today is the first day of our web scraping project.

Below given is the overview of the steps we will follow:

  • Using queries, download the webpage.
  • beautifulsoup4 will parse the HTML source code.
  • Extract company names, CEOs, global rankings, market capitalization, annual revenue, employee count, and company URLs.
  • Using Pandas, compile the data and generate a CSV file.

How to Perform Web scraping?

Python is a fantastic language that provides packages such as Beautiful Soup, Requests, and Pandas that are used to extract data from HTML code and transform it into various formats (CSV, XML, JSON) depending on the application.

HTML: The code used to organize a website and its information is known as HTML (Hypertext Markup Language). It includes tags that specify how well a web browser should format and display information.

BeautifulSoup is a Library for python that extracts data from HTML and XML files.

Requests is the de facto Python library standard for trying to make HTTP requests.

HTTP is a protocol that is used to retrieve resources such as HTML documents.

Let us extract the web page of the top insurance companies by market capitalization

How Does Python and BeautifulSoup are Used to Scrape Top Insurance Companies Data?

At the end of the project, we will create a CSV file in the below format:

companies_name,CEOs_name,world_ranks,market_capitalizations_in_billion_dollars,annual_revenues_in_million_dollars,number_of_employees,companies_URLs

BERKSHIRE HATHAWAY,Warren Buffett,8,543.68,286260.0,391500.0,https://www.berkshirehathaway.com/

UNITEDHEALTH GROUP,David S. Wichmann,18,332.73,255630.0,320000.0,https://www.unitedhealthgroup.com/

BANK OF AMERICA CORPORATION,Brian Moynihan,20,262.2,85530.0,208000.0,https://www.bankofamerica.com/

WELLS FARGO & COMPANY,Charles W. Scharf,65,124.78,72340.0,258700.0,https://www.wellsfargo.com/

AIA GROUP,Lee Yuan Siong,91,152.33,50360.0,23000.0,http://www.aia.com/

Download the Webpage using requests

We’ll use the requests Python library to download the web page.

How Does Python and BeautifulSoup are Used to Scrape Top Insurance Companies Data?

Let’s get started with installing and importing requests.

!pip install requests --upgrade --quiet
import requests

We can make use of requests.get obtain the ability to download a webpage

topics_url = 'https://www.value.today/world-top-companies/insurance'
response = requests.get(topics_url)

requests.get returns a reaction object that contains the information from web page as well as some additional information. Using response.text, we can get at the contents of a website page.

page_content = response.text

page_content[:1000]
'<!DOCTYPE html>
\n<html lang="en" dir="ltr" prefix="
content: http://purl.org/rss/1.0/modules/content/  
dc: http://purl.org/dc/terms/  
foaf: http://xmlns.com/foaf/0.1/  
og: http://ogp.me/ns#  
rdfs: http://www.w3.org/2000/01/rdf-schema#  
schema: http://schema.org/  
sioc: http://rdfs.org/sioc/ns#  
sioct: http://rdfs.org/sioc/types#  
skos: http://www.w3.org/2004/02/skos/core#  
xsd: http://www.w3.org/2001/XMLSchema# ">\n  
<head>\n    
    <meta charset="utf-8"/>
    \n<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
    \n<script>(adsbygoogle=window.adsbygoogle||[]).push({google_ad_client:"ca-pub-2407955258669770",enable_page_level_ads:true});</script>
    <script>window.google_analytics_uacct="UA-121331115-1";(function(i,s,o,g,r,a,m){i["GoogleAnalyticsObject"]=r;i[r]=i[r]||function(){(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)})(window,document,"script","https://www'>

HTML code can be found on the website. Using requests, we successfully fetched the web page. The HTML of the webpage value is contained in the above cell page_ content [:1000]. We can also save it to a folder and view it locally within Jupyter by selecting “File > Open.”

with open('world-insurance.html','w',encoding = "utf-8") as file:
    file.write(page_content)

This page will look similar to the original page

Parse the HTML source code using beautifulsoup4

To decode the HTML source code of the web page downloaded in the previous section, we’ll use the Beautiful Soup Python library. We’ll also add a helper function.

!pip install beautifulsoup4 --upgrade --quiet
from bs4 import BeautifulSoup
doc = BeautifulSoup(response.text, 'html.parser')

We can use doc it to retrieve data from the web page once it has been parsed.

type(doc)
bs4.BeautifulSoup
doc.find('title')
World Top Insurance Companies by Market Value as on 2021

We were able to recover the headline from the website page, as shown above.

Let’s make a reusable helper function get page which can download a website page and generate a Beautiful Soup doc for every given URL.

def get_page(url):
    """Download a web page and return a beautiful soup doc"""
    # Download the webpage
    response = requests.get(url)
    
    # Check if the dowmload was successful
    if response.status_code != 200:
        raise Exception('Unable to download page {}'.format(url))
    
    # Get the page HTML
    page_contents = response.text
    
    # Create a bs4 doc
    doc = BeautifulSoup(response.text,'html.parser')
    return doc

We can even use the aforementioned function to obtain the URL of any website page. The status code will be returned by response.status_code in this case. A legitimate sequence number should be in the range (200–299). More on the status code. response.text returns the response’s html content. The term parser refers to a class called HTML Parser, which is used to parse HTML files whilst also web scraping.

Extract Company Names, CEOs, World Ranks, Market Capitalization, Annual Revenue, Number of Employees, Company URLs, etc.

All of the information can be retrieved from the web page’s li_tag.

extract company names ceos world ranks

Let’s make a variable that extracts the li_tag of the class row well clearfix.

company_block = doc.find_all('li',class_='row well clearfix')

Extracting Company Names

Let’s write a helper function that will extract business names from a web page.

def name_of_companies(company_block):
    company_names = []
    for tag in company_block:
        c_name = tag.find('div',class_='field field--name-node-title field--type-ds field--label-hidden field--item')

        company_names.append(c_name.find('a').text)
    return company_names

Let us check the function name_of_companies

#Let's check the function
name_of_companies(company_block)
['BERKSHIRE HATHAWAY',
 'UNITEDHEALTH GROUP',
 'BANK OF AMERICA CORPORATION',
 'WELLS FARGO & COMPANY',
 'AIA GROUP',
 'ROYAL BANK OF CANADA',
 'PING AN INSURANCE (GROUP) COMPANY OF CHINA',
 'BANK OF CHINA',
 'STATE FARM',
 'TORONTO-DOMINION BANK']

Extracting CEOs Name

A helper function for extracting CEO names from web pages

def name_of_CEOs(company_block):
    CEO_names = []
    for tag in company_block:
        names = tag.find('div',class_='clearfix col-sm-12 field field--name-field-ceo field--type-entity-reference field--label-above')
        try:
            ceo = names.find('a').text
            CEO_names.append(ceo)
        except AttributeError:
            CEO_names.append(None)
    return CEO_names

Let Us Check the Function name_of_ceo

# Let's call the function
name_of_CEOs(company_block)
['Warren Buffett',
 'David S. Wichmann',
 'Brian Moynihan',
 'Charles W. Scharf',
 'Lee Yuan Siong',
 'David I. McKay',
 'Ma Mingzhe',
 'Gao Yingxin',
 None,
 'Bharat Masrani']

Extracting World Ranks

A helper function for obtaining World Ranks from a web page.

def ranks_of_world(company_block):
    world_ranks = []

    for tag in company_block:
        rank = tag.find('div', class_='clearfix col-sm-6 field field--name-field-world-rank-sep-01-2021- field--type-integer field--label-above')

        world_ranks.append(rank.find('div',class_='field--item').text)
    return world_ranks

Let us Check the Function ranks_of_world

ranks_of_world(company_block)
['8', '18', '20', '65', '91', '93', '94', '111', '131', '133']

Extracting Market Capitalization

A helper function for obtaining Market Capitalization from a website.

def market_caps(company_block):
    market_capitalization_in_dollars = []

    for tag in company_block:
        market_cap = tag.find('div',class_='clearfix col-sm-6 field field--name-field-market-value-jan012021 field--type-float field--label-above')
        try:
            caps = market_cap.find('div',class_='field--item').text
            replace_caps = caps.replace(' Billion USD',"")
            market_capitalization_in_dollars.append(float(replace_caps))
        except AttributeError:
            market_capitalization_in_dollars.append(None)
    return market_capitalization_in_dollars

Let us Check the Function market_caps

# Let's call the function
market_caps(company_block)
[543.68, 332.73, 262.2, 124.78, 152.33, 116.72, 233.34, 129.25, None, 102.4]

Extracting Annual Revenue

A helper function for obtaining Annual Revenue from a web page.

annual_revenue_in_dollars = []

for tag in company_block:
    annual_revenue = tag.find('div',class_='clearfix col-sm-12 field field--name-field-revenue-in-usd field--type-float field--label-inline')
    try:
        revenue = annual_revenue.find('div',class_='field--item').text
        replace_string = revenue.replace(',',"").replace(' Million USD',"")

        annual_revenue_in_dollars.append(int(replace_string))

    except AttributeError:
        annual_revenue_in_dollars.append(None)
return annual_revenue_in_dollars

Let us Check the Function annual_revenue_in_dollars

# Let's call the function
annual_rev(company_block)
[286260, 255630, 85530, 72340, 50360, 37367, 166950, 82215, 81730, 34568]

Extracting Number of Employees

A helper function for calculating the number of employees from a web page.

def employees(company_block):
    no_of_employees = []

    for tag in company_block:
        employee = tag.find('div',class_='clearfix col-sm-12 field field--name-field-employee-count field--type-integer field--label-inline')
        try:
            n_employee = employee.find('div',class_='field--item').text
            replace_string = n_employee.replace(',',"")
            no_of_employees.append(int(replace_string))
        except AttributeError:
            no_of_employees.append(None)
    return no_of_employees
Let us Check the Function no_of_employees
# Let's call the function
employees(company_block)
[391500, 320000, 208000, 258700, 23000, 83842, 376900, 309384, 59000, 89598]

Extracting the Company URLs

A helper function for extracting the URLs of the Company from a web page.

def extract_urls(company_block):    
    company_urls = []

    for tag in company_block:
        c_url = tag.find('div',class_='clearfix col-sm-12 field field--name-field-company-website field--type-link field--label-above')
        try:

            company_urls.append(c_url.find('a')['href'])
        except AttributeError:
            company_urls.append(None)
    return company_urls
Check the Function extract_urls
extract_urls(company_block)
['https://www.berkshirehathaway.com/',
 'https://www.unitedhealthgroup.com/',
 'https://www.bankofamerica.com/',
 'https://www.wellsfargo.com/',
 'http://www.aia.com/',
 'https://www.rbcroyalbank.com',
 'http://www.pingan.com/',
 'https://www.boc.cn/en/',
 'https://www.statefarm.com/',
 'https://www.td.com']

We already have all of functions we need to extract the required data from a web page. It’s time to write a dictionary function using all of the helper features mentioned above.

Let’s define the scrape page function, which will loop through all of the website value’s pages.

today (there are 53 pages starting with (0, 52)).

def scrape_page():
    all_info_dict = {}
   
    all_info_dict = {
            'companies_name':[],
            'CEOs_name':[],
            'world_ranks':[],
            'market_capitalizations_in_billion_dollars':[],
            'annual_revenues_in_million_dollars':[],
            'number_of_employees':[],
            'companies_URLs':[]
            }
    for page in range (0,53):

        url = f"https://www.value.today/world-top-companies/insurance?title=&field_headquarters_of_company_target_id&field_company_category_primary_target_id&field_company_website_uri=&field_market_cap_aug_01_2021__value=&page={page}"
        company_block = get_page(url).find_all('li',class_='row well clearfix')

        all_info_dict['companies_name'] += name_of_companies(company_block)
        all_info_dict['CEOs_name'] += name_of_CEOs(company_block)
        all_info_dict['world_ranks'] += ranks_of_world(company_block)
        all_info_dict['market_capitalizations_in_billion_dollars'] += market_caps(company_block)
        all_info_dict['annual_revenues_in_million_dollars'] += annual_rev(company_block)
        all_info_dict['number_of_employees'] += employees(company_block)
        all_info_dict['companies_URLs'] += extract_urls(company_block)
        page = page + 1
    return all_info_dict

In the above feature scrape_page, we initially generated a vacant dictionary (it stores data in key: value pairs) all_info_dict, and then we formed a vacant list in the dictionary with the key: ‘companies_name, ‘CEOs_name, ‘world_ranks,’market_ capitalizations_in_billion_dollars, ‘annual_revenues_in_million_dollars, ‘number_ of_employee. These keys will hold the value pairs for all of the helper functions defined earlier in the section.

The for loop here will loop through all of the website’s pages, looking for a li tag which will be stored in the changeable company block.

Then, all vacant lists will be concatenated with the correlating helper functions, and finally, all outputs will be stored in all info_dict.

Compiling the Information and Creating a CSV File Using Pandas

From the dictionary, we’ll create a pandas data frame.

Pandas is a Python library that is used to work with data sets.

A DataFrame is a data structure that organises information into a two-dimensional table of rows and columns.

# Create pandas dataframe from dictionary
import pandas as pd
scrape_page_dataframe = pd.DataFrame(scrape_page())
No companies_name CEOs_name world_ranks market_capitalizations_in_billion_dollars annual_revenues_in_million_dollars number_of_employees companies_URLs
0 BERKSHIRE HATHAWAY Warren Buffett 8 543.680 286260.0 391500.0 https://www.berkshirehathaway.com/
1 UNITEDHEALTH GROUP David S. Wichmann 18 332.730 255630.0 320000.0/td> https://www.unitedhealthgroup.com/
2 BANK OF AMERICA CORPORATION Brian Moynihan 20 262.200 85530.0 208000.0 https://www.bankofamerica.com/
3 WELLS FARGO & COMPANY Charles W. Scharf 65 124.780 72340.0 258700.0 https://www.wellsfargo.com/
4 AIA GROUP Lee Yuan Siong 91 152.330 50360.0 23000.0 http://www.aia.com/
.... .... .... .... .... .... .... ....
522 NOVUS AC None 36,479 0.005 NaN NaN None
523 INSR INSURANCE GROUP ASA None 36,703 0.010 NaN NaN None
524 ATLAS FINANCIAL HOLDINGS, INC None 37,108 NaN NaN NaN None
525 HEALTH REVENUE ASSURANCE HOLDIN None 37,682 NaN NaN NaN None
526 TRIAD GUARANTY None 38,342 0.001 NaN NaN None

Saving the Extracted Data into a CSV File:

scrape_page_dataframe.to_csv('scrape_page_dataframe.csv',index=None)

Conclusion

Here’s a brief breakdown of the steps we took to scrape top insurance companies from value today.

  • Using requests, we downloaded the webpage.
  • We used beautifulsoup4 to parse the HTML source code of the web page.
  • We extracted company names, CEOs, global rankings, market capitalization, annual revenue, employee count, and company URLs.
  • Using Pandas, I compiled the data and generated a CSV file.

The format of the CSV file we created is as follows:

csv file

Here is the complete script for the Project:

def get_page(url):
    """Download a web page and return a beautiful soup doc"""
    # Download the webpage
    response = requests.get(url)    
    # Check if the dowmload was successful
    if response.status_code != 200:
        raise Exception('Unable to download page {}'.format(url))    
    # Get the page HTML
    page_contents = response.text    
    # Create a bs4 doc
    doc = BeautifulSoup(response.text,'html.parser')
    return doc


def name_of_companies(company_block):
    company_names = []
    for tag in company_block:
        c_name = tag.find('div',class_='field field--name-node-title field--type-ds field--label-hidden field--item')

        company_names.append(c_name.find('a').text)
    return company_names


def name_of_CEOs(company_block):
    CEO_names = []
    for tag in company_block:
        names = tag.find('div',class_='clearfix col-sm-12 field field--name-field-ceo field--type-entity-reference field--label-above')
        try:
            ceo = names.find('a').text
            CEO_names.append(ceo)
        except AttributeError:
            CEO_names.append(None)
    return CEO_names


def ranks_of_world(company_block):
    world_ranks = []

    for tag in company_block:
        rank = tag.find('div', class_='clearfix col-sm-6 field field--name-field-world-rank-sep-01-2021- field--type-integer field--label-above')

        world_ranks.append(rank.find('div',class_='field--item').text)
    return world_ranks


def market_caps(company_block):
    market_capitalization_in_dollars = []

    for tag in company_block:
        market_cap = tag.find('div',class_='clearfix col-sm-6 field field--name-field-market-value-jan012021 field--type-float field--label-above')
        try:
            caps = market_cap.find('div',class_='field--item').text
            replace_caps = caps.replace(' Billion USD',"")
            market_capitalization_in_dollars.append(float(replace_caps))
        except AttributeError:
            market_capitalization_in_dollars.append(None)
    return market_capitalization_in_dollars


def annual_rev(company_block):
    annual_revenue_in_dollars = []

    for tag in company_block:
        annual_revenue = tag.find('div',class_='clearfix col-sm-12 field field--name-field-revenue-in-usd field--type-float field--label-inline')
        try:
            revenue = annual_revenue.find('div',class_='field--item').text
            replace_string = revenue.replace(',',"").replace(' Million USD',"")

            annual_revenue_in_dollars.append(int(replace_string))

        except AttributeError:
            annual_revenue_in_dollars.append(None)
    return annual_revenue_in_dollars


def employees(company_block):
    no_of_employees = []

    for tag in company_block:
        employee = tag.find('div',class_='clearfix col-sm-12 field field--name-field-employee-count field--type-integer field--label-inline')
        try:
            n_employee = employee.find('div',class_='field--item').text
            replace_string = n_employee.replace(',',"")
            no_of_employees.append(int(replace_string))
        except AttributeError:
            no_of_employees.append(None)
    return no_of_employees



def extract_urls(company_block):    
    company_urls = []

    for tag in company_block:
        c_url = tag.find('div',class_='clearfix col-sm-12 field field--name-field-company-website field--type-link field--label-above')
        try:

            company_urls.append(c_url.find('a')['href'])
        except AttributeError:
            company_urls.append(None)
    return company_urls



def scrape_page():
    all_info_dict = {}
   
    all_info_dict = {
            'companies_name':[],
            'CEOs_name':[],
            'world_ranks':[],
            'market_capitalizations':[],
            'annual_revenues':[],
            'number_of_employees':[],
            'companies_URLs':[]
            }
    for page in range (0,53):

        url = f"https://www.value.today/world-top-companies/insurance?title=&field_headquarters_of_company_target_id&field_company_category_primary_target_id&field_company_website_uri=&field_market_cap_aug_01_2021__value=&page={page}"
        company_block = get_page(url).find_all('li',class_='row well clearfix')

        all_info_dict['companies_name'] += name_of_companies(company_block)
        all_info_dict['CEOs_name'] += name_of_CEOs(company_block)
        all_info_dict['world_ranks'] += ranks_of_world(company_block)
        all_info_dict['market_capitalizations_in_billion_dollars'] += market_caps(company_block)
        all_info_dict['annual_revenues_in_million_dollars'] += annual_rev(company_block)
        all_info_dict['number_of_employees'] += employees(company_block)
        all_info_dict['companies_URLs'] += extract_urls(company_block)
        page = page + 1
    return all_info_dict

Looking to extract top Insurance Companies Data? Contact X-Byte Enterprise Crawling now!!