How Does Python and BeautifulSoup are Used to Scrape Top Insurance Companies Data?

Why Do You Need Web Scraping?

It all begins with information (It is the collection of facts). Data is required by businesses and organizations for market research. These data can be gathered through interviews, observations, surveys, and questionnaires, as well as government archives and the Internet.

Web scraping is a technique for extracting relevant large amounts of data from websites and saving it to a file or database. The data that is scraped is usually in tabular or spreadsheet format (e.g.: CSV file).

In this blog, we’ll scrape the value of a website. Today is the first day of our web scraping project.

Below given is the overview of the steps we will follow:

Using queries, download the webpage.
beautifulsoup4 will parse the HTML source code.
Extract company names, CEOs, global rankings, market capitalization, annual revenue, employee count, and company URLs.
Using Pandas, compile the data and generate a CSV file.

How to Perform Web scraping?

Python is a fantastic language that provides packages such as Beautiful Soup, Requests, and Pandas that are used to extract data from HTML code and transform it into various formats (CSV, XML, JSON) depending on the application.

HTML: The code used to organize a website and its information is known as HTML (Hypertext Markup Language). It includes tags that specify how well a web browser should format and display information.

BeautifulSoup is a Library for python that extracts data from HTML and XML files.

Requests is the de facto Python library standard for trying to make HTTP requests.

HTTP is a protocol that is used to retrieve resources such as HTML documents.

Let us extract the web page of the top insurance companies by market capitalization

At the end of the project, we will create a CSV file in the below format:

companies_name,CEOs_name,world_ranks,market_capitalizations_in_billion_dollars,annual_revenues_in_million_dollars,number_of_employees,companies_URLs

BERKSHIRE HATHAWAY,Warren Buffett,8,543.68,286260.0,391500.0,https://www.berkshirehathaway.com/

UNITEDHEALTH GROUP,David S. Wichmann,18,332.73,255630.0,320000.0,https://www.unitedhealthgroup.com/

BANK OF AMERICA CORPORATION,Brian Moynihan,20,262.2,85530.0,208000.0,https://www.bankofamerica.com/

WELLS FARGO & COMPANY,Charles W. Scharf,65,124.78,72340.0,258700.0,https://www.wellsfargo.com/

AIA GROUP,Lee Yuan Siong,91,152.33,50360.0,23000.0,http://www.aia.com/

Download the Webpage using requests

We’ll use the requests Python library to download the web page.

Let’s get started with installing and importing requests.

!pip install requests --upgrade --quiet
import requests

We can make use of requests.get obtain the ability to download a webpage

topics_url = 'https://www.value.today/world-top-companies/insurance'
response = requests.get(topics_url)

requests.get returns a reaction object that contains the information from web page as well as some additional information. Using response.text, we can get at the contents of a website page.

page_content = response.text

page_content[:1000]
'<!DOCTYPE html>
\n<html lang="en" dir="ltr" prefix="
content: http://purl.org/rss/1.0/modules/content/  
dc: http://purl.org/dc/terms/  
foaf: http://xmlns.com/foaf/0.1/  
og: http://ogp.me/ns#  
rdfs: http://www.w3.org/2000/01/rdf-schema#  
schema: http://schema.org/  
sioc: http://rdfs.org/sioc/ns#  
sioct: http://rdfs.org/sioc/types#  
skos: http://www.w3.org/2004/02/skos/core#  
xsd: http://www.w3.org/2001/XMLSchema# ">\n  
<head>\n    
    <meta charset="utf-8"/>
    \n<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
    \n<script>(adsbygoogle=window.adsbygoogle||[]).push({google_ad_client:"ca-pub-2407955258669770",enable_page_level_ads:true});</script>
    <script>window.google_analytics_uacct="UA-121331115-1";(function(i,s,o,g,r,a,m){i["GoogleAnalyticsObject"]=r;i[r]=i[r]||function(){(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)})(window,document,"script","https://www'>

HTML code can be found on the website. Using requests, we successfully fetched the web page. The HTML of the webpage value is contained in the above cell page_ content [:1000]. We can also save it to a folder and view it locally within Jupyter by selecting “File > Open.”

with open('world-insurance.html','w',encoding = "utf-8") as file:
    file.write(page_content)

This page will look similar to the original page

Parse the HTML source code using beautifulsoup4

To decode the HTML source code of the web page downloaded in the previous section, we’ll use the Beautiful Soup Python library. We’ll also add a helper function.

!pip install beautifulsoup4 --upgrade --quiet
from bs4 import BeautifulSoup
doc = BeautifulSoup(response.text, 'html.parser')

We can use doc it to retrieve data from the web page once it has been parsed.

type(doc)
bs4.BeautifulSoup

doc.find('title')
World Top Insurance Companies by Market Value as on 2021

We were able to recover the headline from the website page, as shown above.

Let’s make a reusable helper function get page which can download a website page and generate a Beautiful Soup doc for every given URL.

def get_page(url):
    """Download a web page and return a beautiful soup doc"""
    # Download the webpage
    response = requests.get(url)
    
    # Check if the dowmload was successful
    if response.status_code != 200:
        raise Exception('Unable to download page {}'.format(url))
    
    # Get the page HTML
    page_contents = response.text
    
    # Create a bs4 doc
    doc = BeautifulSoup(response.text,'html.parser')
    return doc

We can even use the aforementioned function to obtain the URL of any website page. The status code will be returned by response.status_code in this case. A legitimate sequence number should be in the range (200–299). More on the status code. response.text returns the response’s html content. The term parser refers to a class called HTML Parser, which is used to parse HTML files whilst also web scraping.

Extract Company Names, CEOs, World Ranks, Market Capitalization, Annual Revenue, Number of Employees, Company URLs, etc.

All of the information can be retrieved from the web page’s li_tag.

Let’s make a variable that extracts the li_tag of the class row well clearfix.

company_block = doc.find_all('li',class_='row well clearfix')

Extracting Company Names

Let’s write a helper function that will extract business names from a web page.

def name_of_companies(company_block):
    company_names = []
    for tag in company_block:
        c_name = tag.find('div',class_='field field--name-node-title field--type-ds field--label-hidden field--item')

        company_names.append(c_name.find('a').text)
    return company_names

Let us check the function name_of_companies

#Let's check the function
name_of_companies(company_block)
['BERKSHIRE HATHAWAY',
 'UNITEDHEALTH GROUP',
 'BANK OF AMERICA CORPORATION',
 'WELLS FARGO & COMPANY',
 'AIA GROUP',
 'ROYAL BANK OF CANADA',
 'PING AN INSURANCE (GROUP) COMPANY OF CHINA',
 'BANK OF CHINA',
 'STATE FARM',
 'TORONTO-DOMINION BANK']

Extracting CEOs Name

A helper function for extracting CEO names from web pages

def name_of_CEOs(company_block):
    CEO_names = []
    for tag in company_block:
        names = tag.find('div',class_='clearfix col-sm-12 field field--name-field-ceo field--type-entity-reference field--label-above')
        try:
            ceo = names.find('a').text
            CEO_names.append(ceo)
        except AttributeError:
            CEO_names.append(None)
    return CEO_names

Let Us Check the Function name_of_ceo

# Let's call the function
name_of_CEOs(company_block)
['Warren Buffett',
 'David S. Wichmann',
 'Brian Moynihan',
 'Charles W. Scharf',
 'Lee Yuan Siong',
 'David I. McKay',
 'Ma Mingzhe',
 'Gao Yingxin',
 None,
 'Bharat Masrani']

Extracting World Ranks

A helper function for obtaining World Ranks from a web page.

def ranks_of_world(company_block):
    world_ranks = []

    for tag in company_block:
        rank = tag.find('div', class_='clearfix col-sm-6 field field--name-field-world-rank-sep-01-2021- field--type-integer field--label-above')

        world_ranks.append(rank.find('div',class_='field--item').text)
    return world_ranks

Let us Check the Function ranks_of_world

ranks_of_world(company_block)
['8', '18', '20', '65', '91', '93', '94', '111', '131', '133']

Extracting Market Capitalization

A helper function for obtaining Market Capitalization from a website.

def market_caps(company_block):
    market_capitalization_in_dollars = []

    for tag in company_block:
        market_cap = tag.find('div',class_='clearfix col-sm-6 field field--name-field-market-value-jan012021 field--type-float field--label-above')
        try:
            caps = market_cap.find('div',class_='field--item').text
            replace_caps = caps.replace(' Billion USD',"")
            market_capitalization_in_dollars.append(float(replace_caps))
        except AttributeError:
            market_capitalization_in_dollars.append(None)
    return market_capitalization_in_dollars

Let us Check the Function market_caps

# Let's call the function
market_caps(company_block)
[543.68, 332.73, 262.2, 124.78, 152.33, 116.72, 233.34, 129.25, None, 102.4]

Extracting Annual Revenue

A helper function for obtaining Annual Revenue from a web page.

annual_revenue_in_dollars = []

for tag in company_block:
    annual_revenue = tag.find('div',class_='clearfix col-sm-12 field field--name-field-revenue-in-usd field--type-float field--label-inline')
    try:
        revenue = annual_revenue.find('div',class_='field--item').text
        replace_string = revenue.replace(',',"").replace(' Million USD',"")

        annual_revenue_in_dollars.append(int(replace_string))

    except AttributeError:
        annual_revenue_in_dollars.append(None)
return annual_revenue_in_dollars

Let us Check the Function annual_revenue_in_dollars

# Let's call the function
annual_rev(company_block)
[286260, 255630, 85530, 72340, 50360, 37367, 166950, 82215, 81730, 34568]

Extracting Number of Employees

A helper function for calculating the number of employees from a web page.

def employees(company_block):
    no_of_employees = []

    for tag in company_block:
        employee = tag.find('div',class_='clearfix col-sm-12 field field--name-field-employee-count field--type-integer field--label-inline')
        try:
            n_employee = employee.find('div',class_='field--item').text
            replace_string = n_employee.replace(',',"")
            no_of_employees.append(int(replace_string))
        except AttributeError:
            no_of_employees.append(None)
    return no_of_employees

Let us Check the Function no_of_employees

# Let's call the function
employees(company_block)
[391500, 320000, 208000, 258700, 23000, 83842, 376900, 309384, 59000, 89598]

Extracting the Company URLs

A helper function for extracting the URLs of the Company from a web page.

def extract_urls(company_block):    
    company_urls = []

    for tag in company_block:
        c_url = tag.find('div',class_='clearfix col-sm-12 field field--name-field-company-website field--type-link field--label-above')
        try:

            company_urls.append(c_url.find('a')['href'])
        except AttributeError:
            company_urls.append(None)
    return company_urls

Check the Function extract_urls

extract_urls(company_block)
['https://www.berkshirehathaway.com/',
 'https://www.unitedhealthgroup.com/',
 'https://www.bankofamerica.com/',
 'https://www.wellsfargo.com/',
 'http://www.aia.com/',
 'https://www.rbcroyalbank.com',
 'http://www.pingan.com/',
 'https://www.boc.cn/en/',
 'https://www.statefarm.com/',
 'https://www.td.com']

We already have all of functions we need to extract the required data from a web page. It’s time to write a dictionary function using all of the helper features mentioned above.

Let’s define the scrape page function, which will loop through all of the website value’s pages.

today (there are 53 pages starting with (0, 52)).

def scrape_page():
    all_info_dict = {}
   
    all_info_dict = {
            'companies_name':[],
            'CEOs_name':[],
            'world_ranks':[],
            'market_capitalizations_in_billion_dollars':[],
            'annual_revenues_in_million_dollars':[],
            'number_of_employees':[],
            'companies_URLs':[]
            }
    for page in range (0,53):

        url = f"https://www.value.today/world-top-companies/insurance?title=&field_headquarters_of_company_target_id&field_company_category_primary_target_id&field_company_website_uri=&field_market_cap_aug_01_2021__value=&page={page}"
        company_block = get_page(url).find_all('li',class_='row well clearfix')

        all_info_dict['companies_name'] += name_of_companies(company_block)
        all_info_dict['CEOs_name'] += name_of_CEOs(company_block)
        all_info_dict['world_ranks'] += ranks_of_world(company_block)
        all_info_dict['market_capitalizations_in_billion_dollars'] += market_caps(company_block)
        all_info_dict['annual_revenues_in_million_dollars'] += annual_rev(company_block)
        all_info_dict['number_of_employees'] += employees(company_block)
        all_info_dict['companies_URLs'] += extract_urls(company_block)
        page = page + 1
    return all_info_dict

In the above feature scrape_page, we initially generated a vacant dictionary (it stores data in key: value pairs) all_info_dict, and then we formed a vacant list in the dictionary with the key: ‘companies_name, ‘CEOs_name, ‘world_ranks,’market_ capitalizations_in_billion_dollars, ‘annual_revenues_in_million_dollars, ‘number_ of_employee. These keys will hold the value pairs for all of the helper functions defined earlier in the section.

The for loop here will loop through all of the website’s pages, looking for a li tag which will be stored in the changeable company block.

Then, all vacant lists will be concatenated with the correlating helper functions, and finally, all outputs will be stored in all info_dict.

Compiling the Information and Creating a CSV File Using Pandas

From the dictionary, we’ll create a pandas data frame.

Pandas is a Python library that is used to work with data sets.

A DataFrame is a data structure that organises information into a two-dimensional table of rows and columns.

# Create pandas dataframe from dictionary
import pandas as pd
scrape_page_dataframe = pd.DataFrame(scrape_page())

No	companies_name	CEOs_name	world_ranks	market_capitalizations_in_billion_dollars	annual_revenues_in_million_dollars	number_of_employees	companies_URLs
0	BERKSHIRE HATHAWAY	Warren Buffett	8	543.680	286260.0	391500.0	https://www.berkshirehathaway.com/
1	UNITEDHEALTH GROUP	David S. Wichmann	18	332.730	255630.0	320000.0/td>	https://www.unitedhealthgroup.com/
2	BANK OF AMERICA CORPORATION	Brian Moynihan	20	262.200	85530.0	208000.0	https://www.bankofamerica.com/
3	WELLS FARGO & COMPANY	Charles W. Scharf	65	124.780	72340.0	258700.0	https://www.wellsfargo.com/
4	AIA GROUP	Lee Yuan Siong	91	152.330	50360.0	23000.0	http://www.aia.com/
....	....	....	....	....	....	....	....
522	NOVUS AC	None	36,479	0.005	NaN	NaN	None
523	INSR INSURANCE GROUP ASA	None	36,703	0.010	NaN	NaN	None
524	ATLAS FINANCIAL HOLDINGS, INC	None	37,108	NaN	NaN	NaN	None
525	HEALTH REVENUE ASSURANCE HOLDIN	None	37,682	NaN	NaN	NaN	None
526	TRIAD GUARANTY	None	38,342	0.001	NaN	NaN	None

Saving the Extracted Data into a CSV File:

scrape_page_dataframe.to_csv('scrape_page_dataframe.csv',index=None)

Conclusion

Here’s a brief breakdown of the steps we took to scrape top insurance companies from value today.

Using requests, we downloaded the webpage.
We used beautifulsoup4 to parse the HTML source code of the web page.
We extracted company names, CEOs, global rankings, market capitalization, annual revenue, employee count, and company URLs.
Using Pandas, I compiled the data and generated a CSV file.

The format of the CSV file we created is as follows:

Here is the complete script for the Project:

def get_page(url):
    """Download a web page and return a beautiful soup doc"""
    # Download the webpage
    response = requests.get(url)    
    # Check if the dowmload was successful
    if response.status_code != 200:
        raise Exception('Unable to download page {}'.format(url))    
    # Get the page HTML
    page_contents = response.text    
    # Create a bs4 doc
    doc = BeautifulSoup(response.text,'html.parser')
    return doc


def name_of_companies(company_block):
    company_names = []
    for tag in company_block:
        c_name = tag.find('div',class_='field field--name-node-title field--type-ds field--label-hidden field--item')

        company_names.append(c_name.find('a').text)
    return company_names


def name_of_CEOs(company_block):
    CEO_names = []
    for tag in company_block:
        names = tag.find('div',class_='clearfix col-sm-12 field field--name-field-ceo field--type-entity-reference field--label-above')
        try:
            ceo = names.find('a').text
            CEO_names.append(ceo)
        except AttributeError:
            CEO_names.append(None)
    return CEO_names


def ranks_of_world(company_block):
    world_ranks = []

    for tag in company_block:
        rank = tag.find('div', class_='clearfix col-sm-6 field field--name-field-world-rank-sep-01-2021- field--type-integer field--label-above')

        world_ranks.append(rank.find('div',class_='field--item').text)
    return world_ranks


def market_caps(company_block):
    market_capitalization_in_dollars = []

    for tag in company_block:
        market_cap = tag.find('div',class_='clearfix col-sm-6 field field--name-field-market-value-jan012021 field--type-float field--label-above')
        try:
            caps = market_cap.find('div',class_='field--item').text
            replace_caps = caps.replace(' Billion USD',"")
            market_capitalization_in_dollars.append(float(replace_caps))
        except AttributeError:
            market_capitalization_in_dollars.append(None)
    return market_capitalization_in_dollars


def annual_rev(company_block):
    annual_revenue_in_dollars = []

    for tag in company_block:
        annual_revenue = tag.find('div',class_='clearfix col-sm-12 field field--name-field-revenue-in-usd field--type-float field--label-inline')
        try:
            revenue = annual_revenue.find('div',class_='field--item').text
            replace_string = revenue.replace(',',"").replace(' Million USD',"")

            annual_revenue_in_dollars.append(int(replace_string))

        except AttributeError:
            annual_revenue_in_dollars.append(None)
    return annual_revenue_in_dollars


def employees(company_block):
    no_of_employees = []

    for tag in company_block:
        employee = tag.find('div',class_='clearfix col-sm-12 field field--name-field-employee-count field--type-integer field--label-inline')
        try:
            n_employee = employee.find('div',class_='field--item').text
            replace_string = n_employee.replace(',',"")
            no_of_employees.append(int(replace_string))
        except AttributeError:
            no_of_employees.append(None)
    return no_of_employees



def extract_urls(company_block):    
    company_urls = []

    for tag in company_block:
        c_url = tag.find('div',class_='clearfix col-sm-12 field field--name-field-company-website field--type-link field--label-above')
        try:

            company_urls.append(c_url.find('a')['href'])
        except AttributeError:
            company_urls.append(None)
    return company_urls



def scrape_page():
    all_info_dict = {}
   
    all_info_dict = {
            'companies_name':[],
            'CEOs_name':[],
            'world_ranks':[],
            'market_capitalizations':[],
            'annual_revenues':[],
            'number_of_employees':[],
            'companies_URLs':[]
            }
    for page in range (0,53):

        url = f"https://www.value.today/world-top-companies/insurance?title=&field_headquarters_of_company_target_id&field_company_category_primary_target_id&field_company_website_uri=&field_market_cap_aug_01_2021__value=&page={page}"
        company_block = get_page(url).find_all('li',class_='row well clearfix')

        all_info_dict['companies_name'] += name_of_companies(company_block)
        all_info_dict['CEOs_name'] += name_of_CEOs(company_block)
        all_info_dict['world_ranks'] += ranks_of_world(company_block)
        all_info_dict['market_capitalizations_in_billion_dollars'] += market_caps(company_block)
        all_info_dict['annual_revenues_in_million_dollars'] += annual_rev(company_block)
        all_info_dict['number_of_employees'] += employees(company_block)
        all_info_dict['companies_URLs'] += extract_urls(company_block)
        page = page + 1
    return all_info_dict

Looking to extract top Insurance Companies Data? Contact X-Byte Enterprise Crawling now!!

Why Do You Need Web Scraping?

How to Perform Web scraping?

Download the Webpage using requests

We can make use of requests.get obtain the ability to download a webpage

Extract Company Names, CEOs, World Ranks, Market Capitalization, Annual Revenue, Number of Employees, Company URLs, etc.

Extracting Company Names

Extracting CEOs Name

Let Us Check the Function name_of_ceo

Extracting World Ranks

Let us Check the Function ranks_of_world

Extracting Market Capitalization

Let us Check the Function market_caps

Extracting Annual Revenue

Extracting Number of Employees

Let us Check the Function no_of_employees

Extracting the Company URLs

Check the Function extract_urls

Compiling the Information and Creating a CSV File Using Pandas

Saving the Extracted Data into a CSV File:

Conclusion

About Us

Services

Industries

Quick Links

how does python and beautiful soup are used to scrape top insurance companies-data

December 13, 2021

Why Do You Need Web Scraping?

How to Perform Web scraping?

Download the Webpage using requests

We can make use of requests.get obtain the ability to download a webpage

Extract Company Names, CEOs, World Ranks, Market Capitalization, Annual Revenue, Number of Employees, Company URLs, etc.

Extracting Company Names

Extracting CEOs Name

Let Us Check the Function name_of_ceo

Extracting World Ranks

Let us Check the Function ranks_of_world

Extracting Market Capitalization

Let us Check the Function market_caps

Extracting Annual Revenue

Extracting Number of Employees

Let us Check the Function no_of_employees

Extracting the Company URLs

Check the Function extract_urls

Compiling the Information and Creating a CSV File Using Pandas

Saving the Extracted Data into a CSV File:

Conclusion