How Web Scraping Is Used To Extract Data From Transfermarkt?

February 08, 2022
How Web Scraping is Used to Extract Data from Transfermarkt

The cornerstone of areas like Data Science is obtaining data and translating it into information. Obtaining it is sometimes extremely straightforward; for example, you may visit any website and receive access to various raw file systems from the government, and then execute an easy, simple, and fast examination of a.csv file.

However, data can be difficult to access in other situations; for example, you might have to retrieve information that is only accessible on a web page to run an analysis. Beautiful Soup, a Python package, may be used to execute web scraping in this case.

Beautiful Soup is now the most commonly programing language for getting web data; it can extract data from HTML and XML files and contains several functions that make finding particular data on websites simple and quick.

Here, for instance, we will scrape Transfermarkt data which provides news and other information related to games, clubs, players, and transfers from the soccer or football world.

The cornerstone of areas like Data

The identity, previous league nation, and cost of the 25 most quality players in AFC Ajax history will be sent to us; this information may be viewed on the Transfermarkt website.

viewed on the Transfermarkt website

The above image shows the page offering data about the 25 most important AFC Ajax symbols.

Extracting Information

Before getting the data, you will need to import the libraries needed to run the application such as Beautiful Soup, Pandas, and Requests.

import requests
from bs4 import BeautifulSoup
import pandas as pd

Then in the application, you will download the webpage content by using the requests library, which requests information from the page, and also the BeautifulSoup library, which converts the data obtained in requests (a Response object) into BeautifulSoup object for data extraction.

To make the request to the page we have to inform the
website that we are a browser and that is why we
use the headers variable
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

# endereco_da_pagina stands for the data page address
endereco_da_pagina = ""

# In the objeto_response variable we will the download of the web page
objeto_response = requests.get(endereco_da_pagina, headers=headers)

Now we will create a BeautifulSoup object from our object_response.
The 'html.parser' parameter represents which parser we will use when creating our object,
a parser is a software responsible for converting an entry to a data structure.
pagina_bs = BeautifulSoup(objeto_response.content, 'html.parser')

The variable pagina_bs now has all the HTML information on our data page.

Now let us scrape the data from our variable, the information that you require must be available in a table. Every row in the table specifies a player, with the name in the HTML using an anchor (a>) with the class “spielprofil_tooltip”, nation of the source league symbolized by a flag image (img>) with both the class “flaggenrahmen” in the seventh column(td>), and the rate presented by a table cell (td>) with the class “rechts hauptlink” in the eight column (td>) of every single row.

The BeautifulSoup library will then be used to obtain this information.

We’ll start by getting the names of the players.

nomes_jogadores = [] # List that will receive all the players names

# The find_all () method is able to return all tags that meet restrictions within parentheses
tags_jogadores = pagina_bs.find_all("a", {"class": "spielprofil_tooltip"})
# In our case, we are finding all anchors with the class "spielprofil_tooltip"

# Now we will get only the names of all players
for tag_jogador in tags_jogadores:

Here we will retrieve the countries of the player’s past leagues.

pais_jogadores = [] # List that will receive all the names of the countries of the players’s previous leagues.

tags_ligas = pagina_bs.find_all("td",{"class": None})
# Now we will receive all the cells in the table that have no class atribute set

for tag_liga in tags_ligas:
    # The find() function will find the first image whose class is "flaggenrahmen" and has a title
    imagem_pais = tag_liga.find("img", {"class": "flaggenrahmen"}, {"title":True})
    # The country_image variable will be a structure with all the image information,
    # one of them is the title that contains the name of the country of the flag image
    if(imagem_pais != None): # We will test if we have found any matches than add them

Finally, you will get the player’s price

custos_jogadores = []

tags_custos = pagina_bs.find_all("td", {"class": "rechts hauptlink"})

for tag_custo in tags_custos:
    texto_preco = tag_custo.text
    # The price text contains characters that we don’t need like £ (euros) and m (million) so we’ll remove them
    texto_preco = texto_preco.replace("£", "").replace("m","")
    # We will now convert the value to a numeric variable (float)
    preco_numerico = float(texto_preco)

Nowadays you will have all the information that you require hence any analysis can be performed. You can conduct the analysis using the pandas library and the DataFrame class, that a class provides a tabular data structure comparable to a table.

# Creating a DataFrame with our data
df = pd.DataFrame({"Jogador":nomes_jogadores,"Preço (milhão de euro)":custos_jogadores,"País de Origem":pais_jogadores})

# Printing our gathered data

We can check that the data received is obtained using web scraping services in the DataFrame!

If you are looking to scrape the Transfermarkt data, contact X-Byte Enterprise Crawling today or ask for a free quote!