
Selling products online on Amazon is the easiest, the most convenient and an inexpensive way of starting a business online. In addition, we can also scrape seller information from Amazon. In this tutorial, we will build an Amazon Scraper for extracting seller’s details. We will build this simple web scraper using Python.
Here, we are going to scrape details of all the sellers available on website – amazon. Such as,
So let’s get started.
Let’s prefer amazon scraper python framework for large scale web scraping. To scrape amazon, you need to have some modules and dependencies installed or set up in your desktop such as,
As you know, you can always install such packages just like below with pip or conda.
To create a basic scrapy project, we suggest creating a respective folder for that. Navigate the path of such folder in command prompt and you can create scrapy project just by executing following:
It’s a basic scrapy project, which as you know can be created by “scrapy startproject <_project name_>“. Here the project name is amazon_seller.
We can proceed by creating a spider to extract seller links, so that we can get started on scrapping which can be created by “scrapy genspider <_spider name_> <_url to scrape_>”.
Here’s the command for that.
The flow of scrapy project looks like this image below. However, this is created specifically for amazon seller extraction. Hence, it has one file extra (databaseconfig.py). Why? – We’ll discuss it further.
Here, in directory “\amazon_seller\amazon_seller\” there is an additional file named “databaseconfig.py” as we discussed previously.
The file contains some of variables and values listed below.
1. Values required to form a database connection strings
2. Names of schema tables
3. Schema table creation strings
4. Some file paths (to direct the process where to save output csv at the end of execution)
Here is a snippet for the same.
host = “localhost”
username = “root”
passwd = “your password here”
db=”amzon_seller”
table_name1 = “seller_list”
table_name2 = “seller_info”
table1_create_table = “”” CREATE TABLEIF NOT EXISTS%s
(Id int NOT NULL AUTO_INCREMENT,
seller_name varchar(100)NOT NULL,
seller_category varchar(100)NOT NULL,
seller_url varchar(255)NOT NULL,
status varchar(10)DEFAULT ‘pending’,….
We have a spider created to extract seller links – seller_links.py
There is a small process to understand how the start URL is formed. Because hey!, we’re not going to have all the sellers for each and every category listed all together in front of us in amazon, right ?
If you have understood the process of creating and changing start URLs, we can now move further to use them.
The spider to extract links of sellers is created by method we discussed above. Here the spider is named – seller_links.py.
We have created 3 functions in seller_links.py to make the process faster and smoother, instead of writing all the code in one function.
In this function, a request is being sent for seller link page with each category-id. See the screenshot below.
for amazon_category in amazon_categories;
yield scrapy.Request(
url=”https://www.amazon.de/mn/search/other?_encoding=UTF8&language=en_
GB&page-l&pickerToList=enc-merchantbin&rh=n%3A+amazon_category,callback
=self.parse_next”
method = “GET”,
meta = {‘amazon_category’ :amazon_category}
)
All category-ids are stored in variable amazon_categories.
This function executes two tasks mainly:
1) Filters category whose products are sold by amazon itself.
2) If not 1), then it collects links of alphabets given in the page which contains seller links starting with respected alphabets.
This is because the page shows Top sellers for that particular category only. Below is a screenshot for that.
Response of seller link is parsed and following details are extracted from that.
def seller_link_extract(self,response);
item = AmazonSellerLinkitem()
item[‘seller_category’] = “_” .join(response.xpath(‘//*@class=”a-row-a-spacing-base”]//a
/text().extract())
sellers = responsive.xpath(‘//[@class=”s-see-all-indexbar-column”]//a’)
for seller in sellers:
item[‘seller_name’] = seller.xpath(‘,/@title’).extract_first()
item[‘sellers_page_url’] = “https://www.amazon.de/sp?_encoding=UTF&asian=&isCBA
=&marketplaceID=&orderID=&seller=” +str(seller.xpath(‘,/@href’).re(‘6%3A(.*?)&’)[0])
yielditem
Spider created for seller information extraction is – get_seller_info.py
As we have stored all the URLs, we can now use them to extract sellers’ data. There are total 2 functions in spider created for data extraction.
Fetches all the links and sends request to the next function. That’s it.
try:
merchantinfo[‘BusinessAddress’]= “,”join(response.xpath(‘//*[contains(text().”Business
Address;”) or contains(text(),”Geschatsadresse:”)]/following-sibling::ul//texi()’)extract())
except Exception as e;
merchantinfo[‘BusinessAddress’]=”
try:
merchantinfo[‘PhoneNumber’] = response.xpath(‘//*@class=”a-coluirnn a-span6″]//**
[contains(text(),”Phone number.”) or contains(text(),”Telefonnurnmer;”)]/parents:span/
text()).extract_first() if not merchantinfo[‘PhoneNumber’]:
try:
merchantinfo[‘PhoneNumber’] = refindall(b”Telefon:(.*?)”, response.body)[0]
except:
merchantinfo[‘PhoneNumber’] = refindall(b”Tel\+Fax\.:(.*?)”,response.body)[0]
exceptException as e:
merchantinfo[‘PhoneNumber’] = ”
try:
con = pyrmysql.connect(dbc.host,dbc.username,dbc.passwd)
Name = dbc.csv + ‘amazon_data.csv’
qry = “select*from amazon_seller.seller_info”
df = pd.read_sql(qry,con)
df.columns = [‘Id’,’SellerName’,’Category’,’SellerPage’,’BusinessAddress’,’PhoneNumber”
Email’]
df.to_csv(Name,index=None)
print(‘CSV file generated”)
except Exception as e:
print(e)
This is it. Now, Sellers’ information has been extracted!! So, by this way, you can execute the code and extract the information from any amazon website.
However, there are some things you need to take care of while performing final execution. Also, Amazon or any Amazon website does not respond as quickly and as smoothly as we expect. Hence, you need to try various proxies or list of various user-agents and applying them dynamically/randomly and see if it works. I have used both here alternately.
Second thing, particularly for this requirement – the phone numbers and email which differs in types and patterns enormously. Make sure that you have got them covered by any means. For phone-number, a module called “phonenumbers” is used. Give it a check for once.
We hope this tutorial gave you a better idea on how to scrape seller information from Amazon or similar e-commerce websites. As a company, we understand e-commerce data having worked with it before. If you are interested in professional help with scraping complex websites, let us know, and we will be glad to help you.
Instagram is crowded. Not only among the users, but also among the brands, influencers, advertising,…
Introduction You already understand what web scraping delivers for your business. Every brand owner understands…
Introduction The modern classroom moves at the pace of notifications, deadlines, and fast-changing sources. Students…
In the context of today's rapidly evolving business landscape, organizations are creating unprecedented volumes of…
TikTok Shop has rapidly evolved into a dominant force in the American eCommerce landscape. With…
Data drives every serious business decision today. Pricing strategy, competitor monitoring, consumer sentiment analysis, none…