How To Scrape Houzz Products Images Using Python As Well As Beautiful Soup

Today, we will see how to scrape Houzz products images data using Python as well as BeautifulSoup in an easy and stylish manner. The objective of this tutorial blog is to start on the real-world problem solving whereas keeping that very easy so that you become familiar as well as find practical results as quickly as possible.

Therefore, the initial thing we want is to ensure that Python 3 is installed and if you don’t have Python 3 installed then install it before proceeding.

So, you may install beautiful soup using:

pip3 install beautifulsoup4

We would also require the library requests, soupsieve, as well as LXML to scrape data, break that to XML, as well as utilize CSS selectors. Then install them with…

pip3 install requests soupsieve lxml

When it gets installed open the editor as well as type in:

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

Now, let’s visit the Houzz list page as well as inspect data that we could get.

That’s how that looks:

houzz

Let’s back to the code now and try to get data through pretending that we are the browser including this:

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.houzz.in/photos/kitchen-design-ideas-phbr0-bp~t_26043'

response=requests.get(url,headers=headers)

soup=BeautifulSoup(response.content,'lxml')

Then, save it as scrapeHouzz.py.

In case, you run that:

python3 scrapeHouzz.py

You would observe the entire HTML page.

Then, let’s utilize CSS selectors for getting the required data. To perform that let’s use Chrome as well as open inspect tool.

houzz code

Here, we observe that all these individual products data are limited in the (div with the class ‘hz-space-card’. Here, we can scrape it with CSS selector ‘.hz-space-card’ very easily. Therefore, that’s how a code will look like:

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.houzz.in/photos/kitchen-design-ideas-phbr0-bp~t_26043'

response=requests.get(url,headers=headers)

soup=BeautifulSoup(response.content,'lxml')

for item in soup.select('.hz-space-card'):
  try:
    print('----------------------------------------')


    print(item)




  except Exception as e:
    #raise e
    b=0

It prints the content in all the containers, which hold the products data.

code1

Now, let’s pick classes inside the rows, which contain data we require. We observe that a title is within the class hz-space-card__photo-title, an image within hz-image, as well as so on. So, now it will look like this when we try to get Titles, images, user names as well as links to that.

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests

headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.houzz.in/photos/kitchen-design-ideas-phbr0-bp~t_26043'

response=requests.get(url,headers=headers)

soup=BeautifulSoup(response.content,'lxml')

for item in soup.select('.hz-space-card'):
  try:
    print('----------------------------------------')


    #print(item)

    print(item.select('.hz-space-card__photo-title')[0].get_text().strip())
    print(item.select('.hz-image')[0]['src'])
    
    print(item.select('.hz-space-card__user-name')[0].get_text().strip())
    print(item.select('.hz-space-card-unify__photo-description')[0].get_text().strip())
    print(item.select('.hz-space-card__image-link-container')[0]['href'])



  except Exception as e:
    #raise e
    b=0

In case, you run that, it would print all the information:

code2

Great News!! We have got all of them!

In case, you wish to utilize it in the production as well as wish to measure thousands of links so, you will discover that you would easily get the IP blocked by Houzz. With this scenario, utilizing the rotating proxy services for rotating IPs is a must. You may use the services like Proxies APIs to route calls using a group of millions of housing proxies.

In case, you wish to scale crawling speed as well as don’t wish to create your private infrastructure, you may utilize our cloud-based crawler to easily scrape thousands of URLs with higher speed from crawler networks.