how to scrape amazon prime video data using beautiful soup & selenium

Selenium is an extremely powerful tool used for web data scraping however, it has some flaws that is fair because it was produced mainly to test web applications. On the other hand, BeautifulSoup was developed produced for data scraping and it is extremely powerful indeed.

However, even BeautifulSoup has its faults, it won’t be beneficial if the required data is behind the “wall”, as it needs user’s login for accessing the data or needs some actions from users.

That’s where we can utilize Selenium, for automating user interactions through the website as well as we would use BeautifulSoup for scraping data after getting inside a “wall”.

Integration of Selenium with BeautifulSoup makes an extremely powerful web scraping tool.

As you can use Selenium for automating user interactions as well as extract the data also, BeautifulSoup is much more effective in extracting the data.

We would be using BeautifulSoup and Selenium to extract movie information like name, description, ratings, etc. in the comedy category from Amazon Prime Video as well as we would filter out the movies depending on the IMDB ratings.

So, let’s start.

Initially, let’s import the necessary modules;


from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup as soup
from time import sleep
from selenium.common.exceptions import NoSuchElementException
import pandas as pd

Then, start three empty lists for holding the movie information,


movie_names = []
movie_descriptions = []
movie_ratings = []

If you want that this program works, then you need to use a Chrome Web Driver. You can have a Chrome driver so ensure that you download a driver, which matches with the Chrome browser’s version.

Then, lets make a function open_site() that will open the sign-in page of Amazon Prime.


def open_site():
    options = webdriver.ChromeOptions()
    options.add_argument("--disable-notifiactions")
    driver = webdriver.Chrome(executable_path='PATH/TO/YOUR/CHROME/DRIVER',options=options)
    driver.get(r'https://www.amazon.com/ap/signin?accountStatusPolicy=P1&clientContext=261-1149697-3210253&language=en_US&openid.assoc_handle=amzn_prime_video_desktop_us&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.mode=checkid_setup&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.ns.pape=http%3A%2F%2Fspecs.openid.net%2Fextensions%2Fpape%2F1.0&openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.primevideo.com%2Fauth%2Freturn%2Fref%3Dav_auth_ap%3F_encoding%3DUTF8%26location%3D%252Fref%253Ddv_auth_ret')
    sleep(5)
    driver.find_element_by_id('ap_email').send_keys('ENTER YOUR EMAIL ID')
    driver.find_element_by_id('ap_password').send_keys('ENTER YOUR PASSWORD',Keys.ENTER)
    sleep(2)
    search(driver)    

Now, define the search() function that searches the genre, which we identify,


def search(driver):
    driver.find_element_by_id('pv-search-nav').send_keys('Comedy Movies',Keys.ENTER)
    
    
    last_height = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("scrollTo(0, document.body.scrollHeight);")
        sleep(5)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height
    html = driver.page_source
    Soup = soup(html,'lxml')
    tiles = Soup.find_all('div',attrs={"class" : "av-hover-wrapper"})
    
    for tile in tiles:
        movie_name = tile.find('h1',attrs={"class" : "_1l3nhs tst-hover-title"})
        movie_description = tile.find('p',attrs={"class" : "_36qUej _1TesgD tst-hover-synopsis"})
        movie_rating = tile.find('span',attrs={"class" : "dv-grid-beard-info"})
        rating = (movie_rating.span.text)
        try:
            if float(rating[-3:]) > 8.0 and float(rating[-3:]) < 10.0:
                movie_descriptions.append(movie_description.text)
                movie_ratings.append(movie_rating.span.text)
                movie_names.append(movie_name.text)
                print(movie_name.text, rating)
        except ValueError:
            pass
    dataFrame()

This functionality searches for a genre as well as scrolls down till the page end, as Amazon Prime Video has limitless scrolling, we will do scrolling till the end with JavaScript executor as well as get the page source by using driver.page_source. We utilize this source as well as pass that into the BeautifulSoup.

In case, the statement is getting movies that have ratings of over 8.0 as well as below 10.0, just to make sure.

Now, it’s time to make a a pandas data frame, for storing all the movie data in,


def dataFrame():
    
    details = {
        'Movie Name' : movie_names,
        'Description' : movie_descriptions,
        'Rating' : movie_ratings
    }
    data = pd.DataFrame.from_dict(details,orient='index')
    data = data.transpose()
    data.to_csv('Comedy.csv')

Let’s call functions now,


open_site()
sample data

Your result won’t look precisely like this so, we have formatted this sheet a bit, like adjusting column widths as well as wrapping the text. Other than that, it would look like the one given here.

Conclusion

While BeautifulSoup and Selenium work together in the best way and can provide good results, some other modules are there that are equally powerful.

If you have any other queries related to scraping Amazon Prime Video data, you can always contact X-Byte!