How To Scrape Netflix Movies & Tv Shows Data, Eda (exploratory Data Analysis) & Visualization With Python?

September 30, 2021
netflix movies and tv shows

Netflix, Inc. is the American media and technology service provider & productions company, having headquarter in Los Gatos in California. It was founded in the year 1997 by Marc Randolph and Reed Hastings in Scotts Valley in California. The primary business of the company is subscription-based streaming services that provide online streaming of television and films series, including in-house production.

Netflix is extremly popular entertainment services utilized by people across the globe. This EDA would explore the given Netflix dataset using graphs and visualizations with Python libraries, seaborn, and matplotlib.

We use Movies and TV Shows for scraping Netflix movies and TV shows data, EDA, and visualizations listed on a Netflix dataset using Kaggle. This dataset includes Movies and TV Shows accessible on Netflix after 2019. This dataset is gathered from Flixable, a third-party search engine for Netflix.

Importing Libraries

Let’s import the libraries needed.

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

Loading a Dataset

With Pandas Library, we would load a CSV file named that with netflix_df for a dataset.

netflix_df = pd.read_csv("netflix_titles.csv")

Then check the initial 5 data.

No Show_id Type Title Director Cast Country Date_added Release_year Rating Duration Listed_in Description
0 81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...
1 80117401 Movie Jandino: Whatever it Takes NaN Jandino Asporaat United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra...
2 70234439 TV Show Transformers Prime NaN Peter Cullen, Sumalee Montano, Frank Welker, J.. United States September 8, 2018 2013 TV-Y7-FV 1 Season Kids' TV With the help of three human allies, the Autob..
3 80058654 TV Show Transformers: Robots in Disguise NaN Will Friedle, Darren Criss, Constance Zimmer, ... United States September 8, 2018 2016 TV-Y7 1 Season Kids' TV When a prison ship crash unleashes hundreds of...
4 80125979 Movie #realityhigh Fernando Lebrija Nesta Cooper, Kate Walsh, John Michael Higgins... United States September 8, 2017 2017 TV-14 99 min Comedies When nerdy high schooler Dani finally attracts...

This dataset has more than 6234 titles and 12 descriptions. After getting a quicker view of data frames, that looks like the typical TVshows or movie data frames without any ratings. We may also see NaN values within some columns.

Data Reporting & Cleaning

Data Cleaning indicates the procedure of recognizing incorrect, inaccurate, irrelevant, incomplete, or missing data as well as modifying, replacing, and deleting them when required. Data Cleansing is measured as the fundamental element of Data Science.

Data Reporting & Cleaning

Do you want to know about the best 250 movies till date? Or the finest comedy shows, which have ever broadcasted on the smaller screens? For such data like reviews, ratings, answers, as well as trivia associated with the domain of shows and movies, people worldwide use IMDB, an online database. While this data is updated by the fans, this database is held as well as operated by the subsidiary of Amazon. This was initially made as to the database in the 1990 as well as moved online in 1993. Where as anybody can access this website data, you must do registration if you want to do edits to the reviews or facts. Here, we will go through

print('\nColumns with missing value:') 
Columns with missing value: show_id False type False title False director True cast True country True date_added True release_year False rating True duration False listed_in False description False dtype: bool

From these details, we understand that 6,234 entries as well as 12 columns are given to deal with the EDA. There are some columns having null values, “cast,” “country,” “director,” “date_added,” and “rating.”

show_id 0 type 0 title 0 director 1969 cast 570 country 476 date_added 11 release_year 0 rating 10 duration 0 listed_in 0 description 0 dtype: int64

There are 3,036 null values in the whole dataset having 1,969 missing points underneath “director”, 570 below “cast,” 476 below “country,” 11 below “date_added,” as well as 10 below “ratings.” We would require to cope with all the null data points before diving into EDA as well as modeling.

Attribution is the method for treating missing values by filling it through definite techniques. Could use mode, mean, or utilize predictive modeling. Here, we would discuss the usage of fillna functions from Pandas to do the attribution. Drop rows having missing values. Could utilize the dropna functions from Pandas.

netflix_df.director.fillna("No Director", inplace=True)
netflix_df.cast.fillna("No Cast", inplace=True)"Country Unavailable", inplace=True)
netflix_df.dropna(subset=["date_added", "rating"], inplace=True)

The coolest way of getting rid of it might be to delete rows having missing data to find missing values. Although, this wouldn’t become helpful to the EDA as this is information loss. As “cast,” “director,” and “country” have most of null values, we have selected to treat every missing value is inaccessible. Another two labels “date_added” as well as “rating” has an irrelevant data portion, therefore it drops from a dataset. In the end, we can observe that no missing values are there in a data frame.


Exploratory Visualization and Analysis

1. Netflix Content through Type

netflix content by type

Analyzing Netflix dataset including both shows and movies is needed. Let’s compare total shows and movies in the dataset to understand which the key point is.

plt.title(“Percentation of Netflix Titles that are either Movies or TV Shows”)
g = plt.pie(netflix_df.type.value_counts(),explode=(0.025,0.025), labels=netflix_df.type.value_counts().index, colors=[‘red’,’black’],autopct=’%1.1f%%’, startangle=180)

There are around 4,000++ movies as well as nearly 2,000 TV shows, having movies as the key part. There are so many movie titles having 68,5% than TV shows titles having 31,5%.

2. Content Amount as the Time Function

content amount as the time function

Then, we will search the content amount of Netflix OTT through web scraping OTT platform that has been added through the past years. As we are interested about when the Netflix added a title in their platform, we would add the “year_added” column for showing date from “date_added” columns.

fig, ax = plt.subplots(figsize=(13, 7))
sns.lineplot(data=netflix_year_df, x=’year’, y=’date_added’)
sns.lineplot(data=movies_year_df, x=’year’, y=’date_added’)
sns.lineplot(data=shows_year_df, x=’year’, y=’date_added’)
ax.set_xticks(np.arange(2008, 2020, 1))
plt.title(“Total content added across all years (up to 2019)”)
plt.legend([‘Total’,’Movie’,’TV Show’])

Depending on the timeline given, we can determine that a popular streaming platform was started gaining grip after 2013. And since then, the content added has been growing considerably. The development in total movies on the Netflix is much larger in numbers than TV shows. Around 1,300 new movies got added in 2018 as well as 2019. Also, we know that Netflix is mainly focused on movies and not TV shows in the current years

3. Countries by Amount of Produced Content

countries by amount of produced content

Next is searching the countries through the amount of content produced on Netflix. We require to separate all the countries in the film before studying that, and removing titles having no countries accessible.

filtered_countries = netflix_df.set_index(‘title’).country.str.split(‘, ‘, expand=True).stack().reset_index(level=1, drop=True);
filtered_countries = filtered_countries[filtered_countries != ‘Country Unavailable’]
g = sns.countplot(y = filtered_countries, order=filtered_countries.value_counts().index[:15])
plt.title(‘Top 15 Countries Contributor on Netflix’)

Using the given images, we can have the top 15 contributors (country-wise) to Netflix. The country having maximum amount of content production is the United States.

4. Top Directors on Netflix

top directors on netflix

For getting the most well-known director, we could visualize it.

filtered_directors = netflix_df[netflix_df.director != 'No Director'].set_index('title').director.str.split(', ', expand=True).stack().reset_index(level=1, drop=True)
plt.title('Top 10 Director Based on The Number of Titles')
sns.countplot(y = filtered_directors, order=filtered_directors.value_counts().index[:10], palette='Blues')

The most well-liked director on Netflix, having the maximum titles, is mostly international.

5. Top Genres on Netflix

top genres on netflix
filtered_genres = netflix_df.set_index('title').listed_in.str.split(', ', expand=True).stack().reset_index(level=1, drop=True);
g = sns.countplot(y = filtered_genres, order=filtered_genres.value_counts().index[:20])
plt.title('Top 20 Genres on Netflix')

From this graph, we can understand that International Movies are at the first place, trailed by dramas as well as comedies.

order = netflix_df.rating.unique()
count_movies = netflix_movies_df.groupby('rating')['title'].count().reset_index()
count_shows = netflix_shows_df.groupby('rating')['title'].count().reset_index()
count_shows = count_shows.append([{"rating" : "NC-17", "title" : 0},{"rating" : "PG-13", "title" : 0},{"rating" : "UR", "title" : 0}], ignore_index=True)
count_shows.sort_values(by="rating", ascending=True)
plt.title('Amount of Content by Rating (Movies vs TV Shows)'), count_movies.title), count_shows.title, bottom=count_movies.title)
plt.legend(['TV Shows', 'Movies'])

The biggest count of the Netflix content is done with the “TV-14” ratings. “TV-14” has material having adult guardians or parents might find improper for children under 14 years of age. However, the biggest count of the TV shows is done with the “TV-MA” ratings. “TV-MA” is the ratings given by TV Parental Guidelines to television programs designed for matured audiences only.

6. Content by Ratings

content by ratings
filtered_cast_shows = netflix_shows_df[netflix_shows_df.cast != ‘No Cast’].set_index(‘title’).cast.str.split(‘, ‘, expand=True).stack().reset_index(level=1, drop=True)
plt.title(‘Top 10 Actor TV Shows Based on The Number of Titles’)
sns.countplot(y = filtered_cast_shows, order=filtered_cast_shows.value_counts().index[:10], palette=’pastel’)

7. Top Actors on Netflix Depending on Total Titles

top actors on netflix depending on total titles

The top actor on Netflix TV Shows, depending on total titles, is Takahiro Sakurai.

filtered_cast_movie = netflix_movies_df[netflix_movies_df.cast != 'No Cast'].set_index('title').cast.str.split(', ', expand=True).stack().reset_index(level=1, drop=True)
plt.title('Top 10 Actor Movies Based on The Number of Titles')
sns.countplot(y = filtered_cast_movie, order=filtered_cast_movie.value_counts().index[:10], palette='pastel')
top actors on netflix depending on total titles 2

The top actor on Netflix Movies, depending on total titles is Anupam Kher.


We have taken many interesting implications from Scraping Netflix movies and TV shows data titles dataset; here’s the summary of some of them:

  • A country by amount of content produces is the United States.
  • A general streaming platform in progress getting traction after year 2014. Since that time, the added content has been growing significantly.
  • International Movies is the genre, which is mainly in Netflix.
  • The biggest count of the Netflix content is done with the “TV-14” ratings.
  • The maximum content type on the Netflix is Movies.
  • The most well-known actor on the Netflix movie, depending on total titles, is Anupam Kher.
  • The most well-known actor on the Netflix TV Shows depending on total titles is Takahiro Sakurai.
  • The most widespread director on Netflix having maximum titles, is Jan Suter.