By, Wei Jiang
Netflix is one of the most popular steaming servies we have today and it is considered one of the top streaming companies in the United States. A big question is how does Netflix keep its user base interested and continue to do so. Is it making popular movies avlaible for users to stream? Is it releasing specific genres of movies and T.V. shows based on past trends?
In this notebook, I will use a Netflix movie dataset to see if there is any kind of relationship between a country and the numbers of movies/tv shows released in that country. Does is this trend consistent in each country and if so, is it a reflection of what Netflix wants to focus on, or are they releasing tv shows and movies at random?
Tools used in the notebook are:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import math
data = pd.read_csv("netflix_titles.csv")
# Drop columns that is not needed
data = data.drop(columns=["show_id", "director", "rating", "description"])
# convert the listed_in coulumn into a list so that it is easier to use
# convert release_year into an int
# convert countries into a list and np.nan if not listed
for index, row in data.iterrows():
data.at[index, "listed_in"] = row["listed_in"].split(",")
data.at[index, "release_year"] = int(row["release_year"])
country = row["country"]
if (type(country) == float):
country = np.nan
else:
country = country.split(",")
data.at[index, "country"] = country
# Drop all rows that NaN because not knowing the country isnt useful in the analysis
data.dropna()
data.head()
The data shows information that is needed to analyze. The important fields are:
The first step is to find out which type of show is more popular on Nextflix and if this has changed across years. The initial data goes back to 1964, but Netflix started streaming most of its contents in 2007. So the focus will be data from 2007 and onwards and not before Netflix began its streaming service.
# unique years in the data
years = data.release_year.unique()
# Filter the list so it only has data from 2007 to current year
years = list(filter(lambda year: 2007 <= year <= 2020, years))
# Keep only type and release year columns and group by year
groupByYear = data.groupby(["release_year"])
groupByYear = groupByYear[["type", "release_year"]]
# Go through each year and tally up the movies and tv shows type
movieMap = []
tvMap = []
for year in years:
# Get the group of said year
targetYear = groupByYear.get_group(year)
# Local count for each year
movieCount = 0
tvCount = 0
# Counting
for index, row in targetYear.iterrows():
if(str(row["type"]) == "Movie"):
movieCount += 1
else:
tvCount += 1
# Add it to the list
movieMap.append([year, movieCount])
tvMap.append([year, tvCount])
# Generate the table and merge them on year
movieDF = pd.DataFrame(movieMap, columns=["Year", "Movie Count"])
tvDF = pd.DataFrame(tvMap, columns=["Year", "TV Count"])
mergedDF = movieDF.merge(tvDF, how="left", left_on="Year", right_on="Year")
mergedDF.head()
plt.bar(mergedDF["Year"], mergedDF["Movie Count"], color="b")
plt.xlabel("Year")
plt.ylabel("Movie Count")
plt.bar(mergedDF["Year"], mergedDF["TV Count"], color="r")
plt.xlabel("Year")
plt.ylabel("TV Show Count")
plt.bar(mergedDF["Year"], mergedDF["Movie Count"], color="b", label="Movie")
plt.bar(mergedDF["Year"], mergedDF["TV Count"], color="r", label="TV")
plt.xlabel("Year")
plt.ylabel("Show Count")
plt.legend(loc="upper left")
Across the years, there is always more movies being release on Netflix compared to T.V shows, except for 2019. In 2019, there were more TV shows release compared to movies. According to William L. Hosch from Britannica, Netflix reached its biggest revenue generator in 2018. Perhaps Netlfix started making more of its original series, which tends to be in the TV show category.
Counting the number of movies and tv shows across years isnt very accurate because movies could have been more popular in the earlier years and they are much easier to produce because they are much shorter. Here, I will take the duration of each movie/tv show and standardize it.
Movies tend to last 90+ minutes and TV shows lasts 1 season. On average, TV shows will have about 13 episdoes per season and each episode lasts about 50 minutes. This brings an average TV show time to be about 650 minutes, which is about 7 movies.
# Method to extract the numbers from a string duration
def extractInt(string):
nums = ""
for i in range(len(string)):
if string[i].isdigit():
nums+= string[i]
return int(nums)
# Go through each row and convert each duration into a int by minues
for index, row in data.iterrows():
data.at[index, "release_year"] = int(row["release_year"])
duration = row["duration"]
# Movies have min and not seasons
if "min" in duration:
duration = extractInt(duration)
else:
# Average TV show tends to be around 50 minutes and 13 epsidoes long
duration = extractInt(duration)
duration = duration * 13 * 50
# Update the Dataframe
data.at[index, "duration"] = duration
data.head()
Using the same counting method as before, only this time, take the duration of the movie into account so that TV shows will get the same weight. For example: 1 Movie = 90 minutes = 1 count 1 Season = 650 minutes = 7 count
# unique years in the data
years = data.release_year.unique()
# Filter the list so it only has data from 2007 to current year
years = list(filter(lambda year: 2007 <= year <= 2020, years))
# Average Movie time
avgMovieDuration = data.groupby(["type"]).get_group("Movie").duration.mean()
print(avgMovieDuration)
# Keep only type and release year columns and group by year
groupByYear = data.groupby(["release_year"])
groupByYear = groupByYear[["type", "release_year", "duration"]]
# Go through each year and tally up the movies and tv shows type
movieMap = []
tvMap = []
for year in years:
# Get the group of said year
targetYear = groupByYear.get_group(year)
# Local count for each year
movieCount = 0
tvCount = 0
# Counting
for index, row in targetYear.iterrows():
duration = row["duration"]
if(str(row["type"]) == "Movie"):
movieCount += math.ceil(duration/avgMovieDuration)
else:
tvCount += math.ceil(duration/avgMovieDuration)
# Add it to the list
movieMap.append([year, movieCount])
tvMap.append([year, tvCount])
# Generate the table and merge them on year
movieDF = pd.DataFrame(movieMap, columns=["Year", "Movie Count"])
tvDF = pd.DataFrame(tvMap, columns=["Year", "TV Count"])
mergedDF = movieDF.merge(tvDF, how="left", left_on="Year", right_on="Year")
mergedDF.head()
plt.bar(mergedDF["Year"], mergedDF["Movie Count"], color="b")
plt.xlabel("Year")
plt.ylabel("Movie Count")
plt.bar(mergedDF["Year"], mergedDF["TV Count"], color="r")
plt.xlabel("Year")
plt.ylabel("TV Show Count")
# Variables for Scatter Plit
year = mergedDF["Year"]
tvCount = mergedDF["TV Count"]
movieCount = mergedDF["Movie Count"]
plt.figure(figsize=(10,10))
plt.scatter(year,tvCount, c = 'r', label="TV")
plt.scatter(year,movieCount, c = 'b', label="Movie")
plt.xlabel("Years")
plt.ylabel("Movie/TV Show Count")
plt.legend(loc="upper left")
After the transformation, it seems like there are actually more TV Shows than movies if we take into account the duration of the movies and TV Shows. From the graph, it looks like right after 2014, TV shows got more popular and continues that trend down the line and movies peaked at around 2017 and began to fall after.
This data shows that people prefer TV shows over movies across the years because TV Shows continues grow and movies begins to decline.
Next, the data should be categorized by country because not all TV shows and movies are avliable in every country. In this step, we will see if there is a coorelation between the number of movies and tv shows and country. Could Netflix release more movies or shows to certain countries because of demand?
In this section, I will only look at:
# Get a list of country names
countries = {
"Movie Count" : {},
"TV Count": {}
}
# Each data point in the data, add the country name
for index, row in data.iterrows():
countryList = row["country"]
if type(countryList) == float:
continue
for country in countryList:
if country.strip() not in countries:
countries["Movie Count"][country.strip()] = {}
countries["TV Count"][country.strip()] = {}
# Add years to each country for the respective year/tv count
for year in years:
countries["Movie Count"][country.strip()][year] = 0
countries["TV Count"][country.strip()][year] = 0
# Go through each data item and add the count to the appropriate country
for index, row in data.iterrows():
# Calculate the movie/Tv shows duration
duration = row["duration"]
category = row["type"]
year = row["release_year"]
count = math.ceil(duration/avgMovieDuration)
countryList = row["country"]
if type(countryList) == float:
continue
for country in countryList:
country = country.strip()
# Valid year in the years list that we defined earlier
if (year in years):
if category == "Movie":
countries["Movie Count"][country][year] += count
else:
countries["TV Count"][country][year] += count
# Target Countries to look at the trends
targetCountries = [
"United States",
"United Kingdom",
"Hong Kong",
"China",
"France",
"India",
"South Korea",
"Thailand",
"Australia",
"Canada"
]
# Make a dataframe for each country that shows the numbers of movies and TV Shows for each
# Then plot it
for country in targetCountries:
countryMovieDict = countries["Movie Count"][country]
countryTVDict = countries["TV Count"][country]
tvCount = []
movieCount = []
for year in years:
tvCount.append(countryTVDict[year])
movieCount.append(countryMovieDict[year])
dataFrame = {
"Year": years,
"TV Count": tvCount,
"Movie Count": movieCount
}
countryDF = pd.DataFrame(dataFrame)
year = countryDF["Year"]
tvCount = countryDF["TV Count"]
movieCount = countryDF["Movie Count"]
plt.figure(figsize=(10,10))
plt.scatter(year,tvCount, c = 'r', label="TV")
plt.scatter(year,movieCount, c = 'b', label="Movie")
plt.xlabel("Years")
plt.ylabel("Movie/TV Show Count")
plt.title(country)
plt.legend(loc="upper left")
# Linear Regression Line
# Movie
m,b = np.polyfit(year, movieCount, 1)
plt.plot(year, m*year+b, c="b")
# TV Show
m,b = np.polyfit(year, tvCount, 1)
plt.plot(year, m*year+b, c="r")
plt.show()
From the scatter plot of 10 countries, we can clearly see that country does affect the frequency of movies and tv shows that Netflix releases. For example Netflix predominatly releases movies in India compared to any other country. For every other country that is not India, at one point, there is an intersection between the amount of movies and tv shows released at around 2008-2010.
From this, I can conlclude that Netflix is focusing on TV Shows rather than movies. From personal experince, Netflix Originals tend to have more TV shows than movies as well. This could also reflect the culture of the world because people may feel that a TV show provides more in-depth character development and plot because it can spand many seasons that can total up to 7 movies on average. Movies on the other hand, is limited to 90 minutes on average and it cannot explore a more detailed plot.
According to Dudley from Quora, he stated that TV Shows are often cheaper to make and is usually 2-10 times faster in production. This supports the idea that Netflix is leaning towards TV shows because it has a higher profit margin, or it is chaper to stream on their services.