Esteban Duran Sampdro
Anime in Japan is short for animation feature, so basically over there the word is used to refer to any animated work, such as a Pixar film or a South Park episode. Conversely, the term outside Japan is used to refer to cartoons from Japan, particularly the ones with the big flashy eyes and extravagant hairstyles. This isolation of the term in the"west" has downgraded the word to a genre, a mere format for children Shows.
However, this scrutiny also points out what it really is, a medium for storytelling, that happens to excel at fictional stories. More specifically, an expansion to the animation feature medium set, as to how the Japanese might see it as animations from home. The reason to call it an expansion, and not a subset (although an expansion itself is a sort of subset) is the variety of genres in the anime, I mean there are more genres in anime than the traditional 'western' ones.
So from a set that tends to categorize its subsets into genres, then one can't create a subset that contains more genres than the original ones (disregarding combinations of the original ones), yet the desired subset might add to the different possible categories of genres. Hence anime expands the animation feature storytelling medium.
Before I lose myself in semantics, the point is that anime as an isolated storytelling medium is one that has shown remarkable genre variety growth. The fact that anime shows have a high amount of new genres and/or combinations of them, and that it retains the attention of a huge fan community, inside or outside of Japan, is what's captivating from this source.
Moreover, in a world more connected than ever, trends and inspirations flow faster and faster. Which means that studying a medium that shows high category variety of artistic expression, it is worthy of study. As findings may help understand the evolution of mediums of fiction, the appearance of anime trends and/or correlations with trends in other sources of fiction.
MyAnimeList is "the world's largest anime and manga database and community," whose users make a list's of anime or manga(Japanese comics), are able to rate the shows, write reviews, and participate community forums. The kaggle data set, by user azathoth42, contains crawled data of MyanimeList, with information about different animes their different metrics (score, genre, popularity, release/broadcasting dates...), users demographics, and users lists (anime, score, status...).
This tutorial seeks to explore and analyze the "growth" of the genre in anime over the years, and some of the metrics that may help understand the perception of these genres and different changes.
To explore, manipulate and analize the data Python 3 with the libraries: pandas, NumPy, Matplotlib, seaborn, Graphviz, IPython and scikit-learn.
Below we load the cvs(comma-separated value) files for different animes, and lists of anime per user. Both tables are loaded into a pandas data frame.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_graphviz
from IPython.display import Image
import pydotplus
import warnings
warnings.filterwarnings('ignore')
animedf = pd.read_csv('anime_cleaned.csv')
anime_listsdf = pd.read_csv('animelists_cleaned.csv')
animedf.head()
Above there are the first five values of the anime table, and these tables contain about 6600 different anime entries. There are 33 different 6600 different anime entries. There are 33 different attributes per entry, some of which will be dropped, and others are explored in more detail later on. So descriptions of these values as the data-processing goes on.
anime_listsdf.head(10)
Now above this, there are the first five values of the users' anime lists table. Each row would be distinguished by its username and anime_id values combination, that indicates that anime in a user's list. There are 31 million different entries. The other attributes will be detailed as the table is cleaned.
We will brush up the data in the anime table first.
We can remove several columns from the table due to redundancy or uselessness of the data in the context of analyzing genre.
The source attribute describes the inspiration of the story and/or styles of the anime, such as a book or game. Looking at the possible sources categories, using the unique function on the 'source' column of the anime dataframe, makes me wish there were more detail on the 'Game' and 'Other' labels as Japan has a history of producing shows in order to promote other products in japan, ie card games, toys, video games. Hence the division of games into video and non video games, and if possible the divition of other into toys, food or candy, all of which might show more understanding on the appearance of new genres in anime. Also, there are few sources that are too similar, if not a subset format, of the manga source. So those sources may be replaced simply by manga. Furthermore the Visual novel, novel and light novel source shouldn't be considered as a single source since they are truly different from one another; one refers to novels in prints, novel; other to novels with pictures, light novel; and the other one to novel told with pictures through a video game, visual novel. They might still sound similar, but they are distinguished enough to have different influence in the genre.
From the type column, the rows with the label music may be removed as such format is more related to the music medium vs. the storytelling one. I don't mean to say that music can't tell a story, just that it's not its forte. Also, the type OVA stands for original video anime, and ONA original net animation, basically animes that were not released in tv, movie, or special format, but where released independently online or in tape, just to provide a description of these.
From the different durations of anime, one might notice some have light under a minute, which reading around those animes make reference to sub/hidden stories within other animes, that is just too specific, and there is no need to include those in the analysis. So they get removed by excluding rows that have the string "sec. " which is unique for animes under a minute.
I am also removing the animes that have status not yet aired, since some of the metrics wouldn't bare a realistic or stable value, such as a score or popularity.
From the genre column, the animes that have no genre value are removed since this tutorial has focus genre. Furthermore, there are only two animes that lack genre in the data.
Also since years are integer values, changing the value of aired_from_year to an int might be useful, or cleaner.
All of this data cleaning is done below.
#removal of columns
animedf.drop(['title_english', 'title_japanese', 'title_synonyms',
'image_url','airing', 'aired_string', 'aired', 'background',
'premiered', 'broadcast', 'related', 'producer', 'licensor',
'opening_theme', 'ending_theme'], axis = 1, inplace = True)
#replacing sources similar to manga, with manga
animedf['source'].replace(['4-koma manga', 'Web manga', 'Digital manga'], 'Manga', inplace = True)
#removing lines with type music
animedf = animedf[animedf['type'] != 'Music']
#removing animes with duration under a minute
animedf = animedf[~animedf['duration'].str.contains("sec.")]
#removing animes that haven't aired
animedf = animedf[animedf['status'] != 'Not yet aired']
#removing animes without genre value
animedf = animedf[pd.notnull(animedf['genre'])]
#chageing the data typr of aired_from_year from double to int
animedf['aired_from_year'] = animedf['aired_from_year'].astype(int)
animedf.reset_index(drop = True, inplace = True)
The genre attribute that has the form of a list of genres per anime requires special attention. So in order to unfold these sublists within each anime 43 different columns, one per genre would be added to the data frame; if an anime has that genre in it, it will contain a 1, 0 otherwise.
Below is the code to add the columns in the dataframe. With code to collect a list of each anime genres as a list, which will be useful later on.
#list of the 43 differnt genres in anime
genres = ['Action', 'Adventure', 'Cars', 'Comedy', 'Dementia', 'Demons',
'Drama', 'Ecchi', 'Fantasy', 'Game', 'Harem', 'Hentai', 'Historical',
'Horror', 'Josei', 'Kids', 'Magic', 'Martial Arts', 'Mecha',
'Military', 'Music', 'Mystery', 'Parody', 'Police', 'Psychological',
'Romance', 'Samurai', 'School', 'Sci-Fi', 'Seinen', 'Shoujo',
'Shoujo Ai', 'Shounen', 'Shounen Ai', 'Slice of Life', 'Space',
'Sports', 'Super Power', 'Supernatural', 'Thriller', 'Vampire',
'Yaoi', 'Yuri']
#adding boolean columns per genre in each anime and converting them into int ones
for g in genres:
animedf[g] = animedf['genre'].str.contains(g)
animedf[g] = animedf[g].astype(bool)
animedf[g] = animedf[g].astype(int)
#collection of the list of genres in each anime
anime_genres_set = list(map(str, (animedf['genre'])))
animedf.head()
Now we clean the users' anime lists data.
Some columns from this table may be removed as well
The my_rewatching column contains either 1s, 0s, or null values, so filling the null values with zeros would tidy this piece of data.
Some username values are null, so this would be difficult to define the null user a single one, moreover when there could be an overlap of rows with null for the user and same anime_id — so removing the rows for the null users.
The status attributes possible values are described as follow: 1, watching; 2, completed; 3, on hold; 4, dropped; 6, plan to watch. But within the possible values found, there are also 5, 55, 0, and 33. Like 5, 55, and 0 are not a status with any meaning they are removed, and the 33 variable will be interpreted as a typo and replaced with a 3.
Below is the code to do this data washing, with a print of the cleaned data frame.
#deliting columns from user anime list data frame
anime_listsdf.drop(['my_watched_episodes', 'my_start_date', 'my_finish_date', 'my_last_updated', 'my_rewatching_ep', 'my_tags'], axis=1, inplace = True)
#replacing null with zeros
anime_listsdf['my_rewatching'] = anime_listsdf['my_rewatching'].fillna(0)
#removing rows with null username value
anime_listsdf = anime_listsdf[pd.notnull(anime_listsdf['username'])]
#removing rows with an invalid ststus value
anime_listsdf = anime_listsdf[anime_listsdf['my_status'] != 5]
anime_listsdf = anime_listsdf[anime_listsdf['my_status'] != 55]
anime_listsdf = anime_listsdf[anime_listsdf['my_status'] != 0]
#changing to 3 for rows whose status is value is 33
anime_listsdf['my_status'] = anime_listsdf['my_status'].replace([33], 3)
anime_listsdf.reset_index(drop = True, inplace = True)
anime_listsdf.head(10)
In this section, data will be explored vie graphs and reformatted, for further cleaning, understanding, and analysis. Although the data contains several variables to analyze, only a few will be considered. And for this section, the user data will not be considered.
Below we see a graph of the count of sources of the anime data. We see that Manga and Original are the highest sources of inspiration. Then with either different sources of fiction bearing a similar amount of influence, following with a tail of few more miscellaneous sources with a low influence that might be described better if they were merged somehow.
plt.figure(figsize = (15, 5))
source_bar = sns.countplot(x = 'source', data = animedf)
Then Below there is code to merge the picture book, cardgame, music and radio into Other, and the novel and book into novel.
animedf['source'] = animedf['source'].replace(['Picture book', 'Card game', 'Music', 'Radio'], 'Other')
animedf['source'] = animedf['source'].replace(['Novel', 'Book'], 'Novel')
Below there is a graph of the sources after the change, and the categories with the lowest values are still the same as the ones in the original graph, but simply placed together.
plt.figure(figsize = (15, 5))
source_bar = sns.countplot(x = 'source', data = animedf)
Below there is a graph for the status of the anime, and seeing that this data the last update was a year ago. Then discarding the status column in the anime data is fine, as it might bear wrongful information, as many of the currently airing anime has probably finished airing. So there is also code to remove this column.
plt.figure(figsize = (15, 5))
status_bar = sns.countplot(x = 'status', data = animedf)
animedf.drop(['status'], axis = 1, inplace = True)
Below there is a graph of the different general scores per anime. Doesn't say much other than a few have made it to the top. Looks pretty though.
plt.figure(figsize = (150, 50))
score_bar = sns.countplot(x = 'score', data = animedf)
Looking at the data per anime in a whole graph reveals how difficult it is to analyze some values individually per anime. So it is time to look at these values in some kind of aggregate. Since genre is the focus of this tutorial, the below code is to make a new dataframe that contains the different genre as rows, and as columns the years that the anime table span. This is a nicer way to group the values per genre than some awkward group by and list of all the genre columns. The resulting dataframe helps view and understand the data better, at least the growth of each genre per year. I learned the method below to unpack from column values that have the shape of a list of strings, in this kaggle tutorial, by Kevin Mario Gerard, of analysis on a movie dataset. This method is used throughout the tutorial. Furthermore, the resulting dataframe helps view and understand the data better, at least the growth of each genre per year.
#getting min and max year values, t define column values
min_year = animedf['aired_from_year'].min()
max_year = animedf['aired_from_year'].max()
#define data frame with range of years as column values and genres as rows
anime_genredf = pd.DataFrame(index = genres, columns = range(min_year, max_year + 1))
#filling the data frame with zeros
anime_genredf = anime_genredf.fillna(value = 0)
#getting an array of the years of individual animes
year = np.array(animedf['aired_from_year'])
y = 0
#dictionary to collect number of different genres in total
genre_count = {}
#for loop to iterate through the list of genres per anime
for g in anime_genres_set:
#split the anime's genre into a list
split_genre = list(map(str, g.split(', ')))
#iterate through the genres's of the anime
for gs in split_genre:
#add count for that genre in that year in the dataframe
anime_genredf.loc[gs, year[y]] = anime_genredf.loc[gs, year[y]] + 1
#count the dictionary
if gs in genre_count:
genre_count[gs] = genre_count[gs] + 1
else:
genre_count[gs] = 1
y+=1
anime_genredf
Above is printed the resulting data frame. As one can see, only a few genres appear in the earlier years, whereas every genre has nonzero values in the latter ones. This clearly shows change, growth, and evolution of the storytelling medium.
To get a further feeling and understand of this newly found format lets make some graphs. First, we are going to look at the number/frequency of animes per genres. Below you'll find the code to get the data dictionary containing the total number animes per genre, that was collected during the formation of the years per genre dataframe, in a proper format for graphing, in this case, a pandas series object. They are sorted to better understand the graphs, in particular, the pie chart that is also generated by the rest of the code in the cell.
#making series object conting the number of animes per series in the data
gens = pd.Series(genre_count)
#soerting the data
gens = gens.sort_values(ascending = False)
#plotting of pie chart per genre
label = list(map(str, gens.keys()))
plt.figure(figsize = (10, 10))
plt.pie(gens, labels = label, autopct = '%1.1f%%')
plt.show()
From the pizza above clearly, the comedy and action slices are the only decent ones. One might say that the most popular ones are similar to most of the genres in live-action or animations from the "west." But then on the seventh position a Japaneese only genre appears, Shounen which basically means for young males. Those tend to be the most known anime, at least in the "west," for example, Dragon Ball Z or Naruto. Then from the 12th position appears other Japanese specific genres such as Mecha, robots; ecchi, sex appeal; seinen, young adult men; Shoujo, teenage females. To further examine this data below, there's a bar graph of this same data.
plt.figure(figsize = (15,5))
gens.plot(kind = 'bar')
The bar graph represents the frequency of each genre of the animes collected. This shows Comedy and action at the pop, prevalence of usual genres on top, then Japanese specific with low values, which was also shown on the percentages in the pie chart above. But this makes sense, a sub medium that comes from the medium of storytelling will have its basis on the general ones of the supra medium, and then develop its owns.
Now using the new format dataframe, we print the change in population size of different genres through the years. It looks cool, but it's hard to really say anything with such a small pallete of colors, and amount of values in the graph. So let us take a look at the same graph of the top and bot five genres in the data frame.
anime_genredf.T.plot(grid = True)
plt.legend(loc = 9, bbox_to_anchor = (0.5, -0.2), ncol = 5)
Below you'll find the code to get the top and bottom genres through the years in order to plot them, and the code to plot the top 10 genres.
#grab genres with top and bot 5 amount of anime from the dataset
gens_top10 = gens.nlargest(5)
gens_bot10 = gens.nsmallest(5)
#grab the subset of the dataframe by the top and low 5 genres
gen_top10df = anime_genredf.loc[gens_top10.keys()]
gen_bot10df = anime_genredf.loc[gens_bot10.keys()]
gen_top10df.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)
From here we see that these genres have grown a lot, from having very few, probably, single-digit values up to the 1980s. Then each is picking up numbers, with comedy and action growing almost exponentially, while adventure and drama seem to show more of linear growth, with some variation along the years. They all take a deep at the end, this is probably due to the low count of anime in 2018, as this data only covers only a few titles of that year.
Below you'll find the graph of the bottom five genres in the data set. Here we see that the genres had their first appearance at cars, ~65; Dementia, ~74; shounen-ai, ~82; yaoi, ~93; yuri, maybe 04, hard to tell. All of them show a similar type of variation, where they may stay silent during periods of 5 to 10 years. With Shounen-ai grabbing the highest number and highest popularity in the mid-teens. Cars and Dementia vary in a more similar fashion, between the early thousands and three animes per year.
To clarify what each of the Japanese genres displayed here stands for Shounen-ai refers to romantic non-sexual relationships among boys, primarily intended for female audiences. Yaoi means boy love, again intended for female audiences. And conversely, yuri means girl love.
gen_bot10df.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)
Also comparing top 5 vs. bottom 5 graphs, one can see how more prevalent the top 5 genres are over the strange or the too Japanese specif ones. With 3 of the top genres above 50 animes a year since the 2000s, and comedy never going below 100 since ~08. While the bottom genres either having low history or simply they have had low coverage over the years.
So far, this exploration of the genre in each year has been interesting. However, these are the mere frequency of the genres throughout the years, what about the other variables. In the next section, we will explore more, with the newly found format of genre per year, but reflecting other attributes other than frequency. In the cell below there's code to generate three different dataframes with the same rows, genres, and columns, years, but with the values of popularity, score and membership, each in a different data frame.
genres_pop = []
genres_score = []
genres_mem = []
#collecting special lables for each new data frame
for g in genres:
g_pop = g + ' popularity'
g_score = g + ' score'
g_mem = g + ' members'
genres_pop.append(g_pop)
genres_score.append(g_score)
genres_mem.append(g_mem)
#create dataframe for the popularity data per genre by year
anime_popdf = pd.DataFrame(index = genres_pop, columns = range(min_year, max_year+1))
anime_popdf = anime_popdf.fillna(value = 0.0)
#get the popularity column to use to fill values
anime_pop = np.array(animedf['popularity'])
#dictionary to count the popularity per genre
pop_count = {}
#create dataframe for the score data per genre by year
anime_scoredf = pd.DataFrame(index = genres_score, columns = range(min_year, max_year+1))
anime_scoredf = anime_scoredf.fillna(value = 0.0)
#get the score column to use to fill values
anime_score = np.array(animedf['score'])
#dictionary to count score per genre
score_count = {}
#create datarframe for the membership data per genre by year
anime_memdf = pd.DataFrame(index = genres_mem, columns = range(min_year, max_year+1))
anime_memdf = anime_memdf.fillna(value = 0.0)
#get the members column to use to fill values
anime_mem = np.array(animedf['members'])
#dictionary to count members per genre
mem_count = {}
#fill in the values of the newly created dataframes and dictionaries
y=0
for g in anime_genres_set:
split_genre = list(map(str, g.split(', ')))
for gs in split_genre:
anime_popdf.loc[gs+' popularity', year[y]] = anime_popdf.loc[gs+' popularity', year[y]] + anime_pop[y]
anime_scoredf.loc[gs+ ' score', year[y]] = anime_scoredf.loc[gs+ ' score', year[y]] + anime_score[y]
anime_memdf.loc[gs+ ' members', year[y]] = anime_memdf.loc[gs+ ' members', year[y]] + anime_mem[y]
if gs+' popularity' in pop_count:
pop_count[gs+' popularity'] = pop_count[gs+' popularity'] + anime_pop[y]
score_count[gs+' score'] = score_count[gs+' score'] + anime_score[y]
mem_count[gs+' members'] = mem_count[gs+' members'] + anime_mem[y]
else:
pop_count[gs+' popularity'] = anime_pop[y]
score_count[gs+' score'] = anime_score[y]
mem_count[gs+' members'] = anime_mem[y]
y+=1
#standardizing the score and popularity data frames as those values refelct a quality metric rather than a quantitative one
anime_popdf2 = (anime_popdf - anime_popdf.mean()) / anime_popdf.std(ddof = 1)
anime_scoredf2 = (anime_scoredf - anime_scoredf.mean()) / anime_scoredf.std(ddof = 1)
#averaging the count of the scores by the tottal number of animes in each genre in the data set
for g in genres:
score_count[g+' score'] = score_count[g+' score']/genre_count[g]
Next, to examine these three variables in these genres per year format, we will treat them as we did with genre frequency per year format. So we will see bar graphs of each variable per genre and line plots of each variable per genre over the years.
pops = pd.Series(pop_count)
pops = pops.sort_values(ascending = False)
plt.figure(figsize=(15,5))
pops.plot(kind='bar')
The bar graph above shows the popularity of each genre on the site. So the number of visits to animes of that genre. These highly resemble the genre frequency plot. However, there are some differences as the slice of life genre has more visits than romance genre anime, even though there are more romance anime than a slice of life ones. The same happens with cars, anime about cars are more visited than josei, thriller, Shounen-ai, Shoujo-ai and dementia anime, regardless of there being less anime about cars than those genres. This shows that the anime produced doesn't reflect the global demand of genres. I say global as MyAnimeList is a cite more popular out of Japan than in Japan.
mems = pd.Series(mem_count)
mems = mems.sort_values(ascending = False)
plt.figure(figsize=(15,5))
mems.plot(kind='bar')
The bar graph above shows the user membership number of animes by genre. Again the frequency of a genre seems to influence this attribute as comedy ad action remains in the top, just like in the frequency count. But not completely. An example is the kids genre, which has a mid-range frequency but is among the lowest members count. This shows how the medium is so underrated, just because it is expressed via drawings doesn't mean that it's meant for kids, and clearly, the fans' genre taste reflects this.
scores = pd.Series(score_count)
scores = scores.sort_values(ascending = False)
plt.figure(figsize=(15,5))
scores.plot(kind='bar')
Above there's a bar graph with the average scores of anime genres. This here is clearly not influenced by the frequency of the genres, as it was average by the number of animes that contain that genre. Although because this is average, they all range from 7.5 to 6.5. Also, its interesting to see comedy so low in the graph with about 7. and Thriller so high while it was one of the genres with a very low count. I guess those animes mostly must be good.
Bellow, there're three line plots, each showing the change of the three values being inspected by genre per year.
plt.figure(figsize=(15,8))
anime_popdf2.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)
The graph above shows the change of a genre's popularity through time.
plt.figure(figsize=(15,5))
anime_scoredf2.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)
The bar graph above shows the change of score through time in each genre. This graph and the one before are highly similar, which shows that the scores somewhat resemble the popularities of anime, or at least the scores attributed per anime genre.
plt.figure(figsize=(150,50))
anime_memdf.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)
Above you'll find a graph of the change of the number of MAL users that are members of animes of a genre per year of release. This is interesting to see, as it shows that people tend to like the recent animes more than the old ones, even though there are very good old ones. But maybe because there was less anime back in the day than now.
These line plots bear the same palet and quantity of lines as the frequency of genres per year graph. So to better explore this data a print of six more line graphs will be provided: 2 per attribute, for each, one will contain the top 5 values in that attribute and the other the bottom 5 bottom values in that attribute. Below you will find the code to print these 6 particular graphs and the graphs themselves. They won't be commented, but be left for the reader to see for themselves and for further analysis to do.
pops_top10 = pops.nlargest(5)
pops_bot10 = pops.nsmallest(5)
scores_top10 = scores.nlargest(5)
scores_bot10 = scores.nsmallest(5)
mems_top10 = mems.nlargest(5)
mems_bot10 = mems.nsmallest(5)
pop_top10df = anime_popdf.loc[pops_top10.keys()]
pop_bot10df = anime_popdf.loc[pops_bot10.keys()]
score_top10df = anime_scoredf2.loc[scores_top10.keys()]
score_bot10df = anime_scoredf2.loc[scores_bot10.keys()]
mem_top10df = anime_memdf.loc[mems_top10.keys()]
mem_bot10df = anime_memdf.loc[mems_bot10.keys()]
plt.figure(figsize=(150,50))
pop_top10df.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)
plt.figure(figsize=(150,50))
pop_bot10df.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)
plt.figure(figsize=(150,50))
mem_top10df.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)
plt.figure(figsize=(150,50))
mem_bot10df.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)
plt.figure(figsize=(150,50))
score_top10df.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)
plt.figure(figsize=(150,50))
score_bot10df.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)
To end this data exploration section, here I print a datframe of the count of different combinations of genres throughout the year. The method to collect this data frame is the same as with the four previous ones. No graph will be provided, as sometimes a table is enough, to explore data. Also because that is another rabbit hole I shan't enter right now.
genre_mix_count = {}
for g in anime_genres_set:
split_genre = list(map(str, g.split(', ')))
split_genre.sort()
g_str = ' '.join(split_genre)
if g_str in genre_mix_count:
genre_mix_count[g_str] = genre_mix_count[g_str] + 1
else:
genre_mix_count[g_str] = 1
genre_mix_set = list(map(str, genre_mix_count.keys()))
anime_mix_genredf = pd.DataFrame(index = genre_mix_set, columns = range(min_year, max_year+1))
anime_mix_genredf = anime_mix_genredf.fillna(value = 0)
y=0
for g in anime_genres_set:
split_genre = list(map(str, g.split(', ')))
split_genre.sort()
g_str = ' '.join(split_genre)
anime_mix_genredf.loc[g_str, year[y]] = anime_mix_genredf.loc[g_str, year[y]] + 1
y+=1
anime_mix_genredf.sort_index()
Not the whole table above is visible here, well the table found 2600 different combinations in this data set. However, the trend of having more cells filled at the latter years than the first ones is still interesting. However, this is far more sparse than the genres not combined, as this frame counts the animes strictly by their genres combinations.
In this section, machine learning algorithms are applied. Specifically, I used the statistical procedure of Principal Component Analysis(PCA) and the Random Forest machine learning algorithm.
First, I use principal component analysis on the attributes that aren't genres of the dataframe. So score, rank(the place the animes are in the data set, according to their score), popularity, members, favorites(amount of people who consider the anime a favorite one) and duration in minutes(per episode or the film).
For PCA first, each column of dataframe is mean-centered, by subtracting the mean of the column to each value in the column. Then covariance matrix of the dataframe is calculated, this is done by multiplying the matrix by its transpose, $$ \frac{1}{m}\sum_{i=1}^m x^2 = XX^T $$ as this is essentially the same as summing up the square of each value, and it makes it a square matrix which is essential for the algorithm. Then their eigenvalues are calculated, which are those values that describe their relevance in the dataframe.
# make a dataframe that contain the attributes to make PCA on
a_pca = animedf[['score', 'rank', 'popularity', 'members', 'favorites','duration_min']]
# mean center the columns
for col in a_pca:
a_pca = a_pca.replace(a_pca[col], a_pca[col] - np.mean(a_pca[col]))
# make covariance matrix of the dataframe
a_pca_cov = np.cov(a_pca.T)
a_pca_cov[np.isnan(a_pca_cov)] = 0
#Calculate the eigen values of the covariance matrix.
a_pca_eig_val, a_pca_eig_vec = np.linalg.eig(a_pca_cov)
i = 0
#print the eigen values
for col in a_pca:
print ('%-*s : %s' % (15, col, str(a_pca_eig_val[i])))
i = i + 1
Above we see the results of PCA on the attributes that aren't a genre. From here we see that score is the attribute is the most relevant one. Then rank, which wasn't analyzed in the section before. Following is popularity, favorites, then members, and lastly duration per minute. This makes me wish I had done PCA before the data stretching section, as I wouldn't have looked into the members, but the rank data.
Bellow, I start the process to do PCA on the genres themselves. The cell bellow is to collect a dataframe only of the genres column to do the PCA.
animedf_genres_pca = animedf[['Action',
'Adventure', 'Cars', 'Comedy', 'Dementia', 'Demons', 'Drama', 'Ecchi',
'Fantasy', 'Game', 'Harem', 'Hentai', 'Historical', 'Horror', 'Josei',
'Kids', 'Magic', 'Martial Arts', 'Mecha', 'Military', 'Music',
'Mystery', 'Parody', 'Police', 'Psychological', 'Romance', 'Samurai',
'School', 'Sci-Fi', 'Seinen', 'Shoujo', 'Shoujo Ai', 'Shounen',
'Shounen Ai', 'Slice of Life', 'Space', 'Sports', 'Super Power',
'Supernatural', 'Thriller', 'Vampire', 'Yaoi', 'Yuri']]
# another dataframe that will be usefull for analysis latter on.
animedf_scores = animedf_genres_pca
# the code bellow mean centers the collumns of each genre
for col in animedf_genres_pca:
animedf_genres_pca[col] = animedf_genres_pca[col] * (animedf_genres_pca[col] - np.mean(animedf_genres_pca[col]))
# here the covariance matrix is calculated for the genres dataframe
animedf_pca_cov = np.cov(animedf_genres_pca.T)
animedf_pca_cov[np.isnan(animedf_pca_cov)] = 0
#print the covariance matrix
animedf_pca_cov
# here the eigen values are calcualted
animedf_pca_eig_val, animedf_pca_eig_vec = np.linalg.eig(animedf_pca_cov)
# here the eigen values ar printed
i = 0
for col in animedf_genres_pca:
print ('%-*s : %s' % (15, col, str(animedf_pca_eig_val[i])))
i = i + 1
It is interesting to see that Action has the highest relevance here and not a comedy as comedy was the most frequented genre. But this just means that the action genre is more mixed with other genres than comedy.
Bellow, there is code to make a random forest classifier to the genres dataframe. Random forest consists of making several decision trees that via bagging, ie averaging the different trees, it returns an average tree for classification. The trees uses the Gini index, $$ 1-\sum_{i=0}^{c}\big[p(i|t)\big]^2 $$ to decide which genre to use as a deciding factor. The Gini index measures inequality of each attribute, and the higher the index, the more relevant it is considered as a branching factor.
# here the random forest classifier is created
clf = RandomForestClassifier()
clf = clf.fit(animedf_genres_pca, animedf['title'])
#for loop to print the importance of each genre in the model
i = 0
for f in clf.feature_importances_:
print('%-*s : %s' % (15, genres[i], f))
i+=1
Here Comedy does come out as the most relevant factor of the random forest. The high frequency must have been rather relevant as a deciding factor for the trees to branch out. What is also interesting is that the most relevant ones are genres that appear in the west aswell. Among these are action, drama, adventure and well comedy.
Next few cells there is code to print the resulting one of the resulting trees.
#get the different anime titles without enstrange characters, such as emojis
titles = []
for t in list(animedf['title']):
titles.append(t.encode('ascii', 'ignore').decode('ascii'))
#get a tree to graph
estimator = clf.estimators_[5]
# to get the tree as a graph descripted language
dot_data = export_graphviz(estimator, out_file = 'dot_data',
feature_names = genres,
class_names = titles,
rounded = True, filled = True)
# get the build graph to show
graph = pydotplus.graphviz.graph_from_dot_file('dot_data-Copy1')
You may notice that the name of the file is different from the one created in the cell before. That is because the original one contained about 7800 different nodes, and pydotplus couldn't plot more than about 1300. So the dot_data-Copy1 is a copy of the original file minus 6500 nodes, that I removed directly in the file. So then do_data-Copy1 has about 1300 nodes, so I can at least show a good portion of the tree, which is plotted below.
#1350 node graph
Image(graph.create_png())
graph2 = pydotplus.graphviz.graph_from_dot_file('dot_data-Copy1-Copy1')
I also plotted a tree with 100 nodes, bellow, to have a better tree to look at. As you can see, the genre was used to determine the animes place in the tree.
#100 node graph
Image(graph2.create_png())
Now the tree's above aren't as informative as one would like. Considering, that the score attribute came as the most important one in the PCA analysis above of the attributes that weren't the genres. Then I'll make another anime genres dataframe, but that contains the score of the anime in each of its genre values, and do PCA on it and another random forest classifier.
# code to make the anime genres dataframe with scores.
for i, r in animedf_scores.iterrows():
animedf_scores.iloc[i] = animedf_scores.iloc[i] * animedf['score'].iloc[i]
animedf_scores.head()
Bellow, there is code to make PCA on the animes genes with score values.
# make the anime genres with score values mean centered
animedf_scores_pca = animedf_scores
for col in animedf_scores_pca:
animedf_scores_pca[col] = animedf_scores_pca[col] * (animedf_scores_pca[col] - np.mean(animedf_scores_pca[col]))
#make the covariance matrix
animedf_scores_pca_cov = np.cov(animedf_scores_pca.T)
animedf_scores_pca_cov[np.isnan(animedf_scores_pca_cov)] = 0
#print the scores
animedf_scores_pca_eig_val, animedf_scores_pca_eig_vec = np.linalg.eig(animedf_scores_pca_cov)
i = 0
for col in animedf_scores_pca:
print ('%-*s : %s' % (15, col, str(animedf_scores_pca_eig_val[i])))
i = i + 1
Here it is interesting to find that Thriller isn't significant, while it was the genre that had the best average score overall. While action, adventure, cars, and comedy has the highest values. And military and music have the lowest.
Bellow, there is code to make a random forest classifier for the anime genres with scores dataframe.
#random forest clasifier for the anime genres with scores as values dataframe.
clf2 = RandomForestClassifier()
clf2 = clf2.fit(animedf_scores, animedf['title'])
#print the importance of each genre when the scores of each anime are taken into account.
i = 0
for f in clf2.feature_importances_:
print('%-*s : %s' % (15, genres[i], f))
i+=1
# to build the graph data to print a tree of the random forest.
estimator2 = clf2.estimators_[5]
dot_data = export_graphviz(estimator2, out_file = 'dot_data_scores',
feature_names = genres,
class_names = titles,
rounded = True, filled = True)
Here once again comedy comes out in the top. Its followed by genres that are also in west storytelling mediums, action, fantasy, drama, and adventure. It's interesting that the result is so similar in both random forests but slightly different. So score does influence an anime's value, but so does the genre it belongs to.
From here on I'll generate another random forest classifier but for the anime genres per years dataframe. And use the user's data to generate the user's animes genres per years in order to calculate what years the user would prefer, based on the genres he/she liked.
#Make a pandas series object containing the sum of animes per year from the anime data frame
#this will be used as the target values for the model
year = np.array(animedf['aired_from_year'])
years = range(min_year, max_year+1)
year_count = {}
for y in years:
year_count[str(y)] = 0
for y in year:
if str(y) in year_count:
year_count[str(y)] = year_count[str(y)] + 1
year_counts = pd.Series(year_count)
#Make a user pandas series object containing the the user names and number
#of animes rated by those users from the user anime list data frame
#this was used to look at the numbers in the different lists
#from there I saw some user and decided to uses him as my guinea pig.
user_set = list(map(str, (anime_listsdf['username'])))
user_count = {}
for u in user_set:
if u in user_count:
user_count[u] = user_count[u] + 1
else:
user_count[u] = 1
users = pd.Series(user_count)
users = users.sort_values(ascending = False)
#Get the table of the animes from the user's list to a data frame
anime_id_set = list(map(str, (animedf['anime_id'])))
fire_dragon_animesdf = anime_listsdf[anime_listsdf['username'] == 'Fire-Dragon']
fire_dragon_animesdf.reset_index(drop=True)
fire_dragon_anime_id_set = list(map(str, (fire_dragon_animesdf['anime_id'])))
fire_dragon_animesdf2 = pd.DataFrame(columns = list(animedf.columns.values))
for a in fire_dragon_anime_id_set:
fire_dragon_animesdf2 = fire_dragon_animesdf2.append(animedf[animedf['anime_id'] == int(a)])
fire_dragon_genres_set = list(map(str, (fire_dragon_animesdf2['genre'])))
fire_dragon_year = np.array(fire_dragon_animesdf2['aired_from_year'])
fire_dragon_years = range(min_year, max_year+1)
fire_dragon_year_count = {}
for y in years:
fire_dragon_year_count[str(y)] = 0
for y in fire_dragon_year:
if str(y) in fire_dragon_year_count:
fire_dragon_year_count[str(y)] = fire_dragon_year_count[str(y)] + 1
fire_dragon_year_counts = pd.Series(fire_dragon_year_count)
#get the users anime list data frame of the format of genre per year
fire_dragon_genredf = pd.DataFrame(index = genres, columns = range(min_year, max_year+1))
fire_dragon_genredf = fire_dragon_genredf.fillna(value = 0)
y = 0
fire_dragon_genre_count = {}
for g in fire_dragon_genres_set:
split_genre = list(map(str, g.split(', ')))
for gs in split_genre:
fire_dragon_genredf.loc[gs, year[y]] = fire_dragon_genredf.loc[gs, year[y]] + 1
if gs in fire_dragon_genre_count:
fire_dragon_genre_count[gs] = fire_dragon_genre_count[gs] + 1
else:
fire_dragon_genre_count[gs] = 1
y+=1
Below you'll find the code that makes the model. The model uses the transpose of the anime genres by year frequency data drama as the X input, so the genres are the columns/attributes are the genres, and the rows/entries are the years. The target values are total anime by year. The relevance of each genre is printed here too.
#fitting random forest classifier model
#X variebles anime genre by year data frame
#Y amount of animes per year
clf = RandomForestClassifier()
clf = clf.fit(anime_genredf.T.sort_index(),year_counts.sort_index())
#for loop to print the importance of each genre in the model
i = 0
for f in clf.feature_importances_:
print('%-*s : %s' % (15, genres[i], f))
i+=1
Below there is code to print the model of the decision tree generated. Since the rows of the model where the years, the tree might be interpreted as the likelihood an anime is from some year based on this genre. Where the leaves that have no genre attribute to them simply specify the possible years of the path of the genres of a certain anime. It's interesting to see the more Japanese or the too specific genres tend to appear in the later years, while the earlier years are defined by more traditional genres. Also, to see that the early genres branch out to the right with earlier year values.
estimator = clf.estimators_[4]
dot_data = export_graphviz(estimator, out_file=None,
feature_names = genres,
class_names = list(year_count.keys()),
rounded = True, filled = True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())
The cell bellow contains a prediction of the model based on a users anime list. And it contains a list reflecting the years in which the user's anime list got predicted, based on genre.
#prediction based on a MAL user anime list
predicted_genres = clf.predict(fire_dragon_genredf.T)
#Code to print the predictions results
liist_years = list(year_count.keys())
i = 0
for f in predicted_genres:
print(liist_years[i], f)
i+=1
To wrap up this tutorial let me say that even though the tutorial was attempted to be divide by data cleaning, exploration, and analysis sections, the tutorial throughout showed components of previous sections. This reflects how the process of analyzing data, has no linear progress, but often requires stepping back to previous stages of analysis. This is not seen in the last stage of this tutorial, that is because there is more to explore in the model, as there is more to explore in any other stage. During each stage, there were several lines of questioning about the data left un-answered in general to clean, analyze, understand, and/or run models on. Further analysis can bring more light into anime genres, looking into the rank attribute would have been interesting.
Furthermore I shared some fun facts about anime throughout the tutorial, for those curious about the subject here are a couple of anime youtube channels that cover several topics explored here. But my main recommendation is to simply watch some anime.