Analysis of Genre in Anime¶

Esteban Duran Sampdro

Introduction¶

Anime in Japan is short for animation feature, so basically over there the word is used to refer to any animated work, such as a Pixar film or a South Park episode. Conversely, the term outside Japan is used to refer to cartoons from Japan, particularly the ones with the big flashy eyes and extravagant hairstyles. This isolation of the term in the"west" has downgraded the word to a genre, a mere format for children Shows.

However, this scrutiny also points out what it really is, a medium for storytelling, that happens to excel at fictional stories. More specifically, an expansion to the animation feature medium set, as to how the Japanese might see it as animations from home. The reason to call it an expansion, and not a subset (although an expansion itself is a sort of subset) is the variety of genres in the anime, I mean there are more genres in anime than the traditional 'western' ones.

So from a set that tends to categorize its subsets into genres, then one can't create a subset that contains more genres than the original ones (disregarding combinations of the original ones), yet the desired subset might add to the different possible categories of genres. Hence anime expands the animation feature storytelling medium.

Before I lose myself in semantics, the point is that anime as an isolated storytelling medium is one that has shown remarkable genre variety growth. The fact that anime shows have a high amount of new genres and/or combinations of them, and that it retains the attention of a huge fan community, inside or outside of Japan, is what's captivating from this source.

Moreover, in a world more connected than ever, trends and inspirations flow faster and faster. Which means that studying a medium that shows high category variety of artistic expression, it is worthy of study. As findings may help understand the evolution of mediums of fiction, the appearance of anime trends and/or correlations with trends in other sources of fiction.

MyAnimeList is "the world's largest anime and manga database and community," whose users make a list's of anime or manga(Japanese comics), are able to rate the shows, write reviews, and participate community forums. The kaggle data set, by user azathoth42, contains crawled data of MyanimeList, with information about different animes their different metrics (score, genre, popularity, release/broadcasting dates...), users demographics, and users lists (anime, score, status...).

This tutorial seeks to explore and analyze the "growth" of the genre in anime over the years, and some of the metrics that may help understand the perception of these genres and different changes.

Data Loading¶

To explore, manipulate and analize the data Python 3 with the libraries: pandas, NumPy, Matplotlib, seaborn, Graphviz, IPython and scikit-learn.

Below we load the cvs(comma-separated value) files for different animes, and lists of anime per user. Both tables are loaded into a pandas data frame.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import export_graphviz
from IPython.display import Image
import pydotplus
import warnings
warnings.filterwarnings('ignore')

animedf = pd.read_csv('anime_cleaned.csv')
anime_listsdf = pd.read_csv('animelists_cleaned.csv')
animedf.head()

Above there are the first five values of the anime table, and these tables contain about 6600 different anime entries. There are 33 different 6600 different anime entries. There are 33 different attributes per entry, some of which will be dropped, and others are explored in more detail later on. So descriptions of these values as the data-processing goes on.

anime_listsdf.head(10)

Now above this, there are the first five values of the users' anime lists table. Each row would be distinguished by its username and anime_id values combination, that indicates that anime in a user's list. There are 31 million different entries. The other attributes will be detailed as the table is cleaned.

Data Washing¶

We will brush up the data in the anime table first.

We can remove several columns from the table due to redundancy or uselessness of the data in the context of analyzing genre.

title_english, title_japanese, title_synonyms and background may be removed as the title column is enough string representations of the anime.
airing, aired_string, aired, premiered, and broadcast since air_from_year is enough information about the temporal origin of the show.
producer and licensor, may be removed as studio provides sufficient information regarding the making of the anime. And since studio variable probably has a higher impact on genre than the producer or licensor. As one may consider studio the artist's house, whereas the others more of the money/resources houses.
opening_theme and ending_theme may be removed, as those are simply the intro and outro song names, with image_url. As these columns bear no useful relevance to this analysis.
finally related is removed as each value has the form of a data dictionary, that tends to point to a manga_id value from MyAnimeList, or an anime_id, and since the manga data is not available, this column has no usefulness here.

The source attribute describes the inspiration of the story and/or styles of the anime, such as a book or game. Looking at the possible sources categories, using the unique function on the 'source' column of the anime dataframe, makes me wish there were more detail on the 'Game' and 'Other' labels as Japan has a history of producing shows in order to promote other products in japan, ie card games, toys, video games. Hence the division of games into video and non video games, and if possible the divition of other into toys, food or candy, all of which might show more understanding on the appearance of new genres in anime. Also, there are few sources that are too similar, if not a subset format, of the manga source. So those sources may be replaced simply by manga. Furthermore the Visual novel, novel and light novel source shouldn't be considered as a single source since they are truly different from one another; one refers to novels in prints, novel; other to novels with pictures, light novel; and the other one to novel told with pictures through a video game, visual novel. They might still sound similar, but they are distinguished enough to have different influence in the genre.

From the type column, the rows with the label music may be removed as such format is more related to the music medium vs. the storytelling one. I don't mean to say that music can't tell a story, just that it's not its forte. Also, the type OVA stands for original video anime, and ONA original net animation, basically animes that were not released in tv, movie, or special format, but where released independently online or in tape, just to provide a description of these.

From the different durations of anime, one might notice some have light under a minute, which reading around those animes make reference to sub/hidden stories within other animes, that is just too specific, and there is no need to include those in the analysis. So they get removed by excluding rows that have the string "sec. " which is unique for animes under a minute.

I am also removing the animes that have status not yet aired, since some of the metrics wouldn't bare a realistic or stable value, such as a score or popularity.

From the genre column, the animes that have no genre value are removed since this tutorial has focus genre. Furthermore, there are only two animes that lack genre in the data.

Also since years are integer values, changing the value of aired_from_year to an int might be useful, or cleaner.

All of this data cleaning is done below.

#removal of columns
animedf.drop(['title_english', 'title_japanese', 'title_synonyms',
              'image_url','airing', 'aired_string', 'aired', 'background', 
              'premiered', 'broadcast', 'related', 'producer', 'licensor', 
              'opening_theme', 'ending_theme'], axis = 1, inplace = True)

#replacing sources similar to manga, with manga
animedf['source'].replace(['4-koma manga', 'Web manga', 'Digital manga'], 'Manga', inplace = True)

#removing lines with type music
animedf = animedf[animedf['type'] != 'Music']

#removing animes with duration under a minute
animedf = animedf[~animedf['duration'].str.contains("sec.")]

#removing animes that haven't aired
animedf = animedf[animedf['status'] != 'Not yet aired']

#removing animes without genre value
animedf = animedf[pd.notnull(animedf['genre'])]

#chageing the data typr of aired_from_year from double to int
animedf['aired_from_year'] = animedf['aired_from_year'].astype(int)

animedf.reset_index(drop = True, inplace = True)

The genre attribute that has the form of a list of genres per anime requires special attention. So in order to unfold these sublists within each anime 43 different columns, one per genre would be added to the data frame; if an anime has that genre in it, it will contain a 1, 0 otherwise.

Below is the code to add the columns in the dataframe. With code to collect a list of each anime genres as a list, which will be useful later on.

#list of the 43 differnt genres in anime
genres = ['Action', 'Adventure', 'Cars', 'Comedy', 'Dementia', 'Demons', 
          'Drama', 'Ecchi', 'Fantasy', 'Game', 'Harem', 'Hentai', 'Historical',
          'Horror', 'Josei', 'Kids', 'Magic', 'Martial Arts', 'Mecha',
          'Military', 'Music', 'Mystery', 'Parody', 'Police', 'Psychological',
          'Romance', 'Samurai', 'School', 'Sci-Fi', 'Seinen', 'Shoujo',
          'Shoujo Ai', 'Shounen', 'Shounen Ai', 'Slice of Life', 'Space',
          'Sports', 'Super Power', 'Supernatural', 'Thriller', 'Vampire',
          'Yaoi', 'Yuri']

#adding boolean columns per genre in each anime and converting them into int ones
for g in genres:
    animedf[g] = animedf['genre'].str.contains(g)
    animedf[g] = animedf[g].astype(bool)
    animedf[g] = animedf[g].astype(int)
    
#collection of the list of genres in each anime
anime_genres_set = list(map(str, (animedf['genre'])))
animedf.head()

Now we clean the users' anime lists data.

Some columns from this table may be removed as well

my_start_date, my_finish_date, are too user-specific. Also, the lack of values in some columns such as finish_date would make it difficult to find data relevant for genre such as mean time-lapse speed to complete a specific genre show.
my_rewatching_ep is too redundant on the my_rewatching, it is enough to know if a user deems the anime valuable enough to rewatch, but not the number of episodes. Also, there was some user that claimed it had rewatched a lewd anime for more than thousands of years. So getting rid of internet troll data is good too.
my_watched_episodes is not relevant for this genre analysis as simply knowing that the user was interested in an anime of zed genre, regardless of the number of episodes watched will suffice.
my_last_update and my_tags are irrelevant for this analysis, as the last time the show was updated by the user doesn't matter and the tags vary too much, often unique for a show and/or user.

The my_rewatching column contains either 1s, 0s, or null values, so filling the null values with zeros would tidy this piece of data.

Some username values are null, so this would be difficult to define the null user a single one, moreover when there could be an overlap of rows with null for the user and same anime_id — so removing the rows for the null users.

The status attributes possible values are described as follow: 1, watching; 2, completed; 3, on hold; 4, dropped; 6, plan to watch. But within the possible values found, there are also 5, 55, 0, and 33. Like 5, 55, and 0 are not a status with any meaning they are removed, and the 33 variable will be interpreted as a typo and replaced with a 3.

Below is the code to do this data washing, with a print of the cleaned data frame.

#deliting columns from user anime list data frame
anime_listsdf.drop(['my_watched_episodes', 'my_start_date', 'my_finish_date', 'my_last_updated', 'my_rewatching_ep', 'my_tags'], axis=1, inplace = True)

#replacing null with zeros
anime_listsdf['my_rewatching'] = anime_listsdf['my_rewatching'].fillna(0)

#removing rows with null username value
anime_listsdf = anime_listsdf[pd.notnull(anime_listsdf['username'])]

#removing rows with an invalid ststus value
anime_listsdf = anime_listsdf[anime_listsdf['my_status'] != 5]
anime_listsdf = anime_listsdf[anime_listsdf['my_status'] != 55]
anime_listsdf = anime_listsdf[anime_listsdf['my_status'] != 0]

#changing to 3 for rows whose status is value is 33
anime_listsdf['my_status'] = anime_listsdf['my_status'].replace([33], 3)

anime_listsdf.reset_index(drop = True, inplace = True)
anime_listsdf.head(10)

Data Streatching¶

In this section, data will be explored vie graphs and reformatted, for further cleaning, understanding, and analysis. Although the data contains several variables to analyze, only a few will be considered. And for this section, the user data will not be considered.

Below we see a graph of the count of sources of the anime data. We see that Manga and Original are the highest sources of inspiration. Then with either different sources of fiction bearing a similar amount of influence, following with a tail of few more miscellaneous sources with a low influence that might be described better if they were merged somehow.

plt.figure(figsize = (15, 5))
source_bar = sns.countplot(x = 'source', data = animedf)

Then Below there is code to merge the picture book, cardgame, music and radio into Other, and the novel and book into novel.

animedf['source'] = animedf['source'].replace(['Picture book', 'Card game', 'Music', 'Radio'], 'Other')
animedf['source'] = animedf['source'].replace(['Novel', 'Book'], 'Novel')

Below there is a graph of the sources after the change, and the categories with the lowest values are still the same as the ones in the original graph, but simply placed together.

plt.figure(figsize = (15, 5))
source_bar = sns.countplot(x = 'source', data = animedf)

Below there is a graph for the status of the anime, and seeing that this data the last update was a year ago. Then discarding the status column in the anime data is fine, as it might bear wrongful information, as many of the currently airing anime has probably finished airing. So there is also code to remove this column.

plt.figure(figsize = (15, 5))
status_bar = sns.countplot(x = 'status', data = animedf)
animedf.drop(['status'], axis = 1, inplace = True)

Below there is a graph of the different general scores per anime. Doesn't say much other than a few have made it to the top. Looks pretty though.

plt.figure(figsize = (150, 50))
score_bar = sns.countplot(x = 'score', data = animedf)

Looking at the data per anime in a whole graph reveals how difficult it is to analyze some values individually per anime. So it is time to look at these values in some kind of aggregate. Since genre is the focus of this tutorial, the below code is to make a new dataframe that contains the different genre as rows, and as columns the years that the anime table span. This is a nicer way to group the values per genre than some awkward group by and list of all the genre columns. The resulting dataframe helps view and understand the data better, at least the growth of each genre per year. I learned the method below to unpack from column values that have the shape of a list of strings, in this kaggle tutorial, by Kevin Mario Gerard, of analysis on a movie dataset. This method is used throughout the tutorial. Furthermore, the resulting dataframe helps view and understand the data better, at least the growth of each genre per year.

#getting min and max year values, t define column values
min_year = animedf['aired_from_year'].min()
max_year = animedf['aired_from_year'].max()

#define data frame with range of years as column values and genres as rows
anime_genredf = pd.DataFrame(index = genres, columns = range(min_year, max_year + 1))

#filling the data frame with zeros
anime_genredf = anime_genredf.fillna(value = 0)

#getting an array of the years of individual animes
year = np.array(animedf['aired_from_year'])
y = 0

#dictionary to collect number of different genres in total
genre_count = {}

#for loop to iterate through the list of genres per anime
for g in anime_genres_set:
    
    #split the anime's genre into a list
    split_genre = list(map(str, g.split(', ')))
    
    #iterate through the genres's of the anime
    for gs in split_genre:
        
        #add count for that genre in that year in the dataframe
        anime_genredf.loc[gs, year[y]] = anime_genredf.loc[gs, year[y]] + 1
        
        #count the dictionary
        if gs in genre_count:
            genre_count[gs] = genre_count[gs] + 1
        else:
             genre_count[gs] = 1
    y+=1
anime_genredf

Above is printed the resulting data frame. As one can see, only a few genres appear in the earlier years, whereas every genre has nonzero values in the latter ones. This clearly shows change, growth, and evolution of the storytelling medium.

To get a further feeling and understand of this newly found format lets make some graphs. First, we are going to look at the number/frequency of animes per genres. Below you'll find the code to get the data dictionary containing the total number animes per genre, that was collected during the formation of the years per genre dataframe, in a proper format for graphing, in this case, a pandas series object. They are sorted to better understand the graphs, in particular, the pie chart that is also generated by the rest of the code in the cell.

#making series object conting the number of animes per series in the data
gens = pd.Series(genre_count)

#soerting the data
gens = gens.sort_values(ascending = False)

#plotting of pie chart per genre
label = list(map(str, gens.keys()))
plt.figure(figsize = (10, 10))
plt.pie(gens, labels = label, autopct = '%1.1f%%')
plt.show()

From the pizza above clearly, the comedy and action slices are the only decent ones. One might say that the most popular ones are similar to most of the genres in live-action or animations from the "west." But then on the seventh position a Japaneese only genre appears, Shounen which basically means for young males. Those tend to be the most known anime, at least in the "west," for example, Dragon Ball Z or Naruto. Then from the 12th position appears other Japanese specific genres such as Mecha, robots; ecchi, sex appeal; seinen, young adult men; Shoujo, teenage females. To further examine this data below, there's a bar graph of this same data.

plt.figure(figsize = (15,5))
gens.plot(kind = 'bar')

<matplotlib.axes._subplots.AxesSubplot at 0x2444024c2e8>

The bar graph represents the frequency of each genre of the animes collected. This shows Comedy and action at the pop, prevalence of usual genres on top, then Japanese specific with low values, which was also shown on the percentages in the pie chart above. But this makes sense, a sub medium that comes from the medium of storytelling will have its basis on the general ones of the supra medium, and then develop its owns.

Now using the new format dataframe, we print the change in population size of different genres through the years. It looks cool, but it's hard to really say anything with such a small pallete of colors, and amount of values in the graph. So let us take a look at the same graph of the top and bot five genres in the data frame.

anime_genredf.T.plot(grid = True)
plt.legend(loc = 9, bbox_to_anchor = (0.5, -0.2), ncol = 5)

<matplotlib.legend.Legend at 0x243b61208d0>

Below you'll find the code to get the top and bottom genres through the years in order to plot them, and the code to plot the top 10 genres.

#grab genres with top and bot 5 amount of anime from the dataset
gens_top10 = gens.nlargest(5)
gens_bot10 = gens.nsmallest(5)

#grab the subset of the dataframe by the top and low 5 genres
gen_top10df = anime_genredf.loc[gens_top10.keys()]
gen_bot10df = anime_genredf.loc[gens_bot10.keys()]

gen_top10df.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)

<matplotlib.legend.Legend at 0x24395af25c0>

From here we see that these genres have grown a lot, from having very few, probably, single-digit values up to the 1980s. Then each is picking up numbers, with comedy and action growing almost exponentially, while adventure and drama seem to show more of linear growth, with some variation along the years. They all take a deep at the end, this is probably due to the low count of anime in 2018, as this data only covers only a few titles of that year.

Below you'll find the graph of the bottom five genres in the data set. Here we see that the genres had their first appearance at cars, ~65; Dementia, ~74; shounen-ai, ~82; yaoi, ~93; yuri, maybe 04, hard to tell. All of them show a similar type of variation, where they may stay silent during periods of 5 to 10 years. With Shounen-ai grabbing the highest number and highest popularity in the mid-teens. Cars and Dementia vary in a more similar fashion, between the early thousands and three animes per year.

To clarify what each of the Japanese genres displayed here stands for Shounen-ai refers to romantic non-sexual relationships among boys, primarily intended for female audiences. Yaoi means boy love, again intended for female audiences. And conversely, yuri means girl love.

gen_bot10df.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)

<matplotlib.legend.Legend at 0x2439559cac8>

Also comparing top 5 vs. bottom 5 graphs, one can see how more prevalent the top 5 genres are over the strange or the too Japanese specif ones. With 3 of the top genres above 50 animes a year since the 2000s, and comedy never going below 100 since ~08. While the bottom genres either having low history or simply they have had low coverage over the years.

So far, this exploration of the genre in each year has been interesting. However, these are the mere frequency of the genres throughout the years, what about the other variables. In the next section, we will explore more, with the newly found format of genre per year, but reflecting other attributes other than frequency. In the cell below there's code to generate three different dataframes with the same rows, genres, and columns, years, but with the values of popularity, score and membership, each in a different data frame.

genres_pop = []
genres_score = []
genres_mem = []

#collecting special lables for each new data frame
for g in genres:
    g_pop = g + ' popularity'
    g_score = g + ' score'
    g_mem = g + ' members'
    genres_pop.append(g_pop)
    genres_score.append(g_score)
    genres_mem.append(g_mem)

#create dataframe for the popularity data per genre by year
anime_popdf = pd.DataFrame(index = genres_pop, columns = range(min_year, max_year+1))
anime_popdf = anime_popdf.fillna(value = 0.0)

#get the popularity column to use to fill values
anime_pop = np.array(animedf['popularity'])

#dictionary to count the popularity per genre
pop_count = {}

#create dataframe for the score data per genre by year
anime_scoredf = pd.DataFrame(index = genres_score, columns = range(min_year, max_year+1))
anime_scoredf = anime_scoredf.fillna(value = 0.0)

#get the score column to use to fill values
anime_score = np.array(animedf['score'])

#dictionary to count score per genre
score_count = {}

#create datarframe for the membership data per genre by year
anime_memdf = pd.DataFrame(index = genres_mem, columns = range(min_year, max_year+1))
anime_memdf = anime_memdf.fillna(value = 0.0)

#get the members column to use to fill values
anime_mem = np.array(animedf['members'])

#dictionary to count members per genre
mem_count = {}

#fill in the values of the newly created dataframes and dictionaries
y=0
for g in anime_genres_set:
    split_genre = list(map(str, g.split(', ')))
    for gs in split_genre:
        anime_popdf.loc[gs+' popularity', year[y]] = anime_popdf.loc[gs+' popularity', year[y]] + anime_pop[y]
        anime_scoredf.loc[gs+ ' score', year[y]] = anime_scoredf.loc[gs+ ' score', year[y]] + anime_score[y]
        anime_memdf.loc[gs+ ' members', year[y]] = anime_memdf.loc[gs+ ' members', year[y]] + anime_mem[y]
        if gs+' popularity' in pop_count:
            pop_count[gs+' popularity'] = pop_count[gs+' popularity'] + anime_pop[y]
            score_count[gs+' score'] = score_count[gs+' score'] + anime_score[y]
            mem_count[gs+' members'] = mem_count[gs+' members'] + anime_mem[y]
        else:
            pop_count[gs+' popularity'] = anime_pop[y]
            score_count[gs+' score'] = anime_score[y]
            mem_count[gs+' members'] = anime_mem[y]
    y+=1
    
#standardizing the score and popularity data frames as those values refelct a quality metric rather than a quantitative one
anime_popdf2 = (anime_popdf - anime_popdf.mean()) / anime_popdf.std(ddof = 1)
anime_scoredf2 = (anime_scoredf - anime_scoredf.mean()) / anime_scoredf.std(ddof = 1)

#averaging the count of the scores by the tottal number of animes in each genre in the data set
for g in genres:
    score_count[g+' score'] = score_count[g+' score']/genre_count[g]

Next, to examine these three variables in these genres per year format, we will treat them as we did with genre frequency per year format. So we will see bar graphs of each variable per genre and line plots of each variable per genre over the years.

pops = pd.Series(pop_count)
pops = pops.sort_values(ascending = False)
plt.figure(figsize=(15,5))
pops.plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x2438f1e0978>

The bar graph above shows the popularity of each genre on the site. So the number of visits to animes of that genre. These highly resemble the genre frequency plot. However, there are some differences as the slice of life genre has more visits than romance genre anime, even though there are more romance anime than a slice of life ones. The same happens with cars, anime about cars are more visited than josei, thriller, Shounen-ai, Shoujo-ai and dementia anime, regardless of there being less anime about cars than those genres. This shows that the anime produced doesn't reflect the global demand of genres. I say global as MyAnimeList is a cite more popular out of Japan than in Japan.

mems = pd.Series(mem_count)
mems = mems.sort_values(ascending = False)
plt.figure(figsize=(15,5))
mems.plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x2437e78bf60>

The bar graph above shows the user membership number of animes by genre. Again the frequency of a genre seems to influence this attribute as comedy ad action remains in the top, just like in the frequency count. But not completely. An example is the kids genre, which has a mid-range frequency but is among the lowest members count. This shows how the medium is so underrated, just because it is expressed via drawings doesn't mean that it's meant for kids, and clearly, the fans' genre taste reflects this.

scores = pd.Series(score_count)
scores = scores.sort_values(ascending = False)
plt.figure(figsize=(15,5))
scores.plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x243f58db6d8>

Above there's a bar graph with the average scores of anime genres. This here is clearly not influenced by the frequency of the genres, as it was average by the number of animes that contain that genre. Although because this is average, they all range from 7.5 to 6.5. Also, its interesting to see comedy so low in the graph with about 7. and Thriller so high while it was one of the genres with a very low count. I guess those animes mostly must be good.

Bellow, there're three line plots, each showing the change of the three values being inspected by genre per year.

plt.figure(figsize=(15,8))
anime_popdf2.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)

<matplotlib.legend.Legend at 0x2436d445e80>

<Figure size 1080x576 with 0 Axes>

The graph above shows the change of a genre's popularity through time.

plt.figure(figsize=(15,5))
anime_scoredf2.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)

<matplotlib.legend.Legend at 0x243259d4630>

<Figure size 1080x360 with 0 Axes>

The bar graph above shows the change of score through time in each genre. This graph and the one before are highly similar, which shows that the scores somewhat resemble the popularities of anime, or at least the scores attributed per anime genre.

plt.figure(figsize=(150,50))
anime_memdf.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)

<matplotlib.legend.Legend at 0x2444d7acb70>

<Figure size 10800x3600 with 0 Axes>

Above you'll find a graph of the change of the number of MAL users that are members of animes of a genre per year of release. This is interesting to see, as it shows that people tend to like the recent animes more than the old ones, even though there are very good old ones. But maybe because there was less anime back in the day than now.

These line plots bear the same palet and quantity of lines as the frequency of genres per year graph. So to better explore this data a print of six more line graphs will be provided: 2 per attribute, for each, one will contain the top 5 values in that attribute and the other the bottom 5 bottom values in that attribute. Below you will find the code to print these 6 particular graphs and the graphs themselves. They won't be commented, but be left for the reader to see for themselves and for further analysis to do.

pops_top10 = pops.nlargest(5)
pops_bot10 = pops.nsmallest(5)
scores_top10 = scores.nlargest(5)
scores_bot10 = scores.nsmallest(5)
mems_top10 = mems.nlargest(5)
mems_bot10 = mems.nsmallest(5)

pop_top10df = anime_popdf.loc[pops_top10.keys()]
pop_bot10df = anime_popdf.loc[pops_bot10.keys()]
score_top10df = anime_scoredf2.loc[scores_top10.keys()]
score_bot10df = anime_scoredf2.loc[scores_bot10.keys()]
mem_top10df = anime_memdf.loc[mems_top10.keys()]
mem_bot10df = anime_memdf.loc[mems_bot10.keys()]

plt.figure(figsize=(150,50))
pop_top10df.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)

<matplotlib.legend.Legend at 0x2443ca812b0>

<Figure size 10800x3600 with 0 Axes>

plt.figure(figsize=(150,50))
pop_bot10df.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)

<matplotlib.legend.Legend at 0x2434269e2b0>

<Figure size 10800x3600 with 0 Axes>

plt.figure(figsize=(150,50))
mem_top10df.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)

<matplotlib.legend.Legend at 0x244389e0fd0>

<Figure size 10800x3600 with 0 Axes>

plt.figure(figsize=(150,50))
mem_bot10df.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)

<matplotlib.legend.Legend at 0x2440109b5f8>

<Figure size 10800x3600 with 0 Axes>

plt.figure(figsize=(150,50))
score_top10df.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)

<matplotlib.legend.Legend at 0x2440fe2dc88>

<Figure size 10800x3600 with 0 Axes>

plt.figure(figsize=(150,50))
score_bot10df.T.plot(grid = True)
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.2), ncol=5)

<matplotlib.legend.Legend at 0x244311450b8>

<Figure size 10800x3600 with 0 Axes>

To end this data exploration section, here I print a datframe of the count of different combinations of genres throughout the year. The method to collect this data frame is the same as with the four previous ones. No graph will be provided, as sometimes a table is enough, to explore data. Also because that is another rabbit hole I shan't enter right now.

genre_mix_count = {}
for g in anime_genres_set:
    split_genre = list(map(str, g.split(', ')))
    split_genre.sort()
    g_str = ' '.join(split_genre)
    if g_str in genre_mix_count:
        genre_mix_count[g_str] = genre_mix_count[g_str] + 1
    else:
        genre_mix_count[g_str] = 1
genre_mix_set = list(map(str, genre_mix_count.keys()))
anime_mix_genredf = pd.DataFrame(index = genre_mix_set, columns = range(min_year, max_year+1))
anime_mix_genredf = anime_mix_genredf.fillna(value = 0)
y=0
for g in anime_genres_set:
    split_genre = list(map(str, g.split(', ')))
    split_genre.sort()
    g_str = ' '.join(split_genre)
    anime_mix_genredf.loc[g_str, year[y]] = anime_mix_genredf.loc[g_str, year[y]] + 1
    y+=1
anime_mix_genredf.sort_index()

Not the whole table above is visible here, well the table found 2600 different combinations in this data set. However, the trend of having more cells filled at the latter years than the first ones is still interesting. However, this is far more sparse than the genres not combined, as this frame counts the animes strictly by their genres combinations.

Data folding¶

In this section, machine learning algorithms are applied. Specifically, I used the statistical procedure of Principal Component Analysis(PCA) and the Random Forest machine learning algorithm.

First, I use principal component analysis on the attributes that aren't genres of the dataframe. So score, rank(the place the animes are in the data set, according to their score), popularity, members, favorites(amount of people who consider the anime a favorite one) and duration in minutes(per episode or the film).

For PCA first, each column of dataframe is mean-centered, by subtracting the mean of the column to each value in the column. Then covariance matrix of the dataframe is calculated, this is done by multiplying the matrix by its transpose, $$ \frac{1}{m}\sum_{i=1}^m x^2 = XX^T $$ as this is essentially the same as summing up the square of each value, and it makes it a square matrix which is essential for the algorithm. Then their eigenvalues are calculated, which are those values that describe their relevance in the dataframe.

# make a dataframe that contain the attributes to make PCA on
a_pca = animedf[['score', 'rank', 'popularity', 'members', 'favorites','duration_min']]

# mean center the columns
for col in  a_pca:
    a_pca = a_pca.replace(a_pca[col], a_pca[col] - np.mean(a_pca[col]))

# make covariance matrix of the dataframe
a_pca_cov = np.cov(a_pca.T)
a_pca_cov[np.isnan(a_pca_cov)] = 0

#Calculate the eigen values of the covariance matrix.
a_pca_eig_val, a_pca_eig_vec = np.linalg.eig(a_pca_cov)
i = 0

#print the eigen values
for col in a_pca:
    print ('%-*s : %s' % (15, col, str(a_pca_eig_val[i])))
    i = i + 1

score           : 11330329171.004313
rank            : 9931736.137941651
popularity      : 4751504.706314461
members         : 0.4007824837441376
favorites       : 629.4417294705529
duration_min    : 0.0

Above we see the results of PCA on the attributes that aren't a genre. From here we see that score is the attribute is the most relevant one. Then rank, which wasn't analyzed in the section before. Following is popularity, favorites, then members, and lastly duration per minute. This makes me wish I had done PCA before the data stretching section, as I wouldn't have looked into the members, but the rank data.

Bellow, I start the process to do PCA on the genres themselves. The cell bellow is to collect a dataframe only of the genres column to do the PCA.

animedf_genres_pca = animedf[['Action',
       'Adventure', 'Cars', 'Comedy', 'Dementia', 'Demons', 'Drama', 'Ecchi',
       'Fantasy', 'Game', 'Harem', 'Hentai', 'Historical', 'Horror', 'Josei',
       'Kids', 'Magic', 'Martial Arts', 'Mecha', 'Military', 'Music',
       'Mystery', 'Parody', 'Police', 'Psychological', 'Romance', 'Samurai',
       'School', 'Sci-Fi', 'Seinen', 'Shoujo', 'Shoujo Ai', 'Shounen',
       'Shounen Ai', 'Slice of Life', 'Space', 'Sports', 'Super Power',
       'Supernatural', 'Thriller', 'Vampire', 'Yaoi', 'Yuri']]

# another dataframe that will be usefull for analysis latter on.
animedf_scores = animedf_genres_pca

# the code bellow mean centers the collumns of each genre
for col in  animedf_genres_pca:
    animedf_genres_pca[col] = animedf_genres_pca[col] * (animedf_genres_pca[col] - np.mean(animedf_genres_pca[col]))

# here the covariance matrix is calculated for the genres dataframe
animedf_pca_cov = np.cov(animedf_genres_pca.T)
animedf_pca_cov[np.isnan(animedf_pca_cov)] = 0

#print the covariance matrix
animedf_pca_cov

array([[ 1.01287325e-01,  1.77877693e-02,  1.16684912e-03, ...,
         1.85906525e-03, -7.18214675e-04, -1.21331871e-04],
       [ 1.77877693e-02,  1.03436001e-01,  5.31654237e-05, ...,
        -1.52547963e-03, -7.11608138e-04, -2.54855448e-04],
       [ 1.16684912e-03,  5.31654237e-05,  6.21442311e-03, ...,
        -9.47739758e-05, -2.71032923e-05, -9.70677725e-06],
       ...,
       [ 1.85906525e-03, -1.52547963e-03, -9.47739758e-05, ...,
         1.46033357e-02, -6.48544863e-05, -2.32269957e-05],
       [-7.18214675e-04, -7.11608138e-04, -2.71032923e-05, ...,
        -6.48544863e-05,  4.26977748e-03, -6.64241473e-06],
       [-1.21331871e-04, -2.54855448e-04, -9.70677725e-06, ...,
        -2.32269957e-05, -6.64241473e-06,  1.53773093e-03]])

# here the eigen values are calcualted
animedf_pca_eig_val, animedf_pca_eig_vec = np.linalg.eig(animedf_pca_cov)

# here the eigen values ar printed
i = 0
for col in animedf_genres_pca:
    print ('%-*s : %s' % (15, col, str(animedf_pca_eig_val[i])))
    i = i + 1

Action          : 0.20824871490177183
Adventure       : 0.16173280008609456
Cars            : 0.13383112462140861
Comedy          : 0.12777131066381173
Dementia        : 0.11459763878293253
Demons          : 0.10361714172211209
Drama           : 0.08496683915700538
Ecchi           : 0.0807164728840884
Fantasy         : 0.07329055213835972
Game            : 0.06888331449801913
Harem           : 0.06358295507282939
Hentai          : 0.060082649241124116
Historical      : 0.05497420949027923
Horror          : 0.05370730101663125
Josei           : 0.051139365748614836
Kids            : 0.050108698873739144
Magic           : 0.0015074411831921422
Martial Arts    : 0.004151432974639757
Mecha           : 0.005692218637241859
Military        : 0.006469851714450555
Music           : 0.007075277242758995
Mystery         : 0.007554367593643342
Parody          : 0.00919343564183585
Police          : 0.010321941389332925
Psychological   : 0.012076366572411429
Romance         : 0.01297158315586666
Samurai         : 0.04578583956597838
School          : 0.04494867389678284
Sci-Fi          : 0.04247341761532464
Seinen          : 0.03988828765556133
Shoujo          : 0.018211215331074785
Shoujo Ai       : 0.03654869415928129
Shounen         : 0.035569458955450176
Shounen Ai      : 0.0203483971423483
Slice of Life   : 0.021716132164281428
Space           : 0.02255126922728222
Sports          : 0.023646334338755876
Super Power     : 0.02494567491134203
Supernatural    : 0.026025325604896097
Thriller        : 0.02768312739278635
Vampire         : 0.029622324262363324
Yaoi            : 0.032192730793145344
Yuri            : 0.031070535821701436

It is interesting to see that Action has the highest relevance here and not a comedy as comedy was the most frequented genre. But this just means that the action genre is more mixed with other genres than comedy.

Bellow, there is code to make a random forest classifier to the genres dataframe. Random forest consists of making several decision trees that via bagging, ie averaging the different trees, it returns an average tree for classification. The trees uses the Gini index, $$ 1-\sum_{i=0}^{c}\big[p(i|t)\big]^2 $$ to decide which genre to use as a deciding factor. The Gini index measures inequality of each attribute, and the higher the index, the more relevant it is considered as a branching factor.

# here the random forest classifier is created
clf = RandomForestClassifier()
clf = clf.fit(animedf_genres_pca, animedf['title'])

#for loop to print the importance of each genre in the model
i = 0
for f in clf.feature_importances_:
    print('%-*s : %s' % (15, genres[i], f))
    i+=1

Action          : 0.06181133873551685
Adventure       : 0.05169026361694601
Cars            : 0.0027764155530549902
Comedy          : 0.06085812717454431
Dementia        : 0.005047604985361501
Demons          : 0.017983493408926847
Drama           : 0.0663725779021305
Ecchi           : 0.019776303631141256
Fantasy         : 0.04724518275674286
Game            : 0.014962501154166663
Harem           : 0.01918363517563968
Hentai          : 0.005094460473504785
Historical      : 0.03118779215137587
Horror          : 0.01448942572758496
Josei           : 0.0067467827314553405
Kids            : 0.015415541205027578
Magic           : 0.03192335122843964
Martial Arts    : 0.013511802795257998
Mecha           : 0.019124711028957797
Military        : 0.020932661888186464
Music           : 0.012024334000181026
Mystery         : 0.027763197044633196
Parody          : 0.01903734908833824
Police          : 0.01115002339882598
Psychological   : 0.01724188690553605
Romance         : 0.044382484543152696
Samurai         : 0.006450628280721514
School          : 0.03684574818685506
Sci-Fi          : 0.04076958575444434
Seinen          : 0.03049517676223018
Shoujo          : 0.025364632588285967
Shoujo Ai       : 0.0033655880591732094
Shounen         : 0.041697515516933195
Shounen Ai      : 0.0035904703218131288
Slice of Life   : 0.03772829723953469
Space           : 0.010726672444314503
Sports          : 0.013891952397834802
Super Power     : 0.02245627102697034
Supernatural    : 0.04813992344679433
Thriller        : 0.007261048777259688
Vampire         : 0.009219149470627
Yaoi            : 0.0022158577902757427
Yuri            : 0.0020482336313032467

Here Comedy does come out as the most relevant factor of the random forest. The high frequency must have been rather relevant as a deciding factor for the trees to branch out. What is also interesting is that the most relevant ones are genres that appear in the west aswell. Among these are action, drama, adventure and well comedy.

Next few cells there is code to print the resulting one of the resulting trees.

#get the different anime titles without enstrange characters, such as emojis
titles = []
for t in list(animedf['title']):
    titles.append(t.encode('ascii', 'ignore').decode('ascii'))

#get a tree to graph    
estimator = clf.estimators_[5]

# to get the tree as a graph descripted language
dot_data = export_graphviz(estimator, out_file = 'dot_data',
                feature_names = genres,
                class_names = titles,
                rounded = True, filled = True)

# get the build graph to show
graph =  pydotplus.graphviz.graph_from_dot_file('dot_data-Copy1')

You may notice that the name of the file is different from the one created in the cell before. That is because the original one contained about 7800 different nodes, and pydotplus couldn't plot more than about 1300. So the dot_data-Copy1 is a copy of the original file minus 6500 nodes, that I removed directly in the file. So then do_data-Copy1 has about 1300 nodes, so I can at least show a good portion of the tree, which is plotted below.

#1350 node graph
Image(graph.create_png())

dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.378656 to fit

graph2 =  pydotplus.graphviz.graph_from_dot_file('dot_data-Copy1-Copy1')

I also plotted a tree with 100 nodes, bellow, to have a better tree to look at. As you can see, the genre was used to determine the animes place in the tree.

#100 node graph
Image(graph2.create_png())

Now the tree's above aren't as informative as one would like. Considering, that the score attribute came as the most important one in the PCA analysis above of the attributes that weren't the genres. Then I'll make another anime genres dataframe, but that contains the score of the anime in each of its genre values, and do PCA on it and another random forest classifier.

# code to make the anime genres dataframe with scores.
for i, r in animedf_scores.iterrows():
    animedf_scores.iloc[i] = animedf_scores.iloc[i] * animedf['score'].iloc[i]
animedf_scores.head()

Bellow, there is code to make PCA on the animes genes with score values.

# make the anime genres with score values mean centered
animedf_scores_pca = animedf_scores
for col in  animedf_scores_pca:
    animedf_scores_pca[col] = animedf_scores_pca[col] * (animedf_scores_pca[col] - np.mean(animedf_scores_pca[col]))
    
#make the covariance matrix
animedf_scores_pca_cov = np.cov(animedf_scores_pca.T)
animedf_scores_pca_cov[np.isnan(animedf_scores_pca_cov)] = 0

#print the scores
animedf_scores_pca_eig_val, animedf_scores_pca_eig_vec = np.linalg.eig(animedf_scores_pca_cov)
i = 0
for col in animedf_scores_pca:
    print ('%-*s : %s' % (15, col, str(animedf_scores_pca_eig_val[i])))
    i = i + 1

Action          : 569.6454296709717
Adventure       : 458.41595892878365
Cars            : 416.4309227026693
Comedy          : 369.67919067761716
Dementia        : 317.32572959407895
Demons          : 284.8099747850559
Drama           : 242.8785705128754
Ecchi           : 221.4042584006225
Fantasy         : 194.9721759302791
Game            : 182.14686006784768
Harem           : 171.0193580616645
Hentai          : 160.63748532161605
Historical      : 146.0212624565787
Horror          : 140.76522282605313
Josei           : 138.30294914913688
Kids            : 127.2601294977736
Magic           : 121.21430240975677
Martial Arts    : 111.20860120013498
Mecha           : 107.23662059909003
Military        : 2.6079948698483966
Music           : 7.985337573802201
Mystery         : 14.281759282787638
Parody          : 14.603107203653868
Police          : 17.842728183718606
Psychological   : 19.59438028934403
Romance         : 27.430868853601037
Samurai         : 32.67153231333125
School          : 35.26446333860263
Sci-Fi          : 33.482758565892645
Seinen          : 93.70560677572298
Shoujo          : 91.94369953138047
Shoujo Ai       : 47.78520185511384
Shounen         : 84.56410912213262
Shounen Ai      : 80.77728736354254
Slice of Life   : 52.66202139815055
Space           : 76.71657954161341
Sports          : 74.01191410143434
Super Power     : 56.47776700088487
Supernatural    : 58.73483288582561
Thriller        : 61.52876671034716
Vampire         : 63.699921053640615
Yaoi            : 68.25595882103366
Yuri            : 67.615409408053

Here it is interesting to find that Thriller isn't significant, while it was the genre that had the best average score overall. While action, adventure, cars, and comedy has the highest values. And military and music have the lowest.

Bellow, there is code to make a random forest classifier for the anime genres with scores dataframe.

#random forest clasifier for the anime genres with scores as values dataframe.
clf2 = RandomForestClassifier()
clf2 = clf2.fit(animedf_scores, animedf['title'])

#print the importance of each genre when the scores of each anime are taken into account.
i = 0
for f in clf2.feature_importances_:
    print('%-*s : %s' % (15, genres[i], f))
    i+=1

# to build the graph data to print a tree of the random forest.
estimator2 = clf2.estimators_[5]
dot_data = export_graphviz(estimator2, out_file = 'dot_data_scores',
                feature_names = genres,
                class_names = titles,
                rounded = True, filled = True)

Action          : 0.07008139032595893
Adventure       : 0.05183343419023747
Cars            : 0.002653374609766805
Comedy          : 0.12025767599786039
Dementia        : 0.0033776531476829856
Demons          : 0.009621687697103526
Drama           : 0.0542014901289842
Ecchi           : 0.020597328152408736
Fantasy         : 0.055710527751135525
Game            : 0.008704652194405124
Harem           : 0.010004070770789996
Hentai          : 0.032720475971119885
Historical      : 0.02078175891602459
Horror          : 0.010118450990500865
Josei           : 0.0039006233679865676
Kids            : 0.020664609561622405
Magic           : 0.024520024565166075
Martial Arts    : 0.008725906937894845
Mecha           : 0.021565300667639276
Military        : 0.01368254930757582
Music           : 0.01569604994129462
Mystery         : 0.020773086619215987
Parody          : 0.012639739906950031
Police          : 0.008395355820793959
Psychological   : 0.009421604576074091
Romance         : 0.04817158602290189
Samurai         : 0.005286562534908811
School          : 0.040559280316425306
Sci-Fi          : 0.047135810726809835
Seinen          : 0.023303889862266083
Shoujo          : 0.021788923882717495
Shoujo Ai       : 0.0028738435675004404
Shounen         : 0.04618550296036187
Shounen Ai      : 0.002840745930174975
Slice of Life   : 0.04197965076651229
Space           : 0.009981541507602535
Sports          : 0.017040489718065203
Super Power     : 0.016999319005012576
Supernatural    : 0.032566098240889414
Thriller        : 0.004088985336965522
Vampire         : 0.005021583447173585
Yaoi            : 0.0025097666038804253
Yuri            : 0.0010175974536388408

Here once again comedy comes out in the top. Its followed by genres that are also in west storytelling mediums, action, fantasy, drama, and adventure. It's interesting that the result is so similar in both random forests but slightly different. So score does influence an anime's value, but so does the genre it belongs to.

From here on I'll generate another random forest classifier but for the anime genres per years dataframe. And use the user's data to generate the user's animes genres per years in order to calculate what years the user would prefer, based on the genres he/she liked.

#Make a pandas series object containing the sum of animes per year from the anime data frame
#this will be used as the target values for the model
year = np.array(animedf['aired_from_year'])
years = range(min_year, max_year+1)
year_count = {}
for y in years:
    year_count[str(y)] = 0
for y in year:
    if str(y) in year_count:
        year_count[str(y)] = year_count[str(y)] + 1
year_counts = pd.Series(year_count)

#Make a user pandas series object containing the the user names and number
#of animes rated by those users from the user anime list data frame
#this was used to look at the numbers in the different lists
#from there I saw some user and decided to uses him as my guinea pig.
user_set = list(map(str, (anime_listsdf['username'])))
user_count = {}
for u in user_set:
    if u in user_count:
        user_count[u] = user_count[u] + 1
    else:
        user_count[u] = 1
users = pd.Series(user_count)
users = users.sort_values(ascending = False)

#Get the table of the animes from the user's list to a data frame
anime_id_set = list(map(str, (animedf['anime_id'])))
fire_dragon_animesdf = anime_listsdf[anime_listsdf['username'] == 'Fire-Dragon']
fire_dragon_animesdf.reset_index(drop=True)
fire_dragon_anime_id_set = list(map(str, (fire_dragon_animesdf['anime_id'])))
fire_dragon_animesdf2 = pd.DataFrame(columns = list(animedf.columns.values))
for a in fire_dragon_anime_id_set:
    fire_dragon_animesdf2 = fire_dragon_animesdf2.append(animedf[animedf['anime_id'] == int(a)])
fire_dragon_genres_set = list(map(str, (fire_dragon_animesdf2['genre'])))

fire_dragon_year = np.array(fire_dragon_animesdf2['aired_from_year'])
fire_dragon_years = range(min_year, max_year+1)
fire_dragon_year_count = {}
for y in years:
    fire_dragon_year_count[str(y)] = 0
for y in fire_dragon_year:
    if str(y) in fire_dragon_year_count:
        fire_dragon_year_count[str(y)] = fire_dragon_year_count[str(y)] + 1
fire_dragon_year_counts = pd.Series(fire_dragon_year_count)

#get the users anime list data frame of the format of genre per year
fire_dragon_genredf = pd.DataFrame(index = genres, columns = range(min_year, max_year+1))
fire_dragon_genredf = fire_dragon_genredf.fillna(value = 0)
y = 0
fire_dragon_genre_count = {}
for g in fire_dragon_genres_set:
    split_genre = list(map(str, g.split(', ')))
    for gs in split_genre:
        fire_dragon_genredf.loc[gs, year[y]] = fire_dragon_genredf.loc[gs, year[y]] + 1
        if gs in fire_dragon_genre_count:
            fire_dragon_genre_count[gs] = fire_dragon_genre_count[gs] + 1
        else:
            fire_dragon_genre_count[gs] = 1
    y+=1

Below you'll find the code that makes the model. The model uses the transpose of the anime genres by year frequency data drama as the X input, so the genres are the columns/attributes are the genres, and the rows/entries are the years. The target values are total anime by year. The relevance of each genre is printed here too.

#fitting random forest classifier model
#X variebles anime genre by year data frame
#Y amount of animes per year
clf = RandomForestClassifier()
clf = clf.fit(anime_genredf.T.sort_index(),year_counts.sort_index())

#for loop to print the importance of each genre in the model
i = 0
for f in clf.feature_importances_:
    print('%-*s : %s' % (15, genres[i], f))
    i+=1

Action          : 0.06419017768809478
Adventure       : 0.08138745741418628
Cars            : 0.02522577145749895
Comedy          : 0.044736576971380865
Dementia        : 0.011081173886719338
Demons          : 0.00659564817280392
Drama           : 0.05223272107647633
Ecchi           : 0.0122512824643915
Fantasy         : 0.04142758198357238
Game            : 0.0027305754138387994
Harem           : 0.009974722201407566
Hentai          : 0.007987668545459863
Historical      : 0.0632215473903633
Horror          : 0.004936526181672431
Josei           : 0.013265686595736327
Kids            : 0.05279393122081933
Magic           : 0.016061600288386973
Martial Arts    : 0.014866093247951523
Mecha           : 0.04874525307763434
Military        : 0.04016049483690995
Music           : 0.007427962002427611
Mystery         : 0.012610982394926942
Parody          : 0.011237663996019501
Police          : 0.01664530301854258
Psychological   : 0.011946260921495288
Romance         : 0.01435505682143147
Samurai         : 0.014215505167323072
School          : 0.019494366579904013
Sci-Fi          : 0.052274820532239585
Seinen          : 0.01120721445750385
Shoujo          : 0.02842625830474195
Shoujo Ai       : 0.007509588304382229
Shounen         : 0.029382044890027496
Shounen Ai      : 0.005454059759508344
Slice of Life   : 0.03336182281831116
Space           : 0.01124623295402872
Sports          : 0.030235977845238276
Super Power     : 0.015493099654994296
Supernatural    : 0.021586751873401314
Thriller        : 0.007622928512224686
Vampire         : 0.007154953312236786
Yaoi            : 0.007231066523450243
Yuri            : 0.010007589240335667

Below there is code to print the model of the decision tree generated. Since the rows of the model where the years, the tree might be interpreted as the likelihood an anime is from some year based on this genre. Where the leaves that have no genre attribute to them simply specify the possible years of the path of the genres of a certain anime. It's interesting to see the more Japanese or the too specific genres tend to appear in the later years, while the earlier years are defined by more traditional genres. Also, to see that the early genres branch out to the right with earlier year values.

estimator = clf.estimators_[4]

dot_data = export_graphviz(estimator, out_file=None, 
                feature_names = genres,
                class_names = list(year_count.keys()),
                rounded = True, filled = True)
graph =  pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())

The cell bellow contains a prediction of the model based on a users anime list. And it contains a list reflecting the years in which the user's anime list got predicted, based on genre.

#prediction based on a MAL user anime list
predicted_genres = clf.predict(fire_dragon_genredf.T)

#Code to print the predictions results
liist_years = list(year_count.keys())
i = 0
for f in predicted_genres:
    print(liist_years[i], f)
    i+=1

1942 0
1943 1
1944 1
1945 0
1946 0
1947 0
1948 0
1949 0
1950 0
1951 0
1952 0
1953 0
1954 0
1955 0
1956 0
1957 1
1958 1
1959 1
1960 2
1961 1
1962 4
1963 4
1964 1
1965 1
1966 1
1967 2
1968 1
1969 4
1970 1
1971 2
1972 4
1973 4
1974 2
1975 5
1976 3
1977 8
1978 8
1979 3
1980 8
1981 8
1982 8
1983 8
1984 8
1985 8
1986 8
1987 29
1988 8
1989 29
1990 49
1991 49
1992 8
1993 8
1994 11
1995 67
1996 49
1997 8
1998 8
1999 56
2000 49
2001 71
2002 63
2003 49
2004 63
2005 99
2006 190
2007 246
2008 35
2009 133
2010 154
2011 71
2012 227
2013 227
2014 305
2015 410
2016 490
2017 410
2018 82

Data wearing¶

To wrap up this tutorial let me say that even though the tutorial was attempted to be divide by data cleaning, exploration, and analysis sections, the tutorial throughout showed components of previous sections. This reflects how the process of analyzing data, has no linear progress, but often requires stepping back to previous stages of analysis. This is not seen in the last stage of this tutorial, that is because there is more to explore in the model, as there is more to explore in any other stage. During each stage, there were several lines of questioning about the data left un-answered in general to clean, analyze, understand, and/or run models on. Further analysis can bring more light into anime genres, looking into the rank attribute would have been interesting.

Furthermore I shared some fun facts about anime throughout the tutorial, for those curious about the subject here are a couple of anime youtube channels that cover several topics explored here. But my main recommendation is to simply watch some anime.

	anime_id	title	title_english	title_japanese	title_synonyms	image_url	type	source	episodes	status	...	broadcast	related	producer	licensor	studio	genre	opening_theme	ending_theme	duration_min	aired_from_year
0	11013	Inu x Boku SS	Inu X Boku Secret Service	妖狐×僕SS	Youko x Boku SS	https://myanimelist.cdn-dena.com/images/anime/...	TV	Manga	12	Finished Airing	...	Fridays at Unknown	{'Adaptation': [{'mal_id': 17207, 'type': 'man...	Aniplex, Square Enix, Mainichi Broadcasting Sy...	Sentai Filmworks	David Production	Comedy, Supernatural, Romance, Shounen	['"Nirvana" by MUCC']	['#1: "Nirvana" by MUCC (eps 1, 11-12)', '#2: ...	24.0	2012.0
1	2104	Seto no Hanayome	My Bride is a Mermaid	瀬戸の花嫁	The Inland Sea Bride	https://myanimelist.cdn-dena.com/images/anime/...	TV	Manga	26	Finished Airing	...	Unknown	{'Adaptation': [{'mal_id': 759, 'type': 'manga...	TV Tokyo, AIC, Square Enix, Sotsu	Funimation	Gonzo	Comedy, Parody, Romance, School, Shounen	['"Romantic summer" by SUN&LUNAR']	['#1: "Ashita e no Hikari (明日への光)" by Asuka Hi...	24.0	2007.0
2	5262	Shugo Chara!! Doki	Shugo Chara!! Doki	しゅごキャラ！！どきっ	Shugo Chara Ninenme, Shugo Chara! Second Year	https://myanimelist.cdn-dena.com/images/anime/...	TV	Manga	51	Finished Airing	...	Unknown	{'Adaptation': [{'mal_id': 101, 'type': 'manga...	TV Tokyo, Sotsu	NaN	Satelight	Comedy, Magic, School, Shoujo	['#1: "Minna no Tamago (みんなのたまご)" by Shugo Cha...	['#1: "Rottara Rottara (ロッタラロッタラ)" by Buono! ...	24.0	2008.0
3	721	Princess Tutu	Princess Tutu	プリンセスチュチュ	NaN	https://myanimelist.cdn-dena.com/images/anime/...	TV	Original	38	Finished Airing	...	Fridays at Unknown	{'Adaptation': [{'mal_id': 1581, 'type': 'mang...	Memory-Tech, GANSIS, Marvelous AQL	ADV Films	Hal Film Maker	Comedy, Drama, Magic, Romance, Fantasy	['"Morning Grace" by Ritsuko Okazaki']	['"Watashi No Ai Wa Chiisaikeredo" by Ritsuko ...	16.0	2002.0
4	12365	Bakuman. 3rd Season	Bakuman.	バクマン。	Bakuman Season 3	https://myanimelist.cdn-dena.com/images/anime/...	TV	Manga	25	Finished Airing	...	Unknown	{'Adaptation': [{'mal_id': 9711, 'type': 'mang...	NHK, Shueisha	NaN	J.C.Staff	Comedy, Drama, Romance, Shounen	['#1: "Moshimo no Hanashi (もしもの話)" by nano.RIP...	['#1: "Pride on Everyday" by Sphere (eps 1-13)...	24.0	2012.0

	username	anime_id	my_watched_episodes	my_start_date	my_finish_date	my_score	my_status	my_rewatching	my_last_updated	my_tags
0	karthiga	21	586	0000-00-00	0000-00-00	9	1	NaN	2013-03-03 10:52:53	NaN
1	karthiga	59	26	0000-00-00	0000-00-00	7	2	NaN	2013-03-10 13:54:51	NaN
2	karthiga	74	26	0000-00-00	0000-00-00	7	2	NaN	2013-04-27 16:43:35	NaN
3	karthiga	120	26	0000-00-00	0000-00-00	7	2	NaN	2013-03-03 10:53:57	NaN
4	karthiga	178	26	0000-00-00	0000-00-00	7	2	0.0	2013-03-27 15:59:13	NaN
5	karthiga	210	161	0000-00-00	0000-00-00	7	2	NaN	2013-03-10 13:57:06	NaN
6	karthiga	232	70	0000-00-00	0000-00-00	6	2	NaN	2013-03-09 17:24:42	NaN
7	karthiga	233	78	0000-00-00	0000-00-00	6	2	NaN	2013-03-10 05:29:44	NaN
8	karthiga	249	167	0000-00-00	0000-00-00	8	2	NaN	2013-03-19 16:04:46	NaN
9	karthiga	269	366	0000-00-00	0000-00-00	10	2	NaN	2013-03-03 09:39:23	NaN

	anime_id	title	type	source	episodes	status	duration	rating	score	scored_by	...	Supernatural
0	11013	Inu x Boku SS	TV	Manga	12	Finished Airing	24 min. per ep.	PG-13 - Teens 13 or older	7.63	139250	...	1
1	2104	Seto no Hanayome	TV	Manga	26	Finished Airing	24 min. per ep.	PG-13 - Teens 13 or older	7.89	91206	...	0
2	5262	Shugo Chara!! Doki	TV	Manga	51	Finished Airing	24 min. per ep.	PG - Children	7.55	37129	...	0
3	721	Princess Tutu	TV	Original	38	Finished Airing	16 min. per ep.	PG-13 - Teens 13 or older	8.21	36501	...	0
4	12365	Bakuman. 3rd Season	TV	Manga	25	Finished Airing	24 min. per ep.	PG-13 - Teens 13 or older	8.67	107767	...	0

	1942	1943	1944	1945	...	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018
Action	0	0	0	1	...	80	97	102	101	93	114	139	156	137	62
Adventure	0	1	0	0	...	47	36	31	42	36	52	55	70	65	21
Cars	0	0	0	0	...	3	0	0	3	0	2	2	2	0	1
Comedy	1	1	1	0	...	137	148	136	153	161	200	231	238	191	79
Dementia	0	0	0	0	...	0	0	2	0	2	3	0	3	3	2
Demons	0	0	0	0	...	5	9	13	11	20	9	19	16	16	14
Drama	0	0	0	0	...	41	34	47	46	46	49	57	75	81	34
Ecchi	0	0	0	0	...	30	40	41	36	28	34	45	30	27	11
Fantasy	0	0	0	0	...	58	48	54	61	78	94	100	116	133	44
Game	0	0	0	0	...	5	4	9	9	12	23	7	28	23	5
Harem	0	0	0	0	...	15	19	24	19	23	23	28	18	13	4
Hentai	0	0	0	0	...	18	18	24	20	19	15	32	27	19	7
Historical	1	1	1	0	...	16	28	22	16	13	22	18	33	35	10
Horror	0	0	0	0	...	4	5	7	15	5	9	9	18	18	7
Josei	0	0	0	0	...	2	6	6	7	8	6	1	6	3	2
Kids	0	0	0	0	...	17	17	12	19	16	23	32	33	21	15
Magic	0	0	0	0	...	23	23	27	22	27	32	33	49	67	15
Martial Arts	0	0	0	0	...	9	10	12	7	6	11	5	7	6	4
Mecha	0	0	0	0	...	14	15	9	15	17	30	25	23	20	11
Military	0	0	1	1	...	12	18	10	13	8	9	19	16	19	6
Music	0	0	0	0	...	9	9	8	11	16	21	31	33	31	7
Mystery	0	0	0	0	...	21	18	24	23	19	25	32	35	34	11
Parody	0	0	0	0	...	16	16	13	16	19	10	30	26	21	11
Police	0	0	0	0	...	3	2	1	4	4	7	5	10	5	3
Psychological	0	0	0	0	...	6	2	6	9	11	17	12	13	12	8
Romance	0	0	0	0	...	52	42	53	66	61	70	61	66	71	27
Samurai	0	0	0	0	...	4	13	8	6	6	4	5	8	5	2
School	0	0	0	0	...	41	47	56	80	77	82	100	103	99	26
Sci-Fi	0	0	0	0	...	33	37	34	52	34	60	54	68	64	26
Seinen	0	0	0	0	...	15	30	24	35	28	40	46	41	28	18
Shoujo	0	0	0	0	...	16	16	19	20	22	30	23	32	18	8
Shoujo Ai	0	0	0	0	...	8	2	3	4	0	4	7	0	3	2
Shounen	0	0	0	0	...	32	49	51	54	48	65	69	79	64	22
Shounen Ai	0	0	0	0	...	0	2	4	2	1	6	1	7	2	2
Slice of Life	0	0	0	0	...	29	28	41	54	71	91	90	87	94	39
Space	0	0	0	0	...	5	3	4	4	7	15	9	15	9	3
Sports	0	0	0	0	...	11	11	11	13	20	20	23	41	24	9
Super Power	0	0	0	0	...	25	26	34	19	19	18	27	17	22	8
Supernatural	0	0	0	0	...	30	47	65	52	50	56	70	82	64	30
Thriller	0	0	0	0	...	10	5	5	6	5	5	4	4	2	4
Vampire	0	0	0	0	...	2	5	10	3	6	0	12	7	8	2
Yaoi	0	0	0	0	...	3	3	0	1	1	0	0	1	0	0
Yuri	0	0	0	0	...	0	3	1	0	0	1	2	0	0	0

	1942	1943	1944	1945	1946	1947	1948	1949	1950	1951	...	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018
Action	0	0	0	0	0	0	0	0	0	0	...	1	0	1	0	0	0	1	2	3	1
Action Adventure	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Cars Comedy Kids Police	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Cars Comedy Sci-Fi Shounen	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Cars Mecha Sci-Fi Shounen Sports	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Cars Sci-Fi	0	0	0	0	0	0	0	0	0	0	...	1	0	0	0	0	0	0	0	0	0
Action Adventure Comedy	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Comedy Demons Drama Ecchi Horror Mystery Romance Sci-Fi	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Comedy Demons Fantasy Historical Magic Romance Shounen Supernatural	0	0	0	0	0	0	0	0	0	0	...	1	0	0	0	0	0	0	0	0	0
Action Adventure Comedy Demons Fantasy Magic	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Comedy Demons Fantasy Martial Arts Shounen Super Power	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Comedy Demons Shounen Super Power	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Comedy Demons Shounen Supernatural	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	1	1	0	0
Action Adventure Comedy Drama	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	2	2	2	1
Action Adventure Comedy Drama Ecchi Mecha Romance Sci-Fi	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Comedy Drama Fantasy Historical	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Comedy Drama Fantasy Josei	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Comedy Drama Fantasy Kids	0	0	0	0	0	0	0	0	0	0	...	1	0	0	1	0	0	0	0	0	0
Action Adventure Comedy Drama Fantasy Magic Military Shounen	0	0	0	0	0	0	0	0	0	0	...	1	0	1	0	0	0	0	0	0	0
Action Adventure Comedy Drama Fantasy Shounen	0	0	0	0	0	0	0	0	0	0	...	1	0	0	1	0	0	0	2	0	0
Action Adventure Comedy Drama Fantasy Shounen Super Power	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	1	0	2	0	1	0
Action Adventure Comedy Drama Harem Martial Arts Mecha Romance Sci-Fi Shounen	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Comedy Drama Historical Military Romance	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Comedy Drama Horror Mecha Mystery Psychological Sci-Fi	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Comedy Drama Josei Supernatural	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
Action Adventure Comedy Drama Mecha Military Music Romance Sci-Fi Shounen	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Comedy Drama Mecha Music Sci-Fi Shounen	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Comedy Drama Mecha Sci-Fi Shounen	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Comedy Drama Romance Sci-Fi	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Action Adventure Comedy Drama School Sci-Fi Shounen Sports	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
School Sports	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	1	0	0
School Supernatural	0	0	0	0	0	0	0	0	0	0	...	0	1	0	0	0	0	0	0	0	0
Sci-Fi	0	0	0	0	0	0	0	0	0	0	...	0	1	0	2	0	0	0	3	7	0
Sci-Fi Seinen	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	2	0	0
Sci-Fi Seinen Slice of Life	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Sci-Fi Seinen Space	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Sci-Fi Shounen	0	0	0	0	0	0	0	0	0	0	...	1	0	0	0	0	0	0	0	0	0
Sci-Fi Shounen Sports Super Power	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	0	0	0	0
Sci-Fi Slice of Life	0	0	0	0	0	0	0	0	0	0	...	0	1	0	0	0	0	0	0	0	0
Sci-Fi Space	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	1	1	0	0	0
Sci-Fi Sports	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Sci-Fi Thriller	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	0	0	1	0	1	1
Seinen	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Seinen Slice of Life	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	2	0	1	2
Seinen Slice of Life Supernatural	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	0
Seinen Sports	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	1	0	0	0	0
Shoujo	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Shoujo Slice of Life	0	0	0	0	0	0	0	0	0	0	...	0	1	0	0	1	0	0	0	0	0
Shounen	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
Shounen Sports	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	1	1	1	0	0
Shounen Sports Super Power	0	0	0	0	0	0	0	0	0	0	...	0	0	1	1	1	0	0	0	0	0
Shounen Super Power	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	1	0	0
Shounen Supernatural	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
Slice of Life	0	0	0	0	0	0	0	0	0	0	...	0	0	1	2	3	6	5	6	9	1
Slice of Life Space	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	0
Slice of Life Supernatural	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	1	1	0	0	1
Space	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0
Sports	0	0	0	0	0	0	0	0	0	0	...	0	0	1	0	1	0	0	4	2	3
Supernatural	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	1	0	0
Yaoi	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	Comedy	Drama	Fantasy	...	Supernatural
0	7.63	0.00	0.00	...	7.63
1	7.89	0.00	0.00	...	0.00
2	7.55	0.00	0.00	...	0.00
3	8.21	8.21	8.21	...	0.00
4	8.67	8.67	0.00	...	0.00