Show me the Code
import os
import glob
import pandas as pd
The other day, me and my brother were arguing over which ship was the most popular within anime (overall/historically)… I said it was L and Yagami Light from Death Note, and he said it was Eren and Mikasa from Attack on Titan. Without robustly defining “popular” and “within anime”, it’s kind of hard to decide what method is better than the other, but since this is just for fun, I decided not to dwell on that, rather, I just wanted to prove my brother wrong…
I decided to measure a ship’s “popularity” through how many fanfictions had been written about it (though there’s probably a lot of room to debate there). Archive of our Own had helpfully published a data dump on all published works on their site ending at March of 2021. So, I used that. Feel free to check it out! Though, you won’t find any Demon Slayer or Spy x Family or Jujutsu Kaisen, sadly.
import os
import glob
import pandas as pd
= pd.read_csv('tag.csv')
df = df.sort_values(by='cached_count', ascending = False)
dg = dg[dg['name'] != "Redacted"]
ds = ds[ds['type'] == "Fandom"] dl
The way their data is (very, very nicely sorted) is through tags and work-data (which is a hefty 1GB), each tag has an ID and is sorted into different kinds of categories (like fandom type, relationship, character, etc…). These tags also have information on their “cached-count”, which is just how many times they show up, which will be very helpful later.
I wanted to first take a look at which ships were most popular overall, and quickly realized what the problem would be with finding the most popular ships in anime
= ds[ds['type'] == "Relationship"]
dr 10) dr.head(
id | type | name | canonical | cached_count | merger_id | |
---|---|---|---|---|---|---|
173025 | 264659 | Relationship | Derek Hale/Stiles Stilinski | True | 122223 | NaN |
4700 | 5672 | Relationship | Castiel/Dean Winchester | True | 111991 | NaN |
8900 | 11006 | Relationship | Sherlock Holmes/John Watson | True | 87435 | NaN |
76021 | 110293 | Relationship | James "Bucky" Barnes/Steve Rogers | True | 77276 | NaN |
85 | 99 | Relationship | Draco Malfoy/Harry Potter | True | 74244 | NaN |
5973 | 7265 | Relationship | Steve Rogers/Tony Stark | True | 64923 | NaN |
378701 | 607596 | Relationship | Stucky | False | 54045 | 110293.0 |
256699 | 450395 | Relationship | Harry Styles/Louis Tomlinson | True | 48225 | NaN |
352836 | 575567 | Relationship | Aziraphale/Crowley (Good Omens) | True | 39319 | NaN |
180531 | 276512 | Relationship | Keith/Lance (Voltron) | True | 37464 | NaN |
First, I don’t read that much fanfiction myself, and second, I haven’t watched enough anime to write down my own list of ships that would probably show up on this list, especically when its thousands and thousands of entries long… So how do you get entries that were about anime in the first place?
= dr['id'].tolist()
reltags = dr['name'].tolist() renames
= pd.read_csv('titles.csv')
titles = dl[dl['name'].isin (titles['title'])]
tags = tags [['id', 'name', 'cached_count']]
tag = tag['id'].tolist()
animetags = [str(x) for x in animetags] animetags
I sorted the ‘tags’ csv to be listed by fandom, and without a good/efficient computational way to determine if an entry was anime or not, I manually collected a list of titles of animes. This actually took a lot faster than you’d think, as there actually aren’t many titles that get a significant amount of fanfiction written about them. Of course, this was based on my own knowledge, but I think I did a pretty decent job…
tags.head()
id | type | name | canonical | cached_count | merger_id | |
---|---|---|---|---|---|---|
494603 | 758208 | Fandom | Haikyuu!! | True | 130918 | NaN |
11296 | 13999 | Fandom | Naruto | True | 105108 | NaN |
466212 | 721553 | Fandom | Shingeki no Kyojin | Attack on Titan | True | 60008 | NaN |
358410 | 582724 | Fandom | Miraculous Ladybug | True | 55895 | NaN |
10317 | 12845 | Fandom | Hetalia: Axis Powers | True | 43092 | NaN |
def checker (s):
return any (x in animetags for x in s)
def multichecker (s):
if ',' in s:
return True
else:
return False
def nonan (d):
return {k:v for k, v in d.items() if v == v}
def topfive (d):
return {k:v for k, v in d.items() if v >= 5.0}
= pd.Series(tag.name.values, index = tag.id).to_dict()
tagdict = {str(k):v for k,v in tagdict.items()}
tdict
def replace_titles (s):
return tdict[s]
Now, I could get information on every entry that had any anime tagged
r'/Users/emilyzou/Desktop/final/chunks')
os.chdir(= glob.glob('*.csv')
filelist for file in filelist:
= pd.read_csv(file)
da 'taglist'] = da ['tags'].map(lambda s: str(s).split('+'))
da ['checkisin'] = da['taglist'].apply(checker)
da[= da[da['checkisin'] == True]
animed 'anime'] = animed ['taglist'].apply(lambda x: list(set(x) & set(animetags)))
animed ['anime'] = animed ['anime'].apply(lambda x: x[0])
animed [= animed [['creation date', 'language', 'word_count', 'taglist', 'anime']].reset_index()
anidf 'tag_length'] = anidf['taglist'].apply(len)
anidf[= anidf.explode('taglist')
explode = explode.groupby(['anime', 'taglist']).size().unstack(fill_value = None).reset_index()
exploded = ['anime', 'taglist']
removeplease = [x for x in list(exploded.columns) if x not in removeplease]
list1 'anime'] = exploded['anime'].astype(pd.StringDtype())
exploded ['multicheck'] = exploded['anime'].apply(multichecker)
exploded [= exploded[exploded['multicheck'] == False]
data dict = data.set_index(['anime']).to_dict('index')
= {k:nonan(v) for k, v in dict.items()}
dic = {k:topfive(v) for k, v in dic.items()}
dic5 = {replace_titles(k): v for k, v in dic5.items()}
dictt = df
dm = list(data.columns)
tagged 'id'] = dm['id'].astype(pd.StringDtype())
dm[= dm[dm['id'].isin (tagged)]
othertags = pd.Series(othertags.name.values, index = othertags.id).to_dict()
tag2dict = {str(k):v for k,v in tag2dict.items()}
t2dict def replace_tags (s):
if s in list(t2dict.keys()):
return t2dict[s]
else:
return None
def dictreplace_tags (dict):
return {replace_tags(k): v for k, v in dict.items()}
= {replace_titles(k): dictreplace_tags(v) for k, v in dic5.items()}
dictt = 'index').to_csv('results{}'.format(file)) pd.DataFrame.from_dict(dictt, orient
The works data, like I wrote before, is really really big, so I had split it into equal chunks of 10,000 lines… which created 105 seperate CSV files. The for loop above does all the processing I want to spit out every /other/ tag that got tagged along the anime, and does this on every single file.
r'/Users/emilyzou/Desktop/final/chunks/results')
os.chdir(= []
csv_files = glob.glob('*.csv')
filelist for file in filelist:
= pd.read_csv(file)
_dg
csv_files.append(_dg)
= pd.concat(csv_files) merged
Then, I merged all 105 of my chunks into one big file. Let’s take a look at what we got!
= merged.rename(columns = {'Unnamed: 0': 'anime'}) merged
all = merged.groupby('anime').sum()
all ['total'] = all.sum(numeric_only = True, axis = 1)
all.sort_values(by = ['total'], ascending = False).to_csv('all.csv')
all.sort_values(by = ['total'], ascending = False).head()
General Audiences | One Piece | Roronoa Zoro | Teen And Up Audiences | Fluff | Nami (One Piece) | Monkey D. Luffy | Nico Robin | Unnamed: 9 | Portgas D. Ace | ... | Magic-Users | Kamui (Gintama) | Crossdressing | Major character death - Freeform | Riding | Fukawa Touko/Togami Byakuya | POV Outsider | Alternate Universe - Bakery | Alternate Universe - Neighbors | total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
anime | |||||||||||||||||||||
Haikyuu!! | 9880.0 | 0.0 | 0.0 | 10344.0 | 8847.0 | 0.0 | 0.0 | 0.0 | 67.0 | 0.0 | ... | 0.0 | 0.0 | 5.0 | 5.0 | 6.0 | 0.0 | 5.0 | 6.0 | 5.0 | 310418.0 |
Naruto | 2757.0 | 11.0 | 0.0 | 3569.0 | 1353.0 | 0.0 | 0.0 | 0.0 | 67.0 | 0.0 | ... | 9.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 99093.0 |
Shingeki no Kyojin | Attack on Titan | 1379.0 | 0.0 | 0.0 | 1941.0 | 1208.0 | 0.0 | 0.0 | 0.0 | 39.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 76630.0 |
Dangan Ronpa - All Media Types | 1494.0 | 0.0 | 0.0 | 2263.0 | 1357.0 | 0.0 | 0.0 | 0.0 | 81.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 | 0.0 | 0.0 | 0.0 | 48513.0 |
Miraculous Ladybug | 3066.0 | 0.0 | 0.0 | 2933.0 | 1523.0 | 0.0 | 0.0 | 0.0 | 198.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 38693.0 |
5 rows × 1428 columns
= pd.read_csv('all.csv').set_index(['anime'])
d = ['Haikyuu!!'], ascending = False).to_csv('trans.csv') d.T.sort_values(by
Looks like Haikyuu is the most popular fandom within anime… let’s see which tags get circulated the most!
= ['Haikyuu!!'], ascending = False).iloc[:,0:1].head(20) d.T.sort_values(by
anime | Haikyuu!! |
---|---|
total | 310418.0 |
Haikyuu!! | 31388.0 |
M/M | 24043.0 |
No Archive Warnings Apply | 17782.0 |
Choose Not To Use Archive Warnings | 11513.0 |
Teen And Up Audiences | 10344.0 |
General Audiences | 9880.0 |
Fluff | 8847.0 |
Hinata Shouyou | 8539.0 |
Oikawa Tooru | 7225.0 |
Kuroo Tetsurou | 7123.0 |
Kageyama Tobio | 6927.0 |
Bokuto Koutarou | 6823.0 |
Akaashi Keiji | 6086.0 |
Tsukishima Kei | 5806.0 |
Iwaizumi Hajime | 5582.0 |
F/M | 5217.0 |
Kozume Kenma | 5177.0 |
Angst | 4798.0 |
Sugawara Koushi | 4593.0 |
Looks about right.
I’m going to transponse the dataframe just to see if there’s anyhing glaringly wrong.
d.T
anime | Haikyuu!! | Naruto | Shingeki no Kyojin | Attack on Titan | Dangan Ronpa - All Media Types | Miraculous Ladybug | One Piece | RWBY | Avatar: Legend of Korra | Hunter X Hunter | Hetalia: Axis Powers | ... | Yu-Gi-Oh! ARC-V | Umineko no Naku Koro ni | When the Seagulls Cry | Fate/stay night (Visual Novel) | Fate/Zero | No. 6 (Anime & Manga) | Pocket Monsters: Diamond & Pearl & Platinum | Pokemon Diamond Pearl Platinum Versions | Psycho-Pass | Senki Zesshou Symphogear | Pokemon Mystery Dungeon | Saiyuki |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
General Audiences | 9880.0 | 2757.0 | 1379.0 | 1494.0 | 3066.0 | 1604.0 | 1212.0 | 899.0 | 632.0 | 959.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
One Piece | 0.0 | 11.0 | 0.0 | 0.0 | 0.0 | 5429.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Roronoa Zoro | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1794.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Teen And Up Audiences | 10344.0 | 3569.0 | 1941.0 | 2263.0 | 2933.0 | 1654.0 | 1641.0 | 914.0 | 869.0 | 1072.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Fluff | 8847.0 | 1353.0 | 1208.0 | 1357.0 | 1523.0 | 759.0 | 426.0 | 545.0 | 477.0 | 161.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Fukawa Touko/Togami Byakuya | 0.0 | 0.0 | 0.0 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
POV Outsider | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Alternate Universe - Bakery | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
Alternate Universe - Neighbors | 5.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
total | 310418.0 | 99093.0 | 76630.0 | 48513.0 | 38693.0 | 37209.0 | 31140.0 | 20980.0 | 20486.0 | 19684.0 | ... | 26.0 | 22.0 | 20.0 | 20.0 | 15.0 | 15.0 | 12.0 | 10.0 | 6.0 | 5.0 |
1428 rows × 80 columns
These numbers are a bit off… I know from looking at the initial overall ship data that Levi/ Eren Yeager should make up a significant chunk, but that isn’t reflected here.
I guess most people don’t simultaenously tag a fandom and their ship, but that’s okay. We’re just grabbing a list of anime ships through getting all tags that get tagged with anime. It’d be pretty hard for a ship to fall through the cracks this way.
I made a list of every relationship tag in the original tag CSV file, and then used list comprehension to create a new dataframe by filtering each “tag” column to match the relationship. We then get a list of every ship-tag that got mentioned with our list of animes.
= all[[c for c in all.columns if c in renames]]
ships max().sort_values(ascending = False).head(20) ships.T.
anime
Haikyuu!! 3664.0
Naruto 1245.0
Avatar: Legend of Korra 945.0
RWBY 907.0
Dangan Ronpa - All Media Types 867.0
Hunter X Hunter 800.0
Shingeki no Kyojin | Attack on Titan 690.0
InuYasha - A Feudal Fairy Tale 525.0
Dragon Ball 210.0
Super Dangan Ronpa 2 209.0
Fairy Tail 151.0
Fullmetal Alchemist: Brotherhood & Manga 140.0
Bleach 128.0
Fruits Basket 103.0
Gintama 89.0
Durarara!! 73.0
Kuroshitsuji | Black Butler 58.0
One Piece 48.0
Yu-Gi-Oh! 5D's 37.0
Tennis no Oujisama | Prince of Tennis 33.0
dtype: float64
This is what we’d get if we only looked at ships that got tagged with an anime title, which is unfortunately not realistic.
max().sort_values(ascending = False) ships.
Iwaizumi Hajime/Oikawa Tooru 3664.0
Hinata Shouyou/Kageyama Tobio 3325.0
Tsukishima Kei/Yamaguchi Tadashi 2107.0
Sawamura Daichi/Sugawara Koushi 1797.0
Uchiha Sasuke/Uzumaki Naruto 1245.0
...
Kisaragi Shintaro/Tateyama Ayano 5.0
Hyuuga Neji/Nara Shikamaru 5.0
Ishimaru Kiyotaka & Oowada Mondo 5.0
Roy Mustang/Riza Hawkeye 5.0
Iason Mink/Riki 5.0
Length: 211, dtype: float64
But, we don’t care about these numbers. With these column names, we can get our final list of a relevant ship tags
= list(ships.columns.values)
shiplist 0:15] shiplist[
['Higurashi Kagome/InuYasha',
'Haruno Sakura/Uchiha Sasuke',
'Hyuuga Hinata/Uzumaki Naruto',
'Senju Hashirama/Senju Tobirama',
'Hatake Kakashi/Umino Iruka',
'Uchiha Sasuke/Uzumaki Naruto',
'Hoshigaki Kisame/Uchiha Itachi',
'Senju Tobirama/Uchiha Madara',
'Naegi Makoto/Togami Byakuya',
'Hinata Hajime/Komaeda Nagito',
'Korra/Asami Sato',
'Hisoka/Illumi Zoldyck',
'Gon Freecs/Killua Zoldyck',
'Gon Freecs & Killua Zoldyck',
'Grimmjow Jaegerjaques/Kurosaki Ichigo']
Now, let’s filter it with our original tag CSV!
= ds[ds['name'].isin (shiplist)]
animeships 20) animeships.head(
id | type | name | canonical | cached_count | merger_id | |
---|---|---|---|---|---|---|
175042 | 267347 | Relationship | Minor or Background Relationship(s) | True | 35799 | NaN |
660211 | 976131 | Relationship | Levi/Eren Yeager | False | 21010 | 5582955.0 |
929282 | 1329922 | Relationship | Iwaizumi Hajime/Oikawa Tooru | True | 18027 | NaN |
494604 | 758209 | Relationship | Hinata Shouyou/Kageyama Tobio | True | 17150 | NaN |
11553 | 14303 | Relationship | Uchiha Sasuke/Uzumaki Naruto | True | 13393 | NaN |
376089 | 604125 | Relationship | Other Relationship Tags to Be Added | True | 11085 | NaN |
554549 | 836528 | Relationship | Sawamura Daichi/Sugawara Koushi | True | 8743 | NaN |
972783 | 1408234 | Relationship | Tsukishima Kei/Yamaguchi Tadashi | True | 8035 | NaN |
597975 | 893104 | Relationship | Levi/Erwin Smith | False | 7521 | 2403411.0 |
687472 | 1011847 | Relationship | Marco Bott/Jean Kirstein | True | 6755 | NaN |
11548 | 14298 | Relationship | Hatake Kakashi/Umino Iruka | True | 6741 | NaN |
577424 | 865991 | Relationship | Nanase Haruka/Tachibana Makoto | True | 6319 | NaN |
271602 | 471636 | Relationship | Korra/Asami Sato | True | 6188 | NaN |
8297 | 10230 | Relationship | Edward Elric/Roy Mustang | True | 6114 | NaN |
733271 | 1072769 | Relationship | Blake Belladonna/Yang Xiao Long | True | 5556 | NaN |
76081 | 110362 | Relationship | Haruno Sakura/Uchiha Sasuke | True | 5071 | NaN |
458156 | 711036 | Relationship | Hinata Hajime/Komaeda Nagito | True | 4572 | NaN |
667984 | 986370 | Relationship | eruri | False | 4273 | 2403411.0 |
953482 | 1362296 | Relationship | Azumane Asahi/Nishinoya Yuu | True | 4224 | NaN |
78454 | 113712 | Relationship | Heiwajima Shizuo/Orihara Izaya | True | 4127 | NaN |
It looks like, as of 2021 at least, Levi and Eren Yeager from Attack on Titan were the most popular ship. They’re followed by Iwaizumi Hajime and Oikawa Tooru from Haikyuu!, by a margin of around 3000. Hinata Shouyou and Kageyama Tobio, also from Haikyuu! take third place, lagging only by 1000. Fourth place is Uchiha Sasuke and Uzumaki Naruto from Naruto.
Fifth and sixth place also go to the Haikyuu fandom. Fifth place is the Sawamura Daichi and Sugawara Koushi pairing, and sixth place is Tsukishima Kei/Yamaguchi Tadashi. Attack on Titan makes a come back at seventh and eighth place, with Levi/Erwin Smith and Marco Bott/Jean Kirstein, though both lag far behind Levi/Eren Yeager. Naruto takes ninth place with Hatake Kakashi/Umino Iruka.
Despite Attack on Titan’s spot in the first place, it looks like Haikyuu! dominates the leaderboard overall, with numerous ships enjoying similar levels of popularity.
Tenth place sees the first non- AOT, Haikyuu, and Naruto anime, with Nanase Haruka/Tachibana Makoto from Free! Eleventh place sees the first non- Male/Male pairing, with Korra/Asami Sato from Legend of Korra. Twelvth is Edward Elric/Roy Mustang from Full Metal Alchemist Brotherhood, and thirteenth is Blake Belladonna/Yang Xiao Long from RWBY.
= animeships.reset_index() animeships
print (animeships[animeships['name'] == "L/Yagami Light"])
index id type name canonical cached_count \
23 9812 12239 Relationship L/Yagami Light True 3881
merger_id
23 NaN
print (animeships[animeships['name'] == "Mikasa Ackerman/Eren Yeager"])
index id type name canonical \
54 687489 1011866 Relationship Mikasa Ackerman/Eren Yeager True
cached_count merger_id
54 1622 NaN
Looks, like I was closer to being correct! Even though we were both far, far from being right overall… Of course, this may be because of the way I chose to analyze a ship’s popularity… it’s pretty easy to argue that a more popular ship would inspire more fanfiction written about it, but not fanfiction-writers make up a specialized group of a fandom… and they tend to heavily favor Male/Male relationships. Thanks for reading! Apologies to any anime/Ao3 enthusiasts if I made any obvious missteps… I can only be considered an amateur in both realms.