Competitive Pokemon battling is a form of play where players adhere to certain sets of rules to create a competitive environment. Oftentimes, this kind of play is conducted on a website called Pokemon Showdown. Pokemon Showdown simulates Pokemon battling as it occurs in the mainline Pokemon games and is very popular due to its ease of use and efficiency. On the site each individual Pokemon is assigned a tier, or label, that represents how well a Pokemon does in the competitive landscape.
On November 18th, 2022 Pokemon Scarlet and Violet released for the Nintendo Switch. Whenever a new Pokemon game is released, new Pokemon are typically released as well. In particular, $103$ new Pokemon were introduced in these games. On Pokemon Showdown, there's a period that typically lasts around 2 months where website administrators observe player games in order to gather data and assign tiers to these brand new Pokemon. Using different machine learning models, we would like to predict these tiers ahead of Pokemon Showdown. In other words, we want to use machine learning to converge to the true tiers of these new Pokemon faster than Pokemon Showdown would organically, leading to balanced and fair gameplay more quickly.
Information regarding both competitive Pokemon battling and Pokemon Showdown are available at these pages:
# importing the libraries necessary for this project.
import pandas as pd
import json
import os
Gathering data for this project was split in two parts. The understanding phase, and the collection stage. We already knew that we wanted to use pokemonshowdown.com and smogon.com as our source of information, so the next step was to understand how the data was stored on the site, and how we would be able to extract it.
When scraping a website, it's tempting to immediately jump to libraries like Beautifulsoup, or puppeteer. However, this can often be unnecessary: depending on the site we are targeting, it could be the case that the data is already clean, simply available in memory. During the exploration phase of the collection, we first analyzed how the site worked: we looked at various source files for information, logs, or network requests. As it turned out, within 20 minutes of exploration, we discovered that all of the site's data was stored inside of javascript objects that were loaded on the initial load of the site. After filtering through dozens of requests, we were able to identify a select few which held most of the site's data regarding Pokemon. From there, we were able to copy the objects from the console to their own files, which ended up saving us a lot more time than if we had written a complete python web scraper.
After gathering most of the site's data, we began to explore what data might have been of interest to us. This part was largely experimentation in pandas. A lot of the later stages of this experimentation are detailed below, but the early stages comprised mostly of understanding what the data we had collected was. After having developed a basic understanding of the site's data, and how it was structured, we were able to move on to the pure data processing step.
At this stage of the project, the objective was clear: work in Python and Pandas as much as possible. The reason for this is that we were handling json files often hundreds of thousands of lines long, we absolutely did not want to make any changes that were not reproducible.
Once we had an understanding of the files we had from Pokemon Showdown, we could use the Pandas functions .read_json()
and .read_csv()
to turn these files into Pandas dataframes. Inside of dataframes, we will be able to manipulate the data much more easily.
# load the various data sources. We take the transpose when necessary due to the structure of the JSON file.
pokemons_df = pd.read_json('./data/battle_pokedex.json').T
battle_types_df = pd.read_json('./data/battle_type_chart.json').T
battle_team_builder_table = open('./data/battle_teambuilder_table.json')
usage_df = pd.read_csv('./data/pokemon_usage.csv')
Since we're pulling our data from many different sources, the set of Pokemon we would like to work with is often represented in slightly different ways. Consider the Pokemon Venusaur, for example. There are two variations of this Pokemon: Venusaur-mega and Venusaur-gmax. Some sources will include these types of variations as separate Pokemon entirely while other sources forego including such variations all together. This project will focus on the set of Pokemon that does not include these variations. This is because the use of these variations is oftentimes player choice. Furthermore, data regarding these variations is usually sparse or inconsistent.
Let's begin by loading in the set of Pokemon that this project will focus on. That is, the set of Pokemon that excludes variations. This set is found in the battle_team_builder_table
JSON file in an entry named learnedsets
. We extract this entry here and turn it into its own dataframe. We transpose the table in order to have a table whose rows are Pokemon where each observation is a single Pokemon.
team_builder_json = json.load(battle_team_builder_table) # this json object contains the set we want.
pokemon_moves = pd.DataFrame(team_builder_json['learnsets']).T # turn this entry into a dataframe
pokemon_moves
blizzard | bubblebeam | cut | doubleedge | earthquake | fissure | fly | icebeam | megakick | megapunch | ... | doodle | makeitrain | ruination | collisioncourse | electrodrift | gigatonhammer | armorcannon | bitterblade | paleowave | shadowstrike | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
missingno | 123456789 | 123456789 | 123 | 123456789 | 123456789 | 123456789 | 123 | 123456789 | 123456789 | 123456789 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
bulbasaur | NaN | NaN | 123456789p | 123456789pqg | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ivysaur | NaN | NaN | 123456789p | 123456789pqg | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
venusaur | NaN | NaN | 123456789p | 123456789pqg | 3456789pqg | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
charmander | NaN | NaN | 123456789p | 123456789 | NaN | NaN | NaN | NaN | 123456789g | 123456789g | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
duohm | 456789qg | NaN | NaN | NaN | NaN | NaN | NaN | 456789qg | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
dorsoil | NaN | NaN | NaN | 456789qg | 456789qg | 456789qg | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
protowatt | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
venomicon | NaN | NaN | NaN | NaN | NaN | NaN | 89g | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
saharaja | NaN | NaN | NaN | NaN | 89g | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1196 rows × 818 columns
The columns of the pokemon_moves
dataframe corresponds to every move in the Pokemon video games. A move is an action that a Pokemon can preform in a battle. We will elaborate on these features we're considering in more depth shortly. For right now however, if a given cell is not NaN, this means that a pokemon is able to perform a given move. Each column represents a single move. However we will have to process these columns later to convert them to the one-hot encoded format we desire. The specific values such as 123456789
are more or less garbage. What's important here is that the set of Pokemon represented by the observations in this dataframe is the set we desire, with some additional fake Pokemon that must be removed. To clarify, a fake Pokemon is a fan-made entity that isn't official. These fan-made Pokemon are beyond the scope of this project. We'll begin removing these fake Pokemon entries now.
df = pokemons_df[['num', 'baseStats', 'abilities']] # creating our main dataframe, we filter only for fields we care about
df = df.sort_values(by=['num']) # sort by the Pokemon number. Some are negative, these are fake pokemons
df.drop(df[df['num'] < 1].index, inplace=True) # removing fake pokemon. These Pokemon are represented by negative num col vals.
df = df.reset_index(drop=False).rename(columns={'index': 'name'}) # index the dataframe using numbers, used to join later
# extract base stats
base_stats = pd.json_normalize(df['baseStats'])
df = df.join(base_stats)
df = df.drop('baseStats', axis=1)
df
name | num | abilities | hp | atk | def | spa | spd | spe | |
---|---|---|---|---|---|---|---|---|---|
0 | bulbasaur | 1 | {'0': 'Overgrow', 'H': 'Chlorophyll'} | 45 | 49 | 49 | 65 | 65 | 45 |
1 | ivysaur | 2 | {'0': 'Overgrow', 'H': 'Chlorophyll'} | 60 | 62 | 63 | 80 | 80 | 60 |
2 | venusaurgmax | 3 | {'0': 'Overgrow', 'H': 'Chlorophyll'} | 80 | 82 | 83 | 100 | 100 | 80 |
3 | venusaurmega | 3 | {'0': 'Thick Fat'} | 80 | 100 | 123 | 122 | 120 | 80 |
4 | venusaur | 3 | {'0': 'Overgrow', 'H': 'Chlorophyll'} | 80 | 82 | 83 | 100 | 100 | 80 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1293 | toedscool | 1006 | {'0': 'Mycelium Might'} | 40 | 40 | 35 | 50 | 100 | 70 |
1294 | toedscruel | 1007 | {'0': 'Mycelium Might'} | 80 | 70 | 65 | 80 | 120 | 100 |
1295 | kingambit | 1008 | {'0': 'Defiant', '1': 'Supreme Overlord', 'H':... | 100 | 135 | 120 | 60 | 85 | 50 |
1296 | clodsire | 1009 | {'0': 'Poison Point', '1': 'Water Absorb', 'H'... | 130 | 75 | 60 | 45 | 100 | 20 |
1297 | annihilape | 1010 | {'0': 'Vital Spirit', '1': 'Inner Focus', 'H':... | 110 | 115 | 80 | 50 | 90 | 90 |
1298 rows × 9 columns
The set of Pokemon stored in the dataframe df
is a set of Pokemon that does include the variations we do not desire. However, using df
, fake Pokemon were easily filtered out using the num
column, which is a unique number that corresponds to a Pokemon. We'll now use df
which no longer includes fake Pokemon to similarly remove fake Pokemon from the set of Pokemon without variations currently in pokemon_moves
. This will achieve the proper set of Pokemon we want to focus on. To reiterate, this is the set of Pokemon that includes no variations of a Pokemon and no fake Pokemon.
We'll now return to the pokemon_moves
dataframe to remove fake Pokemon using the df
dataframe.
pokemon_moves = pokemon_moves.copy()
pokemon_moves['name'] = pokemon_moves.index # turns dataframe index into column to make further operations easier.
pokemon_moves = pokemon_moves[pokemon_moves['name'].isin(list(df['name']))] # this will remove all fake pokemon from pokemon_moves
pokemon_moves = pokemon_moves.drop('name', axis=1) # name column that was added is no longer need
pokemon_moves
blizzard | bubblebeam | cut | doubleedge | earthquake | fissure | fly | icebeam | megakick | megapunch | ... | doodle | makeitrain | ruination | collisioncourse | electrodrift | gigatonhammer | armorcannon | bitterblade | paleowave | shadowstrike | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bulbasaur | NaN | NaN | 123456789p | 123456789pqg | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
ivysaur | NaN | NaN | 123456789p | 123456789pqg | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
venusaur | NaN | NaN | 123456789p | 123456789pqg | 3456789pqg | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
charmander | NaN | NaN | 123456789p | 123456789 | NaN | NaN | NaN | NaN | 123456789g | 123456789g | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
charmeleon | NaN | NaN | 123456789p | 123456789 | NaN | NaN | NaN | NaN | 123456789g | 123456789g | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
charcadet | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
armarouge | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 9a | NaN | NaN | NaN |
ceruledge | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9a | NaN | NaN |
toedscool | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
toedscruel | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1126 rows × 818 columns
Now we have the exact set of Pokemon we want to focus on in the form of the observations in the pokemon_moves
dataframe. Since we now have our proper set of Pokemon to focus on, let's introduce the features that we will be focusing on.
We'll be joining dataframes eventually to generate the complete data that our models will train on. Dataframes will be joined into the df
dataframe. This table will consist of the following columns.
number
$\in \mathbb N$: The unique Pokemon numbertier
(String): The label that our models will be training onhp
$\in \mathbb N$: Health statistic of the pokemonattack
$\in \mathbb N$: Attack statistic of the pokemondefence
$\in \mathbb N$: Defense statistic of the pokemonspecial_attack
$\in \mathbb N$: Special attack statistic of the pokemonspecial_defence
$\in \mathbb N$: Special defence statistic of the pokemonspeed
$\in \mathbb N$: Speed statistic of the pokemonusage
$\in [0, 1]$: The usage of the pokemon in competitive play...move
$m \in \{0, 1\}$ for all $m \in$ move
: One-hot columns representing whether the pokemon can perform this move...ability
$a \in \{0, 1\}$ for all $a \in$ ability
: One-hot columns representing whether the pokemon has this abilityTo start, let's remove the unwanted Pokemon from df
, which will be the main dataframe now.
df = df.copy()
df = df[df['name'].isin(list(pokemon_moves.index.values))] # this will remove all fake pokemon from pokemon_moves
df
name | num | abilities | hp | atk | def | spa | spd | spe | |
---|---|---|---|---|---|---|---|---|---|
0 | bulbasaur | 1 | {'0': 'Overgrow', 'H': 'Chlorophyll'} | 45 | 49 | 49 | 65 | 65 | 45 |
1 | ivysaur | 2 | {'0': 'Overgrow', 'H': 'Chlorophyll'} | 60 | 62 | 63 | 80 | 80 | 60 |
4 | venusaur | 3 | {'0': 'Overgrow', 'H': 'Chlorophyll'} | 80 | 82 | 83 | 100 | 100 | 80 |
5 | charmander | 4 | {'0': 'Blaze', 'H': 'Solar Power'} | 39 | 52 | 43 | 60 | 50 | 65 |
6 | charmeleon | 5 | {'0': 'Blaze', 'H': 'Solar Power'} | 58 | 64 | 58 | 80 | 65 | 80 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1293 | toedscool | 1006 | {'0': 'Mycelium Might'} | 40 | 40 | 35 | 50 | 100 | 70 |
1294 | toedscruel | 1007 | {'0': 'Mycelium Might'} | 80 | 70 | 65 | 80 | 120 | 100 |
1295 | kingambit | 1008 | {'0': 'Defiant', '1': 'Supreme Overlord', 'H':... | 100 | 135 | 120 | 60 | 85 | 50 |
1296 | clodsire | 1009 | {'0': 'Poison Point', '1': 'Water Absorb', 'H'... | 130 | 75 | 60 | 45 | 100 | 20 |
1297 | annihilape | 1010 | {'0': 'Vital Spirit', '1': 'Inner Focus', 'H':... | 110 | 115 | 80 | 50 | 90 | 90 |
1126 rows × 9 columns
Now that we have a rudimentary table of a Pokemon's name, unique number, stats (we use the term stats to refer to the hp
, atk
, def
, spa
, spd
, and spe
features), and abilities, we can convert the abilities which is currently a dictionary for each pokemon to one-hot encoded columns based on if a Pokemon has access to a given ability or not. The current dictionary keys are unnecessary and beyond the scope here, but the values of each Pokemon's abilities dictionary is what we wish to convert to one-hot encoded columns. To clarify, if a Pokemon has access to an ability, the arbitrary cell [Pokemon][ability] will be 1, otherwise, the cell value will be 0. We'll make use of this technique extensively throughout this project.
To clarify, abilities are unique passive behaviors that a Pokemon has during a battle. More information is available here.
abilities = pd.json_normalize(df['abilities']) # extract abilities into columns
# create an abilities table with pokemon names as index and four ability columns based on dict keys
abilities = pd.concat([df, abilities], axis=1)[['name', '0', '1', 'H', 'S']].set_index('name')
# create dummy columns for each ability, essentially one-hot encoding the abilities
abilities = pd.get_dummies(abilities, columns=['0', '1', 'H', 'S'], prefix='', prefix_sep='')
abilities = abilities.groupby(level=0, axis=1).sum() # group columns with the same name
abilities = abilities.sort_index(axis=1) # sort columns alphabetically
abilities = abilities.rename(columns=lambda name: 'abi_' + name.replace(' ', '_').lower()) # rename columns to be more readable
abilities = abilities[abilities.index.notnull()] #Dropping NaN rows that were originally variations of Pokemon.
abilities
abi_adaptability | abi_aftermath | abi_air_lock | abi_analytic | abi_anger_point | abi_anger_shell | abi_anticipation | abi_arena_trap | abi_armor_tail | abi_aroma_veil | ... | abi_weak_armor | abi_well-baked_body | abi_white_smoke | abi_wimp_out | abi_wind_power | abi_wind_rider | abi_wonder_guard | abi_wonder_skin | abi_zen_mode | abi_zero_to_hero | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||||||||||||
bulbasaur | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ivysaur | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
venusaur | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
charmander | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
charmeleon | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
toedscool | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
toedscruel | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
kingambit | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
clodsire | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
annihilape | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1126 rows × 293 columns
Here is the dataframe of one-hot encoded abilities, which indicates if a Pokemon has access to any given ability. We'll return to this dataframe later to concatenate it to the main df
dataframe.
Before we move on to the next set of features, we'll rename a few columns of the main dataframe to make understanding easier.
# rename the columns to something nicer to work with
df = df.rename(columns={
'num': 'number',
'atk': 'attack',
'spa': 'special_attack',
'def': 'defence',
'spd': 'special_defence',
'spe': 'speed',
})
df = df.set_index('name', drop=True) # set index to be the pokemon name, used to join later
df
number | abilities | hp | attack | defence | special_attack | special_defence | speed | |
---|---|---|---|---|---|---|---|---|
name | ||||||||
bulbasaur | 1 | {'0': 'Overgrow', 'H': 'Chlorophyll'} | 45 | 49 | 49 | 65 | 65 | 45 |
ivysaur | 2 | {'0': 'Overgrow', 'H': 'Chlorophyll'} | 60 | 62 | 63 | 80 | 80 | 60 |
venusaur | 3 | {'0': 'Overgrow', 'H': 'Chlorophyll'} | 80 | 82 | 83 | 100 | 100 | 80 |
charmander | 4 | {'0': 'Blaze', 'H': 'Solar Power'} | 39 | 52 | 43 | 60 | 50 | 65 |
charmeleon | 5 | {'0': 'Blaze', 'H': 'Solar Power'} | 58 | 64 | 58 | 80 | 65 | 80 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
toedscool | 1006 | {'0': 'Mycelium Might'} | 40 | 40 | 35 | 50 | 100 | 70 |
toedscruel | 1007 | {'0': 'Mycelium Might'} | 80 | 70 | 65 | 80 | 120 | 100 |
kingambit | 1008 | {'0': 'Defiant', '1': 'Supreme Overlord', 'H':... | 100 | 135 | 120 | 60 | 85 | 50 |
clodsire | 1009 | {'0': 'Poison Point', '1': 'Water Absorb', 'H'... | 130 | 75 | 60 | 45 | 100 | 20 |
annihilape | 1010 | {'0': 'Vital Spirit', '1': 'Inner Focus', 'H':... | 110 | 115 | 80 | 50 | 90 | 90 |
1126 rows × 8 columns
Now we'll focus on the Pokemon moves set of features. Similar to abilities, we desire that each Pokemon move is a one-hot encoded column corresponding to whether or not a Pokemon can preform a move. As we mentioned briefly before, pokemon_moves
, which is the dataframe we used previously to achieve the set of Pokemon we're focusing on for this project is also where Pokemon move information is stored. As such, we'll return to this dataframe and process the columns to achieve the one-hot encoded columns we desire. We'll begin by discarding the previous values associated with each move that a pokemon can do, and only store whether or not the pokemon can perform the move using 1
when a Pokemon can perform a move and 0
otherwise.
More on Pokemon moves here
pokemon_moves = pokemon_moves.copy()
# replace all values with either 0 or 1 depending on if the pokemon can perform that move
pokemon_moves[pokemon_moves.notnull()] = 1
pokemon_moves.fillna(0, inplace=True)
pokemon_moves = pokemon_moves.sort_index(axis=1) # sort the columns alphabetically for good measure
# here, we append "move_" to the beginning of each column so that we don't get name collision
pokemon_moves.rename(columns=lambda col_name: 'move_' + col_name, inplace=True)
pokemon_moves
move_absorb | move_accelerock | move_acid | move_acidarmor | move_acidspray | move_acrobatics | move_acupressure | move_aerialace | move_aeroblast | move_afteryou | ... | move_workup | move_worryseed | move_wrap | move_wringout | move_xscissor | move_yawn | move_zapcannon | move_zenheadbutt | move_zingzap | move_zippyzap | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
bulbasaur | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ivysaur | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
venusaur | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
charmander | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
charmeleon | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
charcadet | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
armarouge | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
ceruledge | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
toedscool | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
toedscruel | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1126 rows × 818 columns
Now we have a dataframe of one-hot encoded moves as we desired. Similar to the abilities dataframe, we'll return to pokemon_moves
later on when concatenating all dataframes together into the main df
dataframe.
The third set of features we'll be focusing on is Pokemon typing. There are 18 types and they are used to determine how Pokemon will interact with each other during a battle. Each Pokemon can have at most two types. Again, we'll be creating a dataframe of one-hot encoded types that will be appended to df
later.
More on types
#BANNED_COLS = ['Bird'] # a strange additional
types_df = pokemons_df[['types']] # create a dataframe for our types
types_df = types_df.explode('types') # each pokemon can have up to 2 abilities, so we explode the list into a row for each
types_df = pd.get_dummies(types_df, columns=['types'], prefix='', prefix_sep='') # one hot encode the types
types_df = types_df.groupby(level=0).sum() # group columns with the same name, summing the values
# remove banned columns
#types_df = types_df.drop(columns=BANNED_COLS)
types_df = types_df.rename(columns=lambda name: 'type_' + name.replace(' ', '_').lower()) # rename columns to be more readable
types_df
type_bird | type_bug | type_dark | type_dragon | type_electric | type_fairy | type_fighting | type_fire | type_flying | type_ghost | type_grass | type_ground | type_ice | type_normal | type_poison | type_psychic | type_rock | type_steel | type_water | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
abomasnow | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
abomasnowmega | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
abra | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
absol | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
absolmega | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
zubat | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
zweilous | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
zygarde | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
zygarde10 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
zygardecomplete | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1385 rows × 19 columns
As you'll notice, the number of observations in types_df
does not match that of previous dataframes. This is because we retrieved the Pokemon type data from the same source that contained variations of Pokemon. However, this means that the Pokemon variations we had removed previously are accounted for in types_df
. To rectify this, we'll apply the same solution we used to get the proper set of Pokemon earlier.
types_df = types_df.copy()
types_df['name'] = types_df.index # adding the index as a column to make operations easier
types_df = types_df[types_df['name'].isin(list(df.index.values))] # this will remove all undesired Pokemon from types_df
types_df = types_df.drop('name', axis=1) # dropping the added name column since its unnecessary
types_df
type_bird | type_bug | type_dark | type_dragon | type_electric | type_fairy | type_fighting | type_fire | type_flying | type_ghost | type_grass | type_ground | type_ice | type_normal | type_poison | type_psychic | type_rock | type_steel | type_water | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
abomasnow | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
abra | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
absol | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
accelgor | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
aegislash | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
zoruahisui | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
zubat | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
zweilous | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
zygarde | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
zygarde10 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1126 rows × 19 columns
types_df
is now ready to be concatenated to df
later.
We'll now begin processing data related to Pokemon tier. There are many different game formats on Pokemon Showdown, and each assigns a different tier to our set of Pokemon. For this project, we're specifically focusing on the gen8natdex
format, as this is the most recent format before the release of Pokemon Scarlet and Violet. The data in team_builder_json
is stored in nested lists, where each individual tier is an individual list. So to retrieve this data and put it into a Pandas dataframe, we'll first need to process it.
tiers = team_builder_json['gen8natdex']['tiers'] # extract the tiers from team_builder specifically for gen8natdex
# build a tier object
pokemon_tiers = {
'name': [],
'tier': []
}
# parse the tiers into a dataframe
current_tier = ''
for el in tiers:
if type(el) is list: # it's a new tier
current_tier = el[1] # set the current tier
else:
# its a pokemon
pokemon_tiers['name'].append(el)
pokemon_tiers['tier'].append(current_tier)
pokemon_tiers = pd.DataFrame(pokemon_tiers) # finally converting processed data to dataframe
pokemon_tiers
name | tier | |
---|---|---|
0 | arghonaut | CAP |
1 | astrolotl | CAP |
2 | aurumoth | CAP |
3 | caribolt | CAP |
4 | cawmodore | CAP |
... | ... | ... |
1180 | yungoos | LC |
1181 | zigzagoon | LC |
1182 | zigzagoongalar | LC |
1183 | zorua | LC |
1184 | zubat | LC |
1185 rows × 2 columns
Again, the set of Pokemon here doesn't align with the set we're focusing on. This is in part because once again fake Pokemon and a few Pokemon variations have a tier ranking that we'll need to filter out later. But more importantly, it's natural that the set of Pokemon who have tier labels is smaller than the total set of Pokemon we're focusing on. After all, the $103$ new Pokemon that were just released in Pokemon Scarlet and Violet don't have associated tier labels yet! For these Pokemon, the tier will be NaN when we concatenate pokemon_tiers
with df
.
The final feature that we need is from a set of csv files that indicate how often a Pokemon is used on Pokemon Showdown. This data actually comes from Smogon, which is the website responsible for administrating Pokemon Showdown. Each individual file is only associated with a single month of play. This timeperiod was not large enough to give ample data. As such, we're taking into account 11 months of play using 11 corresponding csv files. These files contains a column named real
which represents the raw number of times of a given pokemon was used in high level play each month. For Pokemon that have usage in more than one month (this is the case for most Pokemon) we simply accumulate their usage.
USAGE_PATH = './data/usage'
file_lst = os.listdir(USAGE_PATH)
all_usage_df = [] # list of dataframes
# load each csv into its own dataframe
for file_name in file_lst:
file_path = os.path.join(USAGE_PATH, file_name)
# load the current month into a df
month_df = pd.read_csv(file_path)
all_usage_df.append(month_df)
usage_df = pd.concat(all_usage_df, axis=0, ignore_index=True) # concatenate all the dataframes into one
usage_df = usage_df[['pokemon', 'real']] # we only care about the pokemon name and the number of times it was used
usage_df = usage_df.groupby('pokemon').agg('sum') # groupby pokemon name and add up all uses
usage_df = usage_df.rename(columns={'real': 'num_usage'}) # rename our column to something more fitting
usage_df.index = usage_df.index.str.lower() # lowercase pokemon names
usage_df.index = usage_df.index.str.replace(r'[-%\':.]', '') # clean up pokemon names, don't want - or %. This helps later
usage_df
<ipython-input-58-f1a21d0f4b54>:19: FutureWarning: The default value of regex will change from True to False in a future version. usage_df.index = usage_df.index.str.replace(r'[-%\':.]', '') # clean up pokemon names, don't want - or %. This helps later
num_usage | |
---|---|
pokemon | |
abomasnow | 17831 |
abomasnowmega | 23020 |
abra | 2566 |
absol | 26056 |
absolmega | 72637 |
... | ... |
zorua | 1093 |
zubat | 1405 |
zweilous | 967 |
zygarde | 309752 |
zygarde10 | 25629 |
1068 rows × 1 columns
The set of Pokemon is once again not the same as our valid set because it comes from Smogon, not Pokemon Showdown directly. This is actually why we had to process out characters such as %
or -
, because some Pokemon names are not exactly recorded the same on Smogon and Pokemon Showdown. Furthermore, it's natural that the new Pokemon Scarlet and Violet should not have usage values yet for the same reason they do not have tier labels yet. Regardless, we'll have to filter out the Pokemon not in our valid set before concatenating usage_df
to df
. Before we do this however, it will be easier to concatenate all data we've accumulated thus far into df
.
df = df.copy()
df['name'] = df.index # adding the index as a column to make concatenating easier
# concatenate one-hot encoded abilities with the main dataframe
df = df.set_index('name', drop=True) # setting index to name to mirror the structure of abilities dataframe
df = pd.concat([df, abilities], axis=1, join='inner') # concatenating
df = df.drop('abilities', axis=1) # no longer needed since we have one-hot encoded columns representing the same data.
df = pd.concat([df, pokemon_moves], axis=1, join='inner') #concatenate one-hot encoded moves with the main dataframe
df = df.join(types_df) #join main dataframe with one-hot encoded type columns
# update our dataframe with the correct tier information
df = df.copy()
df['name'] = df.index # adding a name column to help merge with pokemon_tiers
df = pd.merge(df,pokemon_tiers, on='name', how='left') # merging
df
number | hp | attack | defence | special_attack | special_defence | speed | abi_adaptability | abi_aftermath | abi_air_lock | ... | type_ground | type_ice | type_normal | type_poison | type_psychic | type_rock | type_steel | type_water | name | tier | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 45 | 49 | 49 | 65 | 65 | 45 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | bulbasaur | LC |
1 | 2 | 60 | 62 | 63 | 80 | 80 | 60 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ivysaur | NFEs not in a higher tier |
2 | 3 | 80 | 82 | 83 | 100 | 100 | 80 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | venusaur | RU |
3 | 4 | 39 | 52 | 43 | 60 | 50 | 65 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | charmander | LC |
4 | 5 | 58 | 64 | 58 | 80 | 65 | 80 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | charmeleon | NFEs not in a higher tier |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1121 | 1006 | 40 | 40 | 35 | 50 | 100 | 70 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | toedscool | NaN |
1122 | 1007 | 80 | 70 | 65 | 80 | 120 | 100 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | toedscruel | NaN |
1123 | 1008 | 100 | 135 | 120 | 60 | 85 | 50 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | kingambit | NaN |
1124 | 1009 | 130 | 75 | 60 | 45 | 100 | 20 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | clodsire | NaN |
1125 | 1010 | 110 | 115 | 80 | 50 | 90 | 90 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | annihilape | NaN |
1126 rows × 1139 columns
We almost have our desired dataframe! All we need to now is further process usage_df
and then add that data to df
.
We'll now take a look at Pokemon in df
that are not in usage_df
. These should be Pokemon that either do not have an associated tier ranking yet or variations we failed to remove previously. We'll do this using an anti-join.
# anti join
anti_df = df[~df['name'].isin(usage_df.index)]
anti_df = anti_df['name']
anti_df
28 pikachucosplay 29 pikachurockstar 30 pikachubelle 31 pikachupopstar 32 pikachulibre ... 1121 toedscool 1122 toedscruel 1123 kingambit 1124 clodsire 1125 annihilape Name: name, Length: 158, dtype: object
While most of these Pokemon are certainly those that were released in Pokemon Scarlet and Violet, a handful such as the pikachucosplay or pikachulibre are variations that should have been removed previously. Now we'll drop these Pokemon variations by creating lists of these erroneous observations and using said lists to remove these observations from df
.
# getting the specific rows that need to be filtered using .contains along the name column
pikachu_df = df[df['name'].str.contains('pikachu')]
pikachu_df = pikachu_df[pikachu_df.name != 'pikachu']
eevee_row = df[df['name'].str.contains('eeveestarter')]
totem_pokemon_df = df[df['name'].str.contains('totem')]
df.set_index('name', inplace=True) # will help with removing these observations
# removing rows from the main dataframe if a given row is in the list of values accumulated above
df = df.drop(eevee_row.name.values.tolist())
df = df.drop(pikachu_df.name.values.tolist())
df = df.drop(totem_pokemon_df.name.values.tolist())
df
number | hp | attack | defence | special_attack | special_defence | speed | abi_adaptability | abi_aftermath | abi_air_lock | ... | type_grass | type_ground | type_ice | type_normal | type_poison | type_psychic | type_rock | type_steel | type_water | tier | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||||||||||||
bulbasaur | 1 | 45 | 49 | 49 | 65 | 65 | 45 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | LC |
ivysaur | 2 | 60 | 62 | 63 | 80 | 80 | 60 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NFEs not in a higher tier |
venusaur | 3 | 80 | 82 | 83 | 100 | 100 | 80 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | RU |
charmander | 4 | 39 | 52 | 43 | 60 | 50 | 65 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | LC |
charmeleon | 5 | 58 | 64 | 58 | 80 | 65 | 80 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NFEs not in a higher tier |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
toedscool | 1006 | 40 | 40 | 35 | 50 | 100 | 70 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN |
toedscruel | 1007 | 80 | 70 | 65 | 80 | 120 | 100 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN |
kingambit | 1008 | 100 | 135 | 120 | 60 | 85 | 50 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | NaN |
clodsire | 1009 | 130 | 75 | 60 | 45 | 100 | 20 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NaN |
annihilape | 1010 | 110 | 115 | 80 | 50 | 90 | 90 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN |
1099 rows × 1138 columns
Now we know that every Pokemon that won't have a usage value is a Pokemon that was released in Pokemon Scarlet and Violet. These Pokemon will have a usage value of NaN when we join usage_df
with df
, but we'll replace NaN with $0$ here since these new Pokemon haven't actually been used yet on Pokemon Showdown.
df = df.copy()
df = df.join(usage_df) # joining usage with main dataframe
df = df.fillna(0) # get rid of the NaN's generated by Scarlet and Violet Pokemon and replace with 0
df = df.drop('number', axis=1) # dropping the unique pokemon identifier number since we're done processing.
df
hp | attack | defence | special_attack | special_defence | speed | abi_adaptability | abi_aftermath | abi_air_lock | abi_analytic | ... | type_ground | type_ice | type_normal | type_poison | type_psychic | type_rock | type_steel | type_water | tier | num_usage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||||||||||||
bulbasaur | 45 | 49 | 49 | 65 | 65 | 45 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | LC | 5513.0 |
ivysaur | 60 | 62 | 63 | 80 | 80 | 60 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | NFEs not in a higher tier | 2868.0 |
venusaur | 80 | 82 | 83 | 100 | 100 | 80 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | RU | 128430.0 |
charmander | 39 | 52 | 43 | 60 | 50 | 65 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | LC | 2306.0 |
charmeleon | 58 | 64 | 58 | 80 | 65 | 80 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NFEs not in a higher tier | 1808.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
toedscool | 40 | 40 | 35 | 50 | 100 | 70 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 |
toedscruel | 80 | 70 | 65 | 80 | 120 | 100 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 |
kingambit | 100 | 135 | 120 | 60 | 85 | 50 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0.0 |
clodsire | 130 | 75 | 60 | 45 | 100 | 20 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.0 |
annihilape | 110 | 115 | 80 | 50 | 90 | 90 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 |
1099 rows × 1138 columns
We finally have a dataframe of all the Pokemon we wish to focus on that contains all of our desired features! With this, our data processing is nearly complete. From here, we'll simply make the dataframe more readable and ensure that the dataframe is ready to be used by our machine learning models.
A few label names are misleading or nuances not necessary to this project. The tier label Uber by technicality
for example is still comprised of Pokemon considered to be in the Uber
tier despite the "technicality". We'll also rename NFEs not in a higher tier
to NFE
simply to make it easier to read.
#renaming tiers using Pandas .loc() function
df.loc[df["tier"] == "Uber by technicality", "tier"] = "Uber"
df.loc[df["tier"] == "OU by technicality", "tier"] = "OU"
df.loc[df["tier"] == "NFEs not in a higher tier", "tier"] = "NFE"
Now that we're happy with our processed dataframe df
we'll separate out the new Pokemon from Scarlet and Violet into a dataframe called test_df
for later testing on our machine learning models.
test_df = df[df['tier'] == 0] # Pokemon with tier 0 are those whose tier was originally NaN. So these are new Pokemon.
df = df[df['tier'] != 0] # Removing those same new Pokemon from the main dataframe.
test_df = test_df.copy()
test_df['name'] = test_df.index # adding a name column so that we can tell which Pokemon we're testing later on.
test_df
hp | attack | defence | special_attack | special_defence | speed | abi_adaptability | abi_aftermath | abi_air_lock | abi_analytic | ... | type_ice | type_normal | type_poison | type_psychic | type_rock | type_steel | type_water | tier | num_usage | name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||||||||||||
growlithehisui | 60 | 75 | 45 | 65 | 50 | 55 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0.0 | growlithehisui |
arcaninehisui | 95 | 115 | 80 | 95 | 80 | 90 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0.0 | arcaninehisui |
voltorbhisui | 40 | 30 | 50 | 55 | 55 | 100 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | voltorbhisui |
electrodehisui | 60 | 50 | 70 | 80 | 80 | 150 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | electrodehisui |
taurospaldea | 75 | 110 | 105 | 30 | 70 | 100 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | taurospaldea |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
toedscool | 40 | 40 | 35 | 50 | 100 | 70 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | toedscool |
toedscruel | 80 | 70 | 65 | 80 | 120 | 100 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | toedscruel |
kingambit | 100 | 135 | 120 | 60 | 85 | 50 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0.0 | kingambit |
clodsire | 130 | 75 | 60 | 45 | 100 | 20 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0.0 | clodsire |
annihilape | 110 | 115 | 80 | 50 | 90 | 90 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | annihilape |
134 rows × 1139 columns
Finally we'll normalize the main dataframe to pass to the models later on. This will be done by iterating over all numeric columns and normalizing. If any cells are made NaN in the normalizing process due to dividing by 0 or some much, we'll fill those cells with 0.
normalized_df = df.copy() # creating a copy of the original dataframe to normalize all values.
# getting a list of columns in the normalized_df that need to be normalized.
columns_lst = list(normalized_df.columns)
columns_lst.remove("tier")
# normalizing all necessary columns in normalized_df.
for curr_col in columns_lst:
normalized_df[curr_col] = normalized_df[curr_col] /normalized_df[curr_col].abs().max()
normalized_df = normalized_df.fillna(0) # in case some cells became NaN.
normalized_df
hp | attack | defence | special_attack | special_defence | speed | abi_adaptability | abi_aftermath | abi_air_lock | abi_analytic | ... | type_ground | type_ice | type_normal | type_poison | type_psychic | type_rock | type_steel | type_water | tier | num_usage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||||||||||||
bulbasaur | 0.176471 | 0.270718 | 0.213043 | 0.375723 | 0.282609 | 0.225 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | LC | 0.004296 |
ivysaur | 0.235294 | 0.342541 | 0.273913 | 0.462428 | 0.347826 | 0.300 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NFE | 0.002235 |
venusaur | 0.313725 | 0.453039 | 0.360870 | 0.578035 | 0.434783 | 0.400 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | RU | 0.100076 |
charmander | 0.152941 | 0.287293 | 0.186957 | 0.346821 | 0.217391 | 0.325 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | LC | 0.001797 |
charmeleon | 0.227451 | 0.353591 | 0.252174 | 0.462428 | 0.282609 | 0.400 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NFE | 0.001409 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
glastrier | 0.392157 | 0.801105 | 0.565217 | 0.375723 | 0.478261 | 0.150 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | RU | 0.006382 |
spectrier | 0.392157 | 0.359116 | 0.260870 | 0.838150 | 0.347826 | 0.650 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Uber | 0.015167 |
calyrex | 0.392157 | 0.441989 | 0.347826 | 0.462428 | 0.347826 | 0.400 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | RU | 0.001998 |
calyrexice | 0.392157 | 0.911602 | 0.652174 | 0.491329 | 0.565217 | 0.250 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | Uber | 0.090404 |
calyrexshadow | 0.392157 | 0.469613 | 0.347826 | 0.953757 | 0.434783 | 0.750 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | Uber | 0.739218 |
965 rows × 1138 columns
We'll save df
, normalized_df
, and test_df
to csv files to speed up model training later.
# save the dataframe to a csv
df.to_csv('./data/model_data.csv', header=True, index=False)
normalized_df.to_csv('./data/normalized_model_data.csv', header=True, index=False)
test_df.to_csv('./data/model_test_data.csv', header=True, index=False)
For our first model, let's see how a more simple classification model performs with our normalized Pokemon data by training a multiclass logistic regression model with softmax.
Using Sklearn, we split the data into 80% training and 20% testing and prepare each dataset by replacing missing data with 0s and clean up unnecessary columns.
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pokemon_df = pd.read_csv("data/normalized_model_data.csv")
# create dummy columns for each tier label
pokemon_df = pd.get_dummies(pokemon_df, columns=['tier'])
# we replace our missing data with 0
pokemon_df = pokemon_df.fillna(0)
# split dataset into train/test sets
train, test = train_test_split(pokemon_df, test_size=0.2)
# we drop columns associated with fake tiers from each set
trainX = train.drop(columns=['tier_LC',
'tier_NFE',
'tier_OU',
'tier_RU',
'tier_RUBL',
'tier_UU',
'tier_UUBL',
'tier_Uber'])
trainY = train[['tier_LC',
'tier_NFE',
'tier_OU',
'tier_RU',
'tier_RUBL',
'tier_UU',
'tier_UUBL',
'tier_Uber']]
testX = test.drop(columns=['tier_LC',
'tier_NFE',
'tier_OU',
'tier_RU',
'tier_RUBL',
'tier_UU',
'tier_UUBL',
'tier_Uber'])
testY = test[['tier_LC',
'tier_NFE',
'tier_OU',
'tier_RU',
'tier_RUBL',
'tier_UU',
'tier_UUBL',
'tier_Uber']]
print(trainX.shape)
print(trainY.shape)
(772, 1137) (772, 8)
Now, we'll train a logistic regression classifier, applying softmax to the output in order to generate a predicted probability for each of our labels. We chose to use TensorFlow for this model since it allows us to easily apply softmax to the output using the 'activation' parameter. We train the model over 500 epochs using cross-entropy loss to assess our models predictions and optimize using stochastic gradient descent.
We chose to train over 500 epochs after testing different numbers of iterations, since it resulted in the highest training accuracy while avoiding overfitting. We also decided a learning rate of 0.01 for stochastic gradient descent was optimal in order to prevent the model from diverging.
import tensorflow as tf
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
model = Sequential()
model.add(Dense(
trainY.shape[1],
input_shape=(trainX.shape[1],),
activation="softmax"))
sgd = SGD(0.01)
epochs = 500
model.compile(optimizer=sgd,
loss='categorical_crossentropy',
metrics=['accuracy'])
H = model.fit(trainX, trainY, validation_data=(testX, testY),
epochs=epochs, batch_size=512, verbose=0)
max_val_acc_index = max(range(len(H.history['val_accuracy'])), key=lambda i: H.history['val_accuracy'][i])
print("Top Training Loss: {0}\nTop Training Accuracy: {1}\nTop Validation Loss: {2}\nTop Validation Accuracy: {3}"
.format(H.history['loss'][max_val_acc_index], H.history['accuracy'][max_val_acc_index], H.history['val_loss'][max_val_acc_index], H.history['val_accuracy'][max_val_acc_index]))
Top Training Loss: 0.9764702320098877 Top Training Accuracy: 0.7059585452079773 Top Validation Loss: 1.2300726175308228 Top Validation Accuracy: 0.7046632170677185
Here we plot the accuracy and loss of our multiclass logistic regression model on both our training and test sets over the number of epochs.
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, epochs), H.history["loss"], label="train_loss")
plt.plot(np.arange(0, epochs), H.history["val_loss"], label="val_loss")
plt.plot(np.arange(0, epochs), H.history["accuracy"], label="train_acc")
plt.plot(np.arange(0, epochs), H.history["val_accuracy"], label="val_acc")
plt.title("Training Loss and Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Loss/Accuracy")
plt.legend()
plt.show()
The improvements in the validation accuracy and loss leveled out at around 500 epochs. This simple multiclass logistic regression did better than we expected, considering it has only one layer. There was little overfitting; the training loss and validation loss are similar, and the accuracies are even closer.
For our next model, we chose a feedforward neural network in order to improve our validation accuracy since this model can learn a more complex non-linear decision boundary for classifying Pokemon tiers. We chose to use TensorFlow again for this model.
We tested several configurations for our neural network including different amounts of layers, types of layers, and activation functions. First, we determined that our neural network would only require $1$ hidden layers since the performance was not improved by adding additional layers and since our training set has a small amount of observations we wanted to avoid overfitting. For each layer's hidden units, we tested several amounts of units including 64, 128, 256, and 512, choosing to test powers of 2 in order to optimize performance speed for GPU parallelization. We found that 512 hidden units performed the best on validation accuracy.
We also tested using Dense or Dropout layers in our model, where Dense layers are fully connected using input from every neuron in the previous layer and Dropout layers randomly select a fraction of the input values in order to remove overfitting. A dropout of $5\%$ performed the best consistently, probably due to our small dataset. For determining the activation function for each layer, we tested Rectified Linear Unit (ReLU), Leaky Rectified Linear Unit (Leaky ReLU), sigmoid and hyperbolic tangent (tanh) and found that Leaky ReLU had the strongest performance in finding the most important features to predict on.
Below, we train the data on a neural network with two hidden Leaky ReLU layers, each with 128 units, and we apply a softmax output layer with 8 units, which is equivalent to the number of labels, mapping the output to a probability distribution over our tiers. Again, this model is trained using cross-entropy loss and optimized using SGD, with a learning rate of 0.01 to avoid divergence, over 500 epochs, after which the validation loss and accuracy leveled off.
model = Sequential()
model.add(Dense(512,
activation=tf.keras.layers.LeakyReLU(alpha=0.01)))
model.add(Dropout(0.05))
model.add(Dense(trainY.shape[1],
activation="softmax"))
sgd = SGD(0.01)
epochs = 500
model.compile(loss="categorical_crossentropy", optimizer=sgd,
metrics=["accuracy"])
H = model.fit(trainX, trainY, validation_data=(testX, testY),
epochs=epochs, batch_size=512, verbose=0)
max_val_acc_index = max(range(len(H.history['val_accuracy'])), key=lambda i: H.history['val_accuracy'][i])
print("Top Training Loss: {0}\nTop Training Accuracy: {1}\nTop Validation Loss: {2}\nTop Validation Accuracy: {3}"
.format(H.history['loss'][max_val_acc_index], H.history['accuracy'][max_val_acc_index], H.history['val_loss'][max_val_acc_index], H.history['val_accuracy'][max_val_acc_index]))
Top Training Loss: 0.7692697048187256 Top Training Accuracy: 0.734455943107605 Top Validation Loss: 1.0907363891601562 Top Validation Accuracy: 0.7253885865211487
Again, we'll plot the accuracy and loss of our multiclass logistic regression model on both our training and test sets over the number of epochs.
plt.style.use("ggplot")
plt.figure()
plt.plot(np.arange(0, epochs), H.history["loss"], label="train_loss")
plt.plot(np.arange(0, epochs), H.history["val_loss"], label="val_loss")
plt.plot(np.arange(0, epochs), H.history["accuracy"], label="train_acc")
plt.plot(np.arange(0, epochs), H.history["val_accuracy"], label="val_acc")
plt.title("Training Loss and Accuracy")
plt.xlabel("Epoch #")
plt.ylabel("Loss/Accuracy")
plt.legend()
plt.show()
The improvements in the validation accuracy and loss leveled out at around 500 epochs. This neural network did not perform much better than the vanilla multiclass logistic regression, most likely due to a lack of data and the complexity of the problem. There was a little overfitting; while the accuracy between train and validation were about the same, the training loss was lower. This can be chalked up to the fact that the model was, at the end of the day trained on the train set, but it still generalizes well.
from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
from interpret.glassbox import ExplainableBoostingClassifier
from interpret import show
pokemon_df = normalized_df
train, test = train_test_split(pokemon_df, test_size=0.2)
trainX = train.drop(columns=['tier'])
trainY = train[['tier']]
testX = test.drop(columns=['tier'])
testY = test[['tier']]
trainX = trainX.replace(np.nan, 0)
testX = testX.replace(np.nan, 0)
ebm = ExplainableBoostingClassifier()
ebm.fit(trainX, trainY)
/Users/firstsingularity/opt/anaconda3/lib/python3.8/site-packages/interpret/glassbox/ebm/ebm.py:568: UserWarning: Detected multiclass problem. Forcing interactions to 0. Multiclass interactions work except for global visualizations, so the line below setting interactions to zero can be disabled if you know what you are doing. warn("Detected multiclass problem. Forcing interactions to 0. Multiclass interactions work except for global visualizations, so the line below setting interactions to zero can be disabled if you know what you are doing.")
ExplainableBoostingClassifier()
Since the Feedforward Neural Network we made was not giving us a particularly high accuracy with the prediction of competitive tiers, we decided to train an Explainable Boosted Machine (EBM) to at least give us more insight into what determines a pokemon's competitive viability. The main purpose of this insight would be to assist in crafting the initial placements of pokemon into tiers to hasten the convergence of pokemon into their eventual tiers.
The architecture of and training methods used on an EBM are dissimilar to that of an NN. To prevent the fusion of features that occurs in a NN (which results in a lack of specific insight), EBMs are formulated in such a way that keeps each feature separate.
$$g(E[y])=\beta_0 + \sum f_i(x_i)$$The equation above is the formulation for a vanilla Generalized Additive Model (GAM), which is the family of models that the EBM comes from. Each function, $f_i$, is applied to an individual feature, after which the results are linearly combined. Each function can be non-linear and is learned during the training phase. There is one important deviation from this standard architecture present in the EBM:
$$g(E[y])=\beta_0 + \sum f_i(x_i) + \sum f_{i,j}(x_i, x_j)$$The difference shown above is the addition of pairwise interaction terms to the model. This allows for more flexibility in training without sacrificing too much interpretability during testing. In this particular use case, however, pairwise interaction term were not present since the library we used strongly recommends that they be left out for multiclass classification.
For training, EBMs use a round-robin style of training a single-feature decision tree at a time using the residual after the prediction of already trained decision trees for other features. This process is referred to as "gradient boosting", or just "boosting". In addition to boosting, EBMs use "bagging," a process by which the model is trained on subsets of the training data and the predictions are averaged across the sets. Using backfitting (the algorithm usesd to train each individual tree in the context of the previous trees' residuals), boosting, bagging, and pairwise interaction terms, EBMs are able to get close to the overall predictive power of a vanilla NN, especially when there is little data to work with.
For additional reading, please refer to this paper.
ebm_global = ebm.explain_global()
show(ebm_global)