Predicting Pokemon Competitive Battling Viability

Competitive Pokemon battling is a form of play where players adhere to certain sets of rules to create a competitive environment. Oftentimes, this kind of play is conducted on a website called Pokemon Showdown. Pokemon Showdown simulates Pokemon battling as it occurs in the mainline Pokemon games and is very popular due to its ease of use and efficiency. On the site each individual Pokemon is assigned a tier, or label, that represents how well a Pokemon does in the competitive landscape.

On November 18th, 2022 Pokemon Scarlet and Violet released for the Nintendo Switch. Whenever a new Pokemon game is released, new Pokemon are typically released as well. In particular, $103$ new Pokemon were introduced in these games. On Pokemon Showdown, there's a period that typically lasts around 2 months where website administrators observe player games in order to gather data and assign tiers to these brand new Pokemon. Using different machine learning models, we would like to predict these tiers ahead of Pokemon Showdown. In other words, we want to use machine learning to converge to the true tiers of these new Pokemon faster than Pokemon Showdown would organically, leading to balanced and fair gameplay more quickly.

Information regarding both competitive Pokemon battling and Pokemon Showdown are available at these pages:

Data Collection

Gathering data for this project was split in two parts. The understanding phase, and the collection stage. We already knew that we wanted to use pokemonshowdown.com and smogon.com as our source of information, so the next step was to understand how the data was stored on the site, and how we would be able to extract it.

When scraping a website, it's tempting to immediately jump to libraries like Beautifulsoup, or puppeteer. However, this can often be unnecessary: depending on the site we are targeting, it could be the case that the data is already clean, simply available in memory. During the exploration phase of the collection, we first analyzed how the site worked: we looked at various source files for information, logs, or network requests. As it turned out, within 20 minutes of exploration, we discovered that all of the site's data was stored inside of javascript objects that were loaded on the initial load of the site. After filtering through dozens of requests, we were able to identify a select few which held most of the site's data regarding Pokemon. From there, we were able to copy the objects from the console to their own files, which ended up saving us a lot more time than if we had written a complete python web scraper.

After gathering most of the site's data, we began to explore what data might have been of interest to us. This part was largely experimentation in pandas. A lot of the later stages of this experimentation are detailed below, but the early stages comprised mostly of understanding what the data we had collected was. After having developed a basic understanding of the site's data, and how it was structured, we were able to move on to the pure data processing step.

Data Processing

At this stage of the project, the objective was clear: work in Python and Pandas as much as possible. The reason for this is that we were handling json files often hundreds of thousands of lines long, we absolutely did not want to make any changes that were not reproducible.

Once we had an understanding of the files we had from Pokemon Showdown, we could use the Pandas functions .read_json() and .read_csv() to turn these files into Pandas dataframes. Inside of dataframes, we will be able to manipulate the data much more easily.

Since we're pulling our data from many different sources, the set of Pokemon we would like to work with is often represented in slightly different ways. Consider the Pokemon Venusaur, for example. There are two variations of this Pokemon: Venusaur-mega and Venusaur-gmax. Some sources will include these types of variations as separate Pokemon entirely while other sources forego including such variations all together. This project will focus on the set of Pokemon that does not include these variations. This is because the use of these variations is oftentimes player choice. Furthermore, data regarding these variations is usually sparse or inconsistent.

Let's begin by loading in the set of Pokemon that this project will focus on. That is, the set of Pokemon that excludes variations. This set is found in the battle_team_builder_table JSON file in an entry named learnedsets. We extract this entry here and turn it into its own dataframe. We transpose the table in order to have a table whose rows are Pokemon where each observation is a single Pokemon.

The columns of the pokemon_moves dataframe corresponds to every move in the Pokemon video games. A move is an action that a Pokemon can preform in a battle. We will elaborate on these features we're considering in more depth shortly. For right now however, if a given cell is not NaN, this means that a pokemon is able to perform a given move. Each column represents a single move. However we will have to process these columns later to convert them to the one-hot encoded format we desire. The specific values such as 123456789 are more or less garbage. What's important here is that the set of Pokemon represented by the observations in this dataframe is the set we desire, with some additional fake Pokemon that must be removed. To clarify, a fake Pokemon is a fan-made entity that isn't official. These fan-made Pokemon are beyond the scope of this project. We'll begin removing these fake Pokemon entries now.

The set of Pokemon stored in the dataframe df is a set of Pokemon that does include the variations we do not desire. However, using df, fake Pokemon were easily filtered out using the num column, which is a unique number that corresponds to a Pokemon. We'll now use df which no longer includes fake Pokemon to similarly remove fake Pokemon from the set of Pokemon without variations currently in pokemon_moves. This will achieve the proper set of Pokemon we want to focus on. To reiterate, this is the set of Pokemon that includes no variations of a Pokemon and no fake Pokemon.

We'll now return to the pokemon_moves dataframe to remove fake Pokemon using the df dataframe.

Now we have the exact set of Pokemon we want to focus on in the form of the observations in the pokemon_moves dataframe. Since we now have our proper set of Pokemon to focus on, let's introduce the features that we will be focusing on.

We'll be joining dataframes eventually to generate the complete data that our models will train on. Dataframes will be joined into the df dataframe. This table will consist of the following columns.

To start, let's remove the unwanted Pokemon from df, which will be the main dataframe now.

Now that we have a rudimentary table of a Pokemon's name, unique number, stats (we use the term stats to refer to the hp, atk, def, spa, spd, and spe features), and abilities, we can convert the abilities which is currently a dictionary for each pokemon to one-hot encoded columns based on if a Pokemon has access to a given ability or not. The current dictionary keys are unnecessary and beyond the scope here, but the values of each Pokemon's abilities dictionary is what we wish to convert to one-hot encoded columns. To clarify, if a Pokemon has access to an ability, the arbitrary cell [Pokemon][ability] will be 1, otherwise, the cell value will be 0. We'll make use of this technique extensively throughout this project.

To clarify, abilities are unique passive behaviors that a Pokemon has during a battle. More information is available here.

Here is the dataframe of one-hot encoded abilities, which indicates if a Pokemon has access to any given ability. We'll return to this dataframe later to concatenate it to the main df dataframe.

Before we move on to the next set of features, we'll rename a few columns of the main dataframe to make understanding easier.

Now we'll focus on the Pokemon moves set of features. Similar to abilities, we desire that each Pokemon move is a one-hot encoded column corresponding to whether or not a Pokemon can preform a move. As we mentioned briefly before, pokemon_moves, which is the dataframe we used previously to achieve the set of Pokemon we're focusing on for this project is also where Pokemon move information is stored. As such, we'll return to this dataframe and process the columns to achieve the one-hot encoded columns we desire. We'll begin by discarding the previous values associated with each move that a pokemon can do, and only store whether or not the pokemon can perform the move using 1 when a Pokemon can perform a move and 0 otherwise.

More on Pokemon moves here

Now we have a dataframe of one-hot encoded moves as we desired. Similar to the abilities dataframe, we'll return to pokemon_moves later on when concatenating all dataframes together into the main df dataframe.

The third set of features we'll be focusing on is Pokemon typing. There are 18 types and they are used to determine how Pokemon will interact with each other during a battle. Each Pokemon can have at most two types. Again, we'll be creating a dataframe of one-hot encoded types that will be appended to df later.

More on types

As you'll notice, the number of observations in types_df does not match that of previous dataframes. This is because we retrieved the Pokemon type data from the same source that contained variations of Pokemon. However, this means that the Pokemon variations we had removed previously are accounted for in types_df. To rectify this, we'll apply the same solution we used to get the proper set of Pokemon earlier.

types_df is now ready to be concatenated to df later.

We'll now begin processing data related to Pokemon tier. There are many different game formats on Pokemon Showdown, and each assigns a different tier to our set of Pokemon. For this project, we're specifically focusing on the gen8natdex format, as this is the most recent format before the release of Pokemon Scarlet and Violet. The data in team_builder_json is stored in nested lists, where each individual tier is an individual list. So to retrieve this data and put it into a Pandas dataframe, we'll first need to process it.

Again, the set of Pokemon here doesn't align with the set we're focusing on. This is in part because once again fake Pokemon and a few Pokemon variations have a tier ranking that we'll need to filter out later. But more importantly, it's natural that the set of Pokemon who have tier labels is smaller than the total set of Pokemon we're focusing on. After all, the $103$ new Pokemon that were just released in Pokemon Scarlet and Violet don't have associated tier labels yet! For these Pokemon, the tier will be NaN when we concatenate pokemon_tiers with df.

The final feature that we need is from a set of csv files that indicate how often a Pokemon is used on Pokemon Showdown. This data actually comes from Smogon, which is the website responsible for administrating Pokemon Showdown. Each individual file is only associated with a single month of play. This timeperiod was not large enough to give ample data. As such, we're taking into account 11 months of play using 11 corresponding csv files. These files contains a column named real which represents the raw number of times of a given pokemon was used in high level play each month. For Pokemon that have usage in more than one month (this is the case for most Pokemon) we simply accumulate their usage.

The set of Pokemon is once again not the same as our valid set because it comes from Smogon, not Pokemon Showdown directly. This is actually why we had to process out characters such as % or -, because some Pokemon names are not exactly recorded the same on Smogon and Pokemon Showdown. Furthermore, it's natural that the new Pokemon Scarlet and Violet should not have usage values yet for the same reason they do not have tier labels yet. Regardless, we'll have to filter out the Pokemon not in our valid set before concatenating usage_df to df. Before we do this however, it will be easier to concatenate all data we've accumulated thus far into df.

We almost have our desired dataframe! All we need to now is further process usage_df and then add that data to df.

We'll now take a look at Pokemon in df that are not in usage_df. These should be Pokemon that either do not have an associated tier ranking yet or variations we failed to remove previously. We'll do this using an anti-join.

While most of these Pokemon are certainly those that were released in Pokemon Scarlet and Violet, a handful such as the pikachucosplay or pikachulibre are variations that should have been removed previously. Now we'll drop these Pokemon variations by creating lists of these erroneous observations and using said lists to remove these observations from df .

Now we know that every Pokemon that won't have a usage value is a Pokemon that was released in Pokemon Scarlet and Violet. These Pokemon will have a usage value of NaN when we join usage_df with df, but we'll replace NaN with $0$ here since these new Pokemon haven't actually been used yet on Pokemon Showdown.

We finally have a dataframe of all the Pokemon we wish to focus on that contains all of our desired features! With this, our data processing is nearly complete. From here, we'll simply make the dataframe more readable and ensure that the dataframe is ready to be used by our machine learning models.

A few label names are misleading or nuances not necessary to this project. The tier label Uber by technicality for example is still comprised of Pokemon considered to be in the Uber tier despite the "technicality". We'll also rename NFEs not in a higher tier to NFE simply to make it easier to read.

Now that we're happy with our processed dataframe df we'll separate out the new Pokemon from Scarlet and Violet into a dataframe called test_df for later testing on our machine learning models.

Finally we'll normalize the main dataframe to pass to the models later on. This will be done by iterating over all numeric columns and normalizing. If any cells are made NaN in the normalizing process due to dividing by 0 or some much, we'll fill those cells with 0.

We'll save df, normalized_df, and test_df to csv files to speed up model training later.

Multiclass Logistic Regression

For our first model, let's see how a more simple classification model performs with our normalized Pokemon data by training a multiclass logistic regression model with softmax.

Using Sklearn, we split the data into 80% training and 20% testing and prepare each dataset by replacing missing data with 0s and clean up unnecessary columns.

Now, we'll train a logistic regression classifier, applying softmax to the output in order to generate a predicted probability for each of our labels. We chose to use TensorFlow for this model since it allows us to easily apply softmax to the output using the 'activation' parameter. We train the model over 500 epochs using cross-entropy loss to assess our models predictions and optimize using stochastic gradient descent.

We chose to train over 500 epochs after testing different numbers of iterations, since it resulted in the highest training accuracy while avoiding overfitting. We also decided a learning rate of 0.01 for stochastic gradient descent was optimal in order to prevent the model from diverging.

Here we plot the accuracy and loss of our multiclass logistic regression model on both our training and test sets over the number of epochs.

The improvements in the validation accuracy and loss leveled out at around 500 epochs. This simple multiclass logistic regression did better than we expected, considering it has only one layer. There was little overfitting; the training loss and validation loss are similar, and the accuracies are even closer.

Feedforward Neural Network

For our next model, we chose a feedforward neural network in order to improve our validation accuracy since this model can learn a more complex non-linear decision boundary for classifying Pokemon tiers. We chose to use TensorFlow again for this model.

We tested several configurations for our neural network including different amounts of layers, types of layers, and activation functions. First, we determined that our neural network would only require $1$ hidden layers since the performance was not improved by adding additional layers and since our training set has a small amount of observations we wanted to avoid overfitting. For each layer's hidden units, we tested several amounts of units including 64, 128, 256, and 512, choosing to test powers of 2 in order to optimize performance speed for GPU parallelization. We found that 512 hidden units performed the best on validation accuracy.

We also tested using Dense or Dropout layers in our model, where Dense layers are fully connected using input from every neuron in the previous layer and Dropout layers randomly select a fraction of the input values in order to remove overfitting. A dropout of $5\%$ performed the best consistently, probably due to our small dataset. For determining the activation function for each layer, we tested Rectified Linear Unit (ReLU), Leaky Rectified Linear Unit (Leaky ReLU), sigmoid and hyperbolic tangent (tanh) and found that Leaky ReLU had the strongest performance in finding the most important features to predict on.

Below, we train the data on a neural network with two hidden Leaky ReLU layers, each with 128 units, and we apply a softmax output layer with 8 units, which is equivalent to the number of labels, mapping the output to a probability distribution over our tiers. Again, this model is trained using cross-entropy loss and optimized using SGD, with a learning rate of 0.01 to avoid divergence, over 500 epochs, after which the validation loss and accuracy leveled off.

Again, we'll plot the accuracy and loss of our multiclass logistic regression model on both our training and test sets over the number of epochs.

The improvements in the validation accuracy and loss leveled out at around 500 epochs. This neural network did not perform much better than the vanilla multiclass logistic regression, most likely due to a lack of data and the complexity of the problem. There was a little overfitting; while the accuracy between train and validation were about the same, the training loss was lower. This can be chalked up to the fact that the model was, at the end of the day trained on the train set, but it still generalizes well.

Explainable Boosted Machine (EBM)

Since the Feedforward Neural Network we made was not giving us a particularly high accuracy with the prediction of competitive tiers, we decided to train an Explainable Boosted Machine (EBM) to at least give us more insight into what determines a pokemon's competitive viability. The main purpose of this insight would be to assist in crafting the initial placements of pokemon into tiers to hasten the convergence of pokemon into their eventual tiers.

The architecture of and training methods used on an EBM are dissimilar to that of an NN. To prevent the fusion of features that occurs in a NN (which results in a lack of specific insight), EBMs are formulated in such a way that keeps each feature separate.

$$g(E[y])=\beta_0 + \sum f_i(x_i)$$

The equation above is the formulation for a vanilla Generalized Additive Model (GAM), which is the family of models that the EBM comes from. Each function, $f_i$, is applied to an individual feature, after which the results are linearly combined. Each function can be non-linear and is learned during the training phase. There is one important deviation from this standard architecture present in the EBM:

$$g(E[y])=\beta_0 + \sum f_i(x_i) + \sum f_{i,j}(x_i, x_j)$$

The difference shown above is the addition of pairwise interaction terms to the model. This allows for more flexibility in training without sacrificing too much interpretability during testing. In this particular use case, however, pairwise interaction term were not present since the library we used strongly recommends that they be left out for multiclass classification.

For training, EBMs use a round-robin style of training a single-feature decision tree at a time using the residual after the prediction of already trained decision trees for other features. This process is referred to as "gradient boosting", or just "boosting". In addition to boosting, EBMs use "bagging," a process by which the model is trained on subsets of the training data and the predictions are averaged across the sets. Using backfitting (the algorithm usesd to train each individual tree in the context of the previous trees' residuals), boosting, bagging, and pairwise interaction terms, EBMs are able to get close to the overall predictive power of a vanilla NN, especially when there is little data to work with.

For additional reading, please refer to this paper.

Above are the results of training the EBM on our data. The explain_global function shows us the average amount a particular feature contributes to classification at different values. This helps us determine which features are most important when it comes to a pokemon's competitive viability, which in turn helps us understand the meta of the competitive landscape better. Furthermore, we can zoom in on a single feature to observe the feature function in graph from, where the value of thea feature is plotted against its contribution for classification dependent on the class label. For example, at about $120$ speed, a pokemon is more likely to be placed in Ubers or OU (the two highest tiers). From here, we can try to find out why using auxiliary knowledge; for example, maybe a speed of $120$ lets a pokemon out-speed a couple highly used pokemon that dominate a certain tier. Likewise, it seems like competitive viability falls off significantly when going below $70$ speed, perhaps because it inhibits the survivability of the respective pokemon, even if its other stats are high.

Note: The diagrams I am referring to here may not be visible; in that case, please look at the EBM_insights folder for the images.

The EBM performs reasonably well on the validation set, as can be seen above. Typically, EBMs under-perform compared to neural networks, but since there was a lack of data, the neural network's strengths did not get to fully shine. Although the accuracy is not enough to simply use the model to predict tiers in a vacuum, it also provides insight as to why it made those predictions, meaning its is easy to build off of with domain knowledge.

At just a glance, we can tell that the model worked well for typical cases (high stats, legendary pokemon, pseudo-legendaries, pre-evolutions, etc.). The model still fails to capture the full scope of the problem, however. Some oversights are the introduction of new moves and abilities to the game which we could not train on. Furthermore, there are new game-changing mechanics (terastalization). The largest piece of unknown information is what pokemon will be allowed in the competitive format simply from an availability perspective. Every new generation, the Pokemon company decides to make some past pokemon unavailable in the new game. This can impact the tiers of other pokemon since pokemon that are unavailable create a power vacuum. Overall, this model serves as a starting point for humans to use data-backed insight to make their job of rating pokemon easier.