Better predictions for League matches

I’m predicting the winner of League of Legends ranked games with machine learning. The models look at player histories, champions picked, player ranks, blue vs red side, solo vs team queue, etc. The last time I wrote about accuracy improvements my best was 61.2% accuracy with gradient boosting trees.

Since then I’ve increased the amount of data from about 45,000 matches to 1.8 million matches. I’ve done analysis and the trends are much more reliable.

Experiments with 1.8 million matches are slow so I usually use 200k and sometimes 50k to test code. Almost always the trends in 200k are the same as 1.8 million but they run in minutes or hours compared to hours or days.

I keep a spreadsheet with the outcome of each experiment and notes that indicate the model used, features used, and any other tweaks. This graph shows the progress since the last post.

Accuracy improvements Sep Oct 2015

The graph fluctuates so much because I sometimes test ideas on 50,000 matches. The models are worse but it allows rapid testing.

I also test multiple model types. It smooths out towards the end because I wasn’t experimenting as much with weaker models. On this particular problem, gradient boosting trees and neural networks are clearly stronger than logistic regression and random forests.

Feature engineering

Most of my progress came from feature engineering: getting from the starting point of 61.2% accuracy to 67.2%. I also ran the most experiments in this area: around 80 of 120 experiments.

The initial drop in accuracy was the result of adding additional matches but not fully updating the database of player histories. The players’ ranked history stats were available for a smaller percentage of the data. Accuracy dropped to 58% even despite having 1.8 million matches.

After running a full crawl of all player ranked stats the problem was solved and gradient boosting trees were up to 62.3%. That’s 1% higher than the previous best just from having more data.

Previously I dropped features with low weights in the models because they’re making the data more sparse. When you have a small dataset, you can get small but good improvements by dropping these features. There wasn’t anything particularly special about these features; they’re just mins and maxes of individual player features by team. Adding these features back increased from 41 to 53 features and had accuracy gains: gradient boosting improved from 62.3% to 62.4%. Logistic regression improved from 60.7% accuracy to 63.3% which is the new best.

The next big change was looking up each player’s current rank (e.g., Silver 1, 50 LP). In previous experiments I only used their ending rank from the previous season because it’s easy to access. I had to write a new crawler and let it run over a weekend to fetch every player’s current league, division, and points. I converted that to a single numeric value and added features for the average rank of each side and a diff to make them easier to compare (1).

This was very successful, achieving 65.8% accuracy with logistic regression and 67.1% with gradient boosting trees.

I also extracted the role each person played within the team. A standard team comp has the following five roles: solo top lane, solo mid lane, solo jungle, bottom/marksman, and bottom/support. With these assignments we can compare the player on each side of the matchup: do we expect blue side or red side to win in the mid lane? What about jungle? And so on.

Unfortunately the data is extremely messy for the lane/role assignments. I put a lot of effort into making sensible default values but it still needs more work (5).

After all that work though, logistic regression and gradient boosting trees only improved by 0.1%.

I also revisited indicator features for the champions played on each side and the summoner spells selected. This increases the number of features from 61 to 331 and usually makes the models overfit. I only ran a small number of tests with logistic regression but found on the 200k dataset that it improved accuracy from 66.3% to 66.5%.

Neural networks

Feature engineering is good but once you run out of ideas it’s good to try different model types and hyperparameter tuning. I’d been meaning to use neural networks but they aren’t supported in scikit-learn (2).

After surveying Python neural network libraries I found that almost all of them use Theano on the backend sort of like how scikit-learn uses numpy. I really didn’t want to write the NN code manually in Theano. Lasagne provides shorthand functions to create network layers but doesn’t help with the optimization. Keras is much closer to scikit-learn in that it provides an easier interface and you pick from multiple optimization methods. And I don’t need to understand Theano at all to use it. So I went with Keras.

At first I tried replicating logistic regression as a sanity test. A neural network with sigmoid activation and no hidden units is actually just logistic regression. Unfortunately I couldn’t get similar accuracy to scikit-learn logistic regression no matter what I tried. I got 62.2% in Keras vs 66-67% in scikit-learn logistic regression. When I tried using the ReLU activation function instead of sigmoid I could get 65%. I also tried tanh but that got 64.3%. I never figured out why I couldn’t reproduce it and moved on. (3)

Neural networks are sensitive to hyperparameters just like most other algorithms. But you could say they have many more hyperparameters: the number of layers, layer widths, regularization, dropout, maxout, maxnorm, optimizer algorithm, optimizer settings, activation functions, and so on. The easiest way to test would be scikit-learn’s GridSearchCV or RandomizedSearchCV. Unfortunately Keras models aren’t compatible so you need to write a scitkit-learn class that wraps the Keras model (code here). I made many mistakes in doing it before finding this note. It’s pretty janky; it uses introspection to decide which members of your class are hyper parameters for get_params and set_params (hint: anything ending in underscore is excluded). Keras has a scikit-learn wrapper also but it doesn’t look like you can run a grid search over layer sizes with it.

Ranting aside, I can run grid searches over different network configurations, dropout settings, activation functions, etc.

I was able to get to 67.4% accuracy (new best) after two days with Keras using a neural network with one hidden layer of 75-100 units, 0.5 dropout, and ReLU activation function. Since then I’ve tried tons of experiments which I’ll summarize:

  • Adding a second small hidden layer is harmful. I was comparing 61 input -> 75 hidden -> 1 output vs 61 -> 75 -> 5 -> 1. From watching the run it’s fitting the training data much better but generalizing much worse. I couldn’t find any dropout settings that would compensate for the overfitting enough.
  • Maxnorm: Literature shows that it’s helpful in conjunction with dropout but I didn’t get any gains at all.
  • Maxout: Literature shows bigger gains than maxnorm in conjunction with dropout but I could only just barely get it to have the same score by ensuring that it wasn’t increasing the number of parameters in the model (i.e., for maxout 2, use half as many hidden units)
  • 0.5 dropout seems best. When I increased features from 61 to 331 the best dropout was like 0.8 (which is more or less allowing it to ignore all the extra features). I only put the dropout layer after the hidden layer. I may have tried a dropout layer after inputs but found it didn’t help.
  • I found best results by doing mini batch with about 1000 matches per batch followed by full-batch training.
  • Used Adam optimizer cause I saw a talk that said it’s best/easiest.
  • I tried adding a GaussianNoise layer on the input with 0.1 noise but it was slightly harmful.
  • I ran many experiments on early stopping not shown in the graph but was unable to find any settings that gave the same accuracy. However, I did learn that Keras early stopping can only use val_loss or val_acc; it can’t stop on training loss. Also, the code works incorrectly if stopping on accuracy; it stops if the minimum doesn’t decrease for the specified number of epochs but with accuracy we want the max to increase.
    • To get reasonable results I had to set the patience value pretty high (20 epochs).
    • The best I could do with early stopping was 67.3% accuracy in 3.6 minutes vs 67.45% accuracy in 7.0 minutes without it. So about a factor of 2 faster but not as accurate. Probably good for quick tests though.
    • For accuracy stopping I tried modifying the Keras class to handle accuracy correctly but it didn’t seem to work well (maybe stopping on accuracy is fundamentally bad).

After all that tuning I couldn’t beat 67.4% on the 200k dataset and couldn’t beat 67.0% on 1.8m.

Gradient boosting with init

Gradient boosting decision trees start from a base classifier and correct the errors in the model with each new tree. It starts from predicting the majority class by default but you can supply a full estimator to use as default.

I tried a test with using logistic regression as the base and found that it helped slightly (67.1% -> 67.3%). It was a huge pain though due to lack of documentation but Stack Overflow saved me.

The only issue is that it doesn’t seem to work with multiple threads so mostly I don’t use it. But it seems less sensitive to hyperparameter tuning.

Model types: odds and ends

I tried support vector machines briefly and learned why nobody uses them anymore: O(n^2) runtime so they just don’t scale. I’m sure there are ways around it. You could run the kernel on log n carefully selected points but scikit-learn doesn’t have that.

I also tried elastic net, which is logistic regression with both L1 and L2 norms. I vaguely remember benefits with this for maxent language models. But it didn’t improve over L2 logistic regression and the implementation in scikit-learn was harder to use in cross-validation.

Revisiting hyperparameter tuning

I hadn’t tuned the parameters of gradient boosting trees like I had for neural networks because they’re slow. But I’ve been going back through gradient boosting and random forests to reassess the hyperparams. My goal is to improve runtime and hopefully improve accuracy a little.

Gradient boosting trees

My previous hyperparameters for gradient boosting trees were very suboptimal. After tuning I improved from 67.1% to 67.9% (best results yet). The important settings were the learning rate and number of trees. I’d set the learning rate to 0.9 (very poor choice) and tree to 100 to match the random forests. The best settings were around 0.2-0.25 learning rate and 300 trees. Possibly I had set the 0.9 learning rate from hyperparameter tuning when my data set was leaking the test info and I got 90-100% accuracy. The default learning rate of 0.1 was poor.

I also found tiny gains from subsample 0.9, which helps reduce overfitting. I tried subsample at 0.5 and 0.75 but that was awful. Subsample 0.9 should speed up training slightly so I’m using that now. I also found small gains by tuning min_samples_leaf down to 10 from 20. (4)

Random forests

I also tried tuning with random forests and found that I was using too few trees so upped that from 100 to 150. I was hoping to find the “elbow” in the graph of accuracy vs number of trees but it’s smoother than I’d like:

Accuracy vs number of trees in random forest

I have no idea why there’s a blip at 100.

I also tried re-tuning min_samples_leaf and min_samples_split; higher values reduce overfitting and speed up training. I didn’t see much gain. I’m using min_samples_leaf 7 and min_samples_split 50.

Conclusions

I re-trained and tested the best settings on 200k matches with 5-fold cross-validation with 3 threads:

Accuracy (200k) Training time
Gradient boosting trees 67.7% 43.9 min
Neural networks 67.4% 7.3 min
Logistic regression 66.6% 0.6 min
Random forests 66.3% 14.1 min

I’m not sure why gradient boosting lost 0.2% from previous runs but the hybrid with logistic regression gets 67.9% (not listed above) so I’m not too worried.

I might have to try a scikit-learn wrapper for xgboost. I’ve seen Kagglers have better success with xgboost than with scikit-learn and it’s supposedly faster.

Below are the best results to date in one table. The runs on 1.8 million matches aren’t necessarily with the same hyperparams as the corresponding 200k tests because it takes so long to rerun.

200k matches 1.8m matches
Gradient boosting trees 67.9% 67.7%
Neural networks 67.4% 67.0%
Logistic regression 66.6% 66.1%
Random forests 66.3% 66.2%

So what’s next? Unfortunately the ranked season ends next week and there are massive overhauls for season 6. Especially with ranked team builder queue, I expect that more players will get their best role so matches will be less predictable. I’d really like to hit 70% accuracy but I’m running out of time before everything changes.

Things that might help:

  • Crawl normal (unranked) game stats. Sometimes a player doesn’t have ranked stats for a champion but has stats from normals that could at least show whether they’re new to the champion or not. This would take a couple days to code and a few full days of crawling. I’d guess 0.1-0.5% gain from this.
  • Ensemble of classifiers. Unfortunately I didn’t see a wrapper for this in scikit-learn so I haven’t tried it yet. Gradient boosting trees and neural networks are learning in a very different manner so they should combine well. I’d guess 0.3-1.0% gain from this though it would be more complex than majority voting.
  • Improve team queue prediction. I’m using the solo queue ranking of the players and adding them up but I should look up the team ranking. I could also start tracking the win rates of pairs of teams, which may capture strength or weakness of team strategy. I’d guess 0.1-0.3% gain from this.
  • Make my own ELO score. I could easily make an ELO score that’s updated as I generate the dataset. This wouldn’t have any data leakage problems like current rank but if I have lots of players that only show up in a few games then it won’t be stable. I’d guess 0-0.2% gain.

Extra: Predictability tests

It’s good to understand when your models are doing well and poorly. I looked at this before but there weren’t too many interesting trends. I’ve changed my tests in three ways:  logistic regression instead of random forests (for speed), full 1.8 million matches instead of 44k, and using the players’ current league/rank to tell the level of the match instead of their rank in previous season.

Predictability by league

The black line is is overall average and the thin blue lines show plus or minus one standard deviation around the main blue line. Generally lower leagues are much more predictable. There isn’t a statistically significant difference between silver and gold or master and challenger, but the difference between bronze and silver is significant, gold vs platinum, platinum vs diamond, and diamond vs master.

It’s probably because there are more errors in pick/ban phase at lower ranks that players haven’t learned yet. And also high level players are more capable of playing all roles reasonably whereas lower rank players might play only one role well.

Predictability by version

This shows the prediction accuracy of the model by game version. I removed all game versions with a low number of matches. The standard deviation for most versions is around 0.2%.

The trend worries me. It’s completely unlike what I saw on older data (basically flat). It likely means that the features for current league are partially revealing the outcome of previous matches. Unfortunately the Riot API doesn’t provide a player’s historical ranking so it’s not possible to look up their ranking at the time they played a past game. I could drop the feature but it’s useful. Or I could recrawl each player’s league every day but I don’t have enough cloud storage to store that.

Notes

(1) Logistic regression has no trouble comparing two features but it’s more effective in tree-based methods to add a diff feature.

(2) This recently changed.

(3) I should have tried an absurdly high number of iterations. Structurally the models are the same and I set the regularization and feature scaling the same but the optimizers are different. If I did it all over again I’d test scikit-learn’s SGD implementation to compare directly to Keras SGD in addition to a very high number of iterations.

(4) Generally I like this setting because it reduces overfitting and also speeds up runtime. min_samples_split and min_samples_leaf interact a bit though.

(5) First off, the lane/role fields are sometimes empty or may have a lane (bottom) without the role (carry vs support). If the fields are empty I pick the most common lane/role in the data. If lane is present but the role is missing I fill that in with the most common role for that lane and champion. Even still sometimes the data is wrong and lists the support as a second mid lane. This could be better solved by using a classifier to clean up the data but that’s a bigger project. So sometimes I get matches where one side has bot/support and the other side doesn’t. In this case I look up the win rate of that champion as bot/support. If both supports are correctly tagged then I’ll look up the win rate of that matchup (e.g., Morgana bot support vs Blitzcrank bot support).

Advertisements

Feature scaling is important, but not how I expected

Currently I’m getting up to speed with the Keras library for neural networks. After about a day and a half of effort I have a neural network that’s tied with my best results ever for predicting the winner of League of Legends matches.

Keras doesn’t provide exactly the same methods as a scikit-learn model so I have to write some code to fit it into the cross validation and hyperparameter tuning frameworks. Then I’ll have an apples-to-apples comparison.

I learned that feature scaling is critically important for Keras models. Some models won’t even improve beyond random predictions without it! I knew that feature scaling was good for speeding up convergence but didn’t think modern optimizers would suffer total failure without it.

If you aren’t familiar with feature scaling, it’s just making sure that all your features have a similar range of values. The scaling I’m using subtracts the average value and divides by standard deviation so most values will fall in the -1 to 1 range after scaling (called standardization or a z-score). It’s also possible to scale each feature so that the minimum is 0 and maximum is 1.

Feature scaling is important for machine learning models that use some form of gradient descent, like logistic regression and neural networks. In previous experiments I’d tried logistic regression with and without scaling and it had only minor impact though. However, I only ran those tests after finding a reasonable range of regularization constants to tune. Unfortunately I learned that hard way that the optimal regularization constant is radically different for scaled vs. unscaled data (1).

What I learned today

  • Optimal regularization constant (C) is dependent on feature scaling.
  • Feature scaling speeds up optimization by about 6x (including the time to scale).
  • scikit-learn has three optimizers: liblinear, lbfgs, and newton-cg
    • With feature scaling the three optimizers give almost identical results for each C value. lbfgs is consistently worse than newton-cg and lib linear but only by about 0.01%.
    • Without feature scaling, liblinear is consistently better than newton-cg or lbfgs. The best result from liblinear is about 0.02% better than the best from newton-cg and about 1.66% better than the best result of lbfgs. For my problem, 1.66% is about the gain I can get with a couple weeks of feature engineering.
    • lbfgs is drastically faster than newton-cg or liblinear
  • LogisticRegressionCV is about 2x faster than GridSearchCV for lbfgs. The ratio is more dramatic if you’re tuning over more than 10 C values because it initializes the weights for all instances closer to the optimum aka warm start.

Tests within LogisticRegressionCV at 50k samples

  • lbfgs: 0.2 min
  • newton-cg: 1.0 min
  • liblinear: 1.4 min

Tests with GridSearchCV at 50k samples

It’s tuning the same C values just doing it a different way.

  • lbfgs: 0.6 min
  • newton-cg: 1.7 min
  • liblinear: 1.4 min

Note that GridSearchCV and LogisticRegressionCV are the same speed with liblinear because liblinear doesn’t support a “warm start” in the optimization.

Tests at 200k samples

  • lbfgs LogisticRegressionCV: 1.2 minutes [what I’m switching to]
  • newton-cg LogisticRegressionCV: 4.8 minutes
  • lbfgs GridSearchCV: 2.9 minutes
  • liblinear GridSearchCV: 7.5 minutes
  • liblinear GridSearchCV without feature scaling: 33.0 minutes [what I’ve been using for weeks :( ]

After these optimizations it’s 27.5x faster.

I didn’t even try tests with 1.8 million samples. From memory that took about 3 hours for liblinear GridSearchCV without feature scaling and tuning 5 C values instead of 10. If the same speedup holds it should train in about 6.5 minutes.

I also checked the accuracy at the optimal C value for both. The fastest run found an optimum of 66.37% accuracy. The slowest (no feature scaling) found an optimum of 66.26% accuracy. So it’s not that lbfgs is cutting corners here; we’re actually gaining accuracy due to better optimization of C value with feature scaling.

Why does this matter?

When I’m trying out a new feature I don’t know if it’s useful or not. So I test it at 50k samples. But sometimes the feature isn’t reliable in a small data set. I may find that it fluctuates the accuracy by 0.01 at 50k samples but gives a 0.2 gain at 2 million. Big improvements are clear even at 50k though; this is mostly helping me to quickly make decisions about small changes.

It really matters because speed of iteration is life. What I mean is being able to test ideas quickly.

Notes

  1. The optimal C value will differ when the useful features have radically different scales. It happens because the scale of the optimal weights is affected by the scale of the inputs and therefore affects the regularization value whether it’s L2 or L1.
  2. These tests tried 10 C values each. In my previous experiments I dropped it down to 4-5 values to try and speed up my experiments.
  3. For all tests I used n_jobs=3 on a quad-core cpu. They all use the exact same 10-fold cross-validation splits and the splitter is initialized with random seed 4.
  4. Comparing LogisticRegressionCV and GridSearchCV is reasonable if you’re only tuning the regularization constant C. If you’re tuning other parameters like the max number of iterations or trying L1 vs L2 then you have to use GridSearchCV.

Bigger League of Legends data set

Background

Riot granted me an app key so I can crawl a lot more data. The downside is that I had to re-engineer much of my system because I couldn’t use free MongoLab tier with that much data. To give some ballpark sense, my mongo data directory is 46gb for 1.8 million matches and 1.2 million players.

The larger data set should lead to more accurate predictions of who will win. But first I want to check the distributions in the data in case anything’s changed.

The old code started entirely from featured matches. The featured matches endpoint gives 5 matches which gets me 50 players. For each queued player I’d look up their most recent 20 games and add those matches. Then I’d also get a list of new players from those matches. And I’d repeat this over and over.

There were a couple more tweaks: I’d set a date for when each player’s history should be refreshed by estimating when they’ll have 20 new matches. And I skip 3v3, ARAM, or other games; I only crawl ranked 5v5.

The new crawling setup starts from featured matches but also crawls the endpoint of challenger and master leagues. Instead of loading their match history I use the new match list endpoint which lets me get their entire season 5 history at once and when I’m recrawling I can get exactly the new games. Beyond that it’s the same just that I can crawl much more quickly. And I’m also looking up their current league too.

Game version

The majority of my old data was from when I started crawling: 5.14-5.16.

In the new setup it’s pretty even from 5.5 to 5.16.

Matches crawled by version

5.17 doesn’t have much data because it was just released when I did this crawl. The blip in 5.7 is around when URF mode was released – maybe that increases ranked play as well?

Either way, I have a much more even distribution across versions than before.

Players by tier

The old code figured out a player’s league by the highest league they’ve reached in the past – this info is easily available in the match data for the game to set borders at the loading screen.

The majority of players were diamond but also many players are new and didn’t have a value:

Games by most common highest achieved season tier in old data set.

Games by most common highest achieved season tier in old data set.

The new data is a much cleaner distribution: many more players in platinum than diamond, gold than platinum, etc.

Players by league in the updated data set.

Players by league in the updated data set.

This is similar to the distributions shown on League of Graphs and League of Legends Summoners. One of the interesting parts of those sites is showing the stats by division as well; you can see that division V typically has many more players than other divisions. That’s because it’s easier to be promoted a tier than demoted a tier.

image (7)

The same trend is shown – there are many more players in gold V than silver 1, for instance. (The league labels got mangled a bit when exporting from google sheets, sorry!)

Blue side win rate

Overall the blue side has a small advantage. In solo queue the blue side wins 50.5% of the time. Previously I saw that it varies by league but wasn’t statistically significant (read: possibly due to randomness). But now I have enough data that the trends are statistically significant!

Win rate by tier in solo queue

Thick line is the win rate. Thin lines show plus or minus one standard deviation. Note that the axis doesn't go to zero.

Thick line is the win rate. Thin lines show plus or minus one standard deviation. Note that the axis doesn’t go to zero.

For the curious, see note (1) on how the tiers are computed.

From platinum down to bronze the blue side has a win rate of 50.8% to 51.2% and those stats are roughly within a standard deviation of each other (not significant).

But then the blue side has a 48.5% win rate in diamond and 47.0% in masters! The change from platinum to diamond is significant and so is the change from diamond to master. The difference between master and challenger could just be randomness; there aren’t enough challenger games to tell for sure.

Maybe breaking it down by division will help find what’s going on?

Thick line is the win rate. Thin lines show plus or minus one standard deviation. Note that the axis doesn't go to zero.

Thick line is the win rate. Thin lines show plus or minus one standard deviation. Note that the axis doesn’t go to zero.

The entirety of the drop happens between platinum I and diamond 3, dropping from 50.4% win rate for blue down to 47.4% win rate. Each of those successive drops is statistically significant.

Possible explanations:

  • Red side gets first ban and first pick. Maybe players don’t take full advantage of this until around diamond III.
  • The red side (higher elo) will tend to have more professional players from LCS. This is something I heard Heisendong say: At high level, some wins are determined based on which team gets LCS pros (like how at low level some losses are due to disconnects).

Win rate by tier in team queue

Last time I analyzed only 44 thousand games and about 14 thousand were team games. There wasn’t really enough data to get a clear picture.

Now the trend for team games is clear.

Win rate by tier team queue

Thick line is the win rate. Thin lines show plus or minus one standard deviation. Note that the axis doesn’t go to zero.

The changes from silver to gold, gold to platinum, platinum to diamond, and diamond to master tier are significant. There isn’t really enough data for bronze to say and challenger data is so sparse that I didn’t show it.

The closest published stat I could find was 52.8% win rate for blue side in team queue vs 50.9% for solo queue on League of Graphs but that doesn’t show the effect of higher tiers.

There are two clear trends: 1) blue side has a bigger advantage in team queue than solo queue and 2) the advantage grows at higher rankings.

It seems to agree with professional results: overall 57% win rate for blue side and 59% in NA+EU Spring split 2015 according to esportsfanz.

The overwhelming advantage of blue side in team queue is likely due to first pick and ban. Like solo queue, players at platinum begin to take more advantage of first pick/ban and it escalates to master tier. The advantage in team queue is likely much greater because you’re coordinating the draft strategy with your teammates. Furthermore, trading your pick is more effective with a team than with random solo queue players.

From what I can tell, the overwhelming first pick advantage is why OGN uses blind pick in game 5 of a best of 5: If the teams are so close that they make it to game 5, first pick advantage won’t be the deciding factor of the entire match. (2)

Notes

(1) To get the tier+division of a match I convert all players’ current tier, division, and league points into their total league points. I filter out anyone that’s division V with zero LP because I don’t know what their level should be. Then I average and convert back from league points into tier, division, and leftover league points.

I need to improve this for team games by looking up the league of the team not just the individual players.

(2) There’s a wealth of practice and information for regular draft but do they have enough analyst support for blind pick? How hard is it to practice challenger-level Zed vs. Zed play? That said both teams have the same challenges so it should still be fair.

Predicting League matches: Are some matches more predictable?

I’ve been working on improving the accuracy of my machine learning models and redesigning my software now that I have a much lower rate limit.

But I’ve been wondering whether some matches are more predictable than others. I’m sure it’s true but I don’t know what the trends are. Possibly lower skill players make more errors in pick/ban phase? Sometimes a game change makes a champion too strong and matches seem more predictable if one side gets them.

Setup

I’m evaluating random forest classification with 10-fold cross validation. After making predictions and computing prediction accuracy I can analyze for correlations between accuracy and various properties (1):

  • What’s the accuracy in patch 5.16 vs 5.15?
  • What’s the accuracy at high skill vs low skill?
  • What’s the accuracy when certain champions are picked?

Just to be extra clear, I’m not looking at win rates. I’m looking to see how well my model is learning in certain situations. This will help me assess the strengths and weaknesses of my model and the features I’m using.

I’m testing this using random forests with hyperparameters that were good in previous tests. The results would probably differ slightly for gradient boosting trees and may differ more for logistic regression.

Game Version

Riot updates the game balance about every two weeks. Sometimes new champions are added and they may be too strong or too weak. Champions are tweaked and may become too strong or too weak. The most egregious recent change was 5.16 in which Skarner was nearly 100% banned or picked.

Prediction accuracy by version

Raw data here with standard deviations (couldn’t figure out error bars in Google Sheets). I dropped versions with fewer than 100 matches. Most of the data is 5.14-5.16 so those patches comprise most of the overall average. The big blips seem statistically significant unless I computed standard deviations incorrectly.

Some comments on what changed in some patches:

  • 5.13: Devourer reworked (some junglers very strong now). Runeglaive buffed, making mid Ezreal even stronger. AP item overhaul making build paths easier. Tahm Kench released but very low win rate.
  • 5.14.0.329: UI overhauled. Gangplank reworked, above average strength. Miss Fortune update (bot adc), brought up to average. Elise (jungle) buffed, becomes near 100% pick or ban. Ryze (top) nerfed, almost not played anymore. Runeglaive Ezreal no longer viable.
  • 5.15: Fiora (top) reworked, underpowered at first and minor patch to improve her. Sivir (bot adc) ult nerfed.
  • 5.16.0.342: Skarner (jungle) overpowered. A few strong top laners. Elise (jungle) nerfed. New items for tanks. Armor items reworked (generally armor lowered).
  • 5.16.0.344: Skarner hotfixed to be less strong.

It’s tough to summarize each patch in one or two sentences (the patch notes are several pages long and don’t clearly indicate what each bug fix version number means). Originally I was thinking that patches with overpowered or underpowered champions would be more predictable. That’s certainly the case with the Skarner patch. But it’s not the case with the Elise patch.

My feeling overall is that the differences are fairly small. In general if my model were very good then probably any patch with major changes is hard to predict until we have enough data on the changing win rates. Patches with overpowered champions might be more predictable. Even if they’re consistently banned this gives a small advantage to blue side in ban phase.

League tier

Your ranked tier is a rough measure of overall skill level: bronze, silver, gold, platinum, diamond, master, and challenger. Are matches more predictable in bronze? In a previous post I saw a blue side advantage in low tiers and red side advantage in higher tiers.

If you aren’t familiar with League of Legends one thing to note is that the number of players at each tier follows a sort of exponential distribution. See LoL Summoners’ Global Stats for instance. 27% of players are bronze, 42% silver, 20% gold, 8% platinum, 2% diamond, 0.04% master, 0.02% challenger.

Note that I’m still just using highestAchievedSeasonTier which is the max of all previous seasons and many players don’t have a value. Also I have the median value for blue and red side separately so I only counted games where they were the same.

Prediction accuracy by tier

Aside from platinum the others are all within about a standard deviation of one another. I don’t even have a guess as to why platinum is less predictable; it has the most data of these groups.

Maybe when I look up their current season tier and crawl more data I can update this and get more accurate results. That should confirm the weird dip in predictability of platinum matches or show that it was an artifact of the way I was measuring.

Champion selection

Do we better know the outcome when certain champions are picked? Many players feel some champions are really bad and the game is lost if they’re picked (Teemo, Yorick, and Tahm Kench come to mind. Urgot used to be this way).

There are 126 champions and I can’t graph that well. Plus it may lead to poor conclusions. Instead I’ll look at the difference from average over the standard deviation, or the standard score. 2-3 standard deviations away from average is pretty significant. (2)

Here are the champions with the highest deviations in prediction accuracy from the average of 60.3%:

  • Varus: 64.0% accuracy in 2,200 games (z-score 3.62)
  • Nocturne: 63.7% accuracy in 2,116 games (z-score 3.21)
  • Tahm Kench: 63.1% accuracy in 2,363 games (z-score 2.82)
  • Kha’Zix: 58.1% accuracy in 2,829 games (z-score -2.40)
  • Galio: 53.7% accuracy in 650 games (z-score -3.38)

Of these champions, Tahm Kench and Kha’Zix have pretty low win rates. Varus and Galio have win rates that are very dependent on the lane in which they’re played.

One problem in analyzing this data is that the impact of champion picks is very dependent on the game version and individual player. If we have a few Varus mains with over 200 games played on him, they will likely have much higher win rates and lead to more predictable matches.

I don’t feel comfortable drawing any conclusions from this analysis except to say that looking at individual champions alone clearly doesn’t explain what my models are learning. In previous experiments I found that indicator variables for champion picks didn’t help the model; it added too much sparseness and the model couldn’t learn as well. So the models can’t easily learn to predict outcome purely based on the champions.

The raw data is here if you’d like to look.

Thoughts

It’s tempting to draw conclusions that fit my intuitions with respect to predictability. But that would be biased interpretation of the data. I don’t think my intuitions are upheld in this analysis with respect to predictability of game versions, predictability at different levels, or predictability in the presence of certain champions.

The conclusions could differ when I redo this with a much larger data set, especially when I get players’ current tier rather than max tier from previous seasons.

What’s next?

I redesigned much of my crawling and storage infrastructure to handle a much larger number of matches. It isn’t optimized yet but I have full match data for 1.8 million matches and ranked stats for 94 thousand players. In the short term I’m regenerating my feature matrix to test machine learning with much more data and I’ll report on the results of that.

I’ve also crawled each player’s current tier/rank and I can experiment with features again once I experiment on the larger data set. I’ll revisit feature engineering in general.

Beyond that I’ll likely revisit hyperparameter tuning and then try out neural networks (there may be enough data to make use of them now).

And eventually I want to start focusing more on ranked teams – at a high level teams have match histories against one another which better mimics professional play.

Notes

(1) I’m not saying features because I don’t directly include the game version in the features (it was too sparse to be useful).

(2) I’m being cagey about claiming statistical significance because I’m doing multiple tests – the chance of one accidentally passing a significance test is reasonably high. I’m just doing a quick test to get a ballpark idea for now because I’ll soon redo everything with a much larger sample size. At that time I’ll look into doing statistical analysis that is more robust.

Predicting League match outcomes: Week 2

I’ve continued to log my experimental accuracy in predicting League of Legends matches (see part 1) and this graph picks up from where I left off last time (around 64% accuracy).

Experimental accuracy part 2

When I hit 100% I learned I had data leakage. Then it dropped back down after fixing the leakage problems. (1) Also note that the cyclic patterns in the graph happen because I run random forest, then logistic regression, then gradient boosting trees most of the time. If I could do it again I’d make a different column in the spreadsheet for each one.

Gradient boosting trees revisited

Early on I tried gradient boosting trees but they didn’t perform well. They also take a while to train so I stopped experimenting with them. Now that the overall accuracy was decent I decided to revisit them to see if I did something wrong.

While random forests were getting 64.7% accuracy and logistic regression hit 61.6%, gradient boosting trees were 82.5% accurate. Clearly something was wrong.

Data leakage 2.0

The easiest way to debug the problem is to inspect the models. I can’t actually view all 100 trees in gradient boosting, but I can check out the feature_importances_ field of the learned model. They give scores for how much each feature contributes to the overall model. At the same time I can check the feature_importances_ in the random forest model and the array of weights for logistic regression. If gradient boosting is clearly learning to cheat and the others aren’t, I can compare to see which features are more important in gradient boosting.

The big difference was that gradient boosting trees heavily relied on the win rates computed from match history. There was one set of features for win rate by champion and one set for win rate by champion and game version.

I removed those two sets of features and accuracy was believable again: around 58-59%.

What was the leakage? I never found all the leaks but I found two issues:

There was an issue in the per-player statistics from the ranked stats endpoint. I didn’t realize but it can only provide stats on a per-season basis and it defaults to the current season. Some of the matches in my database are from SEASON2014 but the stats are SEASON2015. So when I subtracted the match outcome from the win rate, that was incorrect for 2014 matches.

The second issue was another misunderstanding. Say the ranked stats indicate that you played Morgana 1 time and won 0 times. The timestamp indicates that these stats were computed on Monday at 5pm GMT. Now say we process a match you played Morgana on Monday at 4pm GMT and won. This is inconsistent with the ranked stats. Previously I’d make sure neither value could subtract lower than 0 but really if it’s inconsistent I shouldn’t adjust either the wins or total games played (what I’m doing now).

How could this happen? The match data is being updated in a live database. The ranked stats seem like they’re from a Hadoop job. Probably the data is copied over from the live database to Hadoop for batch processing periodically. Then a giant job is used to recompute ranked stats now and then, which may also have some processing time. Even if they’re using the same database (say HBase) then the list of matches would be fixed at the start of processing and the timestamps may be set at the end.

The strangest thing is that fixing these two problems increases the accuracy of gradient boosting to almost 100%. So the features more cleanly represent leakage somehow. I couldn’t figure out how but decided I should compute match history statistics chronologically anyway.

Aside: Diagnosing by decision tree images

Random forests and gradient boosting trees are both collections of large numbers of decision trees (I’m using 100-500). It’s too hard to just look inside. But I can take a single decision tree and visualize that.

Scikit-learn fortunately has a function to export a graph file of the decision tree. Then you can plug it into webgraphviz if it’s small enough or render locally. Unfortunately the trees get so big that they crash renderers. Even with max depth of 5 it’s a very horizontal image:

decision_tree.dot

Second aside: While trying to make a graph fit on this blog post I started lowering the max_depth and found that actually decision trees aren’t completely crap they just overfit horribly with default parameters in sklearn.

Aside: What may have worked

There were predominantly two features that were used. I could have tried learning a decision tree on only those two features and dumping it out. Or plotting a decision surface.

Also, it feels like it’s learning to implicitly diff the features and check that two numbers are 1 different when there was leakage because it wasn’t something logistic regression was able to learn.

Regenerating the data with win rates computed chronologically

Although it nags at me, the specific bug isn’t important. Computing all match-history-based info can be done iteratively as I generate the data if I generate matches chronologically. I process one match then update the win rates for that patch and those champions and move to the next match. Leakage isn’t possible when computing it this way (at least for the match history features).

This brought be back down to 57-58% accuracy for random forests and logistic regression. I tried decision trees for prediction out of curiosity and they’re around 52-53% with these features. (2)

Using backoff for sparse win rates

Win rates are just like language modeling probabilities (my background is language modeling). You’re computing the probability of an outcome given some conditional information like the champion selected, game version, and/or player. As you use more parameters the data gets very sparse. Often there’s no history of a player’s win rate on a given champion.

Previously I would default these values to 50%. But now I explicitly set a backoff/fallback value. When computing the player’s win rate on a champion, I back off to the aggregate win rate on that champion (both from ranked stats). The most basic form is when the player’s data is empty the value should be the more general one. But actually it’s even better to handle sparseness such as only 2 games played for champion_i, player_j.

So I combine the primary feature and secondary feature based on how sparse the primary feature is.

primary weight = num games played / (num games played + crossover)

secondary weight = 1 – primary weight

The crossover parameter is the number of games at which the two are averaged equally. Through experimentation I found that 10 worked better than 5 or 20. With enough games played the value will default to the primary one (player+champion) but when there are a few games it will nudge the more general value (champion only) towards the player’s history. When there are no games at all it’ll seamlessly back off to (champion only).

This got me to 60% accuracy with random forests and logistic regression. Gradient boosting trees reached 62.5%.

Lesson learned: Very sensitive to having ranked stats available

Now that the data is extracted chronologically I can compute other features, such as the player’s current winning or losing streak. So I regenerated the data with winning and losing streaks.

The numbers dropped down to about 58% accuracy. I played around some more and found that it was 58% even without the new features… the ongoing data crawling from the backend had changed my data and my numbers were lower.

After poking around I found that tons of player ids were queued up to have their ranked stats pulled in for the first time. So I halted the ongoing crawling and wrote a script to do a one-time crawl of ALL players. This took half a day but when it finished and I found that random forests and logistic regression were back up to 60% accuracy. Gradient boosting was about 61.2%.

It’s a good thing that I learned that accuracy is so sensitive to fetching ranked stats to compute player win rates. But it’s also a shame because it means my system needs full coverage of that long list of players, which takes a while.

Beyond this I reduced my features from about 52 to about 42, which speeds up training and very slightly improves accuracy by making the models more general. Mostly I dropped several _min and _max aggregate features.

What’s next?

Currently I’m working on two initiatives:

  1. Which matches are more predictable? I’m trying to break down prediction accuracy by player skill level, game version, etc to understand the value of what I’m building.
  2. Riot approved my app key so I can query much faster. At the same time I need to drastically reduce data stored on MongoLab so I’m refactoring to store just skeleton info in MongoLab like match ids, player ids, timestamps, and player rankings. And I’ll cache Riot API lookups in a mongo instance on my own machine running mongo 3 for compression. So far the refactor is going well but it takes time.

Notes

(1) I’m calling it data leakage but specifically I mean that the features used to represent the problem unintentionally contain a feature that’s derived from what we’re trying to predict. For instance, when I subtract out the result of the match from the win rates that can leak information if I do it incorrectly.

(2) I later found that decision trees can do reasonably well with hyperparameter tuning. With default params I got around 52% accurate. With a little tuning I got 58.5%. Compare to 60.4% for logistic regression with slight tuning.

Predicting League match outcomes: First week of machine learning

Edit 8/26: I meant to include learning curves but forgot. Added initial learning curve and final learning curve.

Just a reminder, my goal is to predict the winner of a League of Legends ranked 5×5 game based on the pick/ban phase, including any player stats. This is an intermediate goal towards predicting the outcome of professional matches.

Before you read much further, decide on your own hypotheses: How accurate is good? How accurate is bad? Is 100% achievable? What would it mean? (Look back after reading more to assess your predictions.)

In the previous post I described my infrastructure for building a database of players and matches from the Riot API. This one will describe the process of machine learning including the horrible failures.

Data and evaluation

Once my crawling infrastructure was running semi-smoothly I started working on the machine learning infrastructure. At first I had about 20,000 matches crawled. As I went on I updated the dataset and it’s now about 44,000 matches. The reason I updated the dataset during experimentation is 1) I goofed in my software design so that any time I needed to add new columns from the match data I had to reprocess all of my match collection and 2) 20,000 really isn’t enough for good machine learning in this space. Now I’m keeping my data stable for a few days of experimentation then updating if/when I get significantly more.

For evaluation I’m using 10-fold cross-validation so that all matches in the data set contribute to training and testing (but at different times – I’m never testing on my training data except to judge how much I’m overfitting).

I’m evaluating in terms of accuracy: the percent of the time that I correctly guess which side will win.

Problem setup

Two-class classification is straightforward: the output is a simple yes or no. To convert this problem, I’m predicting whether the blue side will win.

The following is a graph of experiments from August 18-24. The left side is the oldest experiment and right side is the most recent.

Accuracy over time

There was a terrible mistake in the middle and I wanted to hide it for the purpose of graphing, but I think it’s important not to hide mistakes so that we can learn. Aside from that middle part there’s a rough upward trend. I started off around 51-55% accuracy and now I’m achieving 61-64% accuracy.

I’ll discuss these four phases separately:

Accuracy over time in 4 regions

Phase 1: Basics

When starting out I do a few things:

  1. Try to understand my data
  2. Build a very basic set of features for each match
  3. Test a few machine learning algorithms and tune hyperparameters

Exploring the trends

I have a basic understanding of why teams win from playing the game. Overall I know that blue side has an advantage but that in ranked matches Riot puts the higher ranked team on red side and also gives them first pick/ban to balance out the blue side advantage. Also it’s clear that certain champions have a higher probability of winning than others and it’s dependent on the game version (changes every 2 weeks).

Overall let’s look at win rate by side in this data:

Blue side: 50.6% win rate (in 43,905 ranked games)

Compare this to League of Graphs, which shows 51% win rate for solo queue and 52.9% for team queue. They probably have more data than me and it’s probably more balanced between different levels of play. I started crawling from featured matches and challenger/master so my data is likely biased towards the top levels of play.

But even then, given a game how do I decide whether it’s diamond level or gold level? The Riot API doesn’t provide this information. There are two methods:

  1. In the match participant info, the highestAchievedSeasonTier tells you the highest tier they achieved in previous seasons, which is used for the loading screen borders.
  2. You can look up the current league/tier each player is in. But you can’t look up the tier they were at the time they played a game in their match history.

#2 is probably more accurate but adds a ton more server lookups so I’m using their highest previously achieved tier as each player’s current level. Given each individual’s ranking I combine them by taking the most common one. Sometimes that may mean the most common is unranked though. (1)

Games by most common highest achieved season tier

Games crawled by most common highest achieved season tier

Even though I started the crawl with current challenger and master players most of what I get are former diamond and platinum players and a ton of formerly unranked players.

How does it break down by solo queue and team queue? I prioritize team games higher when crawling but solo queue is more popular.

Solo queue: ~30,000 games
Team queue: ~14,000 games

So I have a good number of team games to hopefully learn different trends.

Looking at the blue side advantage by queue and tier:

Blue side win rate by tier and queue

I have zero master team games so that data point is missing. In solo queue there’s an interesting trend where the blue side has more advantage at lower ranks and red side has advantage at higher ranks. In team queue the blue side is even or favored to win at all ranks.

I tested statistical significance between a few pairs like solo queue silver vs platinum but the difference wasn’t significant. In other words, it’s possible that the difference between any two pairs is just due to random chance. There’s clearly something fishy and it’s possible that more data would reveal a difference that can’t be attributed to chance. Or it’s possible that there’s a better statistical test for this situation which is sort of a mixture of rank correlation with binomial distributions.

Per champion win rates

It’s common to analyze win rates by champion per patch. For just starting out I did by champions only. There are two ways to compute win rate and they led to different numbers:

  1. From ranked summoner stats add up wins and losses
  2. From matches add up wins and losses

I expected the same results but was wrong. What I found led me to this:

Total win rate from matches: 50.0% in 439 thousand instances (num matches * 10 players)
Total win rate from summoner ranked stats: 52.5% in 2.9 million instances

At first I thought there must be a bug but didn’t find any. I’m guessing that it’s because my crawl is biased towards higher level players. For a player to reach diamond they had to have a higher than 50% win rate for a period of time and then stabilize at 50% once they reach their skill level. So their total history is over 50% win rate but their recent history should be around 50%. (2)

Win rates by champion look pretty sensible so I won’t get into that.

Basic machine learning

For machine learning we need to reduce each match to a fixed-size list of features. The features generally can be yes/no, numeric, or categorical. I started with the following:

  1. Is this solo or team queue?
    1. 1 feature
  2. What’s the most common tier on each side?
    1. 2 features
  3. One feature per (side, champion), for instance: Does blue side have Ahri? Does red side have Ahri? Does blue side have Annie? Does red side have Annie?
    1. 272 features

I started with three classifiers:

  1. Random Forest: Easy to use and generalizes well especially with lots of trees. Flexible with feature inputs but can be slow.
  2. Logistic Regression: Also easy, especially with numeric or yes/no inputs.
  3. Gradient Boosting Trees: Kaggle competitions are usually won by a neural network or by gradient boosting trees. When I’ve used them they tended to beat out random forests but are slower to train.

My first 6 experiments were just variations on these three with some hyperparameter tuning, like the number of trees for random forests, min_samples_split for random forests, or the regularization constant for logistic regression. My conclusion from this was that gradient boosting wasn’t helpful when random forests had the same number of trees and that logistic regression was comparable but much faster to train than both. Logistic regression with L2 regularization was better than L1. From then on out I stopped using gradient boosting and stopped trying L1 regularization in logistic regression.

I converted the highestSeasonAchievedTier to numbers like challenger=1, master=2, diamond=3, … and took an actual average. At first I accidentally did integer division rather than floating point but when I used floating point division I found the results were slightly worse. Probably it’s because it’s allowing the classifiers to try and do comparisons that are too sparse.

I generated a learning curve with random forests to check whether I’m overfitting or underfitting.

Learning curve for initial experiment.

Learning curve for random forest classifier with 10 trees. Regularization params disabled.

It’s the worst of both worlds! The gap between training and testing is enormous (overfitting). It’s a little better if we set min_samples_split but still flat. So even regularization methods don’t help. At the time I felt that we were also underfitting because additional data doesn’t help but I was wrong – it’s fitting the training data almost perfectly.

In a moment of desperation I tried enriching the feature space on the mistaken assumption that I also had underfitting. I felt that it’s silly to just check if Annie is on blue side. Annie could be played mid or support or honestly some people probably play top/jungle/adc Annie (and probably would have lower winrates doing so). But unfortunately I’d removed the timeline data which indicates the lane each player is in to save space. Instead I tried just doing fields like Blue.Player1.IsAnnie, Blue.Player1.IsAhri, etc. At this point I also included the summoner spells – it’s useful to know whether a support player selects ignite or exhaust for instance. So I had more columns like Blue.Player1.HasExhaust and so on. Instead of the 272 champion indicator features this is 1,260 champion features and 220 summoner spell features.

This was much worse! The summoner spells generally were used in the classifiers but the model was overfitting.

I went through some twists and turns in desperation and found that per-side champion indicators were better than per-side-per-player and the same trend held for summoner spells, so now I had Blue.NumFlash, Blue.NumIgnite, and so on. In other words, champion features were back down to 272 and summoner spell featured were reduced from 220 features to 22 features.

These tweaks took me from my initial 52.6% accuracy to 54.4% accuracy.

Phase 2: Elation, depression, and typos

The second phase was about looking up each summoner’s win rate history with the champion they’re playing and their win rate history overall. So I had columns like Blue.1.ChampionWinRate, Blue.1.ChampionGamesPlayed, Blue.1.TotalWinRate, etc.

I knew I was cheating – the ranked stats may have been crawled after the specific match so it may be leaking the outcome of the match. Even with leaking info the accuracy was around 74%. (I’d expected higher)

Then I added some code to subtract out the outcome of the match in question when looking up win rate and played rates. Unfortunately I had two variables named “won” and unintentionally subtracted 1 win every time so the models could deduce the outcome. Still, I was fooled because subtracting it out decreased accuracy to 69% so it seemed like I’d solved the leakage of data.

I tried experiments for days but when I went to refactor some code, accuracy mysteriously dropped to 56%. Once I learned my issue I realized that also all the intermediate conclusions were invalid – why would the random forest use another feature if it already knows the outcome some of the time?

So I fixed the bug and started over from 56% accuracy.

Phase 3: Reality check

Progress can be slow in feature engineering. If I had maybe 1000x more data perhaps I could throw this all into a neural network and avoid grinding away at it but I only have 44k data points.

This phase was a grind to work up from 56% accuracy to 59%. Then I updated my data and dropped back down to 58%.

For starters, the progress from 54% accuracy in phase 1 to 56% accuracy was solely due to each player’s win rate. Previously I’d seen that having separate features per player made the data too sparse so I took the sum, min, and max to make features like Blue_Champion_WinRate_Sum for the sum of win rates from player history on those champions (subtracting out the outcome of the match if appropriate).

Some of the experiments:

  1. Total number of games played per player (combined with sum, min, max, and log of the sum). Helped a little
  2. Breakdown of expected damage types (percent physical, magic, true damage)
    1. Note that this started to show differences between random forests and logistic regression. For random forests the actual number was best. For logistic regression it was better to have features like “top 20% most physical damage comps” or “top 20% most magical damage comps”.
    2. This would be correlated with win/loss because it’s easier for opponents to itemize defensively against a single type of damage.
  3. Added win rate features from the total win rates of all players on the champions played (again combined with sum, min, max).
  4. (Failed) Tried individual win rate on the champion divided by popularity. I was hoping to identify people that were good with rarely picked champions because they get an element of surprise. But it wasn’t helpful.
  5. Dropped all champion indicator columns. This gave me a good improvement and somewhat shocked me. What I learned is that with this amount of data it’s much better to build in champion knowledge to aggregated stats like win rates and damage types. Including these columns causes the algorithms to overfit the training data more.
  6. Revisiting hyperparameters. Random forests improved more with 500 trees but it’s slow to train so usually I run with 100 trees.
  7. Refreshed my data and lost 1% accuracy :( Part of the reason is that the player histories hadn’t been looked up for all the new players yet.
  8. Realized that I just added 5.16 data in which Skarner has over 60% win rate so I need to account for Skarner somehow.
  9. Added the game version as a feature which was slightly worse.
  10. Refreshed the data set with ~44k matches and added champion indicators back for slight improvement

Phase 4: Progress

In previous experiments, champion indicators were a huge flop. They’re just too sparse. When you think about champion indicators per side there are 126 features per side. If we consider that there are 6 bans, [(126 – 6) chose 10] is 116,068,180,000,000 different possible configurations of champions. That’s 2.6 billion times more possible champion selects than the number of matches I have crawled.

So I did what worked before: Compute a statistic per person then take a sum, min, and max. This time I precomputed a table of win rate per (game version, champion). If there was no data I looked up the champion overall win rate.

To compute this I had to move away from data from ranked stats endpoint which doesn’t have version info. So I’m computing this over the set of matches I’ve seen and removing the outcome of the current match when looking it up.

This change got me from 58% accuracy up to 63% accuracy.

I got about 0.5% accuracy gain by dropping the champion columns and the game version columns. This has the nice bonus that training is MUCH faster with so many fewer columns.

When I think about the way random forests work, they compare a number to a threshold for floating point feature. So you can have a tree like:

if blue_winrate_sum > 0.55:
   if red_winrate_sum < 0.55:
      ...
   otherwise
      ...
otherwise
   ...

But that’s a pretty dangerous game – there’s no reason for the random forest to pick the same thresholds for each side.

So I decided to add diffs to allow it to directly compare win rates, like (blue_winrate_sum – red_winrate_sum). So now it can learn something like:

if (blue_winrate_sum - red_winrate_sum) > 0:
   ...
otherwise
   ...

Encoding some win rates this way got me up to 64%.

Here’s the learning curve now:

Learning curve showing the impact of more data.

Learning curve showing the impact of more data. Regularization disabled. Ran this with 100 trees.

It’s still overfitting but we’re improving with more data. Removing or simplifying features is more likely to help. Filling in better default values is likely to help. Getting more data will help too.

Conclusions

This post is too damn long.

Random forests are more accurate than logistic regression now, probably because they can represent combinations of features better. But until my most recent experiments, logistic regression was similar accuracy and faster to train.

My progress on this problem can be attributed to mostly feature engineering. Some of that comes from knowing about League of Legends and some of that is the constant struggle against data sparseness and overfitting. Although regularization helps to balance sparse data it’s not nearly as effective as designing features that are less sparse.

Notes

(1) This is an area I could likely improve in, by taking the most common non-unranked rank. I’m also considering conflating challenger and master cause players say there’s little difference.

(2) Unfortunately I didn’t notice this difference until somewhere around phase 4 so I used the ranked stats to look up champion win rates until then.

Current features with feature importances from random forests

For the curious, here are the current features and the list of feature importances from random forest learning. My feature names are pretty bad so here’s a rough legend.

Modifiers

Blue_: Feature is for blue side

Red_: Feature is for red side

Delta_: Feature is the blue side value minus red side value

_Sum: The sum of per-player values for red or blue side.

_LogSum: The log of the sum of per-player values for red or blue side. This is used for number of games played usually because the difference between 100-105 games is very different than 5-10 games.

_Min: The min of per-player values for red or blue side.

_Max: The max of per-player values for red or blue side.

Values

MatchHistPatchWinRate: The win rate for (game version, champion), which is computed from match histories.

MatchHistWinRate: The win rate for (champion), which is computed from match histories.

TotalWinRate: The win rate for (player) for all matches in their ranked stats excluding the current match.

GeneralWinRate: The win rate for (champion), which is computed from ranked stats. (I’ll try removing these features soon)

Damage_true: Fraction of the team’s damage that’s “true” damage. This is computed by aggregating expected damage numbers per match for each champion.

Damage_physical: Same but for physical damage.

Damage_magical: Same but for magical damage

GeneralPlayRate: In general how often is each champion played. Percent of total games.

Played: The number of games played on the champion per player.

WinRate: Each player’s win rate on their champ.

Combined_WR_LP: _WinRate_Sum * _Played_LogSum

QueueType_RANKED_TEAM_5x5: Is this team queue?

QueueType_RANKED_SOLO_5x5: Is this solo queue?

Summoners_Ignite: How many ignite spells were taken by the team? 0-5. Same format for other summoner spells.

Tier: Numeric value for challenger, master, diamond, platinum, gold, silver, bronze, unranked.

Delta_MatchHistPatchWinRate_Sum: 0.0783417150388
Blue_MatchHistPatchWinRate_Min: 0.0323549344068
Red_MatchHistPatchWinRate_Max: 0.0313488027432
Red_MatchHistPatchWinRate_Min: 0.0312881349164
Blue_MatchHistPatchWinRate_Max: 0.0306717664898
Delta_MatchHistWinRate_Sum: 0.0287678556769
Delta_TotalWinRate_Sum: 0.0276201874232
Red_MatchHistWinRate_Max: 0.0276070957926
Blue_MatchHistWinRate_Max: 0.0263995490777
Delta_GeneralWinRate_Sum: 0.0232490041282
Red_MatchHistWinRate_Min: 0.0215945236077
Blue_MatchHistWinRate_Min: 0.020329523968
Blue_Damage_true : 0.0189126204086
Red_TotalWinRate_Max: 0.0185370040173
Red_Damage_true : 0.018432428736
Delta_GeneralPlayRate_Sum: 0.0182740974154
Blue_TotalWinRate_Max: 0.0179750673806
Blue_Damage_physical: 0.0178206785779
Red_Damage_physical : 0.0176842953121
Red_Damage_magic : 0.0175138996467
Blue_Damage_magic : 0.0173994840467
Red_GeneralPlayRate_Min: 0.0159850121751
Blue_GeneralPlayRate_Min: 0.0157020093749
Red_Combined_WR_LP : 0.0150776361251
Red_TotalPlayed_Sum : 0.0141792367679
Blue_GeneralWinRate_Min: 0.0139296297018
Red_TotalPlayed_Max : 0.0138542469971
Red_TotalPlayed_LogSum: 0.0136744596289
Red_GeneralWinRate_Min: 0.0136513142511
Red_GeneralWinRate_Max: 0.0136279491338
Blue_WinRate_Sum : 0.0135539271417
Blue_TotalPlayed_Sum: 0.0134913754497
Blue_GeneralWinRate_Max: 0.0134111710295
Blue_TotalPlayed_LogSum: 0.0132455587734
Red_Played_LogSum : 0.0131352239828
Red_Played_Sum : 0.013119619869
Red_WinRate_Sum : 0.0130897656654
Red_Played_Max : 0.0129777766529
Blue_Combined_WR_LP : 0.0129752348834
Blue_TotalPlayed_Max: 0.0128538678549
Blue_Played_Sum : 0.011959672534
Blue_Played_LogSum : 0.0119042299448
Blue_Played_Max : 0.0116590197237
Red_WinRate_Max : 0.0111025449999
Red_GeneralPlayRate_Max: 0.0108553703343
Blue_WinRate_Max : 0.0107640266013
Blue_GeneralPlayRate_Max: 0.0106759217472
Red_Tier : 0.00791245927754
Blue_Tier : 0.00737769089121
Red_WinRate_Min : 0.00624995727823
Blue_WinRate_Min : 0.00624583713921
Blue_TotalWinRate_Min: 0.00541460174423
Red_TotalWinRate_Min: 0.00538417607259
Red_Summoners_Ignite: 0.00393322706347
Blue_Summoners_Ignite: 0.0036346189246
QueueType_RANKED_TEAM_5x5: 0.0025569841657
Red_Summoners_Exhaust: 0.00250370199147
QueueType_RANKED_SOLO_5x5: 0.00241526202611
Blue_Summoners_Exhaust: 0.00233453697897
Red_Summoners_Flash : 0.00172553159503
Blue_Damage_true_qcut5_(0.0331, 0.0424]: 0.00168115359604
Blue_Summoners_Teleport: 0.00168044078121
Red_Damage_physical_qcut5_(0.532, 0.59]: 0.00167882503717
Red_Summoners_Teleport: 0.00165874470876
Red_Damage_true_qcut5_(0.0329, 0.0424]: 0.00164035443627
Blue_Summoners_Flash: 0.00162409480129
Blue_Damage_magic_qcut5_(0.365, 0.425]: 0.00160973133415
Red_Damage_true_qcut5_(0.0424, 0.0585]: 0.00160545365344
Blue_Damage_true_qcut5_(0.0424, 0.0588]: 0.00159505732403
Red_Damage_magic_qcut5_(0.363, 0.423]: 0.00158946539755
Blue_Damage_physical_qcut5_(0.53, 0.589]: 0.00157602298871
Red_Damage_true_qcut5_(0.0256, 0.0329]: 0.00156570595649
Red_Damage_magic_qcut5_(0.298, 0.363]: 0.0015619551631
Blue_Damage_true_qcut5_(0.0259, 0.0331]: 0.00154027231206
Red_Damage_physical_qcut5_(0.59, 0.658]: 0.00149894639082
Red_TotalPlayed_Min : 0.00147979827299
Blue_Damage_magic_qcut5_(0.425, 0.492]: 0.00144937349364
Blue_Damage_physical_qcut5_(0.463, 0.53]: 0.00144713707292
Blue_Damage_magic_qcut5_(0.298, 0.365]: 0.00143075088626
Blue_Damage_true_qcut5_[0.00407, 0.0259]: 0.00142928172729
Red_Damage_magic_qcut5_(0.423, 0.491]: 0.00142108685736
Blue_Damage_physical_qcut5_(0.589, 0.658]: 0.0014153193219
Red_Damage_true_qcut5_[0.004, 0.0256]: 0.00139654834611
Red_Damage_physical_qcut5_(0.464, 0.532]: 0.00139123098978
Red_Damage_true_qcut5_(0.0585, 0.217]: 0.00135422801033
Blue_Damage_true_qcut5_(0.0588, 0.216]: 0.00133514872817
Red_Summoners_Ghost : 0.0012825018216
Red_Damage_magic_qcut5_[0.0439, 0.298]: 0.0012564652682
Blue_Summoners_Ghost: 0.00123342041717
Blue_TotalPlayed_Min: 0.00121063184839
Blue_Damage_physical_qcut5_(0.658, 0.932]: 0.00120105753152
Blue_Damage_magic_qcut5_(0.492, 0.85]: 0.00116191345092
Red_Damage_magic_qcut5_(0.491, 0.867]: 0.00115009507273
Red_Damage_physical_qcut5_(0.658, 0.92]: 0.00114326762739
Red_Damage_physical_qcut5_[0.113, 0.464]: 0.00112895810494
Blue_Damage_magic_qcut5_[0.0376, 0.298]: 0.00112384366323
Blue_Damage_physical_qcut5_[0.111, 0.463]: 0.00111729364361
Blue_Summoners_Barrier: 0.00104733673701
Blue_Summoners_Heal : 0.00103136162457
Red_Summoners_Barrier: 0.00101362007145
Red_Summoners_Heal : 0.000882613108646
Red_Played_Min : 0.000747445158982
Blue_Played_Min : 0.000717326533907
Red_Summoners_Cleanse: 0.000608349212887
Blue_Summoners_Cleanse: 0.000588182703567
Blue_Summoners_Smite: 0.000216492493282
Red_Summoners_Smite : 0.000174875332852
Blue_Summoners_Clairvoyance: 1.59854560378e-05
Blue_Summoners_Clarity: 1.07734826485e-05
Red_Summoners_Clarity: 7.13387206181e-06
Red_Summoners_Clairvoyance: 5.87287195701e-06

Predicting League match outcomes: Gathering data

The goal

I’d like to take the results of pick/ban phase of professional League of Legends matches and compute the probability of the winner. In part I find it interesting to watch analysis of pick/ban phase by regular casters or Saint’s VOD reviews. How much of the game is really determined at pick/ban phase? Was a match unwinnable after a bad pick/ban or is it just slightly harder? How much does it depend on the specific players? How much depends on the history of those two teams? Is there a measurable probability effect from jet lag? And so on.

Unfortunately there isn’t that much data for professional matches. The game is constantly changing and the teams/players are also changing quickly. With so many things changing at once, 1-2 games per week per team is probably not enough to easily develop a machine learning system. So I’ll work on a simpler problem and try to apply the insights I learn to professional matches.

Trying a simpler problem

Can I predict the outcome of a non-professional match based on pick/ban and the player histories? This is an easier problem in a few ways:

  1. Data is easier to get – it’s available via Riot API.
  2. There’s a lot more data.
  3. Even team queue has a lot more data than pro.

So the problem is much simpler now: I need to crawl match info and build up a database of match and player info. Once I have that I can start working on machine learning.

Crawling the data

There’s no list of all active matches or all matches per day in the Riot API. The only available list has the current featured matches. You can get 5 matches and it updates maybe every 5-10 minutes. At best we can get 1,440 matches per day this way. In practice, there will be duplicates and some matches aren’t ranked 5×5. The other difficulty is that I’m using Heroku for scheduling so at best I can run something every 10 minutes, or max 720 matches per day.

To be able to build up data faster I wrote software that acts like a web crawler: You have some starting points or seeds and crawl outwards from there.

In my case there are three seeds from the API:

  1. Featured matches
  2. List of challenger players
  3. List of masters players

The process is constantly adding more player and match ids to lookup. It works like so:

  1. Look up all challenger and master players. Add any new ones to the player database
  2. Update all players, up to N lookups and prioritize those that have no data or are scheduled for update
    1. Look up their ranked stats and save them
    2. Look up their match history and queue all new matches for lookup
    3. From match history, compute what time and day we expect them to have played 15 new matches and save that as the date to update this player’s info
  3. Update all ranked 5×5 matches without detailed info, up to M
    1. Usually we have a match id and player ids already but don’t know the outcome of the match or various stats.
  4. Look up the featured matches and queue all new players and queue the matches

The reason I look up featured matches at the end is because they fail the match lookup before they’re completed. So I want there to be enough of a delay for the matches to complete before the match lookups start. In retrospect it would’ve been better to set a “don’t crawl before date” in the database but this mostly accomplishes the same goal except for very long games.

Issues

It took some trial and error to get this working. Some of the mistakes I’ve made at a design level:

  • I didn’t know about the challenger and masters APIs at first so it all just started from featured. That meant I wasn’t getting quite as many high-level games as I wanted.
  • I set the player recrawl date incorrectly: At first I took the difference between the first and last match, computed the number of games played per day, then set it to the last match plus 15 games times the rate. But when a player stops plays 15 games in a row then stops for weeks this will put them at the top of the queue for recrawling. Instead you need to set the end date to the current time. (Fixing this dramatically increased what was crawled)
  • At first I didn’t limit the number crawled per type at all. What would happen is that my script would never finish the per-player lookups before Heroku killed it to start the next run. So I’d never process any queued matches.

Rate limiting

The developer key allows something like 60 requests per minute and 500 per 10 minutes. In theory if you limit to 1.2s or a little higher between requests you should be fine. But there are other internal rate limits that aren’t exposed, leading to 429 rate limit exceeded.

Instead of trying to figure it out I just used exponential backoff: Try waiting 1.3 seconds between requests. If you get 429 error, wait 13 seconds. If you get it again, wait 130 seconds. You can continue this but I’ve never hit a limit after waiting 130 seconds.

This solves the 429 problem 100%. Even if I’m running two copies of the script simultaneously this solves the issue. One side note is that it’s kind to Riot servers as well – it’s unlikely that you’ll ever get two 429 errors in a row so they don’t need to waste processing on requests that’ll just be rejected.

But there are many other kinds of errors to handle in the API. Some kinds of errors are transient and will clear up in a few seconds anyway (500 and 503 are like this). Others will return the same result every time (400, 401, 404, and 422 errors are like this). For the transient errors I just try them again after the delay. I don’t use exponential delays for those cause it isn’t necessary.

Python APIs

I took a look at the existing Python wrappers for Riot API briefly and felt that the documentation was a bit lacking so I didn’t use them. It probably would’ve been better for the community if I’d forked their module and added documentation/etc. At first I was just in a rush to get something working. And now after developing for a while I have basically yet another API wrapper. Sigh.

To be fair though, it doesn’t take a lot of effort to wrap a rest api so it’s not that much effort wasted. I would’ve needed to spend all that time reading Riot docs on the field values anyway.

Compute and storage

Heroku is handy for running periodic scripts with the scheduler addon. I’m using that for hourly processing. I use mongodb on MongoLab for storage, which is great for storing Json. After fixing my design issues I found that I’m hitting space quotas on MongoLab.

As a hack I delete all the stats, timelines, runes, and masteries for participants in matches. This helps but mongo doesn’t actually free the disk space – it tries to reuse objects. So I don’t get the space back until I run repairDatabase manually.

Even still, I’m running into issues again. MongoLab is migrating all free databases over to MongoDB 3.0 at the end of September, which allows for compression and should greatly reduce my usage.

Probably before then I’ll have to strip the records down to their bare minimum so that I can keep crawling.

How well does it work?

It works great except that I’m running out of MongoLab storage space. Otherwise I have about 54k players in the database and 83k matches (but only 53% have been looked up fully).

In any case, it’s working well enough that I have a decent data set to use for machine learning and I’ll post my early experiments to that end soon.