Predicting League match outcomes: Gathering data

The goal

I’d like to take the results of pick/ban phase of professional League of Legends matches and compute the probability of the winner. In part I find it interesting to watch analysis of pick/ban phase by regular casters or Saint’s VOD reviews. How much of the game is really determined at pick/ban phase? Was a match unwinnable after a bad pick/ban or is it just slightly harder? How much does it depend on the specific players? How much depends on the history of those two teams? Is there a measurable probability effect from jet lag? And so on.

Unfortunately there isn’t that much data for professional matches. The game is constantly changing and the teams/players are also changing quickly. With so many things changing at once, 1-2 games per week per team is probably not enough to easily develop a machine learning system. So I’ll work on a simpler problem and try to apply the insights I learn to professional matches.

Trying a simpler problem

Can I predict the outcome of a non-professional match based on pick/ban and the player histories? This is an easier problem in a few ways:

  1. Data is easier to get – it’s available via Riot API.
  2. There’s a lot more data.
  3. Even team queue has a lot more data than pro.

So the problem is much simpler now: I need to crawl match info and build up a database of match and player info. Once I have that I can start working on machine learning.

Crawling the data

There’s no list of all active matches or all matches per day in the Riot API. The only available list has the current featured matches. You can get 5 matches and it updates maybe every 5-10 minutes. At best we can get 1,440 matches per day this way. In practice, there will be duplicates and some matches aren’t ranked 5×5. The other difficulty is that I’m using Heroku for scheduling so at best I can run something every 10 minutes, or max 720 matches per day.

To be able to build up data faster I wrote software that acts like a web crawler: You have some starting points or seeds and crawl outwards from there.

In my case there are three seeds from the API:

  1. Featured matches
  2. List of challenger players
  3. List of masters players

The process is constantly adding more player and match ids to lookup. It works like so:

  1. Look up all challenger and master players. Add any new ones to the player database
  2. Update all players, up to N lookups and prioritize those that have no data or are scheduled for update
    1. Look up their ranked stats and save them
    2. Look up their match history and queue all new matches for lookup
    3. From match history, compute what time and day we expect them to have played 15 new matches and save that as the date to update this player’s info
  3. Update all ranked 5×5 matches without detailed info, up to M
    1. Usually we have a match id and player ids already but don’t know the outcome of the match or various stats.
  4. Look up the featured matches and queue all new players and queue the matches

The reason I look up featured matches at the end is because they fail the match lookup before they’re completed. So I want there to be enough of a delay for the matches to complete before the match lookups start. In retrospect it would’ve been better to set a “don’t crawl before date” in the database but this mostly accomplishes the same goal except for very long games.


It took some trial and error to get this working. Some of the mistakes I’ve made at a design level:

  • I didn’t know about the challenger and masters APIs at first so it all just started from featured. That meant I wasn’t getting quite as many high-level games as I wanted.
  • I set the player recrawl date incorrectly: At first I took the difference between the first and last match, computed the number of games played per day, then set it to the last match plus 15 games times the rate. But when a player stops plays 15 games in a row then stops for weeks this will put them at the top of the queue for recrawling. Instead you need to set the end date to the current time. (Fixing this dramatically increased what was crawled)
  • At first I didn’t limit the number crawled per type at all. What would happen is that my script would never finish the per-player lookups before Heroku killed it to start the next run. So I’d never process any queued matches.

Rate limiting

The developer key allows something like 60 requests per minute and 500 per 10 minutes. In theory if you limit to 1.2s or a little higher between requests you should be fine. But there are other internal rate limits that aren’t exposed, leading to 429 rate limit exceeded.

Instead of trying to figure it out I just used exponential backoff: Try waiting 1.3 seconds between requests. If you get 429 error, wait 13 seconds. If you get it again, wait 130 seconds. You can continue this but I’ve never hit a limit after waiting 130 seconds.

This solves the 429 problem 100%. Even if I’m running two copies of the script simultaneously this solves the issue. One side note is that it’s kind to Riot servers as well – it’s unlikely that you’ll ever get two 429 errors in a row so they don’t need to waste processing on requests that’ll just be rejected.

But there are many other kinds of errors to handle in the API. Some kinds of errors are transient and will clear up in a few seconds anyway (500 and 503 are like this). Others will return the same result every time (400, 401, 404, and 422 errors are like this). For the transient errors I just try them again after the delay. I don’t use exponential delays for those cause it isn’t necessary.

Python APIs

I took a look at the existing Python wrappers for Riot API briefly and felt that the documentation was a bit lacking so I didn’t use them. It probably would’ve been better for the community if I’d forked their module and added documentation/etc. At first I was just in a rush to get something working. And now after developing for a while I have basically yet another API wrapper. Sigh.

To be fair though, it doesn’t take a lot of effort to wrap a rest api so it’s not that much effort wasted. I would’ve needed to spend all that time reading Riot docs on the field values anyway.

Compute and storage

Heroku is handy for running periodic scripts with the scheduler addon. I’m using that for hourly processing. I use mongodb on MongoLab for storage, which is great for storing Json. After fixing my design issues I found that I’m hitting space quotas on MongoLab.

As a hack I delete all the stats, timelines, runes, and masteries for participants in matches. This helps but mongo doesn’t actually free the disk space – it tries to reuse objects. So I don’t get the space back until I run repairDatabase manually.

Even still, I’m running into issues again. MongoLab is migrating all free databases over to MongoDB 3.0 at the end of September, which allows for compression and should greatly reduce my usage.

Probably before then I’ll have to strip the records down to their bare minimum so that I can keep crawling.

How well does it work?

It works great except that I’m running out of MongoLab storage space. Otherwise I have about 54k players in the database and 83k matches (but only 53% have been looked up fully).

In any case, it’s working well enough that I have a decent data set to use for machine learning and I’ll post my early experiments to that end soon.


13 thoughts on “Predicting League match outcomes: Gathering data

    1. Before I got a high rate API key from Riot it was slow; my data was limited by my API limit. Once I got the faster key I think doing a full data refresh of ~2 mil matches took 2-3 days. If you’re just trying to get your own matches and people you play with then it should be pretty quick… I think the last time I tried that it took only a minute or so. When I tried for someone that plays a lot like Annie Bot I think it took maybe 20 min lol… when I was testing some new code for player-specific info I had to limit it to just the last year of data so I could process his games.

      1. I tried KDA early on but it wasn’t that useful when I also had their win rate and num games on that champ. I thought it was surprising cause you sometimes win by a lot and other times win by a little and the KDA could show that.

      2. How big was your training data? I’ve gathered a training set of 1291 matches and test set of 1345 matches. I test my logistics model and get 93% on the test set. Can you suggest what is the problem? Because I dont believe in such a high accuracy. are my data set too small?

      3. Depending on the experiment I used either 50,000 matches or 200,000 and depending on the model I got around 65-69% accuracy after all the experimentation.

        93% accuracy sounds high. When I got that high it was because I had features that accidentally included the outcome of the match I was predicting, like the win rate over their total games (even into the future). One time I had 2 features like that and gradient boosting got 99% accuracy which is how I learned that it was wrong.

        For logistic regression, take a look at the weights that it learns. If 1-2 features are very strong then investigate those features.

        It’s also possible that higher accuracy is possible because you’re predicting for a single person. For example on my own games if I play anyone but Morgana I have a much lower win rate. I understand that so I mostly play Morgana but sometimes you find a player that has very high win rate on one champion and very low on all others but doesn’t change their pick strategy to optimize winning.

    2. Pretty interesting! I also include the win rate of a player (by the champion he plays in the game) into my features and the weight that my model learns is also high on it! Can you explain a good reason not to do so?

      1. There’s no reason not to include this feature but you have to be very careful about how to represent it. Like say I play a game on Tuesday with Janna. I should only compute my Janna win rate with games before Tuesday. In the extreme case, suppose I’ve only played two Janna games – one last week and one on Tuesday. If I include Tuesday’s game when I compute my Janna win rate then I’m giving the model a lot of insight into the outcome of the game.

        That’s an extreme case though. I think what I did was computed the win rate on the champion but subtracted out the game and outcome from the computation of the win rate. But then I still had some strange issue I couldn’t figure out that was clearly leaking the outcome of the match so instead I switched to computing win rate over past games only.

  1. The bottom of this post has I think most of my features except the summoner history ones: I’ll try and remember to dig up a more recent list.

    I didn’t find much benefit to the specific champions as features but instead I looked up the global win rates for them in the current patch and each player’s win rate and number of games. I think that list doesn’t have each player’s win rate/num games but I remember that being the most important. Also their current league is important (league in last season didn’t seem useful).

    For the most part I got the feeling that the models mostly learned to spot bad choices like first time on a champ, first time ranked, a few bad champions in certain patches, etc.

    I think when I go back and update that work the mastery points will be really important because it lets you get more reliable information across seasons and across normal/ranked.

  2. “didn’t find much benefit to the specific champions as features but instead I looked up the global win rates for them in the current patch and each player’s win rate and number of games.”

    So with the winrate of the 5 players on the a team and the winrate of each of their champions you could predict whether that team would win with over 60% accuracy? Or did you include winrate of the champions on the opposing team? If you included winrate of players on both teams that doesn’t seem very useful to me since in the pregame lobby you can’t see who the players are on the opposing team.

    1. Eventually I got to like 69% accuracy but yeah the way I did it wouldn’t be useful in pregame lobby except potentially for pro play. But for pro play I would’ve had to represent the problem differently cause there’s just not enough data for pro play for each patch.

      1. Could you clarify how it would potentially be useful for pregame in pro play but not regular ranked play? I don’t understand the difference. I ask these questions because I’m trying to make predictions myself and I was wondering if you had any tips, I’m only concerned with Silver =D I had almost 90% accuracy until I got rid of lookahead bias and now I’m at 50% =[

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s