Tuning dropout for each network size

In the previous post I tested a range of shallow networks from 50 hidden units to 1000. On the smaller dataset (50k rows) additional network complexity hurts: It’s just overfitting. On the larger dataset (200k rows) the additional complexity helps because the amount of data prevents the network from overfitting.

But I learned from the Stanford CNN class that I made a mistake: It’s bad to view network complexity as regularization, instead it’s better to pick the most complex you can and tune dropout (and/or L2). I’d amend that to say that it’s bad if you have excess compute resources. If the pace of experimentation is limited by runtime then reducing the network size may be a good way to achieve both regularization and experimental efficiency.

Just as a reminder this is a part of my ongoing project to predict the winner of ranked matches in League of Legends based on information from champion select and the player histories.

Here’s the graph from last time, showing that added network complexity is harmful on a small dataset. I kept the default of 0.5 dropout.

Shallow NN hidden units on 50k dataset

The hidden layer config is shown on the x-axis as Python tuples; this graph is a series of experiments all with a single hidden layer of varying widths.

Tuning dropout for each network size

I replicated the test but stopped at 600 units; it takes increasingly longer to train the wider networks and I was running several times more tests than before. Let’s start with tuning dropout separately for each hidden layer size:

Scaling number of units with and without dropout tuning on 50k dataset

The hidden layer config is shown on the x-axis as Python tuples; this graph is a series of experiments with different numbers of hidden units in a single hidden layer. This was run with 10-fold cross-validation and the following dropout values were tested: [0.4, 0.5, 0.6, 0.7, 0.8].

Now we see a different trend: when tuning dropout in conjunction with the network size, the added capacity doesn’t lead to overfitting. If anything it improves accuracy slightly then plateaus.

It’s also interesting to see the trend more consistent but that could be the result of taking the max over 4 tests. I’m not sure how to assess that independently.

Also note that the best dropout value I tested for 200-600 is 0.8 (the max value I tested). Higher values may have been better but I didn’t have time to test more.

Weight initialization

I replicated these tests with the he_uniform initialization (the default is glorot_uniform). Last time I saw benefits from he_uniform but didn’t test thoroughly.

he_uniform vs glorot_uniform across network size with and without dropout tuning

The dashed lines show the results without tuning dropout – he_uniform is generally an improvement. It’s isn’t any more consistent than glorot_uniform though (previously I was thinking it might be).

When we tune dropout for each network size (solid lines) they’re almost identical. Looking into the best dropout values per size, the tuned values for he_uniform tend to be lower values of dropout. It still transitions to 0.8 for the larger networks but not until network size 300 in contrast to 200 for glorot_uniform. Another way of looking at it: The default 0.5 dropout is closer to optimal for he_uniform and therefore it fares better when dropout isn’t tuned.

I can only guess at the cause… probably the additional randomness in initialization with the He correction is starting the network off with additional diversity. Dropout itself is forcing diversity so maybe we don’t need to force it as much.

All that said, I’m happy to have learned more: Although He initialization is important for deep networks with ReLU, for shallow networks it’s a minor improvement. But depending on the test it may appear to be an improvement if dropout isn’t tuned because it changes the optimal value of dropout.

 

Switching from deep to wide

In the previous post I found gains by adding a second hidden layer. But I accidentally found even better results with wider networks of a single hidden layer. I’ve done more systematic experimentation and wanted to share. Just as a reminder this is a part of my ongoing project to predict the winner of ranked matches in League of Legends based on information from champion select and the player histories.

On the smaller dataset (50k rows) increasing the size of the network is harmful: it’s just overfitting. This is probably a part of the reason why I originally decided to stay with a small network but I’ll show that this conclusion doesn’t apply to the larger dataset.

Shallow NN hidden units on 50k dataset

The hidden layer config is shown on the x-axis as Python tuples; this graph is a series of experiments all with a single hidden layer of varying widths.

I began to replicate this test on the larger dataset and unfortunately found that my training code doesn’t scale. It works relatively well for networks of under 500 hidden units but slows down more as it scales. Normally the 200k dataset takes 4x longer to train than the 50k dataset. In this situation however, I suspect it’s the full-batch training that’s scaling nonlinearly and I suspect that it’s because of the matrix of 200k x 500 weights.

But I learned this the hard way: It took perhaps 12 hours to train all sizes on the 50k dataset with 20-fold cross-validation so I expected around 48 hours for the 200k dataset. After letting it train for an entire week I gave up.

In short I’ve changed my training:
Old: 100 epochs in batches of 1024 followed by 100 epochs of full batch training
New: 100 epochs in batches of 1024 followed by 100 epochs of batches of 5000

In the graph below I separated the two training methods but it doesn’t seem to matter much.

Shallow NN configurations on 200k dataset

The trend is unsteady especially considering the standard deviations but it’s reasonably consistent. When I first tested wider networks I tried 250 and 500 units, which got 68.2-68.4% accuracy. Those results were probably just outliers. Even so, the wider networks fare quite well (considerably better than the deeper networks I tried last time).

If it’s one thing I’ve learned it’s that nothing quite goes the way you plan. After starting these experiments I’ve begun following along Stanford’s CS231n class on convolutional neural networks. One of the things Andrej said is directly applicable: rather than tuning the network size to the data, pick the biggest that you can and tune dropout.

It makes sense: Increasing dropout will sort of shift the excess capacity for learning over to ensembling. But I’m also not entirely happy with the added training time of the larger networks and that may slow down my overall experimentation. On the other hand, thinking about it this way is directly analogous to the way you tune regularization for logistic regression so it simplifies code a little. I tested 50 and 250 units on the 50k dataset in conjunction with tuning dropout and found the same accuracy for both network sizes once dropout was tuned for each one. Maybe this weekend I’ll try replicating the network size graph on the 50k dataset in conjunction with dropout tuning.

Another thing I learned from the class is that the Keras models suggest an incorrect weight initialization, or at least it’s incorrect in conjunction with ReLU. The default is glorot_uniform but it should be he_uniform or he_normal. Both kinds of initializations compensate for diminishing variance of the activations as you stack layers but the Glorot ones are only correct for sigmoid and tanh activations. In ReLU initially about half of the activations will be zero and the He versions compensate in initialization so that the variance of activations doesn’t vanish as you go down the network (which is a problem because the gradients will be very very small and the initial updates too small).

Neural network initializations over 3 dropout settings and 2 network configurations on 50k dataset.png

Experiment on 50k dataset of dropout [0.4, 0.5, 0.6] x 3 initializations shown above x 2 network configurations: (50,) and (50, 50)

The truth is that I didn’t expect it to matter at all with shallow networks but it’s a reliable improvement. In some ways switching from glorot_uniform to he_uniform is more important than tuning the dropout param.

Since this test I’ve added the initialization test to many of the configuration experiments on the 200k dataset. Generally he_uniform has a more stable trend from one configuration to the next and is usually better than glorot. I haven’t tested enough configurations to know whether the degradations from Glorot are due to outliers or not.

One brief note: I think the class examples used glorot_normal and he_normal. Not really sure why uniform is better but I’d guess that it has better symmetry-breaking properties.

Next steps

  • Replicate the network size scaling graph in conjunction with tuning the dropout. I’m hoping to see that it’s flat or slightly increasing.
  • Replicate scaling graphs with he_uniform to see if it really is more consistent from one configuration to the next.
  • Get this working on GPU – I spent a few hours but found that it’s challenging to get CUDA installed under Windows or at least, a version of CUDA that Theano can use. I looked into TensorFlow instead but that’s actually less supported than Theano.
  • I’ve been focusing too much on tuning the individual models I think…. it’d be great to get back to feature engineering.
  • I’ve been meaning to update the datasets for season 6, which started a few weeks ago. Lots has changed in the game: Champion select is done in a way where you’re more likely to get the role you want. There are tons of champion and item changes. And team queue for 5v5 is gone – now you can queue up as 2-5 people in the same queue as solo but it tries to match the teams or subteams in addition to the players.
    • I’ve been avoiding thinking about this because it’s enough of a change in data that my intuitions from this data may not transfer 100%. And I have to do more engineering work, in particular the player history stats need both season 5 and season 6 information.

Probably I’ll take a break so I can catch up on homework for the Stanford class.

Gains from deep learning

Back from the holidays! I’ve finally made some progress with neural networks, particularly a deep network. This is a part of my ongoing project to predict the winner of ranked matches in League of Legends based on information from champion select and the player histories. Previously I’d been working on ensemble methods, but concluded that they’re more of a last-mile improvement.

First off, what is deep learning? It’s any neural network with 2 or more hidden layers.

Deep neural networks in conjunction with convolutional layers are the reason machine learning has made so much progress on face and object detection. They’re also responsible for large improvements in speech recognition over the past decade. More recently language modeling has seen large improvements due to recurrent neural networks. (2) That said there’s a ton of variation in deep learning so we use the underwhelming definition “any neural network with 2 or more hidden layers”.

What changed for my problem? I revisited an assumption I had: That subsequent layers are modeling higher levels of abstraction and therefore it makes sense to reduce the number of units in deeper hidden layers. I tried a network that was 56 inputs -> 75 hidden units -> 5 hidden units  -> 1 output. This is consistently worse than 56 inputs -> 75 hidden -> 1 output. At the time I concluded that deep learning wasn’t applicable for my problem. After all, neural networks don’t dominate all kinds of machine learning problems.

Why revisit my assumption? I was reflecting on my future over the holidays and wanted more practice with deep networks; they’re taking over industry and I keep thinking “I wish I could do that!” Had I been busy with something else or not been reflecting I may have never tried this.

Deep networks

As usual I started development on the 50k dataset because it’s fast. I did a quick first test then expanded my search to include the following configurations. I’m only listing the configuration of the hidden units because the input and output layer sizes are determined by my feature matrix (55 inputs, 1 output).

  • 75 hidden: 66.49% +/- 0.65%
  • 75 hidden -> 75 hidden: 66.34% +/- 0.56%
  • 30 hidden -> 30 hidden: 66.19% +/- 0.55%
  • 20 hidden -> 20 hidden: 66.04% +/- 0.54%
  • 75 hidden -> 5 hidden: 65.99% +/- 0.53%
  • 10 hidden -> 10 hidden: 65.98% +/- 0.61%

The network with two hidden layers of 75 units each does surprisingly well, enough to make me question my previous judgment on a network of 75 and 5.

I was surprised that the smaller networks didn’t do so well; my thought process was that a deeper network with a similar number of parameters might have an easier time learning. The 30->30 network has 2,610 weights in the model compared to 4,256 for the 75 network. In contrast the 75->75 network has 9,900 parameters. (1)

I ran some of these configurations on the 200k dataset and finally achieved gains over the 75 unit network!

Accuracy vs network configuration

That’s a great improvement over the previous best neural network on this dataset. Furthermore it’s competitive with gradient boosting which tends to get around 67.8%-67.9% accuracy.

Additional experiments

I tried additional network configurations 100 -> 100, 100 -> 75, and 75 -> 75 -> 75.  The results seem slightly improved on the 100 -> 100 and 100 -> 75 networks and much worse on 75 -> 75 -> 75.

I tuned the dropout parameter in the range 0.4-0.6 and then again 0.4-0.51 (default 0.5) on the 75->75 network. The trend isn’t a smooth curve so I’m hesitant to draw conclusions. The sheer amount of variation worries me. If nothing more, higher values seem to be worse.

Accuracy vs dropout

Based on a tip from Yann LeCun I tried setting the error function to binary cross entropy rather than mean squared error (MSE). The results were wholly worse: There was no configuration in which it was better. I tested this in conjunction with dropout experiments and found that the models trained with MSE had accuracy in the range 67.4-68.1. Binary cross entropy training had accuracy 66.9-67.2.

I increased the number of training epochs, thinking that deeper networks need more time to converge. Previously I was running 100 epochs in batches of 1024 followed by 100 epochs of full batch training. I also tried 200/200 and 300/300. Generally the 200/200 runs were an improvement and the 300/300 runs were worse.

Accuracy vs training iterations

Batch spec ((100, 1024), (100, -1)) means 100 epochs in batches of 1024 followed by 100 epochs in matches of max size (full batch).

It seems that iterating more on a deep network is just overfitting. So I tried early stopping based on 10% held-out validation accuracy with patience 20 epochs. If accuracy doesn’t improve for 20 epochs it stops. In theory this should fix any issues with the 300/300 epoch training but it failed; early stopping was generally worse overall. I found this surprising cause I’ve heard Geoff Hinton say that early stopping is a sort of free lunch – you’re speeding up training and preventing overfitting at the same time.

Accuracy with and without early stopping

During the experiments I came to realize that due to randomness in initialization and training (dropout) there will always be some outliers in a batch of results. If I run 20 tests of the same model there’s a good chance one or two will have top-notch accuracy and one or two will perform horribly. So I’ve become more skeptical of results and I’m recording not just accuracy but standard deviation of the cross validation more often. I’m also tending to look more at the distribution of results while tuning other hyperparameters (the graphs above are like this).

As I was writing this up I came to see that the shallow network with 100 hidden units was a little better than the 75 unit network. It’s funny because I used to use 100 units normally and reduced to 75 to speed up experimentation. I tried a shallow network with 250 hidden units and found that it got 68.42% (best ever) so now I’m going through much more thorough testing of wider shallow networks.

Odds and ends

Before holidays I ran a few more tests with ensembles. First I tried encouraging diversity in the ensemble. My thinking was to tweak the settings so that the individual classifiers would each overfit. So I reduced the min-samples-split setting in gradient boosting and reduced dropout in the neural network. It didn’t help  though.

I also tried adding logistic regression and random forests as components in the stacked ensemble. When testing on the 50k dataset, logistic regression improved the accuracy of the ensemble to get best results. Adding random forests to the ensemble hurt performance. Funny enough, when stacking with linear regression it assigns random forests a negative weight. I’m amazed that they’re that bad.

Before I’d gotten into deep networks I revisited the activation function. In Keras you can pick ReLU, sigmoid, or tanh. I picked ReLU after initially finding it vastly better than sigmoid. In retesting I found that ReLU was vastly superior to both sigmoid and tanh for the hidden units. The results were pretty close for the output unit.

Thoughts

Stepping away from the problem and coming back to it gave me a fresh perspective and helped make progress, especially by retesting assumptions.

Next steps

  • Thorough evaluation of shallow networks. Tests of hidden layer size are in progress.
  • Tune the learning rate. I haven’t touched the default learning rate at all.
  • Revisit maxout now that I have deeper networks

Notes

  1. Due to dropout maybe I should think of each layer as about half the size?
  2. I’m grossly simplifying. Probably none of the progress would’ve happened without GPU-accelerated linear algebra. Or without better learning techniques for deep networks (rprop/rmsprop, adam, tweaks on sgd). Or without the massive increase in training data (NNs not as useful for small data). It’s tough to choose but I’d also include dropout, maxout, LSTM, GRU and ReLU as part of the neural network renaissance.

Ensembles part 2

I’ve been using ensembles of my best classifiers to slightly improve accuracy at predicting League of Legends winners. Previously I tried scikit-learn’s VotingClassifier and also experimented with probability calibration.

Since then I’ve followed up on two ideas: 1) tuning classifier weights with VotingClassifier and 2) making an ensemble that combines individual classifier probabilities using logistic regression (and other methods!).

Tuning classifier weights

The best ensemble combines gradient boosting trees and a neural network. VotingClassifier allows for weights on the individual classifiers: By default they’re evenly weighted but you can skew the voting towards one classifier. Gradient boosting is a stronger individual classifier so it should probably get more weight than the neural network.

I evaluated several weights with 10-fold cross validation. (1) Side note: This time I’m trying out Plotly because it can make charts with error bars. I’m just showing plus or minus one standard deviation on the error bars.

Prediction accuracy vs ensemble weights

The default weights are 0.5 for each classifier.

Roughly there’s a hump around 0.8 weight on gradient boosting but it’s very rough. Almost all results are within a standard deviation of each other.

Generally I’ll stick with 0.8 weight for now. One note of caution: I’m reporting results on my overall evaluation so I’m cheating the evaluation a little in picking 0.8.

Stacking: Classifier to combine classifiers

Tuning the classifier weight bothers me. I’m tuning this equation by hand:

ensemble_prob(X) = prob_1(X) * weight + prob_2(X) * (1 - weight)

But linear regression is designed for this. I could use logistic regression to take the individual classifier probabilities as features and output the class label.

This approach is called stacking and has been used to achieve best results in several machine learning competitions. MLWave has a great section on stacking in their ensemble guide.

Stacking is a little more complex than it sounds because the output of the individual classifiers is different on training data vs unseen data. For stacking to work, the combiner classifier needs probabilities on unseen data. So to get it all to work, you need to subdivide the training data with another layer of cross-validation. (3)

Fortunately I have scikit-learn’s CalibratedClassifierCV as a guide: It uses cross-validation on the training data to learn the calibration. I created StackedEnsembleClassifier (github link) which accepts a list of base classifiers and a combiner. Regression and classification are handled differently in scikit-learn so I have a separate class to use linear regression for combination. Both of them run 3-fold cross-validation behind the scenes just like CalibratedClassifierCV.

Adding another layer of cross-validation slows training so I’ve focused on the 50k dataset. To try and compensate for higher variation in the small dataset I’ve run these tests with 20-fold cross-validation rather than the usual 5 or 10.

Base classifiers and ensembles on 50k dataset

Linear regression is surprisingly good as a combiner. The weights are around 0.35-0.45 on gradient boosting at this data size and the two model weights sum to about 1. For this dataset the manually-tuned weighting is fairly bad, barely improving over the individual classifiers.

Logistic regression and neural networks are also quite good. One thing with the neural network is that it’s much better with sigmoid than ReLU (probably because sigmoid is smoother).

Random forests don’t work well. It could be because they’re so poorly calibrated. I tried a second implementation in which I use the predict method of the combiner rather than predict_proba but that doesn’t do well with random forests either. I’ve seen other researchers use them successfully as combiner so I may just need to add hyperparameter tuning.

I also ran limited tests on the 200k dataset with 10-fold cross-validation:

Accuracy on 200k dataset

First off, gradient boosting has huge variation from one run to the next (even though it’s an average of 10 cross-validation runs and the cross-validation has a fixed random seed). My first run had learning rate 0.2167 but I’ve been using 0.221 for the ensemble. Usually I’ve seen it around 67.8% accurate.

The ensemble is somewhat helpful with 200k data points but not nearly as much as the 50k dataset. It’s plausible that I’ve tuned my hyperparameters for the individual classifiers too much on the 200k data and they need to overfit more for the ensemble to be effective. (2) It’s also plausible that I’m at the limit of predictability with this data.

Another interesting find is that linear regression picks weights around 0.5-0.6 on gradient boosting in contrast with the hand-tuned 0.8. Also I found that fit_intercept=False is slightly better.

I found that increasing the number of folds in the nested cross-validation for the ensemble is slightly helpful. But it increased runtime multiplicatively so 3 folds is enough.

The stacked ensembles tended to have lower standard deviation than the manually weighted ensemble. Particularly the logistic regression combiner tends to have lower standard deviation (i.e., it’s more dependable). That said, we should expect the stacked ensembles to have lower standard deviation because we have 2 base classifiers times 3 folds rather than simply 2 base classifiers. In other word, the cross-validation used to train the combiner smooths out some of the randomness in the individual models.

Although the runtime is higher, I feel much more comfortable with linear regression and logistic regression stacking than manual weighting of the ensemble. Partly it’s better science but they’re also more reliable.

Next steps

I don’t think focusing on better ensembles will get me much gain but I suspect that what I’ve learned will be useful. There are a few little ideas I didn’t investigate but that I’d like to someday:

  • It’d be great to merge the linear regression code in with the general stacked classifier.
  • Diversity among the ensemble members is important and I’d like to explore some automation to encourage diversity, possibly a grid search over the ensemble member hyperparams much like VotingClassifier. There might be en efficient way to do this if I have a larger ensemble (say 10 variations) and do a single training/testing pass to select a diverse subset to then use in the real ensemble. Or I could just tweak the parameters to overfit somewhat.
    • Still need a way to have ensemble members use different subsets of the features.
  • Training time is too slow. Maybe there’s a way to mimic LogisticRegressionCV for other classifiers by designing a sort of warm_start for each classifier.

Notes

  1. Even with 10-fold cross-validation the randomness is just too much. Maybe 20-fold would’ve been smoother. Also note this graph is a composite of 3-4 searches over ensemble weights so that’s why the points aren’t uniformly spread out.
  2. Saying they need to overfit is a bit counterintuitive. Really what I need is to increase the overall diversity of the ensemble. By making nice general individual models I may have forced them to output similar classifications.
  3. I learned the hard way that held-out data is necessary. I was trying to save some effort and just train on probabilities on the training data. But it’s even worse than the worst classifier in the ensemble! In the one test I recorded, training a logistic regression ensemble this way got 65.0% accuracy vs 66.3-66.4% from the individual classifiers.

 

Ensemble notes

I thought probability calibration would be difficult but it’s pretty easy. My ensemble code looks like this:

estimators = []

estimators.append(("nn", NnWrapper(dropout=0.5))
estimators.append(("gb", GradientBoostingClassifier(300)))

if calibrate:
  estimators = [(n, CalibratedClassifierCV(c)) for n, c in estimators]

model = VotingClassifier(estimators, voting="soft")

The only downside is that training the calibration-wrapped model is now going to run 3-fold cross-validation on the training data to learn the calibration. When predicting on new data it uses the calibrated probabilities from each of the three classifiers and averages them.

I ran several tests on the 200k x 61 dataset and found that calibration helps a tiny bit. But I also found that the results fluctuate much more than my non-ensemble classifiers even though I’m testing the ensemble in 5-fold cross-validation. (1)

Here are the raw numbers over 4 tests with and without calibration. Each of those tests is a 5-fold cross-validation run so the standard deviations are showing the variation between the different folds.

Without calibration With calibration
Accuracy Std dev Accuracy Std dev
67.93 0.12 68.11 0.48
67.77 0.14 67.87 0.19
67.86 0.18 67.89 0.19
67.96 0.15 67.89 0.16
Average 67.88 67.94
Spread 0.19 0.24
Runtime 51.7 min 95 min

My first run with calibration was amazing: 68.1% accuracy is the best I’ve ever seen on this data! But it was a fluke. Also the standard deviation is very high, which likely means that one of the folds was accidentally very easy to predict. Although the average accuracy is higher with calibration it’s almost entirely due to that one outlier.

Calibration about doubles the runtime as well. When I stop to think about it, it’s running 3 training runs over 2/3 of the data instead of 1 run so it should be around double.

One cautionary note about CalibratedClassifierCV: It checks the numpy datatype of the output value and uses stratified cross validation if it’s boolean or int. That’s what I want. Luckily Pandas automatically set my output type to int64. If you’re loading the data without Pandas be warned that numpy’s default datatype is a floating point type which would lead to non-stratified cross validation. Non-stratified cross-validation can sometimes lead to bad evaluations especially if the output classes are sparse.

Conclusion: Calibration helps a little but it’s not worth the additional runtime.

Isotonic calibration

The default calibration is labeled as sigmoid but is really just feeding the output value of the base classifier into logistic regression. This is also called Platt scaling.

Isotonic calibration only uses the ranking of the probability value compared to other output probabilities. It’s a more capable learner but prone to overfitting on small data sets.

I’ve only done one test with isotonic calibration so far and the accuracy was about the same but the standard deviation of accuracy across the folds was halved.

Conclusion: Possible minor benefit from isotonic calibration. Need more tests.

Hard voting

I’ve been using soft voting which averages the probabilities of the individual classifiers. When you only have two good models it’s the only option. I briefly tried including logistic regression in the soft voting ensemble but found that it was harmful.

However, I didn’t try hard voting. Not all classifiers provide probabilities so hard voting is sometimes necessary. Another advantage is that I don’t have to consider calibration.

Model Accuracy on 200k x 61, 10-fold
Hard voting: Neural network + Gradient boosting + Logistic regression 67.57%
Soft voting: Neural network + Gradient boosting + Logistic regression 67.53%
Soft voting: Neural network + Gradient boosting (previous best) 67.88%

Hard and soft voting with three classifiers are well within a standard deviation of each other. But they’re well beyond a standard deviation lower than soft voting with the two best individual models.

I had a feeling hard voting wouldn’t do anything magical but was curious if maybe there’s some way I can use my less accurate models to improve the overall ensemble accuracy. It may be useful to use logistic regression in a soft-voting ensemble but I’d need to tune weights for each classifier.

Conclusion: Not much different from soft voting except that it doesn’t work with weighting.

Thoughts and next steps

Overall I get the feeling that ensembles can get me that last fraction of a percent on a problem. In a competitive environment like Kaggle that can be the difference between top 50 and top 5. Also when you’ve run out of ideas for feature engineering it’s another option for that last little bit of improvement.

Assorted ideas for further improvement:

  • Run a grid search over weights in the 2-model soft voting ensemble
  • Use the code for CalibratedClassifierCV as a guide to use logistic regression to combine models. I tried something like this before but didn’t correctly train it like how CalibratedClassifierCV is implemented.
  • Set up the ensemble so that different models can use different feature subsets. I’d like to have a model that uses the list of champions on each side but most models overfit that data (logistic regression is the exception). I think probably I need to migrate my code to scikit-learn pipelines to have an ensemble with different feature subsets.

Notes

  1. Clarification just in case: The evaluation is using 5-fold cross validation but inside each of those 5 folds CalibratedClassifierCV is splitting the training set into 3 folds.

scikit-learn 0.17 is out!

Scikit-learn 0.17 adds features and improvements that might help me:

  • stochastic average gradient solver for logistic regression is faster on big data sets
  • speed and memory enhancements in several classes
  • ensemble classifier that supports hard and soft voting as well as hyperparameter tuning of the components in grid search
  • robust feature scaler does standard scaling but excludes outliers from the standard range

The full changelog is here. I’ve been testing the changes to see how they’ll impact my work in predicting match winners in League of Legends.

Stochastic average gradient

Like lbfgs and newton-cg, sag supports warm_start so it works well in conjunction with LogisiticRegressionCV to tune the regularization parameter.

First I tried on the 200k match dataset with 61 features. I repeated the tests for better accuracy.

Solver Training time Accuracy
lbfgs 0.5 min 66.59%
lbfgs 0.5 min 66.59%
sag 1.7 min 66.60%
sag 1.8 min 66.60%
newton-cg 2.4 min 66.60%
newton-cg 2.6 min 66.60%

sag is faster than newton-cg but still about 3x slower than lbfgs. It does eke out that last 0.01% accuracy though.

sag is designed for large data sets so I also tried on the 1.8 mil x 61 dataset:

Solver Training time Accuracy
lbfgs 7.2 min 66.07%
sag 45.8 min 66.07%

It’s over 6x slower and achieves the same accuracy. Maybe sag’s benefit really shines on datasets with a large number of features: sklearn team testing used 500 features and 47k features.

Conclusion: Staying with lbfgs.

Other performance

The patch notes briefly mentioned speed and memory improvements in random forests and gradient boosting.

RandomForestClassifier

Tried this on the 200k x 61 dataset:

Version Training time Accuracy
Random Forest 0.16.1 14.1 min 66.34%
Random Forest 0.17 13.7 min 66.39%

The training time and accuracy fluctuations could just be differences due to randomization; random forests tend to fluctuate more than other methods from test to test. In the worst case, it doesn’t seem that much has changed. In the best case there are slight improvements.

Conclusion: Random forest is about the same, but I didn’t test memory usage.

GradientBoostingClassifier

Gradient boosting trains much more slowly than other methods so I started on the 50k x 61 dataset. I ran some tests multiple times to be certain of the results.

Version Training time Accuracy
Gradient Boosting 0.16.1 7.6 min 66.08%
Gradient Boosting 0.16.1 with feature scaling 8.8 min 66.10%
Gradient Boosting 0.17 11.0 min 66.17%
Gradient Boosting 0.17 11.6 min 66.34%
Gradient Boosting 0.17 with feature scaling 11.7 min 66.14%
Gradient Boosting 0.17 with feature scaling 11.7 min 66.17%
Gradient Boosting 0.17 presort=False 14.0 min 65.94%
Gradient Boosting 0.17 max_features=auto 11.2 min 66.19%

Gradient boosting is clearly slower in 0.17 and generally a tad more accurate. The default presort setting is good for runtime and accuracy. Feature scaling doesn’t really help. Adjusting the max_features setting seems to help a touch (should reduce variance and improve training time).

I also tested on the 200k x 61 data:

Version Training time Accuracy
Gradient Boosting 0.16.1 43.9 min 67.66%
Gradient Boosting 0.17 62.3 min 67.75%

Again it’s slower but more accurate. I’ve opened a ticket and right now it’s under investigation. It sounds like a change in the error computation may be the culprit.

Conclusion: Gradient boosting 45% slower but a little more accurate, fix is being investigated.

VotingClassifier

In the previous post I described possible directions to get from 67.9% accuracy up to 70.0% and suggested that an ensemble of the best classifiers may be a fruitful direction but may take a bit of time to code.

Well, two things changed. First off, I found a great guide on making an ensemble in scikit-learn. I implemented a simple ensemble and improved my best results from 67.9% accuracy to 68.0% accuracy by a soft-voting ensemble of gradient boosting and neural networks. It’s not as much as I expected but it’s progress.

The second change is that scikit-learn 0.17 added VotingClassifier, implemented by Sebastian Raschka (who wrote the guide and implementation I found earlier). I ported my ensemble code to scikit-learn and it works great (though I had to change my neural network wrapper to return two columns rather than one for binary classification).

That said, I wish it had a flag to perform calibration of the probabilities of the individual classifiers. I’m currently looking into calibrating but not finding that it helps; gradient boosting has more skewed probabilities than neural networks which leads to more weight on gradient boosting. That’s an unintentionally good decision: putting more weight on the stronger classifier.

Conclusion: VotingClassifier is easy and works like a charm.

RobustScaler

In general using the robust scaler seems like an easy solution to save time in preprocessing your data.

I tried it with logistic regression because it’s so sensitive to feature scaling. But after several tests I didn’t find any difference in either scaling+training time or accuracy.

Thoughts

I bolded the main point of each section so I won’t summarize. But I like the direction the scikit-learn is taking.

Better predictions for League matches

I’m predicting the winner of League of Legends ranked games with machine learning. The models look at player histories, champions picked, player ranks, blue vs red side, solo vs team queue, etc. The last time I wrote about accuracy improvements my best was 61.2% accuracy with gradient boosting trees.

Since then I’ve increased the amount of data from about 45,000 matches to 1.8 million matches. I’ve done analysis and the trends are much more reliable.

Experiments with 1.8 million matches are slow so I usually use 200k and sometimes 50k to test code. Almost always the trends in 200k are the same as 1.8 million but they run in minutes or hours compared to hours or days.

I keep a spreadsheet with the outcome of each experiment and notes that indicate the model used, features used, and any other tweaks. This graph shows the progress since the last post.

Accuracy improvements Sep Oct 2015

The graph fluctuates so much because I sometimes test ideas on 50,000 matches. The models are worse but it allows rapid testing.

I also test multiple model types. It smooths out towards the end because I wasn’t experimenting as much with weaker models. On this particular problem, gradient boosting trees and neural networks are clearly stronger than logistic regression and random forests.

Feature engineering

Most of my progress came from feature engineering: getting from the starting point of 61.2% accuracy to 67.2%. I also ran the most experiments in this area: around 80 of 120 experiments.

The initial drop in accuracy was the result of adding additional matches but not fully updating the database of player histories. The players’ ranked history stats were available for a smaller percentage of the data. Accuracy dropped to 58% even despite having 1.8 million matches.

After running a full crawl of all player ranked stats the problem was solved and gradient boosting trees were up to 62.3%. That’s 1% higher than the previous best just from having more data.

Previously I dropped features with low weights in the models because they’re making the data more sparse. When you have a small dataset, you can get small but good improvements by dropping these features. There wasn’t anything particularly special about these features; they’re just mins and maxes of individual player features by team. Adding these features back increased from 41 to 53 features and had accuracy gains: gradient boosting improved from 62.3% to 62.4%. Logistic regression improved from 60.7% accuracy to 63.3% which is the new best.

The next big change was looking up each player’s current rank (e.g., Silver 1, 50 LP). In previous experiments I only used their ending rank from the previous season because it’s easy to access. I had to write a new crawler and let it run over a weekend to fetch every player’s current league, division, and points. I converted that to a single numeric value and added features for the average rank of each side and a diff to make them easier to compare (1).

This was very successful, achieving 65.8% accuracy with logistic regression and 67.1% with gradient boosting trees.

I also extracted the role each person played within the team. A standard team comp has the following five roles: solo top lane, solo mid lane, solo jungle, bottom/marksman, and bottom/support. With these assignments we can compare the player on each side of the matchup: do we expect blue side or red side to win in the mid lane? What about jungle? And so on.

Unfortunately the data is extremely messy for the lane/role assignments. I put a lot of effort into making sensible default values but it still needs more work (5).

After all that work though, logistic regression and gradient boosting trees only improved by 0.1%.

I also revisited indicator features for the champions played on each side and the summoner spells selected. This increases the number of features from 61 to 331 and usually makes the models overfit. I only ran a small number of tests with logistic regression but found on the 200k dataset that it improved accuracy from 66.3% to 66.5%.

Neural networks

Feature engineering is good but once you run out of ideas it’s good to try different model types and hyperparameter tuning. I’d been meaning to use neural networks but they aren’t supported in scikit-learn (2).

After surveying Python neural network libraries I found that almost all of them use Theano on the backend sort of like how scikit-learn uses numpy. I really didn’t want to write the NN code manually in Theano. Lasagne provides shorthand functions to create network layers but doesn’t help with the optimization. Keras is much closer to scikit-learn in that it provides an easier interface and you pick from multiple optimization methods. And I don’t need to understand Theano at all to use it. So I went with Keras.

At first I tried replicating logistic regression as a sanity test. A neural network with sigmoid activation and no hidden units is actually just logistic regression. Unfortunately I couldn’t get similar accuracy to scikit-learn logistic regression no matter what I tried. I got 62.2% in Keras vs 66-67% in scikit-learn logistic regression. When I tried using the ReLU activation function instead of sigmoid I could get 65%. I also tried tanh but that got 64.3%. I never figured out why I couldn’t reproduce it and moved on. (3)

Neural networks are sensitive to hyperparameters just like most other algorithms. But you could say they have many more hyperparameters: the number of layers, layer widths, regularization, dropout, maxout, maxnorm, optimizer algorithm, optimizer settings, activation functions, and so on. The easiest way to test would be scikit-learn’s GridSearchCV or RandomizedSearchCV. Unfortunately Keras models aren’t compatible so you need to write a scitkit-learn class that wraps the Keras model (code here). I made many mistakes in doing it before finding this note. It’s pretty janky; it uses introspection to decide which members of your class are hyper parameters for get_params and set_params (hint: anything ending in underscore is excluded). Keras has a scikit-learn wrapper also but it doesn’t look like you can run a grid search over layer sizes with it.

Ranting aside, I can run grid searches over different network configurations, dropout settings, activation functions, etc.

I was able to get to 67.4% accuracy (new best) after two days with Keras using a neural network with one hidden layer of 75-100 units, 0.5 dropout, and ReLU activation function. Since then I’ve tried tons of experiments which I’ll summarize:

  • Adding a second small hidden layer is harmful. I was comparing 61 input -> 75 hidden -> 1 output vs 61 -> 75 -> 5 -> 1. From watching the run it’s fitting the training data much better but generalizing much worse. I couldn’t find any dropout settings that would compensate for the overfitting enough.
  • Maxnorm: Literature shows that it’s helpful in conjunction with dropout but I didn’t get any gains at all.
  • Maxout: Literature shows bigger gains than maxnorm in conjunction with dropout but I could only just barely get it to have the same score by ensuring that it wasn’t increasing the number of parameters in the model (i.e., for maxout 2, use half as many hidden units)
  • 0.5 dropout seems best. When I increased features from 61 to 331 the best dropout was like 0.8 (which is more or less allowing it to ignore all the extra features). I only put the dropout layer after the hidden layer. I may have tried a dropout layer after inputs but found it didn’t help.
  • I found best results by doing mini batch with about 1000 matches per batch followed by full-batch training.
  • Used Adam optimizer cause I saw a talk that said it’s best/easiest.
  • I tried adding a GaussianNoise layer on the input with 0.1 noise but it was slightly harmful.
  • I ran many experiments on early stopping not shown in the graph but was unable to find any settings that gave the same accuracy. However, I did learn that Keras early stopping can only use val_loss or val_acc; it can’t stop on training loss. Also, the code works incorrectly if stopping on accuracy; it stops if the minimum doesn’t decrease for the specified number of epochs but with accuracy we want the max to increase.
    • To get reasonable results I had to set the patience value pretty high (20 epochs).
    • The best I could do with early stopping was 67.3% accuracy in 3.6 minutes vs 67.45% accuracy in 7.0 minutes without it. So about a factor of 2 faster but not as accurate. Probably good for quick tests though.
    • For accuracy stopping I tried modifying the Keras class to handle accuracy correctly but it didn’t seem to work well (maybe stopping on accuracy is fundamentally bad).

After all that tuning I couldn’t beat 67.4% on the 200k dataset and couldn’t beat 67.0% on 1.8m.

Gradient boosting with init

Gradient boosting decision trees start from a base classifier and correct the errors in the model with each new tree. It starts from predicting the majority class by default but you can supply a full estimator to use as default.

I tried a test with using logistic regression as the base and found that it helped slightly (67.1% -> 67.3%). It was a huge pain though due to lack of documentation but Stack Overflow saved me.

The only issue is that it doesn’t seem to work with multiple threads so mostly I don’t use it. But it seems less sensitive to hyperparameter tuning.

Model types: odds and ends

I tried support vector machines briefly and learned why nobody uses them anymore: O(n^2) runtime so they just don’t scale. I’m sure there are ways around it. You could run the kernel on log n carefully selected points but scikit-learn doesn’t have that.

I also tried elastic net, which is logistic regression with both L1 and L2 norms. I vaguely remember benefits with this for maxent language models. But it didn’t improve over L2 logistic regression and the implementation in scikit-learn was harder to use in cross-validation.

Revisiting hyperparameter tuning

I hadn’t tuned the parameters of gradient boosting trees like I had for neural networks because they’re slow. But I’ve been going back through gradient boosting and random forests to reassess the hyperparams. My goal is to improve runtime and hopefully improve accuracy a little.

Gradient boosting trees

My previous hyperparameters for gradient boosting trees were very suboptimal. After tuning I improved from 67.1% to 67.9% (best results yet). The important settings were the learning rate and number of trees. I’d set the learning rate to 0.9 (very poor choice) and tree to 100 to match the random forests. The best settings were around 0.2-0.25 learning rate and 300 trees. Possibly I had set the 0.9 learning rate from hyperparameter tuning when my data set was leaking the test info and I got 90-100% accuracy. The default learning rate of 0.1 was poor.

I also found tiny gains from subsample 0.9, which helps reduce overfitting. I tried subsample at 0.5 and 0.75 but that was awful. Subsample 0.9 should speed up training slightly so I’m using that now. I also found small gains by tuning min_samples_leaf down to 10 from 20. (4)

Random forests

I also tried tuning with random forests and found that I was using too few trees so upped that from 100 to 150. I was hoping to find the “elbow” in the graph of accuracy vs number of trees but it’s smoother than I’d like:

Accuracy vs number of trees in random forest

I have no idea why there’s a blip at 100.

I also tried re-tuning min_samples_leaf and min_samples_split; higher values reduce overfitting and speed up training. I didn’t see much gain. I’m using min_samples_leaf 7 and min_samples_split 50.

Conclusions

I re-trained and tested the best settings on 200k matches with 5-fold cross-validation with 3 threads:

Accuracy (200k) Training time
Gradient boosting trees 67.7% 43.9 min
Neural networks 67.4% 7.3 min
Logistic regression 66.6% 0.6 min
Random forests 66.3% 14.1 min

I’m not sure why gradient boosting lost 0.2% from previous runs but the hybrid with logistic regression gets 67.9% (not listed above) so I’m not too worried.

I might have to try a scikit-learn wrapper for xgboost. I’ve seen Kagglers have better success with xgboost than with scikit-learn and it’s supposedly faster.

Below are the best results to date in one table. The runs on 1.8 million matches aren’t necessarily with the same hyperparams as the corresponding 200k tests because it takes so long to rerun.

200k matches 1.8m matches
Gradient boosting trees 67.9% 67.7%
Neural networks 67.4% 67.0%
Logistic regression 66.6% 66.1%
Random forests 66.3% 66.2%

So what’s next? Unfortunately the ranked season ends next week and there are massive overhauls for season 6. Especially with ranked team builder queue, I expect that more players will get their best role so matches will be less predictable. I’d really like to hit 70% accuracy but I’m running out of time before everything changes.

Things that might help:

  • Crawl normal (unranked) game stats. Sometimes a player doesn’t have ranked stats for a champion but has stats from normals that could at least show whether they’re new to the champion or not. This would take a couple days to code and a few full days of crawling. I’d guess 0.1-0.5% gain from this.
  • Ensemble of classifiers. Unfortunately I didn’t see a wrapper for this in scikit-learn so I haven’t tried it yet. Gradient boosting trees and neural networks are learning in a very different manner so they should combine well. I’d guess 0.3-1.0% gain from this though it would be more complex than majority voting.
  • Improve team queue prediction. I’m using the solo queue ranking of the players and adding them up but I should look up the team ranking. I could also start tracking the win rates of pairs of teams, which may capture strength or weakness of team strategy. I’d guess 0.1-0.3% gain from this.
  • Make my own ELO score. I could easily make an ELO score that’s updated as I generate the dataset. This wouldn’t have any data leakage problems like current rank but if I have lots of players that only show up in a few games then it won’t be stable. I’d guess 0-0.2% gain.

Extra: Predictability tests

It’s good to understand when your models are doing well and poorly. I looked at this before but there weren’t too many interesting trends. I’ve changed my tests in three ways:  logistic regression instead of random forests (for speed), full 1.8 million matches instead of 44k, and using the players’ current league/rank to tell the level of the match instead of their rank in previous season.

Predictability by league

The black line is is overall average and the thin blue lines show plus or minus one standard deviation around the main blue line. Generally lower leagues are much more predictable. There isn’t a statistically significant difference between silver and gold or master and challenger, but the difference between bronze and silver is significant, gold vs platinum, platinum vs diamond, and diamond vs master.

It’s probably because there are more errors in pick/ban phase at lower ranks that players haven’t learned yet. And also high level players are more capable of playing all roles reasonably whereas lower rank players might play only one role well.

Predictability by version

This shows the prediction accuracy of the model by game version. I removed all game versions with a low number of matches. The standard deviation for most versions is around 0.2%.

The trend worries me. It’s completely unlike what I saw on older data (basically flat). It likely means that the features for current league are partially revealing the outcome of previous matches. Unfortunately the Riot API doesn’t provide a player’s historical ranking so it’s not possible to look up their ranking at the time they played a past game. I could drop the feature but it’s useful. Or I could recrawl each player’s league every day but I don’t have enough cloud storage to store that.

Notes

(1) Logistic regression has no trouble comparing two features but it’s more effective in tree-based methods to add a diff feature.

(2) This recently changed.

(3) I should have tried an absurdly high number of iterations. Structurally the models are the same and I set the regularization and feature scaling the same but the optimizers are different. If I did it all over again I’d test scikit-learn’s SGD implementation to compare directly to Keras SGD in addition to a very high number of iterations.

(4) Generally I like this setting because it reduces overfitting and also speeds up runtime. min_samples_split and min_samples_leaf interact a bit though.

(5) First off, the lane/role fields are sometimes empty or may have a lane (bottom) without the role (carry vs support). If the fields are empty I pick the most common lane/role in the data. If lane is present but the role is missing I fill that in with the most common role for that lane and champion. Even still sometimes the data is wrong and lists the support as a second mid lane. This could be better solved by using a classifier to clean up the data but that’s a bigger project. So sometimes I get matches where one side has bot/support and the other side doesn’t. In this case I look up the win rate of that champion as bot/support. If both supports are correctly tagged then I’ll look up the win rate of that matchup (e.g., Morgana bot support vs Blitzcrank bot support).