Tuning dropout for each network size

In the previous post I tested a range of shallow networks from 50 hidden units to 1000. On the smaller dataset (50k rows) additional network complexity hurts: It’s just overfitting. On the larger dataset (200k rows) the additional complexity helps because the amount of data prevents the network from overfitting.

But I learned from the Stanford CNN class that I made a mistake: It’s bad to view network complexity as regularization, instead it’s better to pick the most complex you can and tune dropout (and/or L2). I’d amend that to say that it’s bad if you have excess compute resources. If the pace of experimentation is limited by runtime then reducing the network size may be a good way to achieve both regularization and experimental efficiency.

Just as a reminder this is a part of my ongoing project to predict the winner of ranked matches in League of Legends based on information from champion select and the player histories.

Here’s the graph from last time, showing that added network complexity is harmful on a small dataset. I kept the default of 0.5 dropout.

Shallow NN hidden units on 50k dataset

The hidden layer config is shown on the x-axis as Python tuples; this graph is a series of experiments all with a single hidden layer of varying widths.

Tuning dropout for each network size

I replicated the test but stopped at 600 units; it takes increasingly longer to train the wider networks and I was running several times more tests than before. Let’s start with tuning dropout separately for each hidden layer size:

Scaling number of units with and without dropout tuning on 50k dataset

The hidden layer config is shown on the x-axis as Python tuples; this graph is a series of experiments with different numbers of hidden units in a single hidden layer. This was run with 10-fold cross-validation and the following dropout values were tested: [0.4, 0.5, 0.6, 0.7, 0.8].

Now we see a different trend: when tuning dropout in conjunction with the network size, the added capacity doesn’t lead to overfitting. If anything it improves accuracy slightly then plateaus.

It’s also interesting to see the trend more consistent but that could be the result of taking the max over 4 tests. I’m not sure how to assess that independently.

Also note that the best dropout value I tested for 200-600 is 0.8 (the max value I tested). Higher values may have been better but I didn’t have time to test more.

Weight initialization

I replicated these tests with the he_uniform initialization (the default is glorot_uniform). Last time I saw benefits from he_uniform but didn’t test thoroughly.

he_uniform vs glorot_uniform across network size with and without dropout tuning

The dashed lines show the results without tuning dropout – he_uniform is generally an improvement. It’s isn’t any more consistent than glorot_uniform though (previously I was thinking it might be).

When we tune dropout for each network size (solid lines) they’re almost identical. Looking into the best dropout values per size, the tuned values for he_uniform tend to be lower values of dropout. It still transitions to 0.8 for the larger networks but not until network size 300 in contrast to 200 for glorot_uniform. Another way of looking at it: The default 0.5 dropout is closer to optimal for he_uniform and therefore it fares better when dropout isn’t tuned.

I can only guess at the cause… probably the additional randomness in initialization with the He correction is starting the network off with additional diversity. Dropout itself is forcing diversity so maybe we don’t need to force it as much.

All that said, I’m happy to have learned more: Although He initialization is important for deep networks with ReLU, for shallow networks it’s a minor improvement. But depending on the test it may appear to be an improvement if dropout isn’t tuned because it changes the optimal value of dropout.

 

Switching from deep to wide

In the previous post I found gains by adding a second hidden layer. But I accidentally found even better results with wider networks of a single hidden layer. I’ve done more systematic experimentation and wanted to share. Just as a reminder this is a part of my ongoing project to predict the winner of ranked matches in League of Legends based on information from champion select and the player histories.

On the smaller dataset (50k rows) increasing the size of the network is harmful: it’s just overfitting. This is probably a part of the reason why I originally decided to stay with a small network but I’ll show that this conclusion doesn’t apply to the larger dataset.

Shallow NN hidden units on 50k dataset

The hidden layer config is shown on the x-axis as Python tuples; this graph is a series of experiments all with a single hidden layer of varying widths.

I began to replicate this test on the larger dataset and unfortunately found that my training code doesn’t scale. It works relatively well for networks of under 500 hidden units but slows down more as it scales. Normally the 200k dataset takes 4x longer to train than the 50k dataset. In this situation however, I suspect it’s the full-batch training that’s scaling nonlinearly and I suspect that it’s because of the matrix of 200k x 500 weights.

But I learned this the hard way: It took perhaps 12 hours to train all sizes on the 50k dataset with 20-fold cross-validation so I expected around 48 hours for the 200k dataset. After letting it train for an entire week I gave up.

In short I’ve changed my training:
Old: 100 epochs in batches of 1024 followed by 100 epochs of full batch training
New: 100 epochs in batches of 1024 followed by 100 epochs of batches of 5000

In the graph below I separated the two training methods but it doesn’t seem to matter much.

Shallow NN configurations on 200k dataset

The trend is unsteady especially considering the standard deviations but it’s reasonably consistent. When I first tested wider networks I tried 250 and 500 units, which got 68.2-68.4% accuracy. Those results were probably just outliers. Even so, the wider networks fare quite well (considerably better than the deeper networks I tried last time).

If it’s one thing I’ve learned it’s that nothing quite goes the way you plan. After starting these experiments I’ve begun following along Stanford’s CS231n class on convolutional neural networks. One of the things Andrej said is directly applicable: rather than tuning the network size to the data, pick the biggest that you can and tune dropout.

It makes sense: Increasing dropout will sort of shift the excess capacity for learning over to ensembling. But I’m also not entirely happy with the added training time of the larger networks and that may slow down my overall experimentation. On the other hand, thinking about it this way is directly analogous to the way you tune regularization for logistic regression so it simplifies code a little. I tested 50 and 250 units on the 50k dataset in conjunction with tuning dropout and found the same accuracy for both network sizes once dropout was tuned for each one. Maybe this weekend I’ll try replicating the network size graph on the 50k dataset in conjunction with dropout tuning.

Another thing I learned from the class is that the Keras models suggest an incorrect weight initialization, or at least it’s incorrect in conjunction with ReLU. The default is glorot_uniform but it should be he_uniform or he_normal. Both kinds of initializations compensate for diminishing variance of the activations as you stack layers but the Glorot ones are only correct for sigmoid and tanh activations. In ReLU initially about half of the activations will be zero and the He versions compensate in initialization so that the variance of activations doesn’t vanish as you go down the network (which is a problem because the gradients will be very very small and the initial updates too small).

Neural network initializations over 3 dropout settings and 2 network configurations on 50k dataset.png

Experiment on 50k dataset of dropout [0.4, 0.5, 0.6] x 3 initializations shown above x 2 network configurations: (50,) and (50, 50)

The truth is that I didn’t expect it to matter at all with shallow networks but it’s a reliable improvement. In some ways switching from glorot_uniform to he_uniform is more important than tuning the dropout param.

Since this test I’ve added the initialization test to many of the configuration experiments on the 200k dataset. Generally he_uniform has a more stable trend from one configuration to the next and is usually better than glorot. I haven’t tested enough configurations to know whether the degradations from Glorot are due to outliers or not.

One brief note: I think the class examples used glorot_normal and he_normal. Not really sure why uniform is better but I’d guess that it has better symmetry-breaking properties.

Next steps

  • Replicate the network size scaling graph in conjunction with tuning the dropout. I’m hoping to see that it’s flat or slightly increasing.
  • Replicate scaling graphs with he_uniform to see if it really is more consistent from one configuration to the next.
  • Get this working on GPU – I spent a few hours but found that it’s challenging to get CUDA installed under Windows or at least, a version of CUDA that Theano can use. I looked into TensorFlow instead but that’s actually less supported than Theano.
  • I’ve been focusing too much on tuning the individual models I think…. it’d be great to get back to feature engineering.
  • I’ve been meaning to update the datasets for season 6, which started a few weeks ago. Lots has changed in the game: Champion select is done in a way where you’re more likely to get the role you want. There are tons of champion and item changes. And team queue for 5v5 is gone – now you can queue up as 2-5 people in the same queue as solo but it tries to match the teams or subteams in addition to the players.
    • I’ve been avoiding thinking about this because it’s enough of a change in data that my intuitions from this data may not transfer 100%. And I have to do more engineering work, in particular the player history stats need both season 5 and season 6 information.

Probably I’ll take a break so I can catch up on homework for the Stanford class.

Gains from deep learning

Back from the holidays! I’ve finally made some progress with neural networks, particularly a deep network. This is a part of my ongoing project to predict the winner of ranked matches in League of Legends based on information from champion select and the player histories. Previously I’d been working on ensemble methods, but concluded that they’re more of a last-mile improvement.

First off, what is deep learning? It’s any neural network with 2 or more hidden layers.

Deep neural networks in conjunction with convolutional layers are the reason machine learning has made so much progress on face and object detection. They’re also responsible for large improvements in speech recognition over the past decade. More recently language modeling has seen large improvements due to recurrent neural networks. (2) That said there’s a ton of variation in deep learning so we use the underwhelming definition “any neural network with 2 or more hidden layers”.

What changed for my problem? I revisited an assumption I had: That subsequent layers are modeling higher levels of abstraction and therefore it makes sense to reduce the number of units in deeper hidden layers. I tried a network that was 56 inputs -> 75 hidden units -> 5 hidden units  -> 1 output. This is consistently worse than 56 inputs -> 75 hidden -> 1 output. At the time I concluded that deep learning wasn’t applicable for my problem. After all, neural networks don’t dominate all kinds of machine learning problems.

Why revisit my assumption? I was reflecting on my future over the holidays and wanted more practice with deep networks; they’re taking over industry and I keep thinking “I wish I could do that!” Had I been busy with something else or not been reflecting I may have never tried this.

Deep networks

As usual I started development on the 50k dataset because it’s fast. I did a quick first test then expanded my search to include the following configurations. I’m only listing the configuration of the hidden units because the input and output layer sizes are determined by my feature matrix (55 inputs, 1 output).

  • 75 hidden: 66.49% +/- 0.65%
  • 75 hidden -> 75 hidden: 66.34% +/- 0.56%
  • 30 hidden -> 30 hidden: 66.19% +/- 0.55%
  • 20 hidden -> 20 hidden: 66.04% +/- 0.54%
  • 75 hidden -> 5 hidden: 65.99% +/- 0.53%
  • 10 hidden -> 10 hidden: 65.98% +/- 0.61%

The network with two hidden layers of 75 units each does surprisingly well, enough to make me question my previous judgment on a network of 75 and 5.

I was surprised that the smaller networks didn’t do so well; my thought process was that a deeper network with a similar number of parameters might have an easier time learning. The 30->30 network has 2,610 weights in the model compared to 4,256 for the 75 network. In contrast the 75->75 network has 9,900 parameters. (1)

I ran some of these configurations on the 200k dataset and finally achieved gains over the 75 unit network!

Accuracy vs network configuration

That’s a great improvement over the previous best neural network on this dataset. Furthermore it’s competitive with gradient boosting which tends to get around 67.8%-67.9% accuracy.

Additional experiments

I tried additional network configurations 100 -> 100, 100 -> 75, and 75 -> 75 -> 75.  The results seem slightly improved on the 100 -> 100 and 100 -> 75 networks and much worse on 75 -> 75 -> 75.

I tuned the dropout parameter in the range 0.4-0.6 and then again 0.4-0.51 (default 0.5) on the 75->75 network. The trend isn’t a smooth curve so I’m hesitant to draw conclusions. The sheer amount of variation worries me. If nothing more, higher values seem to be worse.

Accuracy vs dropout

Based on a tip from Yann LeCun I tried setting the error function to binary cross entropy rather than mean squared error (MSE). The results were wholly worse: There was no configuration in which it was better. I tested this in conjunction with dropout experiments and found that the models trained with MSE had accuracy in the range 67.4-68.1. Binary cross entropy training had accuracy 66.9-67.2.

I increased the number of training epochs, thinking that deeper networks need more time to converge. Previously I was running 100 epochs in batches of 1024 followed by 100 epochs of full batch training. I also tried 200/200 and 300/300. Generally the 200/200 runs were an improvement and the 300/300 runs were worse.

Accuracy vs training iterations

Batch spec ((100, 1024), (100, -1)) means 100 epochs in batches of 1024 followed by 100 epochs in matches of max size (full batch).

It seems that iterating more on a deep network is just overfitting. So I tried early stopping based on 10% held-out validation accuracy with patience 20 epochs. If accuracy doesn’t improve for 20 epochs it stops. In theory this should fix any issues with the 300/300 epoch training but it failed; early stopping was generally worse overall. I found this surprising cause I’ve heard Geoff Hinton say that early stopping is a sort of free lunch – you’re speeding up training and preventing overfitting at the same time.

Accuracy with and without early stopping

During the experiments I came to realize that due to randomness in initialization and training (dropout) there will always be some outliers in a batch of results. If I run 20 tests of the same model there’s a good chance one or two will have top-notch accuracy and one or two will perform horribly. So I’ve become more skeptical of results and I’m recording not just accuracy but standard deviation of the cross validation more often. I’m also tending to look more at the distribution of results while tuning other hyperparameters (the graphs above are like this).

As I was writing this up I came to see that the shallow network with 100 hidden units was a little better than the 75 unit network. It’s funny because I used to use 100 units normally and reduced to 75 to speed up experimentation. I tried a shallow network with 250 hidden units and found that it got 68.42% (best ever) so now I’m going through much more thorough testing of wider shallow networks.

Odds and ends

Before holidays I ran a few more tests with ensembles. First I tried encouraging diversity in the ensemble. My thinking was to tweak the settings so that the individual classifiers would each overfit. So I reduced the min-samples-split setting in gradient boosting and reduced dropout in the neural network. It didn’t help  though.

I also tried adding logistic regression and random forests as components in the stacked ensemble. When testing on the 50k dataset, logistic regression improved the accuracy of the ensemble to get best results. Adding random forests to the ensemble hurt performance. Funny enough, when stacking with linear regression it assigns random forests a negative weight. I’m amazed that they’re that bad.

Before I’d gotten into deep networks I revisited the activation function. In Keras you can pick ReLU, sigmoid, or tanh. I picked ReLU after initially finding it vastly better than sigmoid. In retesting I found that ReLU was vastly superior to both sigmoid and tanh for the hidden units. The results were pretty close for the output unit.

Thoughts

Stepping away from the problem and coming back to it gave me a fresh perspective and helped make progress, especially by retesting assumptions.

Next steps

  • Thorough evaluation of shallow networks. Tests of hidden layer size are in progress.
  • Tune the learning rate. I haven’t touched the default learning rate at all.
  • Revisit maxout now that I have deeper networks

Notes

  1. Due to dropout maybe I should think of each layer as about half the size?
  2. I’m grossly simplifying. Probably none of the progress would’ve happened without GPU-accelerated linear algebra. Or without better learning techniques for deep networks (rprop/rmsprop, adam, tweaks on sgd). Or without the massive increase in training data (NNs not as useful for small data). It’s tough to choose but I’d also include dropout, maxout, LSTM, GRU and ReLU as part of the neural network renaissance.

Ensembles part 2

I’ve been using ensembles of my best classifiers to slightly improve accuracy at predicting League of Legends winners. Previously I tried scikit-learn’s VotingClassifier and also experimented with probability calibration.

Since then I’ve followed up on two ideas: 1) tuning classifier weights with VotingClassifier and 2) making an ensemble that combines individual classifier probabilities using logistic regression (and other methods!).

Tuning classifier weights

The best ensemble combines gradient boosting trees and a neural network. VotingClassifier allows for weights on the individual classifiers: By default they’re evenly weighted but you can skew the voting towards one classifier. Gradient boosting is a stronger individual classifier so it should probably get more weight than the neural network.

I evaluated several weights with 10-fold cross validation. (1) Side note: This time I’m trying out Plotly because it can make charts with error bars. I’m just showing plus or minus one standard deviation on the error bars.

Prediction accuracy vs ensemble weights

The default weights are 0.5 for each classifier.

Roughly there’s a hump around 0.8 weight on gradient boosting but it’s very rough. Almost all results are within a standard deviation of each other.

Generally I’ll stick with 0.8 weight for now. One note of caution: I’m reporting results on my overall evaluation so I’m cheating the evaluation a little in picking 0.8.

Stacking: Classifier to combine classifiers

Tuning the classifier weight bothers me. I’m tuning this equation by hand:

ensemble_prob(X) = prob_1(X) * weight + prob_2(X) * (1 - weight)

But linear regression is designed for this. I could use logistic regression to take the individual classifier probabilities as features and output the class label.

This approach is called stacking and has been used to achieve best results in several machine learning competitions. MLWave has a great section on stacking in their ensemble guide.

Stacking is a little more complex than it sounds because the output of the individual classifiers is different on training data vs unseen data. For stacking to work, the combiner classifier needs probabilities on unseen data. So to get it all to work, you need to subdivide the training data with another layer of cross-validation. (3)

Fortunately I have scikit-learn’s CalibratedClassifierCV as a guide: It uses cross-validation on the training data to learn the calibration. I created StackedEnsembleClassifier (github link) which accepts a list of base classifiers and a combiner. Regression and classification are handled differently in scikit-learn so I have a separate class to use linear regression for combination. Both of them run 3-fold cross-validation behind the scenes just like CalibratedClassifierCV.

Adding another layer of cross-validation slows training so I’ve focused on the 50k dataset. To try and compensate for higher variation in the small dataset I’ve run these tests with 20-fold cross-validation rather than the usual 5 or 10.

Base classifiers and ensembles on 50k dataset

Linear regression is surprisingly good as a combiner. The weights are around 0.35-0.45 on gradient boosting at this data size and the two model weights sum to about 1. For this dataset the manually-tuned weighting is fairly bad, barely improving over the individual classifiers.

Logistic regression and neural networks are also quite good. One thing with the neural network is that it’s much better with sigmoid than ReLU (probably because sigmoid is smoother).

Random forests don’t work well. It could be because they’re so poorly calibrated. I tried a second implementation in which I use the predict method of the combiner rather than predict_proba but that doesn’t do well with random forests either. I’ve seen other researchers use them successfully as combiner so I may just need to add hyperparameter tuning.

I also ran limited tests on the 200k dataset with 10-fold cross-validation:

Accuracy on 200k dataset

First off, gradient boosting has huge variation from one run to the next (even though it’s an average of 10 cross-validation runs and the cross-validation has a fixed random seed). My first run had learning rate 0.2167 but I’ve been using 0.221 for the ensemble. Usually I’ve seen it around 67.8% accurate.

The ensemble is somewhat helpful with 200k data points but not nearly as much as the 50k dataset. It’s plausible that I’ve tuned my hyperparameters for the individual classifiers too much on the 200k data and they need to overfit more for the ensemble to be effective. (2) It’s also plausible that I’m at the limit of predictability with this data.

Another interesting find is that linear regression picks weights around 0.5-0.6 on gradient boosting in contrast with the hand-tuned 0.8. Also I found that fit_intercept=False is slightly better.

I found that increasing the number of folds in the nested cross-validation for the ensemble is slightly helpful. But it increased runtime multiplicatively so 3 folds is enough.

The stacked ensembles tended to have lower standard deviation than the manually weighted ensemble. Particularly the logistic regression combiner tends to have lower standard deviation (i.e., it’s more dependable). That said, we should expect the stacked ensembles to have lower standard deviation because we have 2 base classifiers times 3 folds rather than simply 2 base classifiers. In other word, the cross-validation used to train the combiner smooths out some of the randomness in the individual models.

Although the runtime is higher, I feel much more comfortable with linear regression and logistic regression stacking than manual weighting of the ensemble. Partly it’s better science but they’re also more reliable.

Next steps

I don’t think focusing on better ensembles will get me much gain but I suspect that what I’ve learned will be useful. There are a few little ideas I didn’t investigate but that I’d like to someday:

  • It’d be great to merge the linear regression code in with the general stacked classifier.
  • Diversity among the ensemble members is important and I’d like to explore some automation to encourage diversity, possibly a grid search over the ensemble member hyperparams much like VotingClassifier. There might be en efficient way to do this if I have a larger ensemble (say 10 variations) and do a single training/testing pass to select a diverse subset to then use in the real ensemble. Or I could just tweak the parameters to overfit somewhat.
    • Still need a way to have ensemble members use different subsets of the features.
  • Training time is too slow. Maybe there’s a way to mimic LogisticRegressionCV for other classifiers by designing a sort of warm_start for each classifier.

Notes

  1. Even with 10-fold cross-validation the randomness is just too much. Maybe 20-fold would’ve been smoother. Also note this graph is a composite of 3-4 searches over ensemble weights so that’s why the points aren’t uniformly spread out.
  2. Saying they need to overfit is a bit counterintuitive. Really what I need is to increase the overall diversity of the ensemble. By making nice general individual models I may have forced them to output similar classifications.
  3. I learned the hard way that held-out data is necessary. I was trying to save some effort and just train on probabilities on the training data. But it’s even worse than the worst classifier in the ensemble! In the one test I recorded, training a logistic regression ensemble this way got 65.0% accuracy vs 66.3-66.4% from the individual classifiers.

 

Ensemble notes

I thought probability calibration would be difficult but it’s pretty easy. My ensemble code looks like this:

estimators = []

estimators.append(("nn", NnWrapper(dropout=0.5))
estimators.append(("gb", GradientBoostingClassifier(300)))

if calibrate:
  estimators = [(n, CalibratedClassifierCV(c)) for n, c in estimators]

model = VotingClassifier(estimators, voting="soft")

The only downside is that training the calibration-wrapped model is now going to run 3-fold cross-validation on the training data to learn the calibration. When predicting on new data it uses the calibrated probabilities from each of the three classifiers and averages them.

I ran several tests on the 200k x 61 dataset and found that calibration helps a tiny bit. But I also found that the results fluctuate much more than my non-ensemble classifiers even though I’m testing the ensemble in 5-fold cross-validation. (1)

Here are the raw numbers over 4 tests with and without calibration. Each of those tests is a 5-fold cross-validation run so the standard deviations are showing the variation between the different folds.

Without calibration With calibration
Accuracy Std dev Accuracy Std dev
67.93 0.12 68.11 0.48
67.77 0.14 67.87 0.19
67.86 0.18 67.89 0.19
67.96 0.15 67.89 0.16
Average 67.88 67.94
Spread 0.19 0.24
Runtime 51.7 min 95 min

My first run with calibration was amazing: 68.1% accuracy is the best I’ve ever seen on this data! But it was a fluke. Also the standard deviation is very high, which likely means that one of the folds was accidentally very easy to predict. Although the average accuracy is higher with calibration it’s almost entirely due to that one outlier.

Calibration about doubles the runtime as well. When I stop to think about it, it’s running 3 training runs over 2/3 of the data instead of 1 run so it should be around double.

One cautionary note about CalibratedClassifierCV: It checks the numpy datatype of the output value and uses stratified cross validation if it’s boolean or int. That’s what I want. Luckily Pandas automatically set my output type to int64. If you’re loading the data without Pandas be warned that numpy’s default datatype is a floating point type which would lead to non-stratified cross validation. Non-stratified cross-validation can sometimes lead to bad evaluations especially if the output classes are sparse.

Conclusion: Calibration helps a little but it’s not worth the additional runtime.

Isotonic calibration

The default calibration is labeled as sigmoid but is really just feeding the output value of the base classifier into logistic regression. This is also called Platt scaling.

Isotonic calibration only uses the ranking of the probability value compared to other output probabilities. It’s a more capable learner but prone to overfitting on small data sets.

I’ve only done one test with isotonic calibration so far and the accuracy was about the same but the standard deviation of accuracy across the folds was halved.

Conclusion: Possible minor benefit from isotonic calibration. Need more tests.

Hard voting

I’ve been using soft voting which averages the probabilities of the individual classifiers. When you only have two good models it’s the only option. I briefly tried including logistic regression in the soft voting ensemble but found that it was harmful.

However, I didn’t try hard voting. Not all classifiers provide probabilities so hard voting is sometimes necessary. Another advantage is that I don’t have to consider calibration.

Model Accuracy on 200k x 61, 10-fold
Hard voting: Neural network + Gradient boosting + Logistic regression 67.57%
Soft voting: Neural network + Gradient boosting + Logistic regression 67.53%
Soft voting: Neural network + Gradient boosting (previous best) 67.88%

Hard and soft voting with three classifiers are well within a standard deviation of each other. But they’re well beyond a standard deviation lower than soft voting with the two best individual models.

I had a feeling hard voting wouldn’t do anything magical but was curious if maybe there’s some way I can use my less accurate models to improve the overall ensemble accuracy. It may be useful to use logistic regression in a soft-voting ensemble but I’d need to tune weights for each classifier.

Conclusion: Not much different from soft voting except that it doesn’t work with weighting.

Thoughts and next steps

Overall I get the feeling that ensembles can get me that last fraction of a percent on a problem. In a competitive environment like Kaggle that can be the difference between top 50 and top 5. Also when you’ve run out of ideas for feature engineering it’s another option for that last little bit of improvement.

Assorted ideas for further improvement:

  • Run a grid search over weights in the 2-model soft voting ensemble
  • Use the code for CalibratedClassifierCV as a guide to use logistic regression to combine models. I tried something like this before but didn’t correctly train it like how CalibratedClassifierCV is implemented.
  • Set up the ensemble so that different models can use different feature subsets. I’d like to have a model that uses the list of champions on each side but most models overfit that data (logistic regression is the exception). I think probably I need to migrate my code to scikit-learn pipelines to have an ensemble with different feature subsets.

Notes

  1. Clarification just in case: The evaluation is using 5-fold cross validation but inside each of those 5 folds CalibratedClassifierCV is splitting the training set into 3 folds.

scikit-learn 0.17 is out!

Scikit-learn 0.17 adds features and improvements that might help me:

  • stochastic average gradient solver for logistic regression is faster on big data sets
  • speed and memory enhancements in several classes
  • ensemble classifier that supports hard and soft voting as well as hyperparameter tuning of the components in grid search
  • robust feature scaler does standard scaling but excludes outliers from the standard range

The full changelog is here. I’ve been testing the changes to see how they’ll impact my work in predicting match winners in League of Legends.

Stochastic average gradient

Like lbfgs and newton-cg, sag supports warm_start so it works well in conjunction with LogisiticRegressionCV to tune the regularization parameter.

First I tried on the 200k match dataset with 61 features. I repeated the tests for better accuracy.

Solver Training time Accuracy
lbfgs 0.5 min 66.59%
lbfgs 0.5 min 66.59%
sag 1.7 min 66.60%
sag 1.8 min 66.60%
newton-cg 2.4 min 66.60%
newton-cg 2.6 min 66.60%

sag is faster than newton-cg but still about 3x slower than lbfgs. It does eke out that last 0.01% accuracy though.

sag is designed for large data sets so I also tried on the 1.8 mil x 61 dataset:

Solver Training time Accuracy
lbfgs 7.2 min 66.07%
sag 45.8 min 66.07%

It’s over 6x slower and achieves the same accuracy. Maybe sag’s benefit really shines on datasets with a large number of features: sklearn team testing used 500 features and 47k features.

Conclusion: Staying with lbfgs.

Other performance

The patch notes briefly mentioned speed and memory improvements in random forests and gradient boosting.

RandomForestClassifier

Tried this on the 200k x 61 dataset:

Version Training time Accuracy
Random Forest 0.16.1 14.1 min 66.34%
Random Forest 0.17 13.7 min 66.39%

The training time and accuracy fluctuations could just be differences due to randomization; random forests tend to fluctuate more than other methods from test to test. In the worst case, it doesn’t seem that much has changed. In the best case there are slight improvements.

Conclusion: Random forest is about the same, but I didn’t test memory usage.

GradientBoostingClassifier

Gradient boosting trains much more slowly than other methods so I started on the 50k x 61 dataset. I ran some tests multiple times to be certain of the results.

Version Training time Accuracy
Gradient Boosting 0.16.1 7.6 min 66.08%
Gradient Boosting 0.16.1 with feature scaling 8.8 min 66.10%
Gradient Boosting 0.17 11.0 min 66.17%
Gradient Boosting 0.17 11.6 min 66.34%
Gradient Boosting 0.17 with feature scaling 11.7 min 66.14%
Gradient Boosting 0.17 with feature scaling 11.7 min 66.17%
Gradient Boosting 0.17 presort=False 14.0 min 65.94%
Gradient Boosting 0.17 max_features=auto 11.2 min 66.19%

Gradient boosting is clearly slower in 0.17 and generally a tad more accurate. The default presort setting is good for runtime and accuracy. Feature scaling doesn’t really help. Adjusting the max_features setting seems to help a touch (should reduce variance and improve training time).

I also tested on the 200k x 61 data:

Version Training time Accuracy
Gradient Boosting 0.16.1 43.9 min 67.66%
Gradient Boosting 0.17 62.3 min 67.75%

Again it’s slower but more accurate. I’ve opened a ticket and right now it’s under investigation. It sounds like a change in the error computation may be the culprit.

Conclusion: Gradient boosting 45% slower but a little more accurate, fix is being investigated.

VotingClassifier

In the previous post I described possible directions to get from 67.9% accuracy up to 70.0% and suggested that an ensemble of the best classifiers may be a fruitful direction but may take a bit of time to code.

Well, two things changed. First off, I found a great guide on making an ensemble in scikit-learn. I implemented a simple ensemble and improved my best results from 67.9% accuracy to 68.0% accuracy by a soft-voting ensemble of gradient boosting and neural networks. It’s not as much as I expected but it’s progress.

The second change is that scikit-learn 0.17 added VotingClassifier, implemented by Sebastian Raschka (who wrote the guide and implementation I found earlier). I ported my ensemble code to scikit-learn and it works great (though I had to change my neural network wrapper to return two columns rather than one for binary classification).

That said, I wish it had a flag to perform calibration of the probabilities of the individual classifiers. I’m currently looking into calibrating but not finding that it helps; gradient boosting has more skewed probabilities than neural networks which leads to more weight on gradient boosting. That’s an unintentionally good decision: putting more weight on the stronger classifier.

Conclusion: VotingClassifier is easy and works like a charm.

RobustScaler

In general using the robust scaler seems like an easy solution to save time in preprocessing your data.

I tried it with logistic regression because it’s so sensitive to feature scaling. But after several tests I didn’t find any difference in either scaling+training time or accuracy.

Thoughts

I bolded the main point of each section so I won’t summarize. But I like the direction the scikit-learn is taking.