Scikit-learn 0.17 adds features and improvements that might help me:
- stochastic average gradient solver for logistic regression is faster on big data sets
- speed and memory enhancements in several classes
- ensemble classifier that supports hard and soft voting as well as hyperparameter tuning of the components in grid search
- robust feature scaler does standard scaling but excludes outliers from the standard range
The full changelog is here. I’ve been testing the changes to see how they’ll impact my work in predicting match winners in League of Legends.
Stochastic average gradient
Like lbfgs and newton-cg, sag supports warm_start so it works well in conjunction with LogisiticRegressionCV to tune the regularization parameter.
First I tried on the 200k match dataset with 61 features. I repeated the tests for better accuracy.
sag is faster than newton-cg but still about 3x slower than lbfgs. It does eke out that last 0.01% accuracy though.
sag is designed for large data sets so I also tried on the 1.8 mil x 61 dataset:
It’s over 6x slower and achieves the same accuracy. Maybe sag’s benefit really shines on datasets with a large number of features: sklearn team testing used 500 features and 47k features.
Conclusion: Staying with lbfgs.
The patch notes briefly mentioned speed and memory improvements in random forests and gradient boosting.
Tried this on the 200k x 61 dataset:
|Random Forest 0.16.1||14.1 min||66.34%|
|Random Forest 0.17||13.7 min||66.39%|
The training time and accuracy fluctuations could just be differences due to randomization; random forests tend to fluctuate more than other methods from test to test. In the worst case, it doesn’t seem that much has changed. In the best case there are slight improvements.
Conclusion: Random forest is about the same, but I didn’t test memory usage.
Gradient boosting trains much more slowly than other methods so I started on the 50k x 61 dataset. I ran some tests multiple times to be certain of the results.
|Gradient Boosting 0.16.1||7.6 min||66.08%|
|Gradient Boosting 0.16.1 with feature scaling||8.8 min||66.10%|
|Gradient Boosting 0.17||11.0 min||66.17%|
|Gradient Boosting 0.17||11.6 min||66.34%|
|Gradient Boosting 0.17 with feature scaling||11.7 min||66.14%|
|Gradient Boosting 0.17 with feature scaling||11.7 min||66.17%|
|Gradient Boosting 0.17 presort=False||14.0 min||65.94%|
|Gradient Boosting 0.17 max_features=auto||11.2 min||66.19%|
Gradient boosting is clearly slower in 0.17 and generally a tad more accurate. The default presort setting is good for runtime and accuracy. Feature scaling doesn’t really help. Adjusting the max_features setting seems to help a touch (should reduce variance and improve training time).
I also tested on the 200k x 61 data:
|Gradient Boosting 0.16.1||43.9 min||67.66%|
|Gradient Boosting 0.17||62.3 min||67.75%|
Again it’s slower but more accurate. I’ve opened a ticket and right now it’s under investigation. It sounds like a change in the error computation may be the culprit.
Conclusion: Gradient boosting 45% slower but a little more accurate, fix is being investigated.
In the previous post I described possible directions to get from 67.9% accuracy up to 70.0% and suggested that an ensemble of the best classifiers may be a fruitful direction but may take a bit of time to code.
Well, two things changed. First off, I found a great guide on making an ensemble in scikit-learn. I implemented a simple ensemble and improved my best results from 67.9% accuracy to 68.0% accuracy by a soft-voting ensemble of gradient boosting and neural networks. It’s not as much as I expected but it’s progress.
The second change is that scikit-learn 0.17 added VotingClassifier, implemented by Sebastian Raschka (who wrote the guide and implementation I found earlier). I ported my ensemble code to scikit-learn and it works great (though I had to change my neural network wrapper to return two columns rather than one for binary classification).
That said, I wish it had a flag to perform calibration of the probabilities of the individual classifiers. I’m currently looking into calibrating but not finding that it helps; gradient boosting has more skewed probabilities than neural networks which leads to more weight on gradient boosting. That’s an unintentionally good decision: putting more weight on the stronger classifier.
Conclusion: VotingClassifier is easy and works like a charm.
In general using the robust scaler seems like an easy solution to save time in preprocessing your data.
I tried it with logistic regression because it’s so sensitive to feature scaling. But after several tests I didn’t find any difference in either scaling+training time or accuracy.
I bolded the main point of each section so I won’t summarize. But I like the direction the scikit-learn is taking.