I ran a quick Gradient Boosted Trees vs Neural Nets check using scikit-learn's dev branch which makes it more convenient to work with tabular datasets with mixed numerical and categorical features data (e.g. the Adult Census dataset).
Let's start with the GBRT model. It's now possible to reproduce the SOTA number of this dataset in a few lines of code 2 s (CV included) on my laptop.
1/n
For neural networks, feature preprocessing is a deal breaker.
I was pleasantly surprised to observe that by intuitively composing basic building blocks (`OneHotEncoder` and `SplineTransformer` and `MLPClassifier`) from scikit-learn, it's possible to approach the predictive performance of trees on this dataset.
2/n
Note that the runtime for the neural net is ~10x slower than the tree-based model on my Apple M1 laptop.
I did not try to use an expensive GPU with PyTorch.
Note however that I did configure conda-forge's numpy to link against Apple Accelerate and use the M1 chip builtin GPU which is typically around 3x faster than OpenBLAS on the M1's CPU.
It's possible that with float32 ops (instead of float64) the difference would be wider though. Unfortunately it's not yet easy to do w/ sklearn.
3/n
It was interesting to see that the neural network predictive accuracy would be degraded by one or two points if we had used standard scaling of numerical features instead of splines, or if I had used a small number of knots for the splines.
For this particular dataset, it seems important to use an axis-align prior feature preprocessing for the numerical features.
4/n
This is in line with the numbers in the AD column of Table 6 of this very interesting paper:
On Embeddings for Numerical Features in Tabular Deep Learning
Yury Gorishniy, Ivan Rubachev, Artem Babenko
https://arxiv.org/abs/2203.05556
Note that I did not do extensive parameter tuning but my notebook is not too far from those numbers.
I might try to implement the periodic features as a preprocessor in the future.
5/n
Meanwhile, I also checked the calibration of the tree-based and nn-based models.
The conclusion is that both models are well calibrated by default, as long as you use early stopping.
If you disable early stopping and `max_iter` is too small (under fit) or too large (over fit) then the models can either be significantly under-confident or over-confident.
6/n
Here is the link to the rendered notebook:
It also includes a similar study on California Housing which has only numerical features.
For this dataset, spline features degrade performance. I found that quite surprising. But standard scaling makes the neural network competitive (albeit still slower) than the tree based model.
7/7.