I have previously been exploring uncertainty measures that we can build into our machine learning models, making it easier to see whether a concrete prediction can be trusted or not. This involved confidence intervals for datasets and prediction intervals for models; see the previous posts in this series for a more in-depth treatment of all of these.
I have been getting many people contacting me about implementations of these
methods, as it is still somewhat of a hassle to implement these methods if we
simply have a model or a dataset at hand, and we just want some quick
uncertainty estimates. This led to me to develop the Python library
doubt
, which aims to make this
process as easy as possible. In this post I will cover a few common use cases
of the library and attempt to convince you that the step from only having
point predictions to also having uncertainty bounds do not have to be
complicated.
This post is part of my series on quantifying uncertainty:
- Confidence intervals
- Parametric prediction intervals
- Bootstrap prediction intervals
- Quantile regression
- Quantile regression forests
- Doubt
Setting up
Installing the library is as simple as most Python libraries. You simply write
pip install doubt
in your favorite terminal, and you are good to go!
Prelude: doubt.datasets
Throughout this demo post, I will be using different test datasets coming from
the real world, all of which have also been implemented in the doubt
library
as well. These all have the same uniform API. We load in the dataset as
follows, here using the FacebookComments
dataset:
>>> from doubt.datasets import FacebookComments
>>> dataset = FacebookComments()
>>> dataset.shape
(199030, 54)
To see more information about the individual dataset, simply use the help
function:
>>> help(dataset)
class FacebookComments(doubt.datasets._dataset.BaseDataset)
| FacebookComments(cache: Union[str, NoneType] = '.dataset_cache')
|
| Instances in this dataset contain features extracted from Facebook posts.
| The task associated with the data is to predict how many comments the
| post will receive.
|
|
| Parameters:
| cache (str or None, optional):
| The name of the cache. It will be saved to `cache` in the
| current working directory. If None then no cache will be saved.
| Defaults to '.dataset_cache'.
|
| Attributes:
| shape (tuple of integers):
| Dimensions of the data set
| columns (list of strings):
| List of column names in the data set
|
| Class attributes:
| url (string):
| The url where the raw data files can be downloaded
| feats (iterable):
| The column indices of the feature variables
| trgts (iterable):
| The column indices of the target variables
|
| Features:
| page_popularity (int):
| Defines the popularity of support for the source of the document
| (...)
|
| Targets:
| ncomments (int): The number of comments in the next `h_local` hours
|
| Source:
| https://archive.ics.uci.edu/ml/datasets/Facebook+Comment+Volume+Dataset
To split the dataset into a feature matrix and a target vector, use the split
method, which also allows splitting into train/test sets:
>>> X, y = dataset.split()
>>> X.shape, y.shape
((199030, 54), (199030,))
>>>
>>> X_train, X_test, y_train, y_test = dataset.split(test_size=0.1)
>>> X_train.shape, X_test.shape, y_train.shape, y_test.shape
((179035, 54), (19995, 54), (179035,), (19995,))
Uncertainty estimates from an existing model
A common scenario is when you have an existing model, usually an off-the-shelf
model from scikit-learn
, and you would like to produce good uncertainty
bounds around your predictions. This would allow you to be more helpful to your
clients, who might not be interested in the point estimate, but moreso on
roughly what they should expect to happen.
The doubt
library handles this using bootstrapping methods, as described in
my previous post.
Normally our pipeline would look as follows:
>>> from doubt.datasets import PowerPlant
>>> from sklearn.linear_model import LinearRegression
>>>
>>> model = LinearRegression()
>>> model.fit(X, y)
>>> model.predict([10, 30, 1000, 50])
481.9203102126274
We only have to change that to the following:
>>> from doubt.datasets import PowerPlant
>>> from sklearn.linear_model import LinearRegression
>>> from doubt import Boot
>>>
>>> model = Boot(LinearRegression())
>>> model.fit(X, y)
>>> model.predict([10, 30, 1000, 50], uncertainty=0.05)
(481.9203102126274, array([473.43314309, 490.0313962 ]))
Here the uncertainty
parameter denotes how much uncertainty we allow in our
estimates, so that an uncertainty being 0.0 would mean that our prediction
interval had to span all the possible values, and an uncertainty of 1.0 would
give us a point estimate. A common value is 0.05, giving us the traditional 95%
prediction interval.
Note that predictions will take longer, as the Boot
wrapper computes many
predictions instead of a single one, to be able to measure the uncertainty. By
default it produces $\sqrt{N}$ predictions, where $N$ is the number of samples
in the dataset. But this can be manually set by changing the n_boots
parameter in the predict
method:
>>> model.predict([10, 30, 1000, 50], uncertainty=0.05, n_boots=3)
(482.09909346090336, array([473.68305016, 490.16338123]))
Uncertainty estimates with random forests
The above bootstrapping methods works really well, but for ensemble models like random forests the predictions become prohibitively slow. This is because of the sheer amount of predictions that need to be calculated: if the forest consists of 100 decision trees, and we are producing 100 bootstrapped predicitions, we suddenly have to compute 10,000 predictions just to get a single prediction out.
An alternative way faster method is to use quantile regression forests. These only need to compute a single prediction for every decision tree in the forest, just like normally. But here the idea is that the uncertainties are based on the predictions present in the leaf nodes. To read more about this, see my previous post.
Note however that this model requires multiple predictions to be present in each leaf node, meaning that to get optimal prediction intervals we need to enforce this by limiting the size of the tree. Here we do that by limiting the amount of leaf nodes:
>>> from doubt import QuantileRegressionForest
>>> from doubt.datasets import Concrete
>>> import numpy as np
>>>
>>> X, y = Concrete().split()
>>> model = QuantileRegressionForest(max_leaf_nodes=8)
>>> model.fit(X, y)
>>> model.predict(np.ones(8), uncertainty=0.25)
(16.933590347847982, array([ 8.93456428, 26.0664534 ]))
Linear quantile regression
More classical quantile regression methods are also available in doubt
,
wrapping the corresponding model from the excellent statsmodels
library. The
procedure is the same as above:
>>> from doubt import QuantileLinearRegression
>>> from doubt.datasets import ForestFire
>>>
>>> X, y = ForestFire().split()
>>> model = QuantileLinearRegression(uncertainty=0.05)
>>> model.fit(X, y)
>>> model.predict([7, 5, 2, 4, 80, 20, 90, 5, 8, 50, 7, 0])
(7.342714665649355, array([2.27209153e-08, 6.38229624e+01]))
Note that a mean difference with the quantile regression model is that we have
to include the uncertainty
parameter in the constructor and not in the
predict
metod. This is simply because the model needs to be fitted to a
specific uncertainty, and will need to be re-fitted if the uncertainty changes.
Future development
There are several features I would like to implement in the doubt
library.
Firstly, implementing prediction intervals for classification tasks. This is not as straightforward as simply using the probabilities as a regression task, as the target values are the rounded probabilities, skewing the residuals.
Secondly, I would like to include support for neural networks. The bootstrapped methods still work, but inference is very time-consuming. I have previously covered quantile neural networks, but my implementation of that seems quite fragile, and a more robust version of that would be useful.