Uncertainty Estimation Using a Single Deep Deterministic Neural Network
Joost van Amersfoort 1 Lewis Smith 1 Yee Whye Teh 2 Yarin Gal 1
Abstract
We propose a method for training a determinis-
tic deep model that can find and reject out of
distribution data points at test time with a single
forward pass. Our approach, deterministic uncer-
tainty quantification (DUQ), builds upon ideas of
RBF networks. We scale training in these with a
(a) Deep Ensembles (b) Our model - DUQ
novel loss function and centroid updating scheme
and match the accuracy of softmax models. By Figure 1. Uncertainty results on two moons dataset. Yellow in-
enforcing detectability of changes in the input dicates high certainty, while blue indicates uncertainty. DUQ is
using a gradient penalty, we are able to reliably certain only on the data distribution, and uncertain away from
detect out of distribution data. Our uncertainty it: the ideal result. Deep Ensembles is uncertain only along the
quantification scales well to large datasets, and decision boundary, and certain elsewhere.
using a single model, we improve upon or match
Deep Ensembles in out of distribution detection
on notable difficult dataset pairs such as Fashion-
Ensembles, and show that DUQ compares favourably on a
MNIST vs. MNIST, and CIFAR-10 vs. SVHN.
number of evaluations, such as out of distribution (OoD) de-
tection of FashionMNIST vs MNIST, and CIFAR vs. SVHN.
1. Introduction We visualise how DUQ performs on the two moons dataset
in Figure 1. We see that DUQ is only certain on the training
Estimating uncertainty reliably and efficiently has remained data, and its certainty decreases away from it. Deep Ensem-
an open problem with many important applications such bles are not able to obtain meaningful uncertainty on this
as guiding exploration in Reinforcement Learning (Os- dataset, because of a lack of diversity in the different models
band et al., 2016) or as a method for selecting data points in the ensemble. We make our code publicly available1 .
for which to acquire labels in Active Learning (Houlsby
DUQ consists of a deep model and a set of feature vectors
et al., 2011). Until now, most approaches for estimating
corresponding to the different classes (centroids). A pre-
uncertainty in deep learning rely on ensembling (Laksh-
diction is made by computing a kernel function, a distance
minarayanan et al., 2017) or Monte Carlo sampling (Gal
function, between the feature vector computed by the model
& Ghahramani, 2016). In this paper, we introduce a deep
and the centroids. This type of model is called an RBF
model that is able to estimate uncertainty in a single forward
network (LeCun et al., 1998a) and uncertainty is measured
pass. We call our model DUQ, Deterministic Uncertainty
as the distance between the model output and the closest
Quantification, and we construct it by re-examining ideas
centroid. A data point for which the feature vector is far
originally suggested in the 90s. We combine these with
away from all centroids does not belong to any class and can
recent advances and make a number of improvements which
be considered out of distribution. In this paper, we define
enable scalable training of modern deep learning architec-
uncertainty to be predictive uncertainty.
tures. We evaluate our model against the current best ap-
proach for estimating uncertainty in Deep Learning, Deep The model is trained by minimising the distance to the
1
correct centroid, while maximising it with respect to the
OATML, Department of Computer Science, Uni-
others. This incentivises the model to put the features of
versity of Oxford 2 Department of Statistics, University
of Oxford. Correspondence to: Joost van Amersfoort training data close to a particular centroid, however there is
<[email protected]>. no mechanism that dictates what should happen away from
the training data. Therefore we need to enforce that DUQ is
Proceedings of the 37 th International Conference on Machine
1
Learning, Online, PMLR 119, 2020. Copyright 2020 by the au- https://github.com/y0ast/
thor(s). deterministic-uncertainty-quantification
Uncertainty Estimation Using a Single Deep Deterministic Neural Network
sensitive to changes in the input, such that we can reliably Cat
detect out of distribution data and avoid mapping out of e1 Prediction
1
distribution data to in distribution feature representations fθ | | Wc fθ (x) − ec | |22
} Uncertainty = exp −
n
— an effect we call feature collapse. The upper bound of 2σ 2
fθ (x)
this sensitivity can be quantified by the Lipschitz constant
of the model. We are interested in models for which this e2
Dog
sensitivity is not too low, but also not too high, because that
e3 Bird
could hurt generalisation and optimisation. DUQ achieves
this result by regularising the Jacobian with respect to the
input, as was first introduced by Drucker & Le Cun (1992).
Figure 2. A depiction of the architecture of DUQ. The input is
In practice, RBF networks prove difficult to optimise, be- mapped to feature space, where it is assigned to the closest centroid.
cause of instability of the centroids and a saturating loss. We The distance to that centroid is the uncertainty.
propose to make training stable by updating the centroids
using an exponential moving average of the feature vectors
of the data points assigned to them, as was introduced in output and the class centroids, we compute the exponenti-
van den Oord et al. (2017). We use a “one vs the rest” loss ated distance between the model output and the centroids:
function minimising the distance to the correct centroid, 1 2
while maximising the other distances. We find that these n ||Wc fθ (x) − ec ||2
Kc (fθ (x), ec ) = exp − , (1)
two changes stabilise training and lead to accuracies that are 2σ 2
similar to the standard softmax and cross entropy set up on
standard datasets such as FashionMNIST and CIFAR-10. with fθ : Rm → Rd our model, m the input dimension, d
the output dimension, and parameters θ. ec is the centroid
Uncertainty quantification in deep neural networks with a for class c, a vector of length n. Wc is a weight matrix of
softmax output is generally done by measuring the entropy size n (centroid size) by d (feature extractor output size)
of the predictive distribution, so the maximally uncertain and σ a hyper parameter sometimes called the length scale.
output is achieved by uniformly assigning probabilities over This function is also referred to as a Radial Basis Function
all the classes. The only way to achieve a uniform output (RBF) kernel. The class dependent weight matrix allows
for out of distribution data, is by training on additional features insensitivity on a class by class basis, minimising
data and hoping it generalises to out of distribution samples the potential for feature collapse. A prediction is made by
at test time. This does not happen in practice, and it is taking the class c with the maximum correlation (minimum
found that the only uncertainty that can reliably be captured distance) between data point x and class centroids E =
by looking at the entropy of the softmax distribution is {e1 , . . . , eC }:
aleatoric uncertainty (Gal, 2016; Hein et al., 2019). In DUQ,
it is possible to predict that none of the classes seen during arg max Kc (fθ (x), ec ). (2)
c
training is a good fit, when the distance between the model
output and all centroids is large. we define the uncertainty in this model as the distance to
The contributions of this paper are as follows: the closest centroid, i.e. replacing the arg max operator by
a max in Equation (2).
• We stabilise training of RBF networks and show, for The loss function is the sum of the binary cross entropy
the first time, that these type of models can achieve between each class’ kernel value Kc (·, ec ), and a one-hot
competitive accuracy versus softmax models. (binary) encoding of the label. For a particular data point
{x, y} in our data set {X, Y }:
• We show how two-sided Jacobian regularisation makes
it possible to obtain reliable uncertainty estimates for X
L(x, y) = − yc log(Kc ) + (1 − yc ) log(1 − Kc ) (3)
RBF networks.
c
• We obtain excellent uncertainty in a single forward
where we shortened K(fθ (x), ec ) as Kc . During train-
pass, while maintaining competitive accuracy.
ing, we average the loss over a minibatch of data points,
and perform stochastic gradient descent on θ and W =
2. Methods {W1 , · · · , Wc }. The class centroids, E, are updated using
an exponential moving average of the feature vectors of data
DUQ consists of a deep feature extractor, such as a ResNet points belonging to that class. If the model parameters, θ
(He et al., 2016), but without the softmax layer. Instead, we and W, are held constant, then this update rule leads to the
have one learnable weight matrix Wc per class, c. Using the closed form solution for the centroids that minimises the
Uncertainty Estimation Using a Single Deep Deterministic Neural Network
loss: 2.2. Intuition about Gradient Penalty
Nc,t = γ ∗ Nc,t−1 + (1 − γ) ∗ nc,t (4) A gradient penalty enforces smoothness, limiting how
X quickly the output of a function changes as the input x
mc,t = γ ∗ mc,t−1 + (1 − γ) Wc fθ (xc,t,i ) (5)
changes. Smoothness is important for generalisation, espe-
i
mc,t cially if we are using a kernel which depends on distances
ec,t = (6) in the representation space. It is simple to show that regu-
Nc,t
larising the l2 norm of the Jacobian, J, enforces a Lipschitz
where nc,t is the number of data points assigned to class constraint at least locally, since for a small region around x
c in minibatch t, xc,t,i is element i of a minibatch at time we have g(x + ) − g(x) ' Jg (x) ≤ ||J(x)||2 ||||2 .
t, with class c. γ is the momentum, which we usually set
However, smoothness still leaves us vulnerable to the feature
between [0.99, 0.999]. This method of updating centroids
collapse problem outlined earlier, where multiple inputs are
was introduced in the Appendix of van den Oord et al. (2017)
mapped to the same g(x). Lipschitz smooth functions can
for updating quantised latent variables. The high momentum
collapse their inputs — the constant function g(x) = c is
leads to stable optimisation that is robust to initialisation.
Lipschitz for any Lipschitz constant L. Collapsing features
The proposed set up leads to the centroids being pushed can be beneficial for accuracy, but it hurts our ability to per-
further away at each minibatch, without converging to a form out of distribution detection, since it has the potential
stable point. We avoid this by regularising the l2 norm of to make input points indistinguishable in the representation
θ. This restricts the model to sensible solutions and aids space. We find empirically in our work that the two sided
optimisation. penalty is extremely important: using the one sided penalty,
i.e, enforcing only smoothness, is not sufficient to produce
2.1. Gradient Penalty the sensitive behaviour we want in our representation. This
can be seen in Figure 4b, in contrast to Figure 1b with the
As discussed in the introduction, without further regular- two-sided penalty.
isation deep networks are prone to feature collapse. We
find that it can be avoided by regularising the representa- By keeping the norm of the Jacobian above some value, in-
tion map using a gradient penalty. Gradient penalties were tuitively we encourage sensitivity of the learnt function, by
first introduced to aid generalisation in Drucker & Le Cun preventing it from collapsing to a locally constant function,
(1992), who named it “double backpropagation”. Recently, ignoring all changes in the input space. This argument is
this type of penalty has been used successfully in training speculative, as this regularisation scheme has no effect on
Wasserstein GANs (Gulrajani et al., 2017) to regularise the sensitivity in directions orthogonal to the local Jacobian,
Lipschitz constant. and more work is needed to explain definitively exactly why
this penalty seems to encourage sensitivity, as it would seem
In our set up, we consider the following two-sided penalty: mathematically that collapsing the representation would
" #2 still be possible. However, we find empirically that it is
X
2 important for preserving out of distribution performance. In
λ · ||∇x Kc ||2 − 1 , (7)
c Appendix C, we evaluate a number of alternative approaches
such as using a reversible model as feature extractor (guar-
where ||·||2 is the l2 norm and the targeted Lipschitz constant
anteed to be invertible) and computing the Jacobian with
is
P1. We found empirically that regularising the gradient of respect to the vector Kc and fθ (x).
c Kc works better than fθ (x) or Kc (x) (which is the vec-
tor of kernel distances for input x). A similar approach was
taken for softmax models by Ross & Doshi-Velez (2018). 2.3. Epistemic and Aleatoric Uncertainty
The two-sided penalty was introduced by Gulrajani et al. When quantifying uncertainty, it can be useful to distinguish
(2017), who mention that despite a one-sided penalty being between “epistemic” and “aleatoric” uncertainty. Epistemic
sufficient to satisfy their requirements, the two sided penalty uncertainty comes from uncertainty in the parameters of the
proved to be better in practice. The one-sided penalty is model. This uncertainty is high for out of distribution data,
defined as: but also for example for informative data points in active
X learning (Houlsby et al., 2011). Aleatoric uncertainty is
λ · max(0, ||∇x Kc ||2F − 1). (8) uncertainty inherent in the data such as an image of a 3 that
c is similar to an 8 (Smith & Gal, 2018). In this case, the true
In Section 4.1, we show the difference between the single class cannot be determined.
and two sided penalties experimentally. We find the two- In practice, DUQ captures both aleatoric and epistemic
sided penalty to be ideal for enforcing sensitivity, while still uncertainty. Informally, when a point is far from all cen-
allowing strong generalisation.
Uncertainty Estimation Using a Single Deep Deterministic Neural Network
example. If x1 and x2 represent medical data, then pre-
sumably a highly abnormal value for x2 is notable, and we
would like to detect it. However, if x2 is a truly irrelevant
variable, say, the temperature on the surface of a distant
planet, then presumably our model is correct to ignore its
value, even if the value of the irrelevant variable is highly
abnormal. When training using empirical risk minimisation,
features not relevant to classification accuracy can simply be
ignored by the feature extractors of a neural network. This
makes out-of-distribution detection more difficult using fea-
ture space methods, even those that use a distance loss as
we do. It is important to note that there is a potential tension
here with classification accuracy. Enforcing sensitivity can
Figure 3. The uncertainty learned by DUQ on a simple problem of make accurate classification harder because it forces the
classifying samples from two overlapping Gaussian distributions. model to represent changes in input — as in the example
Yellow indicates certainty, while blue indicates uncertainty. There above, these may be irrelevant to the causal structure of the
is significant aleatoric uncertainty due to the overlap between the
problem. If we know about invariances that are appropriate
classes. DUQ can express high aleatoric uncertainty by placing
for the problem at hand, we can enforce these by correspond-
centroids close to each other in feature space, and is able to learn
this in practice if the task needs it, as shown here by the higher ing construction of the network. For example, we enforce
uncertainty around the 0 mark on the x axis. translation invariance by using convolutional networks in
this paper.
troids in feature space there is epistemic uncertainty. While 3. Related Work
aleatoric uncertainty is expressed by placing centroids so
that they are close in feature space (see Figure 3) and map- The largest body of research on obtaining uncertainty in
ping a data point close to both of them. It is important that deep learning are Bayesian neural networks (MacKay, 1992;
the centroids are close in feature space, because otherwise Neal, 2012). While exact inference in them is intractable, a
the model would not map them in between as it incurs a range of approximate methods have been proposed. Mean-
large loss, following Equation 3. We do not currently have field variational inference methods, such as Bayes by Back-
a formal way to distinguish between these two kinds of prop (Blundell et al., 2015) and Radial BNNs (Farquhar
uncertainty in DUQ. Solving this problem is an interesting et al., 2020) are a promising direction but have not yet lead
direction for future research. to stable training on large image datasets. A more scal-
able alternative is MC Dropout (Gal & Ghahramani, 2016),
2.4. Why Sensitivity can be at odds with Classification which is very simple to implement and evaluate. In prac-
tice, these variational Bayesian methods are outperformed
In this section we analyse some of the trade-offs and as- by Deep Ensembles (Lakshminarayanan et al., 2017). This
sumptions encoded in detecting out-of-distribution inputs. is a simple, non-Bayesian, method that involves training
We show in a toy experiment that standard classification multiple deep models from different initialisations and a
losses can hurt out-of-distribution detection. Consider fit- different data set ordering. Snoek et al. (2019) showed that
ting a model on a problem with two features, x1 and x2 , Deep Ensembles consistently outperform Bayesian neural
both sampled from a unit Gaussian, and output y, such that networks that were trained using variational inference. This
y = sign(x1 ) ∗ , where is noise with a low probability of performance comes at the expense of computational cost,
flipping the label. The optimal decision function in terms of Deep Ensembles’ memory and compute use scales linearly
the empirical risk, no matter the algorithm, is the function with the number of ensemble elements at both train and test
f (x1 , x2 ) = sign(x1 ). But this says nothing about the out time.
of distribution behaviour. What happens if we now see the
input x1 , x2 = 1, 1000? By our definition of the problem, Aside from using discriminative models, there have also
this is out of distribution, as it lies many standard deviations been attempts at finding out of distribution data using gener-
away from the observed data. But should it be detected as ative models. Nalisnick et al. (2019a) showed that simply
out of distribution? The data does not define what could measuring the likelihood under the data distribution does
be given as the input, at least if we take a conventional not work. Recently, a more advanced approach that in-
empirical risk minimisation approach. volves separating the likelihood of the semantic foreground
from the background did show promising results on selected
In this situation, it seems natural to prefer the kind of de- datasets (Ren et al., 2019). While generative models are a
cisions which would be made by a generative model, for
Uncertainty Estimation Using a Single Deep Deterministic Neural Network
promising avenue for out of distribution detection, they are
not able to assess predictive uncertainty; given that a data
point is in distribution, can our discriminative model actu-
ally make a reliable prediction? Further, generative models
are significantly more expensive to train than classification
models.
Our approach is distinct from both ensembles/Monte Carlo
(a) DUQ - No penalty (b) DUQ - One-sided penalty
methods, which aim to find different explanations for the
data and increase uncertainty when these disagree, and gen- Figure 4. Uncertainty results for two variations of DUQ: left with-
erative models which model the data distribution directly. out gradient penalty, and right with a one-sided gradient penalty
Instead our approach is more related to pre-deep learning (λ = 1). Yellow indicates certainty, while blue indicates un-
kernel methods (Quinn & Sugiyama, 2014; Schölkopf et al., certainty. Both results are significantly worse than DUQ with a
2000), such as Gaussian processes which revert to a prior two-sided penalty.
away from data, and Support Vector Machines, where the
distance to the separating hyperplane is informative of the
uncertainty. These approaches have never scaled to high di- 2019a), such as FashionMNIST vs MNIST, and CIFAR-10
mensional data, because of a lack of well performing kernel vs SVHN. We further study sensitivity to two important
functions. hyper parameters the length scale σ and gradient penalty
weight λ and propose how to tune them without relying on
The decision function based on kernel distances was first example OoD data.
used in the context of convolutional neural networks by
LeCun et al. (1998a). They were quickly abandoned for 4.1. Two Moons
softmax models, because they were difficult to scale and
optimise with gradient-based approaches due to saturating We use the scikit-learn (Pedregosa et al., 2011) implemen-
gradients and unstable centroids. Notable improvements tation of this dataset and describe the model architecture
in our work over the original are the updating mechanism and optimisation details in Appendix A.1. For colouring
of the centroids and the loss function that is based on a the visualisations, we normalise the colour map within the
multivariate Bernoulli, solving the problems of unstable figure.
centroids and saturating gradients. The result of our model trained with a two-sided gradient
Regularising the Jacobian has a long history, starting with penalty is shown in Figure 1b. The uncertainty is exactly as
Drucker & Le Cun (1992) and more recently Ross & Doshi- one would expect for the two moons dataset: certain on the
Velez (2018). Both papers aim to regularise the l2 norm training data, uncertain away from it and in the heart within
of Jacobian down to zero. In the first case to obtain better the two moons. The difference with Deep Ensembles is
generalisation, while the second paper aims to achieve ad- striking (Figure 1a). The uncertainty for DUQ is quantified
versarial robustness and interpretability. In neither case are as the distance to the closest centroid (max over the kernel
the authors interested in increasing the Jacobian. Gulrajani distances), the uncertainty for Deep Ensembles is computed
et al. (2017) showed how a gradient penalty can be applied as the predictive entropy of the average output, see Appendix
to training GANs with the Wasserstein distance, which was B. The ensemble elements were trained separately using the
a more scalable and simpler alternative to weight clipping. same model as described in Appendix A.1, but without L2
They use the double sided penalty and mention it works regularisation to encourage diverse solutions.
better in practice. Follow up work has analysed the penalty Discussion While Figure 1b is an impressive result in deep
in more detail and concluded that, contrary to our case, for learning, it is worth highlighting that Gaussian processes
training Wasserstein GANS the one-sided penalty is prefer- are able to obtain such result too. A good visualisation
able theoretically and practically (Jolicoeur-Martineau & can be found in Bradshaw et al. (2017). Interestingly, even
Mitliagkas, 2019; Petzka et al., 2017). though Deep Ensembles have been successfully applied to
many large datasets (Snoek et al., 2019), they fail to estimate
4. Experiments uncertainty well on the two moons dataset. This is due to
the simplicity and low dimensionality of this dataset, the
We show the behaviour of DUQ in two dimensions, with the ensembles generalise in nearly the same way — with a
two moons dataset and show the effect of leaving out the diagonal line dividing the top left and the bottom right.
gradient penalty and using a one sided penalty. We continue
by looking at the out of distribution detection performance Gradient Penalty In Section 2.1, we introduced the two-
for some notable difficult data set pairs (Nalisnick et al., sided gradient penalty. Figure 4 shows why it is important.
In Figure 4a, we show the result of having no gradient
Uncertainty Estimation Using a Single Deep Deterministic Neural Network
penalty, which shows that the model is certain every far
away from the data. In Figure 4b, we see that the uncertainty
does not improve when only a one-sided penalty is applied.
In both cases, there are ’blobs’ sticking out of the training
data domain that are also classified with high certainty.
Hyper parameters We found classification performance
on two moons to be insensitive to our setting of the gradient
penalty weight λ, likely because of the simplicity of the
two moons dataset. For the uncertainty visualisation, we
found it important to set the length scale to be small (in the
interval [0.05, 0.5]), despite accuracy not being affected by
this hyper parameter. In the following experiments, we will
discuss methods for picking the length scale and the weight
Figure 5. ROC curve for DUQ trained on FashionMNIST and eval-
of the gradient penalty. uated on FashionMNIST and MNIST. The task is to separate these
data sets based on uncertainty estimates.
4.2. FashionMNIST vs MNIST
In this experiment, we assess the quality of our uncertainty parameters that are particularly important: the length scale
estimation by looking at how well we can separate the test σ and the gradient penalty weight λ. We set the length scale
set of FashionMNIST (Xiao et al., 2017) from the test set by doing a grid search over the interval (0, 1] while keeping
of MNIST (LeCun et al., 1998b) by looking only at the λ = 0. We pick the value that leads to the highest validation
uncertainty predicted by the model. We train our model on accuracy. Following this process, we found that a length
FashionMNIST and we expect it to assign low uncertainty to scale of 0.1 leads to the highest accuracy, as measured over
the FashionMNIST test set, but high uncertainty to MNIST, five runs. While this process might not result in a length
since the model has never seen that dataset before and it is scale that leads to the best OoD performance, it works well
very different from FashionMNIST. in practice.
During evaluation we compute uncertainty scores on both Gradient Penalty Setting the λ parameter is more involved:
test sets and measure for a range of thresholds how well from Section 2.4, we know that the accuracy can suffer as
the two are separated. As in previous work (Ren et al., a result from gaining the ability to do out of distribution
2019), we report the AUROC metric, where a higher value detection, so we cannot rely on it to select the best λ. We
is better and 1 indicates that all FashionMNIST data points also cannot use the AUROC score on the MNIST dataset,
have a higher certainty than all MNIST data points. We because that would give the method an unfair advantage:
picked FashionMNIST vs MNIST, because it is a notably we cannot assume access to the OoD set in advance in prac-
difficult dataset pair (Nalisnick et al., 2019a), while MNIST tice.3 Instead we use a third dataset on which we evaluate
vs NotMNIST (Bulatov, 2011) is much simpler. the AUROC and select our λ values based on that. We fol-
low previous work (Ren et al., 2019) and use NotMNIST
Experimental set up Our model is a three layer convolu- as the third dataset for this pair. The results can be seen
tional network and we report all architectural and optimisa- in Table 1. As expected, the accuracy goes down as λ in-
tion details in Appendix A.2. It is important to note that at creases, and we also observe that the best AUROC result for
test time we set Batch Normalization to evaluation mode, NotMNIST coincides with the best score for MNIST, which
meaning that we use the mean and standard deviation of shows that the strategy of selecting a hyper parameter based
the feature activations computed from the training set (i.e. on the NotMNIST data set is reasonable. We note that while
FashionMNIST). It is unlikely that in practice we would get NotMNIST generalises to MNIST, we cannot rely on this
an entire batch of (uncorrelated) OoD points, so we can not property in general. Therefore, we propose an alternative
normalise using test time batch statistics2 . Further, we use method for model selection based on predictive uncertainty
the same data normalisation for the out of distribution set as in Section 4.3.
the in distribution set. Skipping either of these steps makes
the problem artificially simple. Comparison We show our results and compare with alter-
native methods in Table 2. Our proposed method, DUQ,
Length Scale Most hyper parameters, such as as the learn- outperforms all other classification based methods. The only
ing rate or weight decay parameter, can be set using the method that is better is LL ratio (Ren et al., 2019), which is
standard train/validation split. However there are two hyper based on generative models. These type of models are more
2
Just one datapoint needs to have significantly different activa- 3
If we do assume access, then we can trivially train a binary
tion statistics for the entire batch to be easily detectable. classifier on the original and OoD set.
Uncertainty Estimation Using a Single Deep Deterministic Neural Network
Figure 6. Rejection classification plot: accuracy on a combination Figure 7. A histogram of uncertainty estimates as computed using
of FashionMNIST and MNIST test sets. The x-axis indicates the DUQ (λ = 0.5). CIFAR-10 and SVHN are clearly separated. The
proportion of data rejected based on the uncertainty score. The counts are normalised, because the SVHN test set is significantly
theoretical maximum is computed from a classifier with 100% larger than CIFAR-10’s.
accuracy on FashionMNIST and rejects all MNIST points first.
a certain portion of the combined dataset by uncertainty
computationally costly to train than DUQ. The PixelCNN++ score. Next we compute the accuracy on the remaining data
(Salimans et al., 2017) used by LL ratio for FashionMNIST for each portion, considering all predictions on the OoD
uses 2 blocks of 5 gated ResNet layers, while our model is MNIST set to be incorrect. We expect the accuracy to go up
a simple three layer convolutional network. An alternative, as we reject more of the data points on which the model is
competitive approach is Mahalanobis Distance (Lee et al., uncertain. Ideally, we reject the incorrectly classified Fash-
2018), which computes a distance in the feature space of a ionMNIST points and all MNIST points. The Theoretical
pretrained softmax/cross entropy model in combination with Maximum is computed by assuming a model that has perfect
a number of dataset specific augmentations. The method accuracy on the FashionMNIST test set and is able to reject
relies on hyper parameter tuning using 1,000 samples of the all MNIST data before any FashionMNIST data. This exper-
out of distribution dataset. iment combines out of distribution detection, with detecting
difficult to classify data points, which is closer to actual
The difference in AUROC between our Deep Ensemble deployment scenarios than the AUROC metric, and also a
result and Ren et al. (2019)’s is due to using different archi- suggested practically informed evaluation method by Filos
tectures. For a fair comparison, we use the same architecture et al. (2019). Note that the ensemble model has an accuracy
for the ensemble elements as for DUQ (replacing the class of 93.6% on FashionMNIST, giving it a 1.2% head start on
dependent final layer by the usual single linear layer). In DUQ, which has an accuracy of 92.4%. We see that DUQ
Figure 5, we show the complete ROC curve for our imple- outperforms Deep Ensembles in this more realistic scenario.
mentation of Deep Ensembles and DUQ. We see that DUQ
outperforms Deep Ensembles at all chosen rates.
Accuracy and Gradient Penalty To confirm that training
using DUQ’s distance based output achieves competitive λ Acc (FM) AUROC (NM) AUROC (M)
accuracy, we train two models using our architecture: the 0 92.4% ± .2 0.933 ± .009 0.948 ± .004
standard softmax and cross entropy set up and DUQ with 0.05 92.4% ± .2 0.946 ± .018 0.955 ± .007
λ = 0. We obtained 92.4% ± 0.1 accuracy for the softmax 0.1 92.4% ± .1 0.938 ± .0018 0.948 ± .005
model, and for our proposed set up 92.4% ± 0.2, both av- 0.2 92.2% ± .1 0.945 ± .019 0.944 ± .011
eraged over five runs. The results show that we can obtain 0.3 92.3% ± .1 0.944 ± .013 0.941 ± .011
competitive accuracy using DUQ, resolving previous prob- 0.5 92.0% ± .1 0.946 ± .014 0.932 ± .009
lems with RBF networks. In Table 1, we show how accuracy 1.0 91.9% ± .1 0.945 ± .018 0.934 ± .006
changes for an increasingly weighted gradient penalty. The
accuracy only degrades slightly, while AUROC is improved. Table 1. FM stands for FashionMNIST, NM for NotMNIST, and M
for MNIST. The results are mean/std computed from 5 experiment
repetitions. We show AUROC for separating FashionMNIST from
Rejection Classification In Figure 6, we visualise how well NotMNIST and MNIST; higher is better. We see that the gradient
these algorithms work in a more realistic scenario. We com- penalty improves AUROC performance slightly, but performance
bine the FashionMNIST and MNIST test sets, then we reject on this dataset pair is already very strong.
Uncertainty Estimation Using a Single Deep Deterministic Neural Network
4.3. CIFAR-10 vs SVHN
In this section we look at the CIFAR-10 dataset (Krizhevsky
et al., 2014), with SVHN (Netzer et al., 2019) as OoD set.
We use a ResNet-18 (He et al., 2016) as feature extractor
fθ (·), specifically the version provided by PyTorch (Paszke
et al., 2017) with some minor modifications: we use 64
filters in the first convolutional layer, and skip the first pool-
ing operation and last linear layer. CIFAR-10 is a difficult
dataset for out of distribution detection for several reasons.
There is a significant amount of data noise: some of the dog
and cat examples are not distinguishable using only 32 by 32
pixels. The training set is small compared to its complexity,
making it easy to overfit without data augmentation. Figure 8. Rejection classification plot, which shows model perfor-
Experimental set up As in the previous section, we tune mance on a mix of CIFAR-10 and SVHN, while rejecting uncertain
the length scale using the accuracy on the validation set, points. The theoretical maximum is achieved when a hypotheti-
and find that 0.1 works best from a range of [0.05, 1]. We cal classifier obtains 100% accuracy on CIFAR-10 and rejects all
SVHN data points first. We see that DUQ and a 5 element Deep
train for a fixed 75 epochs and reduce the learning rate
Ensemble perform very similar.
by a factor of 0.2 at 25 and 50 epochs. We use random
horizontal flips and random crops as data augmentation and MRI scan which shows a new type of disease, these sce-
find that this is enough regularisation to prevent the model narios have no reasonable out of distribution set available.
from overfitting. All architectural and optimisation details Generative models are not able to take this approach, be-
are described in Appendix A.3. We obtain an accuracy cause they do not have predictive uncertainty. Even if we
of 94.1% ± 0.2 using the standard softmax/cross entropy use a hybrid model (Nalisnick et al., 2019b), then the dis-
loss. A Deep Ensemble of several softmax models obtains criminative part, a softmax/cross entropy model, does not
an accuracy of 95.2%. DUQ without a gradient penalty have reliable predictive uncertainty.
(λ = 0) obtains 94.2% ± 0.2 accuracy, while accuracy of
DUQ with λ = 0.5 is 93.2% ± 0.4. Results In Figure 7, we show a a normalised histogram
for the kernel distances of CIFAR-10 and SVHN. We see
Gradient Penalty For CIFAR-10, we do not use a third that most of CIFAR-10 is very close to 1, while SVHN is
dataset to set λ. Instead, we avoid using more data and uniformly spread out over the range of distances. This shows
use in-distribution uncertainty. We measure this using the that DUQ works as expected and that out of distribution data
AUROC of detecting correctly and incorrectly classified val- ends up away from all of the centroids in feature space.
idation set data points using the predicted uncertainty. We
found that optimising λ using this procedure also transfer The rejection classification plot, Figure 8, is created sim-
to λ values that lead to strong out of distribution detection ilar to the previous experiment in the last section. Note
performance. In general, this approach is preferable over that this time the Theoretical Maximum line is significantly
using a third dataset, because it is difficult to find an ap- lower, because the SVHN test set contains close to 26, 000
propriate out of distribution dataset, which will have the elements, while CIFAR-10’s only contains 10, 000. This
same characteristics as those encountered during deploy- means that the best possible accuracy when 100% of the
ment. Imagine a particular difficult traffic situation or an data is considered is about 28%. We see that DUQ and Deep
Ensembles perform similarly.
Method AUROC In Table 3, we compare DUQ with several alternative meth-
DUQ 0.955 ods. We see that DUQ performs competitively with a num-
LL ratio (generative model) 0.994 ber of recent approaches. Interestingly, on these more com-
Single model 0.843 plicated data sets Deep Ensembles performs the best. We
5 - Deep Ensembles (ours) 0.861 suspect this is because the complexity of the data set al-
5 - Deep Ensembles (ll) 0.839 lows the ensemble elements to be more diverse while still
Mahalanobis Distance (ll) 0.942 explaining the data well.
Table 2. Results on FashionMNIST, with MNIST as OoD set. Deep We further see a significant gap between DUQ with and
Ensembles is by Lakshminarayanan et al. (2017), Mahalanobis without a gradient penalty: there is a big improvement going
Distance by Lee et al. (2018), LL ratio by Ren et al. (2019). Results from λ = 0 to λ = 0.5. We suspect this is because there is
marked by (ll) are obtained from Ren et al. (2019), (ours) is imple- a lot of within class variation, which incentivises the model
mented using our architecture. Single model is our architecture, to collapse more diverse data points to the class centroids.
but trained with softmax/cross entropy.
Uncertainty Estimation Using a Single Deep Deterministic Neural Network
Method AUROC References
DUQ (λ = 0.5) 0.927 ± 0.013 Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra,
DUQ (λ = 0) 0.861 ± 0.032 D. Weight uncertainty in neural network. In International
LL ratio (generative model) 0.930 Conference on Machine Learning, pp. 1613–1622, 2015.
Single model 0.906 ± 0.007
3 - Deep Ensembles 0.926 ± 0.010 Bradshaw, J., Matthews, A. G. d. G., and Ghahramani, Z.
5 - Deep Ensembles 0.933 ± 0.008 Adversarial examples, uncertainty, and transfer testing ro-
10 - Deep Ensembles 0.941 bustness in gaussian process hybrid deep networks. arXiv
15 - Deep Ensembles 0.942 preprint arXiv:1707.02476, 2017.
Table 3. Deep Ensembles is by Lakshminarayanan et al. (2017), Bulatov, Y. notmnist dataset. Tech. Rep.[Online]. Avail-
but re-implemented and evaluated using our architecture. LL ratio able: http://yaroslavvb.blogspot.com/2011/09/notmnist-
is as reported in Ren et al. (2019). Single model is our architecture, dataset.html, 2011.
but trained with softmax/cross entropy. We show the AUROC for
separating CIFAR-10 from SVHN. Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear
independent components estimation. arXiv preprint
Runtime One of the main advantages of DUQ over Deep arXiv:1410.8516, 2014.
Ensembles is computational cost. For Deep Ensembles, both
computation and memory cost scale linearly in the number Drucker, H. and Le Cun, Y. Improving generalization perfor-
of ensemble components, during both train and test time. mance using double backpropagation. IEEE Transactions
DUQ has to compute the Jacobian at training time, which is on Neural Networks, 3(6):991–997, 1992.
expensive, but at test time there is only a marginal overhead
Farquhar, S., Osborne, M., and Gal, Y. Radial Bayesian
over a softmax based model. Training for one epoch on a
neural networks: Beyond discrete support in large-scale
modern 1080 Ti GPU, takes 21 seconds for a softmax/cross
bayesian deep learning. Proceedings of the 23rd Interna-
entropy model, which leads to 105 seconds for a Deep
tional Conference on Artificial Intelligence and Statistics,
Ensemble with 5 components. DUQ with gradient penalty
2020.
needs 103 seconds for one epoch at training time, but only
27 seconds without gradient penalty. DUQ is 25% slower at Filos, A., Farquhar, S., Gomez, A. N., Rudner, T. G., Ken-
test time than single softmax/cross entropy model, but about ton, Z., Smith, L., Alizadeh, M., de Kroon, A., and Gal,
4 times faster than a Deep Ensemble with 5 components. Y. A systematic comparison of bayesian deep learning
robustness in diabetic retinopathy tasks. arXiv preprint
5. Conclusion arXiv:1912.10481, 2019.
We introduced DUQ, Deterministic Uncertainty Quantifica- Gal, Y. Uncertainty in deep learning. University of Cam-
tion, a simple method for obtaining uncertainty using a deep bridge, 1:3, 2016.
neural network in a single forward pass. Evaluations show
that our method is better in some scenarios and competitive Gal, Y. and Ghahramani, Z. Dropout as a bayesian approxi-
in others with the more computationally expensive Deep mation: Representing model uncertainty in deep learning.
Ensembles. In International Conference on Machine Learning, pp.
1050–1059, 2016.
Interesting future work would be to place DUQ in a prob-
abilistic framework, enabling a calibrated notion of uncer- Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and
tainty and a rigorous way of separating out epistemic and Courville, A. C. Improved training of wasserstein gans.
aleatoric uncertainty. In Advances in neural information processing systems,
pp. 5767–5777, 2017.
6. Acknowledgements He, K., Zhang, X., Ren, S., and Sun, J. Delving deep
We thank Andreas Kirsch, Luisa Zintgraf, Bas Veeling, Mi- into rectifiers: Surpassing human-level performance on
lad Alizadeh, Christos Louizos, and Bobby He for helpful imagenet classification. In Proceedings of the IEEE inter-
discussions and feedback. We also thank the rest of OATML national conference on computer vision, pp. 1026–1034,
for feedback at several stages of the project. JvA/LS 2015.
are grateful for funding by the EPSRC (grant reference He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-
EP/N509711/1 and EP/L015897/1, respectively). JvA is ing for image recognition. In Proceedings of the IEEE
also grateful for funding by Google-DeepMind. conference on computer vision and pattern recognition,
pp. 770–778, 2016.
Uncertainty Estimation Using a Single Deep Deterministic Neural Network
Hein, M., Andriushchenko, M., and Bitterwolf, J. Why Nalisnick, E., Matsukawa, A., Teh, Y. W., Gorur, D., and
relu networks yield high-confidence predictions far away Lakshminarayanan, B. Hybrid models with deep and
from the training data and how to mitigate the problem. In invertible features. arXiv preprint arXiv:1902.02767,
Proceedings of the IEEE Conference on Computer Vision 2019b.
and Pattern Recognition, pp. 41–50, 2019.
Neal, R. M. Bayesian learning for neural networks, volume
Hoffman, J., Roberts, D. A., and Yaida, S. Robust 118. Springer Science & Business Media, 2012.
learning with jacobian regularization. arXiv preprint
arXiv:1908.02729, 2019. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and
Ng, A. The street view house numbers (svhn) dataset,
Houlsby, N., Huszár, F., Ghahramani, Z., and Lengyel, M. 2019.
Bayesian active learning for classification and preference
learning. arXiv preprint arXiv:1112.5745, 2011. Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep
exploration via bootstrapped dqn. In Advances in neural
Hutchinson, M. F. A stochastic estimator of the trace of the information processing systems, pp. 4026–4034, 2016.
influence matrix for laplacian smoothing splines. Com-
munications in Statistics-Simulation and Computation, Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
19(2):433–450, 1990. DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
A. Automatic differentiation in pytorch. 2017.
Jacobsen, J.-H., Smeulders, A., and Oyallon, E. i-revnet:
Deep invertible networks. In ICLR 2018-International Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Conference on Learning Representations, 2018. Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
Weiss, R., Dubourg, V., et al. Scikit-learn: Machine
Jolicoeur-Martineau, A. and Mitliagkas, I. Connections be- learning in python. Journal of machine learning research,
tween support vector machines, wasserstein distance and 12(Oct):2825–2830, 2011.
gradient-penalty gans. arXiv preprint arXiv:1910.06922,
2019. Petzka, H., Fischer, A., and Lukovnicov, D. On the
regularization of wasserstein gans. arXiv preprint
Krizhevsky, A., Nair, V., and Hinton, G. The cifar-10 dataset. arXiv:1709.08894, 2017.
online: http://www. cs. toronto. edu/kriz/cifar. html, 55,
2014. Quinn, J. A. and Sugiyama, M. A least-squares approach to
anomaly detection in static and sequential data. Pattern
Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple Recognition Letters, 40:36–40, 2014.
and scalable predictive uncertainty estimation using deep
ensembles. In Advances in Neural Information Process- Ren, J., Liu, P. J., Fertig, E., Snoek, J., Poplin, R., Depristo,
ing Systems, pp. 6402–6413, 2017. M., Dillon, J., and Lakshminarayanan, B. Likelihood
ratios for out-of-distribution detection. In Advances in
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Neural Information Processing Systems 32, pp. 14680–
Gradient-based learning applied to document recognition. 14691, 2019.
Proceedings of the IEEE, 86(11):2278–2324, 1998a.
Ross, A. S. and Doshi-Velez, F. Improving the adversarial
LeCun, Y., Cortes, C., and Burges, C. J. The robustness and interpretability of deep neural networks
mnist database of handwritten digits, 1998. URL by regularizing their input gradients. In Thirty-second
http://yann.lecun.com/exdb/mnist, 10:34, 1998b. AAAI conference on artificial intelligence, 2018.
Lee, K., Lee, K., Lee, H., and Shin, J. A simple unified Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P.
framework for detecting out-of-distribution samples and Pixelcnn++: Improving the pixelcnn with discretized lo-
adversarial attacks. In Advances in Neural Information gistic mixture likelihood and other modifications. arXiv
Processing Systems, pp. 7167–7177, 2018. preprint arXiv:1701.05517, 2017.
MacKay, D. J. Bayesian methods for adaptive models. PhD Schölkopf, B., Williamson, R. C., Smola, A. J., Shawe-
thesis, California Institute of Technology, 1992. Taylor, J., and Platt, J. C. Support vector method for
novelty detection. In Advances in neural information
Nalisnick, E., Matsukawa, A., Teh, Y. W., Gorur, D., and processing systems, pp. 582–588, 2000.
Lakshminarayanan, B. Do deep generative models know
what they don’t know? In International Conference Smith, L. and Gal, Y. Understanding measures of uncer-
on Learning Representations, 2019a. URL https:// tainty for adversarial example detection. arXiv preprint
openreview.net/forum?id=H1xwNhCcYm. arXiv:1803.08533, 2018.
Uncertainty Estimation Using a Single Deep Deterministic Neural Network
Snoek, J., Ovadia, Y., Fertig, E., Lakshminarayanan, B.,
Nowozin, S., Sculley, D., Dillon, J., Ren, J., and Nado,
Z. Can you trust your model’s uncertainty? evaluating
predictive uncertainty under dataset shift. In Advances
in Neural Information Processing Systems 32, pp. 13969–
13980. Curran Associates, Inc., 2019.
van den Oord, A., Vinyals, O., et al. Neural discrete repre-
sentation learning. In Advances in Neural Information
Processing Systems, pp. 6306–6315, 2017.
Xiao, H., Rasul, K., and Vollgraf, R. Fashion-mnist: a
novel image dataset for benchmarking machine learning
algorithms. arXiv preprint arXiv:1708.07747, 2017.