Chapter 6 - Feedforward Deep Networks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

Chapter 6

Feedforward Deep Networks


6.1

Formalizing and Generalizing Neural Networks

Feedforward supervised neural networks were among the rst and most successful learning algorithms (Rumelhart et al., 1986d,c). They are also called deep networks, multilayer Perceptron (MLP), or simply neural networks and the vanilla architecture with a
single hidden layer is illustrated in Figure 6.1.

Figure 6.1: Vanilla (shallow) MLP, with one sigmoid hidden layer, computing vectorvalued hidden unit vector h = sigmoid(c + W x) with weight matrix W and bias or
oset vector c. The output vector is obtained via another learned ane transformation
y = b + V h, with weight matrix V and output bias (oset) vector b.
MLPs can learn powerful non-linear transformations: in fact, with enough hidden
units they can represent arbitrarily complex but smooth functions (see Section 6.4).
This is achieved by composing simpler but still non-linear learned transformations. By
104

transforming the data non-linearly into a new space, a classication problem that was
not linearly separable (not solvable by a linear classier) can become separable, as
illustrated in Figure 6.2.

Figure 6.2: Each layer of a trained neural network non-linearly transforms its input,
distorting the space so that the task becomes easier to perform, e.g., linear classication
in the above gures. Top: a vanilla neural network with 2 hidden units in its single
hidden layer can transform the 2-D input space so that the two classes (in red/pink vs
blue) become linearly separable (the red and blue curves correspond to the manifold
near which examples concentrate). Bottom: with a larger hidden layer (100 here),
the MNIST digit images (with 28 28 = 784 pixels) can be transformed so that the
classes (each shown with a dierent color) can be much more easily classied by the
output linear+softmax layer (over 10 digit categories). Both gures are reproduced with
permission by Chris Olah from http://colah.github.io/, where many more insightful
visualizations can be found.

105

In addition to covering the basics of such networks, this chapter introduces a general
formalism for gradient-based optimization of parametrized families of functions, often
in the context of conditional maximum likelihood criteria (Section 6.2).
They bring together a number of important machine learning concepts already introduced in the previous chapters:
Dene a parametrized family of functions f describing how the learner will
behave on new examples, i.e., what output f (x) will produce given some input
x. Training consists in choosing the parameter (usually represented by a vector)
given some training examples (x, y) sampled from an unknown data generating
distribution P (X, Y ).
Dene a loss function L describing what scalar loss L( y , y) is associated with
each supervised example (x, y), as a function of the learners output y = f (x)
and the target output y.
Dene a training criterion and a regularizer. The objective of training is
ideally to minimize the expected loss EX,Y [L(f(X), Y )] over X, Y sampled from
the unknown data generating distribution P (X, Y ). However this is not possible
because the expectation makes use of the true underlying P (X, Y ) but we only
have access to a nite number of training examples, i.e. of pairs (X,Y ). Instead,
one denes a training criterion which usually includes an empirical average of the
loss over the training set, plus some additional terms (called regularizers) which
enforce some preferences over the choices of .
Dene an optimization procedure to approximately minimize the training criterion1 . The simplest such optimization procedure is stochastic gradient descent,
described in Section 4.3.
Example 6.1.1 illustrates these concepts for the case of a vanilla neural network for
regression.
In chapter 17, we consider generalizations of the above framework to the unsupervised and semi-supervised cases, where Y may not exist or may not always be present.
An unsupervised loss can then be dened solely as a function of the input x and some
function f (x) that one wishes to learn.
1
It is generally not possible to analytically obtain a global minimum of the training criterion, and
iterative numerical optimization methods are used instead.

106

Example 6.1.1. Vanilla (Shallow) Multi-Layer Neural Network for


Regression

Based on the above denitions, we could pick the family of input-output functions to be
f (x) = b + V sigmoid(c + W x),
illustrated in Figure 6.1, where sigmoid(a) = 1/(1 + ea) is applied elementwise, the input is the vector x Rni , the hidden layer outputs are the elements
of the vector h = sigmoid(c + W x) with nh entries, the parameters are =
(b, c, V , W ) (viewed as the attened vectorized version of the tuple) with
b R no a vector the same dimension as the output (n o), c Rnh of the same
dimension as h (number of hidden units), V R nonh and W Rn h ni being
weight matrices.
The loss function for this classical example could be the squared error
L(
y , y) = ||
y y||2 (see Section 5.8.1 showing that it makes y an estimator of E[Y |x]). The regularizer could be the ordinary L2 weight decay
P
P
|||| 2 = ( ij W 2ij + ki Vki2 ), where we dene the set of weights as the concatenation of the elements of matrices W and V . The L 2 weight decay thus
penalizes the squared norm of the weights, with a scalar that is larger to penalize stronger and yield smaller weights. Combining the loss function and the
regularizer gives the training criterion, which is the objective function during
training:
n

C() = ||||2 +

1 X (t)
||y (b + V sigmoid(c + W x(t) ))||2 .
n t=1

where (x (t) , y(t) ) is the t-th training example, an (input,target) pair. Finally,
the classical training procedure in this example is stochastic gradient descent,
which iteratively updates according to

2 + L(f (x (t)), y (t) )

L(f (x(t) ), y(t) ),


where = (b, c) contains the oset (bias) parameters, = (W , V ) the weight
matrices, is a learning rate and t is incremented after each training example,
modulo n.

107

6.2

Parametrizing a Learned Predictor

There are many ways to dene the family of input-output functions, loss function,
regularizer and optimization procedure, and the most common ones are described below,
while more advanced ones are left to later chapters, in particular Chapters 10 and 16.

6.2.1

Family of Functions

A motivation for the family of functions dened by multi-layer neural networks is to


compose anetransformations andnon-linearities, where the choice of parameter values
controls the degree and location of these non-linearities. As discussed in Section 6.4
below, with the appropriate choice of parameters, multi-layer neural networks can in
principle approximate smooth functions, with more hidden units allowing one to achieve
better approximations.
A multi-layer neural network with more than one hidden layer can be dened by
generalizing the above structure, e.g., as follows, where we chose to use hyperbolic
tangent2 activation functions instead of sigmoid activation functions:
hk = tanh(bk + W k h k1 )
where h 0 = x is the input of the neural net, h k (for k > 0) is the output of the k-th
hidden layer, which has weight matrix W k and oset (or bias) vector b k. If we want
the output f (x) to lie in some desired range, then we typically dene an output nonlinearity (which we did not have in the above Example 6.1.1). The non-linearity for the
output layer is of course generally dierent from the tanh, depending on the type of
output to be predicted and the associated loss function (see below).
As discussed in Chapter 12 (Section 12.5) there are several other non-linearities
besides the sigmoid and the hyperbolic tangent which have been successfully used with
neural networks. In particular, TODO discusses alternative non-linear units, such as
the rectied linear unit (max(0, b + w x)) and the maxout unit (max i (b i + W:,i x))
which have been particularly successful in the case of deep feedforward or convolutional
networks. These and other non-linear neural network activation functions commonly
found in the literature are summarized below.
Rectier or rectied linear unit (ReLU) or positive part: h(x) = max(0, b+
w x), also written (b + w x)+ .
Hyperbolic tangent: h(x) = tanh(b + w x).
Sigmoid: h(x) = 1/(1 + e(b+wx)).
Softmax: h(x) = softmax(b + W x). It is mostly used as output non-linearity for
predicting discrete probabilities. See denition and discussion below, Eq. 6.1.
2 which

is linearly related to the sigmoid as follows: tanh(x) = 2 sigmoid(2x) 1

108

Radial basis function or RBF unit: h(x) = e||wx|| / . This is heavily used
in kernel SVMs (Boser et al., 1992; Scholkopf et al., 1999) and has the advantage
that such units can be easily initialized as a random subset of the input examples (Powell, 1987; Niranjan and Fallside, 1990).
Softplus: this is a smooth version of the rectier, introduced in Dugas et al. (2001)
for function approximation and in Nair and Hinton (2010) in RBMs. Glorot et al.
(2011a) compared the softplus and rectier and found better results with the latter,
in spite of the very similar shape and the dierentiability and non-zero derivative
of the softplus everywhere, contrary to the rectier.
Hard tanh: this is shaped similarly to the tanh and the rectier but unlike the
latter, it is bounded, h(x) = max(1, min(1, x)). It was introduced by Collobert
(2004).
Absolute value rectication: h(x) = |x| (may be applied on the dot product
output or on the output of a tanh unit). It is also related to the rectier and
has been used for object recognition from images (Jarrett et al., 2009b), where
it makes sense to seek features that are invariant under a polarity reversal of the
input illumination.
Maxout: this is discussed in more detail in TODO It generalizes the rectier but
introduces multiple weight vectors wi (called lters) for each unit. h(x) = maxi (bi + wi x).
This is not an exhaustive list but covers most of the non-linearities and unit computations seen in the deep learning and neural nets literature. Many variants are possible.
As discussed in Section 6.3, the structure (also called architecture) of the family of
input-output functions can be varied in many ways, which calls for a generic principle
for eciently computing gradients, described in that section. For example, a common
variation is to connect layers that are not adjacent, with so-called skip connections,
which are found in the visual cortex (for which the word layer should be replaced by
the word area). Other common variations depart from a full connectivity between
adjacent layers. For example, each unit at layer k may be connected to only a subset of
units at layer k 1. A particular case of such form of sparse connectivity is discussed
in chapter 9 with convolutional networks. In general, the set of connections between
units of the whole network only needs to form a directed acyclic graph in order to dene
a meaningful computation (see the ow graph formalism below, Section 6.3). When
units of the network are connected to themselves through a cycle, one has to properly
dene what computation is to be done, and this is what is done in the case of recurrent
networks, treated in Chapter 10.

6.2.2

Loss Function and Conditional Log-Likelihood

In the 80s and 90s the most commonly used loss function was the squared error
L(f (x), y) = ||f (x) y|| 2 . As discussed in Section 5.8.1, if f is unrestricted (nonparametric), minimizing the expected value of the loss function over some data-generating
109

distribution P (X, Y ) yields f (x) = E[Y |X], the true conditional expectation of Y given
X 3 . See also Section 5.8.1 for more details. This tells us what the neural network
is trying to learn. Replacing the squared error by an absolute value makes the neural
network try to estimate not the conditional expectation but the conditional median.
However, when y is a discrete label, i.e., for classication problems, other loss functions such as the Bernoulli negative log-likelihood 4 have been found to be more appropriate than the squared error. In the case where y {0, 1} is binary this gives
L(f (x), y) = y log f (x) (1 y) log(1 f (x)) and it can be shown that the optimal
(non-parametric) f minimizing this criterion is f (x) = P (Y = 1|x). Note that in order
for the above expression of the criterion to make sense, f (x) must be strictly between 0
and 1 (an undened or innite value would otherwise arise). To achieve this, it is common to use the sigmoid as non-linearity for the output layer, which matches well with
the Binomial negative log-likelihood criterion5 . As explained below (Section 6.2.2), the
cross-entropy criterion allows gradients to pass through the output non-linearity even
when the neural network produces a condently wrong answer, unlike the squared error
criterion coupled with a sigmoid or softmax non-linearity.
Learning a Conditional Probability Model
More generally, one can dene a loss function as corresponding to a conditional loglikelihood, i.e., the negative log-likelihood (NLL) criterion
L NLL(f (x), y) = log P (Y = y|X = x).
See Section 5.8.2 which shows that this criterion yields an estimator of the true conditional probability of Y given X. For example, if Y is a continuous random variable and
we assume that, given X, it has a Gaussian distribution with mean f (X) and variance
2 , then
1
log P (Y |X) = (f (X) Y )2/ 2 + log(22 ).
2
Up to an additive and multiplicative constant (which would give the same choice of ),
this is therefore equivalent to the squared error loss. Once we understand this principle,
we can readily generalize it to other distributions, as appropriate. For example, it
is straightforward to generalize the univariate Gaussian to the multivariate case, and
under appropriate parametrization consider the variance to be a parameter or even
parametrized function of x (for example with output units that are guaranteed to be
positive, or forming a positive denite matrix).
Similarly, for discrete variables, the Binomial negative log-likelihood criterion corresponds to the conditional log-likelihood associated with the Bernoulli distribution with
3

Proving this is not dicult and is very instructive.


This is often called cross entropy in the literature, even though the term cross entropy should also
apply to many other losses that can be viewed as negative log-likelihood, discussed below in more detail.
TODO: where do we actually discuss this? may as well discuss it in maximum likelihood section
5
In reference to statistical models, this match between the loss function and the output nonlinearity is similar to the choice of a link function in generalized linear models (McCullagh and Nelder,
1989).
4

110

probability p = f (x) of generating Y = 1 given X = x (and probability 1 p of


generating Y = 0):
log P (Y |X) = 1Y =1 log p 1Y =0 log(1 p) = Y log f (X) (1 Y ) log(1 f (X)).
Softmax
When y is discrete (in some nite set, say {1, . . . , N }) but not binary, the Bernoulli
distribution is extended to the Multinoulli (Murphy, 2012), Section 3.10.2 6. This distribution is specied by a vector of N 1 probabilities whose sum is less or equal to
1, each element of which provides the probability pi = P (y = i|x). We thus need a
vector-valued output non-linearity that produces such a vector of probabilities, and a
commonly used non-linearity for this purpose is the softmax non-linearity introduced
earlier.
e ai
(6.1)
p = softmax(a) p i = P aj .
e
j

where typically a = b + W h is the vector of scores whose elements a i are associated


with each category i. The corresponding loss function is therefore L NLL(p, y) = log py .
Note how minimizing this loss will push a y up (increase the score ay associated with
the correct label y) while pushing ai (for i 6= y) down (decreasing the score of the other
labels, in the context x). The rst eect comes from the numerator of the softmax
while the second eect comes from the normalizing denominator. These forces cancel on
a specic example only if py = 1 and they cancel in average over examples (say sharing
the same x) if p i equals the fraction of times that y = i for this value x. The same
principles and the role of the normalization constant (or partition function) can be
seen at play in the training of Markov Random Fields, Boltzmann machines and RBMs,
in Chapter 14. Note other interesting properties of the softmax. First of all, the gradient
of the Multinoulli negative log-likelihood with respect to the softmax argument a is

py
LNLL(p(a), y) =
( log p y (a))
a
py
a
1
=
py (a)(1 i=y p)
p y(a)
= (p 1 i=y)

(6.2)

which does not saturate, the gradient does not vanish when the output probabilities
approach 0 or 1, except when the model is providing the correct answer. Specically,
lets consider the case where the correct label is i, i.e. y = i. The element of the gradient
associated with an erroneous label, say j 6= i, is

L NLL (p(a), y) = p j .
a j
6

(6.3)

The Multinoulli is also known as the categorical or generalized Bernoulli distribution, which is a
special case of the Multinomial distribution with one trial, and the term multinomial is often used even
in this case.

111

So if the model correctly predicts a low probability that the ry = j, i.e. that p j 0,
then the gradient is also close to zero.
There are other loss functions such as the squared error applied to softmax (or
sigmoid) outputs (which was popular in the 80s and 90s) which have vanishing gradient
when an output unit saturates (when the derivative of the non-linearity is near 0), even
if the output is completely wrong (Solla et al., 1988). This may be a problem because it
means that the parameters will basically not change, even though the output is wrong.
To see how the squared error interacts with the softmax output, we need to introduce
a one-hot encoding of the label, y = [0, . . . , 0, 1, 0, . . . , 0], i.e for the label y = i, we have
yi = 1 and y j = 0, j 6= i. We will again consider that we have the output of the
network to be p = softmax(a), where, as before, a is the input to the softmax function
( i.e. a = b + W h with h the output of the last hidden layer).
For the squared error loss L2 (p(a), y) = ||p(a) y|| 2, the gradient of the loss with
respect to the input vector to the softmax, a, is given by:


L(p(a), y) =
a
p p
= 2(p(a) y)

(1 p),

where indicates an element-wise multiplication. Let us again consider the case where
y = i, the gradient associated with an erroneous label, say j 6= i, is

L (p(a), y) = 2(p(a) y)
a j 2

(1 p).

(6.4)

So if the model correctly predicts a low probability that the ry = j, i.e. that p j 0,
then the gradient is also close to zero.
Another property that could potentially be interesting is that the softmax output is
invariant under additive changes of its input vector: softmax(a) = softmax(a + b) when
b is a scalar added to all the elements of vector a. Finally, it is interesting to think
of the softmax as a way to create a form of competition between the units (typically
output units, but not necessarily) that participate in it: the strongest lter output
ai gets reinforced (because the exponential increases faster for larger values) and the
other units get inhibited. This is analogous to the lateral inhibition that is believed to
exist between nearby neurons in cortex, and at the extreme (when the ais are large in
magnitude) becomes a form of winner-take-all (one of the outputs is 1 and the others is
0). A more computationally expensive form of competition is found with sparse coding,
described in Section 20.3.
Neural Net Outputs as Parameters of a Conditional Distribution
In general, for any parametric probability distribution P (y|) with parameters , we
can construct a conditional distributionP (y|x) by making a parametrized function of
x, and learning that function:
P (y| = f (x))
112

where f (x) is the output of a predictor, x is its input, and y can be thought of as a
target. This established choice of words comes from the common cases of classication
and regression, where f (X) is really a prediction associated with Y , or its expected
value. However, in general = f (X) may contain parameters of the distribution of Y
other than its expected value. For example, it could contain its variance or covariance,
in the case where Y is conditionally Gaussian. In the above examples, with the squared
error loss, is the mean of the Gaussian which captures the conditional distribution
of Y (which means that the variance is considered xed, not a function of X). In
the common classication case, contains the probabilities associated with the various
events of interest.
Once we view things in this way, we automatically get as the natural cost function
the negative log-likelihood L(X, Y ) = log P (Y | = f (X)). Besides , there could be
other parameters that control the distribution of Y . For example, we may wish to learn
the variance of a conditional Gaussian, even when the latter is not a function of X. In
this case, the solution can be computed analytically because the maximum likelihood
estimator of variance is simply the empirical mean of the squared dierence between
observations Y and their expected value (here f (X)). In other words, the conditional
variance can be estimated from the mean squared error.
Besides the Gaussian, a simple and common example is the case where Y is binary
(i.e. Bernoulli distributed), where it is enough to specify = P (Y= 1|X). In the
Multinoulli case, is generally specied by a vector of probabilities summing to 1, e.g.,
via the softmax non-linearity discussed above.
Another interesting and powerful example of output distribution for neural networks
is the mixture model, and in particular the Gaussian mixture model, introduced in Section 3.10.5. Neural networks that compute the parameters of a mixture model were
introduced in Jacobs et al. (1991); Bishop (1994). In the case of the Gaussian mixture
model with N components,
P (y|x) =

N
X

P (C = i|x)N (y|i (x), i(x)),

i=1

the neural network must have three kinds of outputs, P (C = i|x), i(x), and i (x),
which must satisfy dierent constraints:
1. Mixture components P (C = i|x): these form a Multinoulli distribution over the
N dierent components C, and can typically be obtained by a softmax over an
N-vector, to guarantee that these outputs are positive and sum to 1.
2. Means i (x): these indicate the center or mean associated with each Gaussian
component, and are unconstrained (typically with no non-linearity at all for these
output units). If Y is a d-vector, then the network must output an N d matrix
containing all these N d-vectors.
3. Covariances i (x): these specify the covariance matrix for each component i. In
many models the variance is unconditional (does not depend on x) and diagonal or
113

even scalar. If it is diagonal or scalar, only positivity must be enforced (e.g., using
the softplusnon-linearity). If it is full and conditional, then a parametrization must
be chosen that guarantees positive-deniteness of the predicted covariance matrix.
This can be achieved by writing i (x) = B i(x)B>
i (x), where B i is an unconstrained
square matrix. One practical issue if the the matrix is full is that computing
the likelihood is expensive, requiring O(d3 ) computation for the determinant and
inverse of i (x) (or equivalently, and more commonly done, its eigendecomposition
or that of Bi(x)).
It has been reported that gradient-based optimization of conditional Gaussian mixtures
(on the output of neural networks) can be nicky, in part because one gets divisions (by
the variance) which can be numerically unstable (when some variance gets to be small
for a particular example, yielding very large gradients). One solution is to clip gradients
(see Section 10.6.7 and Mikolov (2012); Pascanu and Bengio (2012); Graves (2013);
Pascanu et al. (2013a)), while another is to scale the gradients heuristically (Murray
and Larochelle, 2014).
Multiple Output Variables
When Y is actually a tuple formed by multiple random variables Y1 , Y2, . . . , Y k, then one
has to choose an appropriate form for their joint distribution, conditional on X = x. The
simplest and most common choice is to assume that the Y i are conditionally independent,
i.e.,
k
Y
P (Y 1, Y 2 , . . . , Yk|x) =
P (Y i |x).
i=1

This brings us back to the single variable case, especially since the log-likelihood now
decomposes into a sum of terms log P (Y i |x). If each P (Yi|x) is separately parametrized
(e.g. a dierent neural network), then we can train these neural networks independently.
However, a more common and powerful choice assumes that the dierent variables Y i
share some common factors that can be represented in some hidden layer of the network
(such as the top hidden layer). See Sections 6.5 and 7.12 for a deeper treatment of the
notion of underlying factors of variation and multi-task training. See also Figure 7.6
illustrating these concepts.
If the conditional independence assumption is considered too strong, what can we do?
At this point it is useful to step back and consider everything we know about learning
a joint probability distribution. Any probability distribution P (Y ) parametrized by
parameters can be turned into a conditional distribution P (Y |x) by making a
function = f (x) parametrized by . This is not only true of the simple parametric
distributions we have seen above (Gaussian, Bernoulli, Multinoulli) but also of more
complex joint distributions typically represented by a graphical model. Much of this
book is devoted to such graphical models (see Chapters 14, 19, 20, 21). For more
concrete examples of such structured output models with deep learning, see Section 13.4.

114

6.2.3

Training Criterion and Regularizer

The loss function (often interpretable as a negative log-likelihood) tells us what we would
like the learner to capture. Maximizing the conditional log-likelihood over the true
distribution, i.e., minimizing the expected loss EX,Y [ log P (Y |X)], makes P (Y |X)
estimate the true P (Y |X) associated with the unknown data generating distribution,
within the boundaries of what the chosen family of functions allows. See Section 5.8.2
for a proof. In practice we cannot minimize this expectation because we do not know
P (X, Y ) and because computing and minimizing this integral exactly would generally
be intractable. Instead we are going to approximately minimize a training criterion
C() based on the empirical average of the loss (over the training set). The simplest
P
such criterion is the average training loss 1n nt=1 L(f (x (t)), y(t) ), where the training
set is a set of n examples (x(t) , y (t) ). However, better results can often be achieved by
crippling the learner and preventing it from simply nding the best that minimizes the
average training loss. This means that we combine the evidence coming from the data
(the training set average loss) with some a priori preference on the dierent values that
can take (the regularizer). If this concept (and the related concepts of generalization,
overtting and undertting) are not clear, please return to Sections 5.3 and 5.6.2 for a
refresher.
The most common regularizer is simply an additive term equal to a regularization
coecient times the squared norm of the parameters7, |||| 22 . This regularizer is often
called the weight decay or L2 weight decay or shrinkage because adding the squared
L2 norm to the training criterion pushes the weights towards zero, in proportion to
their magnitude. For example, when using stochastic gradient descent, each step of the
update with regularizer term 2 ||||2 would look like
L(f (x), y)
where

is the learning rate (see Section 4.3). This can be rewritten


(1 ) L(f (x), y)

where we see that the rst term systematically shaves o a small proportion of before
adding the gradient.
Another commonly used regularizer is the so-called L1 regularizer, which adds a term
P
proportional to the L1 (absolute value) norm, || 1 = i |i|. The L1 regularizer also
prefers values that are small, but whereas the L2 weight decay has only a weak preference
for 0 rather than small values, the L1 regularizer continues pushing the parameters
towards 0 with a constant gradient even when they get very small. As a consequence, it
tends to bring some of the parameters to exactly 0 (how many depends on how large the
regularization coecient is chosen). When optimizing with an approximate iterative
and noisy method such as stochastic gradient descent, no actual 0 is obtained. On the
7

In principle, one could have dierent priors on dierent parameters, e.g., it is common to treat
the output weights with a separate regularization coecient, but the more hyper-parameters, the more
dicult is their optimization, discussed in Chapter 12.

115

other hand, the L1 regularizer tolerates large values of some parameters (only additively
removing a small quantity compared to the situation without regularization) whereas
the L2 weight decay aggressively punishes and prevents large parameter values. The L1
and L2 regularizers can be combined, as in the so-called elastic net (Zou and Hastie,
2005), and this is commonly done in deep networks 8 .
Note how in the context of maximum likelihood, regularizers can generally be interpreted as Bayesian priors on the parameters P () or on the learned function, as
discussed in Sections 5.6.2 and 5.7. Later in this book we discuss regularizers that are
data-dependent (i.e., cannot be expressed easily as a pure prior), such as the contractive
penalty (Chapter 16) as well as regularization methods that are dicult to interpret
simply as added terms in the training criterion, such as dropout (Section 7.11).

6.2.4

Optimization Procedure

Once a training criterion is dened, we enter the realm of iterative optimization procedures, with the slight twist that if possible we want to monitor held-out performance
(for example the average loss on a validation set, which usually does not include the
regularizing terms) and not just the value of the training criterion. Monitoring the validation set performance allows one to stop training at a point where generalization is
estimated to be the best among the training iterations. This is called early stopping and
is a standard machine learning technique, discussed in Sec. 7.8.
Section 4.3 has already covered the basics of gradient-based optimization. The simplest such technique is stochastic gradient descent (with minibatches), in which the
parameters are updated after computing the gradient of the average loss over a minibatch of examples (e.g. 128 examples) and making an update in the direction opposite
to that gradient (or opposite to some accumulated average of such gradients, i.e., the
momentum technique, reviewed in Section 12.5.3). The most commonly used optimization procedures for multi-layer neural networks and deep learning in general are either
variants of stochastic gradient descent (typically with minibatches), second-order methods (the most commonly used being L-BFGS and nonlinear conjugate gradients) applied
on large minibatches (e.g. of size 10000) or on the whole training set at once, or natural gradient methods (Amari, 1997; Park et al., 2000; Le Roux et al., 2008; Pascanu
and Bengio, 2013). Exact second-order and natural gradient methods are computationally too expensive for large neural networks because they involve matrices of dimension
equal to the square of the number of parameters. Approximate methods are discussed
in Section 8.4.
On smaller datasets or when computing can be parallelized, second-order methods
have a computational advantage because they are easy to parallelize and can still perform
many updates per unit of time. On larger datasets (and in the limit of an innite dataset,
i.e., a stream of training examples) one cannot aord to wait for seeing the whole dataset
before making an update, so that favors either stochastic gradient descent variants
8

See the Deep Learning Tutorials at http://deeplearning.net/tutorial/gettingstarted.


html#l1-and-l2- regularization

116

(possibly with minibatches to take advantage of fast or parallelized implementations of


matrix-matrix products) or second-order methods with minibatches.
A more detailed discussion of issues arising with optimization methods in deep learning can be found in Chapter 8. In particular, many design choices in the construction
of the family of functions, loss function and regularizer can have a major impact on
the diculty of optimizing the training criterion. Furthermore, instead of using generic
optimization techniques, one can design optimization procedures that are specic to
the learning problem and chosen architecture of the family of functions, for example by
initializing the parameters of the nal optimization routine from the result of a dierent optimization (or a series of optimizations, each initialized from the previous one).
Because of the non-convexity of the training criterion, the initial conditions can make
a very important dierence, and this is what is exploited in the various pre-training
strategies, Sections 8.6.4 and 17.1, as well as with curriculum learning (Bengio et al.,
2009), Section 8.7.

6.3

Flow Graphs and Back-Propagation

The term back-propagation is often misunderstood as meaning the whole learning algorithm for multi-layer neural networks. Actually it just means the method for computing
gradients in such networks. Furthermore, it is generally understood as something very
specic to multi-layer neural networks, but once its derivation is understood, it can easily be generalized to arbitrary functions (for which computing a gradient is meaningful),
and we describe this generalization here, focusing on the case of interest in machine
learning where the output of the function to dierentiate (e.g., the loss L or the training
criterion C) is a scalar and we are interested in its derivative with respect to a set of parameters (considered to be the elements of a vector ). The partial derivative of C with
respect to (called the gradient) tells us whether should be increased or decreased
in order to decrease C, and is a crucial tool in optimizing the training objective. It
can be readily proven that the back-propagation algorithm has optimal computational
complexity in the sense that there is no algorithm that can compute the gradient faster
(in the O() sense, i.e., up to an additive and multiplicative constant).
The basic idea of the back-propagation algorithm is that the partial derivative of
the cost C with respect to parameters can be decomposed recursively by taking into
consideration the composition of functions that relate to C, via intermediate quantities
that mediate that inuence, e.g., the activations of hidden units in a deep neural network.
In this section we call these intermediate values uj (indexed by j) and consider the
general case in which they form a directed acyclic graph that has C as its nal node
u N , that depends of all the other nodes uj . The back-propagation algorithm exploits
C
the chain rule for derivatives to compute C
uj when ui has already been computed for
successors u i of uj in the graph, e.g., the hidden units in the next layer downstream.
C
This recursion can be initialized by noting that u
= 1 and at each step only requires
N
u i , when u is a
to use the partial derivatives associated with each arc of the graph, u
i
j
successor of uj.
117

Figure 6.3: The chain rule, illustrated in the simplest possible case, with z a scalar
function of y, which is itself a scalar function of x. A small change x in x gets turned
into a small change y in y through the partial derivative y
, from the rst-order Taylor
x
approximation of y(x), and similarly for z(y). Plugging the equation for y into the
equation for z yields the chain rule.

6.3.1

Chain Rule

In order to apply the back-propagation algorithm, we take advantage of the chain rule:
C(g()) = g() C(g())

g()

(6.5)

which works also when C, g or are vectors rather than scalars (in which case the
corresponding partial derivatives are understood as Jacobian matrices of the appropriate
dimensions). In the purely scalar case we can understand the chain rule as follows: a
small change in will propagate into a small change in g() by getting multiplied by
g()
. Similarly, a small change in g() will propagate into a small change in C(g())
by getting multiplied by g() C(g()). Hence a small change in rst gets multiplied
g()

by to obtain the change in g() and this then gets multiplied by g()C(g()) to
obtain the change in C(g()). Hence the ratio of the change in C(g()) to the change in
is the product of these partial derivatives. The partial derivative measures the locally
linear inuence of a variable on another. This is illustrated in Figure 6.3, with x = ,
y = g(), and z = C(g()).
Now, if g is a vector, we can rewrite the above as follows:
C(g()) =

X C(g()) gi ()
i

gi ()

which sums over the inuences of on C(g()) through all the intermediate variables
g i (). This is illustrated in Figure 6.4 with x = , yi = g i(), and z = C(g()).

118

Figure 6.4: Top: The chain rule, when there are two intermediate variables y1 and y 2
between x and z, creating two paths for changes in x to propagate and yield changes in
z. Bottom: more general case, with n intermediate variables y 1 to yn.

6.3.2

Back-Propagation in an MLP

Whereas example 6.1.1 illustrated the case of of an MLP with a single hidden layer let
us consider in this section back-propagation for an ordinary but deep MLP, i.e., like the
above vanilla MLP but with several hidden layers. For this purpose, we will recursively
apply the chain rule illustrated in Figure 6.4. The algorithm proceeds by rst computing
the gradient of the cost C with respect to output units, and these are used to compute
the gradient of C with respect to the top hidden layer activations. We can then continue
computing the gradients of lower level hidden units one at a time in the same way The
gradients on hidden and output units can be used to compute the gradient of C with
respect to the parameters (e.g. weights and biases) of each layer (i.e., that directly
contribute to the output of that layer).
Algorithm 6.1 describes in matrix-vector form the forward propagation computation
for a classical multi-layer network with M layers, where each layer computes an ane

119

Algorithm 6.1 Forwardcomputation associated with input x for a deep neural network
with ordinary ane layers composed with an arbitrary elementwise dierentiable (almost
everywhere) non-linearity f. There are Msuch layers, each mapping their vectorvalued input h k to a pre-activation vector ak via a weight matrix W(k) which is then
transformed via f into hk+1 . The input vector x corresponds to h 0 and the predicted
outputs
y corresponds to h M. The cost function L(y , y) depends on the output y and
on a target y (see Section 6.2.2 for examples of loss functions). The loss may be added
to a regularizer (see Section 6.2.3 and Chapter 7) to obtain the example-wise cost C.
Algorithm 6.2 shows how to compute gradients of C with respect to parameters W and
b.
h0 = x
for k = 1 . . . , M do
a(k) = b(k) + W (k)h (k1)
h(k) = f(a(k) )
end for
y = h(M )
C = L(
y, y) +
transformation (dened by a bias vector b(k) and a weight matrix W (k) followed by a
non-linearity f . In general, the non-linearity may be dierent on dierent layers, and
this is typically the case for the output layer (see Section 6.2.1). Hence each unit at
layer k computes an output h (k)
as follows:
i
a (k)
= b(k)
+
i
i

Wij(k)h (k1)
j

(k)
hi

= f(a (k) )

(6.6)

where we separate the ane transformation from the non-linear activation operations
for ease of exposition of the back-propagation computations.
These are described in matrix-vector form by Algorithm 6.2 and proceed from the
output layer towards the rst hidden layer, as outlined above.

6.3.3

Back-Propagation in a General Flow Graph

More generally, we can think about decomposing a function C() into a more complicated
graph of computations. This graph is called a ow graph. Each node u i of the graph
denotes a numerical quantity that is obtained by performing a computation requiring
the values uj of other nodes, with j < i. The nodes satisfy a partial order which
dictates in what order the computation can proceed. In practical implementations of
such functions (e.g. with the criterion C() or its value estimated on a minibatch), the
nal computation is obtained as the composition of simple functions taken from a given
set (such as the set of numerical operations that the numpy library can perform on arrays
of numbers).
120

Algorithm 6.2 Backward computation for the deep neural network of Algorithm 6.1,
which uses in addition to the input x a target y. This computation yields the gradients on
the activations a (k) for each layer k, starting from the output layer and going backwards
to the rst hidden layer. From these gradients, which can be interpreted as an indication
of how each layers output should change to reduce error, one can obtain the gradient on
the parameters of each layer. The gradients on weights and biases can be immediately
used as part of a stochastic gradient update (performing the update right after the
gradients have been computed) or used with other gradient-based optimization methods.
After the forward computation, compute the gradient on the output layer:
g y C = y L(
y , y) + y
(typically is only a function of parameters not activations, so the last term would
be zero)
for k = M down to 1 do
Convert the gradient on the layers output into a gradient into the pre-nonlinearity
activation (element-wise multiplication if f is element-wise):
g a(k) C = g f 0 (a(k) )
Compute gradients on weights and biases (including the regularization term, where
needed):
b (k)C = g + b(k)
W (k) C = g h(k1)> + W(k)
Propagate the gradients w.r.t. the next lower-level hidden layers activations:
g h(k1) C = W (k)> a(k) C
end for
We will dene the back-propagation in a general ow-graph, using the following
generic notation: ui = f i(a i), where a i is a list of arguments for the application of f i to
the values u j for the parents of i in the graph: ai = (uj )jparents(i) . This is illustrated
in Figure 6.5.

121

uN

ui = fi (ai )
fi

ai
uj

u1 u 2
Figure 6.5: Illustration of recursive forward computation, where at each node ui we
compute a value u i = fi (ai ), with ai being the list of values from parents u j of node
u i . Following Algorithm 6.3, the overall inputs to the graph are u1 . . . , uM (e.g., the
parameters we may want to tune during training), and there is a single scalar output
u N (e.g., the loss which we want to minimize).
The overall computation of the function represented by the ow graph can thus be
summarized by the forward computation algorithm, Algorithm 6.3.
In addition to having some code that tells us how to compute f i(a i ) for some values in the vector ai , we also need some code that tells us how to compute its partial
i(a i)
derivatives, fa
with respect to any immediate argument aik. Let k = (i, j) denote
ik
the index of uj in the list ai . Note that uj could inuence ui through multiple paths.
i(ai )
ui
Whereas u
would denote the total gradient adding up all of these inuences, f
a ik
j
only denotes the derivative of f i with respect to its specic k-th argument, keeping the
other arguments xed, i.e., only considering the inuence through the arc from u j to
u i . In general, when manipulating partial derivatives, one should keep clear in ones
mind (and implementation) the notational and semantic distinction between a partial
derivative that includes all paths and one that includes only the immediate eect of a
functions argument on the function output, with the other arguments considered xed.
For example consider f3 (a3,1, a3,2 ) = ea 3,1+a3,2 and f 2 (a 2,1 ) = a22,1, while u 3 = f3 (u2 , u 1)
122

Algorithm 6.3 Flow graph forwardcomputation. Each node computes numerical value
u i by applying a function f i to its argument list ai that comprises the values of previous
nodes uj , j < i, with j parents(i). The input to the ow graph is the vector x, and is
set into the rst M nodes u1 to u M . The output of the ow graph is read o the last
(output) node uN .
for i = 1 . . . , M do
ui xi
end for
for i = M + 1 . . . , N do
ai (u j )jparents(i)
ui fi (a i)
end for
return uN
and u 2 = f2(u1 ), illustrated in Figure 6.6. The direct derivative of f3 with respect to its
f3 = e a 3,1+a 3,2 while if we consider the variables u and u to which
argument a 3,2 is a
3
1
3,2
these correspond, there are two paths from u 1 to u3 , and we obtain as derivative the
u1+u 2 (1 + 2u ). The results are
3
sum of partial derivatives over these two paths, u
1
u = e
1

3
dierent because u
u 1 involves not just the direct dependency of u 3 on u 1 but also the
indirect dependency through u 2.

u3

u2
u1
Figure 6.6: Illustration of indirect eect and direct eect of variable u1 on variable u 3
in a ow graph, which means that the derivative of u3 with respect to u 1 must include
the sum of two terms, one for the direct eect (derivative of u 3 with respect to its
rst argument) and one for the indirect eect through u 2 (involving the product of
the derivative of u 3 with respect to u 2 times the derivative of u 2 with respect to u 1).
Forward computation of uis (as in Figure 6.5) is indicated with upward full arrows,
while backward computation (of derivatives with respect to ui s, as in Figure 6.7) is
indicated with downward dashed arrows.
123

@uN
=1
@uN

@u N
@u i

@u N
@uj

Figure 6.7: Illustration of recursive backward computation, where we associate to each


node j not just the values uj computed in the forward pass (Figure 6.5, bold upward
N
arrows) but also the gradient u
u j with respect to the output scalar node uN . These
gradients are recursively computed in exactly the opposite order, as described in AlgoN
rithm 6.4 by using the already computed u
of the children i of j (dashed downward
ui
arrows).
Armed with this understanding, we can dene the back-propagation algorithm as
follows, in Algorithm 6.4, which would be computed after the forward propagation
(Algorithm 6.3) has been performed. Note the recursive nature of the application of the
chain rule, in Algorithm 6.4: we compute the gradient on node j by re-using the already
N
computed gradient for children nodes i, starting the recurrence from the trivial u
u N = 1
that sets the gradient for the output node. This is illustrated in Figure 6.7.
124

Algorithm 6.4 Back-propagation computation of a ow graph (full, upward arrows),


which itself produces an additional ow graph (dashed, backward arrows). See the
forward propagation in a ow-graph (Algorithm 6.3, to be performed rst) and the
N
required data structure. In addition, a quantity u
u i needs to be stored (and computed)
at each node, for the purpose of gradient back-propagation. Below the notation (i, j)
is the index of uj as an argument to fi. The back-propagation algorithm eciently
N for all is (traversing the graph backwards this time), and in particular
computes u
ui
we are interested in the derivatives of the output node uN with respect to the inputs
u 1 . . . , uM (which could be the parameters, in a learning setup). The cost of the overall
computation is proportional to the number of arcs in the graph, assuming that the
partial derivative associated with each arc requires a constant time. This is of the same
order as the number of computations for the forward propagation.
uN
uN

1
for j = N P
1 down to 1 do
u N
fi(ai )
N
i:jparents(i) u
uj
u i ai,(i,j)
end for
M
N
return u
ui
i=1

This recursion is a form of ecient factorization of the total gradient, i.e., it is an


application of the principles of dynamic programming 9 . Indeed, the derivative of the
output node with respect to any node can also be written down in this intractable form:
u N
=
u i

n
Y
ukj
u kj1

paths u k ...,uk n : k 1 =i,kn =N j=2


1

where the paths u k1 . . . , uk n go from the node k1 = i to the nal node kn = N in the ow
graph. Computing the sum as above would be intractable because the number of possible
paths can be exponential in the depth of the graph. The back-propagation algorithm
is ecient because it employs a dynamic programming strategy to re-use rather than
re-compute partial sums associated with the gradients on intermediate nodes.
Although the above was stated as if the u is were scalars, exactly the same procedure
can be run with ui s being tuples of numbers (more easily represented by vectors). In
that case the equations remain valid, and the multiplication of scalar partial derivatives
N
becomes the multiplication of a row vector of gradients u
with a Jacobian of partial
u i
derivatives associated with the j i arc of the graph,

fi (ai )
.
ai,(i,j)

In the case where

minibatches are used during training, ui would actually be a whole matrix (the extra
dimension being for the examples in the minibatch). This would then turn the basic
computation into matrix-matrix products rather than matrix-vector products, and the
9

Here we refer to dynamic programming in the sense of table-lling algorithms that avoid recomputing frequently used subexpressions. In the context of machine learning, dynamic programming
can also refer to iterating Bellmans equations. That is not the kind of dynamic programming we refer
to here.

125

former can be computed much more eciently than a sequence of matrix-vector products
(e.g. with the BLAS library), especially so on modern computers and GPUs, that rely
more and more on parallelization through many cores.
More implementation issues regarding the back-propagation algorithm are discussed
in Chapters 11 and 12, regarding respectively GPU implementations (Section 11.2)
and debugging tricks (Section 12.5.1). Section 12.5.2 also discusses a natural generalization of the back-propagation algorithm in which one manipulates not numbers but
symbolic expressions, i.e., turning a program that performs a computation (decomposed
as a ow graph expressed in the forward propagation algorithm) into another program
(another ow graph) that performs gradient computation (i.e. automatically generating the back-propagation program, given the forward propagation graph). Using such
symbolic automatic dierentiation procedures (implemented for example in the Theano
library10 ), one can compute second derivatives and other kinds of derivatives, as well as
apply numerical stabilization and simplications on a ow graph.

6.4

Universal Approximation Properties and Depth

When there is no hidden layer, and for convex loss functions and regularizers (which is
typically the case), we obtain a convex training criterion. We end up with linear regression, logistic regression or other log-linear models that are used in many applications of
machine learning. This is appealing (no local minima, no saddle point) and convenient
(no strong dependency on initial conditions) but comes with a major disadvantage: such
learners are very limited in their ability to represent more complex functions. In the
case of classication tasks, we are limited to linear decision surfaces. In practice, this
limitation can be somewhat circumvented by handcrafting a large enough or discriminant enough set of hardwired features extracted from the raw input and on which the
linear predictor can be easily trained. See next section for a longer discussion about
feature learning.
To make the family of functions rich enough, the other option is to introduce one
or more hidden layers. It can be proven (White, 1990; Barron, 1993; Girosi, 1994)
that with a single hidden layer of a sucient size and a reasonable choice of nonlinearity (including the sigmoid, hyperbolic tangent, and RBF unit), one can represent
any smooth function to any desired accuracy (the greater the required accuracy, the
more hidden units are required). However, these theorems generally do not tell us
how many hidden units will be required to achieve a given accuracy for particular data
distributions: in fact the worse case scenario is generally going to require an exponential
number of hidden units (to basically record every input conguration that needs to be
distinguished). It is easier to understand how bad things can be by considering the case
of binary input vectors: the number of possible binary functions on vectors v {0, 1}d is
d
2 2 and selecting one such function requires 2 d bits, which will in general require O(2d)
degrees of freedom.
However, machine learning is not about learning any possible function with equal
10

http://deeplearning.net/software/theano/

126

ease: we mostly care about the kinds of functions that are needed to represent the
world around us or the task of interest. The no-free-lunch theorems for machine
learning (Wolpert, 1996) essentially say that no learning algorithm is universal, i.e.,
dominates all the others against all possible ground truths (data generating distribution
or target function). However, for a given family of tasks, some learning algorithms are
denitely better. Choosing a learning algorithm is equivalent to choosing a prior, i.e.,
making a larger bet for some guesses of the underlying target function than for others.
If the target function is likely under the chosen prior, then this prior (i.e., this learning
algorithm) turns out to be a good choice. The best choice is the unrealistic prior that
puts all probability mass on the true target function. This would only happen if there is
nothing to be learned any more because we already know the target function. Having a
universal approximation property just means that this prior is broad enough to cover or
come close to any possible target function and this is a useful property but is, in itself,
clearly not sucient to provide good generalization (not to speak of the optimization
issues and other computational concerns that may be associated with a choice of prior
and learning algorithm).
One of the central priors that is exploited in deep learning is that the target function
to be learned can be eciently represented as a deep composition of simpler functions
(features), where features at one level can be re-used to dene many features at the
next level. This is connected to the notion of underlying factors described in the next
section. One therefore assumes that these factors or features are organized at multiple
levels, corresponding to multiple levels of representation. The number of such levels is
what we call depth in this context. The computation of these features can therefore be
laid down in a ow graph or circuit, whose depth is the length of the longest path from
an input node to an output node. Note that we can dene the operations performed
in each node in dierent ways. For example, do we consider a node that computes
the ane operations of a neural net followed by a node that computes the non-linear
neuron activation, or do we consider both of these operations as one node or one level
of the graph? Hence the notion of depth really depends on the allowed operations at
each node and one ow graph of a given depth can be converted into an equivalent
ow graph of a dierent depth by redening which operations can be performed at each
node. However, for neural networks, we typically talk about a depth 1 network if there
is no hidden layer, a depth 2 network if there is one hidden layer, etc. The universal
approximation properties for neural nets basically tell us that depth 2 is sucient to
approximate any reasonable function to any desired nite accuracy.
From the point of view of approximation properties, the important result is that one
can nd families of functions which can be approximated very eciently when a particular depth is allowed, but which might require a much larger (typically exponentially
larger) model (e.g. more hidden units) if depth is insucient (or is limited to 2). Such
results have been proven for logic gates (H
astad, 1986), linear threshold units with nonnegative weights (H
astad and Goldmann, 1991), polynomials (Delalleau and Bengio,
2011) organized as deep sum-product networks (Poon and Domingos, 2011), and more
recently, for deep networks of rectier units (Pascanu et al., 2013b). Of course, there
is no guarantee that the kinds of functions we want to learn in applications of machine
127

learning (and in particular for AI) share such a property. However, a lot of experimental
evidence suggests that better generalization can be obtained with more depth, for many
such AI tasks (Bengio, 2009; Mesnil et al., 2011; Goodfellow et al., 2011; Ciresan et al.,
2012; Krizhevsky et al., 2012b; Sermanet et al., 2013; Farabet et al., 2013a; Couprie
et al., 2013; Ebrahimi et al., 2013). This suggests that indeed depth is a useful prior
and that in order to take advantage of it, the learners family of function needs to allow
multiple levels of representation.

6.5

Feature / Representation Learning

Let us consider again the single layer networks such as the Perceptron, linear regression
and logistic regression: such linear models are appealing because training them involves
a convex optimization problem 11, i.e., an optimization problem with some convergence
guarantees towards a global optimum, irrespective of initial conditions. Simple and wellunderstood optimization algorithms are available in this case. However, this limits the
representational capacity too much: many tasks, for a given choice of input representation x (the raw input features), cannot be solved by using only a linear predictor. What
are our options to avoid that limitation?
1. One option is to use a kernel machine (Williams and Rasmussen, 1996; Scholkopf
et al., 1999), i.e., to consider a xed mapping from x to (x), where (x) is of
much higher dimension. In this case, f (x) = b + w (x) can be linear in the
parameters (and in (x)) and optimization remains convex (or even analytic). By
the exploiting the kernel trick, we can computationally handle a high-dimensional
(x) (or even an innite-dimensional one) so long as the kernel K(u, v) = (u)(v)
(where is the appropriate dot product for the space of ()) can be computed
eciently. If (x) is of high enough dimension, we can always have enough capacity
to t the training set, but generalization is not at all guaranteed: it will depend
on the appropriateness of the choice of as a feature space for our task. Kernel
machines theory clearly identies the choice of to the choice of a prior. This leads
to kernel engineering, which is equivalent to feature engineering, discussed next.
The other type of kernel (that is very commonly used) embodies a very broad prior,
2
such as smoothness, e.g., the Gaussian (or RBF) kernel K(u, v) = e||uv||/ .
Unfortunately, this prior may be insucient, i.e., too broad and sensitive to the
curse of dimensionality, as introduced in Section 5.12 and developed in more detail
in Chapter 17.
2. Another option is to manually engineer the representation or features (x). Most
industrial applications of machine learning rely on hand-crafted features and most
of the research and development eort (as well as a very large fraction of the
scientic literature in machine learning and its applications) goes into designing
new features that are most appropriate to the task at hand. Clearly, faced with
11
or even one for which an analytic solution can be computed, with linear regression or the case of
some Gaussian process regression models

128

a problem to solve and some prior knowledge in the form of representations that
are believed to be relevant, the prior knowledge can be very useful. This approach
is therefore common in practice, but is not completely satisfying because it involves a very task-specic engineering work and a laborious never-ending eort to
improve systems by designing better features. If there were some more general
feature learning approaches that could be applied to a large set of related tasks
(such as those involved in AI), we would certainly like to take advantage of them.
Since humans seem to be able to learn a lot of new tasks (for which they were
not programmed by evolution), it seems that such broad priors do exist. This
whole question is discussed in a lot more detail in Bengio and LeCun (2007b), and
motivates the third option.
3. The third option is to learn the features, or learn the representation. In a sense,
it allows one to interpolate between the almost agnostic approach of a kernel machine with a general-purpose smoothness kernel (such as RBF SVMs and other
non-parametric statistical models) and full designer-provided knowledge in the
form of a xed representation that is perfectly tailored to the task. This is equivalent to the idea of learning the kernel, except that whereas most kernel learning
methods only allow very few degrees of freedom in the learned kernel, representation learning methods such as those discussed in this book (including multi-layer
neural networks) allow the feature function () to be very rich (with a number of parameters that can be in the millions or more, depending on the amount
of data available). This is equivalent to learning the hidden layers, in the case
of a multi-layer neural network. Besides smoothness (which comes for example
from regularizers such as weight decay), other priors can be incorporated in this
feature learning. The most celebrated of these priors is depth, discussed above
(Section 6.4). Other priors are discussed in Chapter 17.
This whole discussion is clearly not specic to neural networks and supervised learning, and is one of the central motivations for this book.

6.6

Piecewise Linear Hidden Units

Most of the recent improvement in the performance of deep neural networks can be
attributed to increases in computational power and the size of datasets. The machine
learning algorithms involved in recent state-of-the-art systems have mostly existed since
the 1980s, with very few recent conceptual advances contributing signicantly to increased performance.
One of the main algorithmic improvements that has had a signicant impact is the
use of piecewise linear units, such as absolute value rectiers, rectied linear units. Such
units consist of two linear pieces and their behavior is driven by a single weight vector.
Jarrett et al. (2009a) observed that using a rectifying non-linearity is the single most
important factor in improving the performance of a recognition system among several
dierent factors of neural network architecture design.
129

For small datasets, Jarrett et al. (2009a) observed that rectifying non-linearities is
even more important than learning the weights of the hidden layers. Random weights are
sucient to propagate useful information through a rectied linear network, allowing the
classier layer at the top to learn how to map dierent feature vectors to class identities.
When more data is available, learning becomes relevant because it can extract more
knowledge from it, and learning typically beats xed or random settings of parameters.
Glorot et al. (2011b) showed that learning is far easier in deep rectied linear networks
than in deep networks that have curvature or two-sided saturation in their activation
functions. Because the behavior of the unit is linear over half of its domain, it is easy for
an optimization algorithm to tell how to improve the behavior of a unit, even when the
units activations are far from optimal. Just as piecewise linear networks are good at
propagating information forward, back-propagation in such a network is also piecewise
linear and propagates information about the error derivatives to all of the gradients in
the network. Each piecewise linear function can be decomposed into dierent regions
corresponding to dierent linear pieces. When we change a parameter of the network,
the resulting change in the networks activity is linear until the point that it causes some
unit to go from one linear piece to another. Traditional units such as sigmoids are more
prone to discarding information due to saturation both in forward propagation and in
back-propagation, and the response of such a network to a change in a single parameter
may be highly nonlinear even in a small neighborhood.
One drawback to rectied linear units is that they cannot learn via gradient-based
methods on examples for which their activation is zero. This problem can be mitigated
by initializing the biases to a small positive number, but it is still possible for a rectied
linear unit to learn to de-activate and then never be activated again. Goodfellow et al.
(2013a) introduced maxout units and showed that maxout units can successfully learn
in conditions where rectied linear units become stuck. Maxout units are also piecewise
linear, but unlike rectied linear units, each piece of the linear function has its own
weight vector, so whichever piece is active can always learn.
This same general principle of using linear behavior to obtain easier optimization also
applies in other contexts besides deep linear networks. Recurrent networks, which need
to propagate information through several time steps, are much easier to train when they
are more linear. One of the best-performing recurrent network architectures, the LSTM,
propagates information through time via summationa particular straightforward kind
of linear activation. This is discussed further in Section 10.6.4.
In addition to helping to propagate information and making optimization easier,
piecewise linear units also have some nice properties that can make them easier to
regularize. This is discussed further in Section 7.11.
Sigmoidal non-linearities still perform well in some contexts, but piecewise linear
units are now by far the most popular kind of hidden units.

6.7

Historical Notes

130

You might also like