Taking the derivative of the loss function of a neural network can be quite cumbersome. Even taking the derivative of a single layer in a neural network often results in expressions cluttered with indices. In this post I’d like to show an index-free way to do it.
Consider the map where
is the weight matrix,
is the bias,
is the input, and
is the activation function. Usually
represents both a scalar function (i.e. mapping
) and the function mapping
which applies
in each coordinate. In training neural networks, we would try to optimize for best parameters
and
. So we need to take the derivative with respect to
and
. So we consider the map
This map is a concatenation of the map
and
and since the former map is linear in the joint variable
, the derivative of
should be pretty simple. What makes the computation a little less straightforward is the fact the we are usually not used to view matrix-vector products
as linear maps in
but in
. So let’s rewrite the thing:
There are two particular notions which come in handy here: The Kronecker product of matrices and the vectorization of matrices. Vectorization takes some given columnwise
and maps it by
The Kronecker product of matrices and
is a matrix in
We will build on the following marvelous identity: For matrices ,
,
of compatible size we have that
Why is this helpful? It allows us to rewrite
So we can also rewrite
So our map mapping
can be rewritten as
mapping . Since
is just a concatenation of
applied coordinate wise and a linear map, now given as a matrix, the derivative of
(i.e. the Jacobian, a matrix in
) is calculated simply as
While this representation of the derivative of a single layer of a neural network with respect to its parameters is not particularly simple, it is still index free and moreover, straightforward to implement in languages which provide functions for the Kronecker product and vectorization. If you do this, make sure to take advantage of sparse matrices for the identity matrix and the diagonal matrix as otherwise the memory of your computer will be flooded with zeros.
Now let’s add a scalar function (e.g. to produce a scalar loss that we can minimize), i.e. we consider the map
The derivative is obtained by just another application of the chain rule:
If we want to take gradients, we just transpose the expression and get
Note that the right hand side is indeed vector in and hence, can be reshaped to a tupel
of an
matrix and an
vector.
A final remark: the Kronecker product is related to tensor products. If and
represent linear maps
and
, respectively, then
represents the tensor product of the maps,
. This relation to tensor products and tensors explains where the tensor in TensorFlow comes from.
April 11, 2019 at 6:55 am
This Resource site
An index-free way to take the gradient of a neural network | regularize