Taking the derivative of the loss function of a neural network can be quite cumbersome. Even taking the derivative of a single layer in a neural network often results in expressions cluttered with indices. In this post I’d like to show an index-free way to do it.

Consider the map where is the weight matrix, is the bias, is the input, and is the activation function. Usually represents both a scalar function (i.e. mapping ) and the function mapping which applies in each coordinate. In training neural networks, we would try to optimize for best parameters and . So we need to take the derivative with respect to and . So we consider the map

This map is a concatenation of the map and and since the former map is linear in the joint variable , the derivative of should be pretty simple. What makes the computation a little less straightforward is the fact the we are usually not used to view matrix-vector products as linear maps in but in . So let’s rewrite the thing:

There are two particular notions which come in handy here: The Kronecker product of matrices and the vectorization of matrices. Vectorization takes some given columnwise and maps it by

The Kronecker product of matrices and is a matrix in

We will build on the following marvelous identity: For matrices , , of compatible size we have that

Why is this helpful? It allows us to rewrite

So we can also rewrite

So our map mapping can be rewritten as

mapping . Since is just a concatenation of applied coordinate wise and a linear map, now given as a matrix, the derivative of (i.e. the Jacobian, a matrix in ) is calculated simply as

While this representation of the derivative of a single layer of a neural network with respect to its parameters is not particularly simple, it is still index free and moreover, straightforward to implement in languages which provide functions for the Kronecker product and vectorization. If you do this, make sure to take advantage of sparse matrices for the identity matrix and the diagonal matrix as otherwise the memory of your computer will be flooded with zeros.

Now let’s add a scalar function (e.g. to produce a scalar loss that we can minimize), i.e. we consider the map

The derivative is obtained by just another application of the chain rule:

If we want to take gradients, we just transpose the expression and get

Note that the right hand side is indeed vector in and hence, can be reshaped to a tupel of an matrix and an vector.

A final remark: the Kronecker product is related to tensor products. If and represent linear maps and , respectively, then represents the tensor product of the maps, . This relation to tensor products and tensors explains where the tensor in TensorFlow comes from.