### Uncategorized

In my previous post I illustrated why it is not possible to compute the Jordan canonical form numerically (i.e. in floating point numbers). The simple reason: For every matrix ${A}$ and every ${\epsilon>0}$ there is a matrix ${A_{\epsilon}}$ which differs from ${A}$ by at most ${\epsilon}$ (e.g. in every entry – but all norms for matrices are equivalent, so this does not really play a role) such that ${A_{\epsilon}}$ is diagonalizable. So why should you bother about computing the Jordan canonical form anyway? Or even learning or teaching it? Well, the prime application of the Jordan canonical form is to calculate solutions of linear systems of ODEs. The equation

$\displaystyle y'(t) = Ay(t),\quad y(0) = y_{0}$

with matrix ${A\in {\mathbb R}^{n\times n}}$ and initial value ${y_{0}\in{\mathbb R}^{n}}$ (both could also be complex). This system has a unique solution which can be given explicitly with the help of the matrix exponential as

$\displaystyle y(t) = \exp(At)y_{0}$

where the matrix exponential is

$\displaystyle \exp(At) = \sum_{k=0}^{\infty}\frac{A^{k}t^{k}}{k!}.$

It is not always simple to work out the matrix exponential by hand. The straightforward way would be to calculate all the powers of ${A}$, weight them by ${1/k!}$ and sum the series. This may be a challenge, even for simple matrices. My favorite example is the matrix

$\displaystyle A = \begin{bmatrix} 0 & 1\\ 1 & 1 \end{bmatrix}.$

Its first powers are

$\displaystyle A^{2} = \begin{bmatrix} 1 & 1\\ 1 & 2 \end{bmatrix},\quad A^{3} = \begin{bmatrix} 1 & 2\\ 2 & 3 \end{bmatrix}$

$\displaystyle A^{4} = \begin{bmatrix} 2 & 3\\ 3 & 5 \end{bmatrix},\quad A^{5} = \begin{bmatrix} 3 & 5\\ 5 & 8 \end{bmatrix}.$

You may notice that the Fibonicci numbers appear (and this is pretty clear on a second thought). So, finding a explicit form for ${\exp(A)}$ leads us to finding an explicit form for the ${k}$-th Fibonacci number (which is possible, but I will not treat this here).

Another way is diagonalization: If ${A}$ is diagonalizable, i.e. there is an invertible matrix ${S}$ and a diagonal matrix ${D}$ such that

$\displaystyle S^{-1}AS = D\quad\text{or, equivalently}\quad A = SDS^{-1},$

you see that

$\displaystyle \exp(At) = S\exp(Dt)S^{-1}$

and the matrix exponential of a diagonal matrix is simply the exponential function applied to the diagonal entries.

But not all matrices are diagonalizable! The solution that is usually presented in the classroom is to use the Jordan canonical form instead and to compute the matrix exponential of Jordan blocks (using that you can split a Jordan block ${J = D+N}$ into the sum of a diagonal matrix ${D}$ and a nil-potent matrix ${N}$ and since ${D}$ and ${N}$ commute one can calculate ${\exp(J) = \exp(D)\exp(N)}$ and both matrix exponentials are quite easy to compute).

But in light of the fact that there are a diagonalizable matrices arbitrarily close to any matrix, on may ask: What about replacing a non-diagonalizable matrix ${A}$ with a diagonalizable one (with a small error) and then use this one?

Let’s try this on a simple example:

We consider

$\displaystyle A = \begin{bmatrix} -1 & 1\\ 0 & -1 \end{bmatrix}$

which is not diagonalizable. The linear initial value problem

$\displaystyle y' = Ay,\quad y(0) = y_{0}$

has the solution

$\displaystyle y(t) = \exp( \begin{bmatrix} -t & t\\ 0 & -t \end{bmatrix}) y_{0}$

and the matrix exponential is

$\displaystyle \begin{array}{rcl} \exp( \begin{bmatrix} -t & t\\ 0 & -t \end{bmatrix}) & = &\exp(\begin{bmatrix} -t & 0\\ 0 & -t \end{bmatrix})\exp(\begin{bmatrix} 0 & t\\ 0 & 0 \end{bmatrix})\\& = &\begin{bmatrix} \exp(-t) & 0\\ 0 & \exp(-t) \end{bmatrix}\begin{bmatrix} 1 & t\\ 0 & 1 \end{bmatrix}\\ &=& \begin{bmatrix} \exp(-t) & t\exp(-t)\\ 0 & \exp(-t) \end{bmatrix}. \end{array}$

So we get the solution

$\displaystyle y(t) = \begin{bmatrix} e^{-t}(y^{0}_{1} + ty^{0}_{2})\\ e^{-t}y^{0}_{2} \end{bmatrix}.$

Let us take a close-by matrix which is diagonalizable. For some small ${\epsilon}$ we choose

$\displaystyle A_{\epsilon} = \begin{bmatrix} -1 & 1\\ 0 & -1+\epsilon \end{bmatrix}.$

Since ${A_{\epsilon}}$ is upper triangular, it has its eigenvalues on the diagonal. Since ${\epsilon\neq 0}$, there are two distinct eigenvalues and hence, ${A_{\epsilon}}$ is diagonalizable. Indeed, with

$\displaystyle S = \begin{bmatrix} 1 & 1\\ 0 & \epsilon \end{bmatrix},\quad S^{-1}= \begin{bmatrix} 1 & -\tfrac1\epsilon\\ 0 & \tfrac1\epsilon \end{bmatrix}$

we get

$\displaystyle A = S \begin{bmatrix} -1 & 0 \\ 0 & -1+\epsilon \end{bmatrix}S^{-1}.$

The matrix exponential of ${A_{\epsilon}t}$ is

$\displaystyle \begin{array}{rcl} \exp(A_{\epsilon}t) &=& S\exp( \begin{bmatrix} -t & 0\\ 0 & -t(1-\epsilon) \end{bmatrix} )S^{-1}\\ &=& \begin{bmatrix} e^{-t} & \tfrac{e^{-(1-\epsilon)t} - e^{-t}}{\epsilon}\\ 0 & e^{-(1-\epsilon)t} \end{bmatrix}. \end{array}$

Hence, the solution of ${y' = Ay}$, ${y(0) = y_{0}}$ is

$\displaystyle y(t) = \begin{bmatrix} e^{-t}y^{0}_{1} + \tfrac{e^{-(1-\epsilon)t} - e^{-t}}{\epsilon}y^{0}_{2}\\ e^{-(1-\epsilon)t}y^{0}_{2} \end{bmatrix}.$

How is this related to the solution of ${y'=Ay}$? How far is it away?

Of course, the lower right entry of ${\exp(A_{\epsilon}t)}$ converges to ${e^{-t}}$ for ${\epsilon \rightarrow 0}$, but what about the upper right entry? Note that the entry

$\displaystyle \tfrac{e^{-(1-\epsilon)t} - e^{-t}}{\epsilon}$

is nothing else that the (negative) difference quotient for the derivative of the function ${f(a) = e^{-at}}$ at ${a=1}$. Hence

$\displaystyle \tfrac{e^{-(1-\epsilon)t} - e^{-t}}{\epsilon} \stackrel{\epsilon\rightarrow 0}{\longrightarrow} -f'(1) = te^{-t}$

and we get

$\displaystyle \exp(A_{\epsilon}t) \stackrel{\epsilon\rightarrow 0}{\longrightarrow} \begin{bmatrix} e^{-t} & te^{-t}\\ 0 & e^{-t} \end{bmatrix} = \exp(At)$

as expected.

It turns out that a fairly big $\epsilon$ is already enough to get a quite good approximation and even the correct asymptotics: The blue curve it first component of the exact solution (initialized with the second standard basis vector), the red one corresponds $\epsilon = 0.1$ and the yellow on (pretty close to the blue on) is for $\epsilon = 0.01$.

to  \$\e

I remember from my introductory class in linear algebra that my instructor said

It is impossible to calculate the Jordan canonical form of a matrix numerically.

Another quote I remember is

The Jordan canonical form does not depend continuously on the matrix.

For both quotes I did not remember the underlying reasons and since I do teach an introductory class on linear algebra this year, I got thinking about these issues again.

Here is a very simple example for the fact in the second quote:

Consider the matrix

$\displaystyle A_{\varepsilon} = \begin{pmatrix}1 & \varepsilon\\ 0 & 1\end{pmatrix}$

for ${\varepsilon>0}$. This matrix has ${1}$ as a double eigenvalue, but the corresponding eigenspace is one-dimensional and spanned by ${v_{1} = e_{1}}$. To extend this vector to a basis we calculate a principle vector by solving

$\displaystyle (A_{\varepsilon}-I)v_{2} = v_{1}$

$\displaystyle v_{2} = \begin{pmatrix} 0\\\varepsilon^{-1} \end{pmatrix}.$

Defining ${S = [v_{1}\, v_{2}]}$ we get the Jordan canonical form of ${A_{\varepsilon}}$ as

$\displaystyle J_{\varepsilon} = S^{-1}A_{\varepsilon}S = \begin{pmatrix} 1 & 1\\ 0 & 1 \end{pmatrix}$

So we have

$\displaystyle A_{\varepsilon}\rightarrow A = I\quad\text{and}\quad J_{\varepsilon} \rightarrow J = \begin{pmatrix} 1 & 1\\ 0 & 1 \end{pmatrix},$

but ${J}$ is not the Jordan canonical form of ${A}$. So, in short: The Jordan canonical form of the limit of ${A_{\varepsilon}}$ is not the limit of the Jordan canonical form of ${A_{\varepsilon}}$. In other words: Taking limits does not commute with forming the Jordan canonical form.

A side note: Of course, the Jordan canonical form is not even unique in general, so speaking of “dependence on the matrix” is an issue. What we have shown is, that there is no way to get continuous dependence on the matrix even if non-uniqueness is not an issue (like in the example above).

What about the first quote? Why is it impossible to compute the Jordan canonical form numerically? Let’s just try! We start with the simplest non-diagonalizable matrix

$\displaystyle A = \begin{pmatrix} 1 & 1\\ 0 & 1 \end{pmatrix}$

If we ask MATLAB or Octave to do the eigenvalue decomposition we get

>> [V,D] = diag(A)
V =

1.00000 -1.00000
0.00000  0.00000

D =

1 0
0 1

We see that ${V}$ does not seem to be invertible and indeed we get 

>> rank(V)
ans = 1


What is happening? MATLAB did not promise the produce an invertible ${V}$ and it does not promise that the putcome would fulfill ${V^{-1}AV = D}$ (which is my definition of diagonalizability). It does promise that ${AV = VD}$ and in fact

>> A*V
ans =

1.00000  1.00000
0.00000 -0.00000

>> V*D
ans =

1.00000 -1.00000
0.00000  0.00000

>> A*V-V*D
ans =

0.0000e+00 2.2204e-16
0.0000e+00 0.0000e+00

so the promised identity is fulfilled up to machine precision (which is actually equal $\texttt{2.2204e-16}$ and we denote it by ${\epsilon}$ from here on).

How did MATLAB diagonalize this matrix? Here is the thing: The diagonalizable matrices are dense in ${{\mathbb C}^{n\times n}}$! (You probably have heard that before…) What does that mean numerically? Any matrix that you represent in floating point numbers is actually a representative of a whole bunch of matrices. Each entry is only known up to a certain precision. But this bunch of matrices does contain some matrix which is diagonalizable! This is exactly, what it means to be a dense set! So it is impossible to say if a matrix given in floating point numbers is actually diagonalizable or not. So, what matrix was diagonalized by MATLAB? Let us have closer look at the matrix ${V}$: The entries in the first row are in fact ${1}$ and ${-1}$:

>> V(1,:)
ans =
1 -1

In the second row we have 

>> V(2,1)
ans =
0
>> V(2,2)
ans = 2.2204e-16

and there we have it. The inverse of ${V}$ does exist (although the matrix has numerical rank ${1}$) and it is

>> inv(V) warning: matrix singular to machine precision, rcond = 1.11022e-16
ans =

1.0000e+00 4.5036e+15
0.0000e+00 4.5036e+15

and note that $\texttt{4.5036e+15}$ is indeed just the inverse of the machine precision, so this inverse is actually 100% accurate. Recombining gives

>> inv(V)*D*V warning: matrix singular to machine precision, rcond = 1.11022e-16
ans =

1 0
0 1


which is not even close to our original ${A}$.
How is that? Here is a solution. The matrix

$\displaystyle \tilde A = \begin{pmatrix} 1 & 1\\ 0 & 1+\epsilon^{-2} \end{pmatrix}$

is indistinguishable from ${A}$ numerically. However, it has two distinct eigenvalues, so it is diagonalizable. Indeed, a basis of eigenvectors is

$\displaystyle \tilde V = \begin{pmatrix} 1 & -1\\ 0 & \epsilon^{-2} \end{pmatrix}$

which is indistinguishable from ${V}$ above and it holds that

$\displaystyle \tilde V^{-1}\tilde A\tilde V = \begin{pmatrix} 1 & 0\\ 0 & 1+\epsilon^{-2}. \end{pmatrix}$

which is indistinguishable from $D$.

Im Wintersemster 2018/19 habe ich die Vorlesung “Lineare Algebra 1” gehalten. Hier die lecture notes dazu:

Taking the derivative of the loss function of a neural network can be quite cumbersome. Even taking the derivative of a single layer in a neural network often results in expressions cluttered with indices. In this post I’d like to show an index-free way to do it.

Consider the map ${\sigma(Wx+b)}$ where ${W\in{\mathbb R}^{m\times n}}$ is the weight matrix, ${b\in{\mathbb R}^{m}}$ is the bias, ${x\in{\mathbb R}^{n}}$ is the input, and ${\sigma}$ is the activation function. Usually ${\sigma}$ represents both a scalar function (i.e. mapping ${{\mathbb R}\mapsto {\mathbb R}}$) and the function mapping ${{\mathbb R}^{m}\rightarrow{\mathbb R}^{m}}$ which applies ${\sigma}$ in each coordinate. In training neural networks, we would try to optimize for best parameters ${W}$ and ${b}$. So we need to take the derivative with respect to ${W}$ and ${b}$. So we consider the map

$\displaystyle \begin{array}{rcl} G(W,b) = \sigma(Wx+b). \end{array}$

This map ${G}$ is a concatenation of the map ${(W,b)\mapsto Wx+b}$ and ${\sigma}$ and since the former map is linear in the joint variable ${(W,b)}$, the derivative of ${G}$ should be pretty simple. What makes the computation a little less straightforward is the fact the we are usually not used to view matrix-vector products ${Wx}$ as linear maps in ${W}$ but in ${x}$. So let’s rewrite the thing:

There are two particular notions which come in handy here: The Kronecker product of matrices and the vectorization of matrices. Vectorization takes some ${W\in{\mathbb R}^{m\times n}}$ given columnwise ${W = [w_{1}\ \cdots\ w_{n}]}$ and maps it by

$\displaystyle \begin{array}{rcl} \mathrm{Vec}:{\mathbb R}^{m\times n}\rightarrow{\mathbb R}^{mn},\quad \mathrm{Vec}(W) = \begin{bmatrix} w_{1}\\\vdots\\w_{n} \end{bmatrix}. \end{array}$

The Kronecker product of matrices ${A\in{\mathbb R}^{m\times n}}$ and ${B\in{\mathbb R}^{k\times l}}$ is a matrix in ${{\mathbb R}^{mk\times nl}}$

$\displaystyle \begin{array}{rcl} A\otimes B = \begin{bmatrix} a_{11}B & \cdots &a_{1n}B\\ \vdots & & \vdots\\ a_{m1}B & \cdots & a_{mn}B \end{bmatrix}. \end{array}$

We will build on the following marvelous identity: For matrices ${A}$, ${B}$, ${C}$ of compatible size we have that

$\displaystyle \begin{array}{rcl} \mathrm{Vec}(ABC) = (C^{T}\otimes A)\mathrm{Vec}(B). \end{array}$

Why is this helpful? It allows us to rewrite

$\displaystyle \begin{array}{rcl} Wx & = & \mathrm{Vec}(Wx)\\ & = & \mathrm{Vec}(I_{m}Wx)\\ & = & \underbrace{(x^{T}\otimes I_{m})}_{\in{\mathbb R}^{m\times mn}}\underbrace{\mathrm{Vec}(W)}_{\in{\mathbb R}^{mn}}. \end{array}$

So we can also rewrite

$\displaystyle \begin{array}{rcl} Wx +b & = & \mathrm{Vec}(Wx+b )\\ & = & \mathrm{Vec}(I_{m}Wx + b)\\ & = & \underbrace{ \begin{bmatrix} x^{T}\otimes I_{m} & I_{m} \end{bmatrix} }_{\in{\mathbb R}^{m\times (mn+m)}}\underbrace{ \begin{bmatrix} \mathrm{Vec}(W)\\b \end{bmatrix} }_{\in{\mathbb R}^{mn+m}}\\ &=& ( \underbrace{\begin{bmatrix} x^{T} & 1 \end{bmatrix}}_{\in{\mathbb R}^{1\times(n+1)}}\otimes I_{m}) \begin{bmatrix} \mathrm{Vec}(W)\\b \end{bmatrix}. \end{array}$

So our map ${G(W,b) = \sigma(Wx+b)}$ mapping ${{\mathbb R}^{m\times n}\times {\mathbb R}^{m}\rightarrow{\mathbb R}^{m}}$ can be rewritten as

$\displaystyle \begin{array}{rcl} \bar G( \begin{pmatrix} \mathrm{Vec}(W)\\b \end{pmatrix}) = \sigma( ( \begin{bmatrix} x^{T} & 1 \end{bmatrix}\otimes I_{M}) \begin{bmatrix} \mathrm{Vec}(W)\\b \end{bmatrix}) \end{array}$

mapping ${{\mathbb R}^{mn+m}\rightarrow{\mathbb R}^{m}}$. Since ${\bar G}$ is just a concatenation of ${\sigma}$ applied coordinate wise and a linear map, now given as a matrix, the derivative of ${\bar G}$ (i.e. the Jacobian, a matrix in ${{\mathbb R}^{m\times (mn+m)}}$) is calculated simply as

$\displaystyle \begin{array}{rcl} D\bar G( \begin{pmatrix} \mathrm{Vec}(W)\\b \end{pmatrix}) & = & D\sigma(Wx+b)( \begin{bmatrix} x^{T} & 1 \end{bmatrix}\otimes I_{M})\\ &=& \underbrace{\mathrm{diag}(\sigma'(Wx+b))}_{\in{\mathbb R}^{m\times m}}\underbrace{( \begin{bmatrix} x^{T} & 1 \end{bmatrix}\otimes I_{M})}_{\in{\mathbb R}^{m\times(mn+m)}}\in{\mathbb R}^{m\times(mn+m)}. \end{array}$

While this representation of the derivative of a single layer of a neural network with respect to its parameters is not particularly simple, it is still index free and moreover, straightforward to implement in languages which provide functions for the Kronecker product and vectorization. If you do this, make sure to take advantage of sparse matrices for the identity matrix and the diagonal matrix as otherwise the memory of your computer will be flooded with zeros.

Now let’s add a scalar function ${L}$ (e.g. to produce a scalar loss that we can minimize), i.e. we consider the map

$\displaystyle \begin{array}{rcl} F( \begin{pmatrix} \mathrm{Vec}(W)\\b \end{pmatrix}) = L(G(Wx+b)) = L(\bar G( \begin{pmatrix} \mathrm{Vec}(W)\\b \end{pmatrix}). \end{array}$

The derivative is obtained by just another application of the chain rule:

$\displaystyle \begin{array}{rcl} DF( \begin{pmatrix} \mathrm{Vec}(W)\\b \end{pmatrix}) = DL(G(Wx+b))D\bar G( \begin{pmatrix} \mathrm{Vec}(W)\\b \end{pmatrix}). \end{array}$

If we want to take gradients, we just transpose the expression and get

$\displaystyle \begin{array}{rcl} \nabla F( \begin{pmatrix} \mathrm{Vec}(W)\\b \end{pmatrix}) &=& D\bar G( \begin{pmatrix} \mathrm{Vec}(W)\\b \end{pmatrix})^{T} DL(G(Wx+b))^{T}\\ &=& ([x^{T}\ 1]\otimes I_{m})^{T}\mathrm{diag}(\sigma'(Wx+b))\nabla L(G(Wx+b))\\ &=& \underbrace{( \begin{bmatrix} x\\ 1 \end{bmatrix} \otimes I_{m})}_{\in{\mathbb R}^{(mn+m)\times m}}\underbrace{\mathrm{diag}(\sigma'(Wx+b))}_{\in{\mathbb R}^{m\times m}}\underbrace{\nabla L(G(Wx+b))}_{\in{\mathbb R}^{m}}. \end{array}$

Note that the right hand side is indeed vector in ${{\mathbb R}^{mn+m}}$ and hence, can be reshaped to a tupel ${(W,b)}$ of an ${m\times n}$ matrix and an ${m}$ vector.

A final remark: the Kronecker product is related to tensor products. If ${A}$ and ${B}$ represent linear maps ${X_{1}\rightarrow Y_{1}}$ and ${X_{2}\rightarrow Y_{2}}$, respectively, then ${A\otimes B}$ represents the tensor product of the maps, ${X_{1}\otimes X_{2}\rightarrow Y_{1}\otimes Y_{2}}$. This relation to tensor products and tensors explains where the tensor in TensorFlow comes from.

This is a short follow up on my last post where I wrote about the sweet spot of the stepsize of the Douglas-Rachford iteration. For the case $\beta$-Lipschitz + $\mu$-strongly monotone, the iteration with stepsize $t$ converges linear with rate

$\displaystyle r(t) = \tfrac{1}{2(1+t\mu)}\left(\sqrt{2t^{2}\mu^{2}+2t\mu + 1 +2(1 - \tfrac{1}{(1+t\beta)^{2}} - \tfrac1{1+t^{2}\beta^{2}})t\mu(1+t\mu)} + 1\right)$

Here is animated plot of this contraction factor depending on $\beta$ and $\mu$ and $t$ acts as time variable:

What is interesting is, that this factor has increasing or decreasing in $t$ depending on the values of $\beta$ and $\mu$.

For each pair $(\beta,\mu)$ there is a best $t^*$ and also a smallest contraction factor $r(t^*)$. Here are plots of these quantities:

Comparing the plot of te optimal contraction factor to the animated plot above, you see that the right choice of the stepsize matters a lot.

The fundamental theorem of calculus relates the derivative of a function to the function itself via an integral. A little bit more precise, it says that one can recover a function from its derivative up to an additive constant (on a simply connected domain).

In one space dimension, one can fix some ${x_{0}}$ in the domain (which has to be an interval) and then set

$\displaystyle F(x) = \int_{x_{0}}^{x}f'(t) dt.$

Then ${F(x) = f(x) + c}$ with ${c= - f(x_{0})}$.

Actually, a similar claim is true in higher space dimensions: If ${f}$ is defined on a simply connected domain in ${{\mathbb R}^{d}}$ we can recover ${f}$ from its gradient up to an additive constant as follows: Select some ${x_{0}}$ and set

$\displaystyle F(x) = \int_{\gamma}\nabla f(x)\cdot dx \ \ \ \ \ (1)$

for any path ${\gamma}$ from ${x_{0}}$ to ${x}$. Then it holds under suitable conditions that

$\displaystyle F(x) = f(x) - f(x_{0}).$

And now for something completely different: Convex functions and subgradients.

A function ${f}$ on ${{\mathbb R}^{n}}$ is convex if for every ${x}$ there exists a subgradient ${x^{*}\in{\mathbb R}^{n}}$ such that for all ${y}$ one has the subgradient inequality

$\displaystyle f(y) \geq f(x) + \langle x^{*}, y-x\rangle.$

Writing this down for ${x}$ and ${y}$ with interchanged roles (and ${y^{*}}$ as corresponding subgradient to ${y}$), we see that

$\displaystyle \langle x-y,x^{*}-y^{*}\rangle \geq 0.$

In other words: For a convex function ${f}$ it holds that the subgradient ${\partial f}$ is a (multivalued) monotone mapping. Recall that a multivalued map ${A}$ is monotone, if for every ${y \in A(x)}$ and ${y^{*}\in A(y)}$ it holds that ${\langle x-y,x^{*}-y^{*}\rangle \geq 0}$. It is not hard to see that not every monotone map is actually a subgradient of a convex function (not even, if we go to “maximally monotone maps”, a notion that we sweep under the rug in this post). A simple counterexample is a (singlevalued) linear monotone map represented by

$\displaystyle \begin{bmatrix} 0 & 1\\ -1 & 0 \end{bmatrix}$

(which can not be a subgradient of some ${f}$, since it is not symmetric).

Another hint that monotonicity of a map does not imply that the map is a subgradient is that subgradients have some stronger properties than monotone maps. Let us write down the subgradient inequalities for a number of points ${x_{0},\dots x_{n}}$:

$\displaystyle \begin{array}{rcl} f(x_{1}) & \geq & f(x_{0}) + \langle x_{0}^{*},x_{1}-x_{0}\rangle\\ f(x_{2}) & \geq & f(x_{1}) + \langle x_{1}^{*},x_{2}-x_{1}\rangle\\ \vdots & & \vdots\\ f(x_{n}) & \geq & f(x_{n-1}) + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle\\ f(x_{0}) & \geq & f(x_{n}) + \langle x_{n}^{*},x_{0}-x_{n}\rangle. \end{array}$

If we sum all these up, we obtain

$\displaystyle 0 \geq \langle x_{0}^{*},x_{1}-x_{0}\rangle + \langle x_{1}^{*},x_{2}-x_{1}\rangle + \cdots + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle + \langle x_{n}^{*},x_{0}-x_{n}\rangle.$

This property is called ${n}$-cyclical monotonicity. In these terms we can say that a subgradient of a convex function is cyclical monotone, which means that it is ${n}$-cyclically monotone for every integer ${n}$.

By a remarkable result by Rockafellar, the converse is also true:

Theorem 1 (Rockafellar, 1966) Let ${A}$ by a cyclically monotone map. Then there exists a convex function ${f}$ such that ${A = \partial f}$.

Even more remarkable, the proof is somehow “just an application of the fundamental theorem of calculus” where we recover a function by its subgradient (up to an additive constant that depends on the basepoint).

Proof: we aim to “reconstruct” ${f}$ from ${A = \partial f}$. The trick is to choose some base point ${x_{0}}$ and corresponding ${x_{0}^{*}\in A(x_{0})}$ and set

$\displaystyle \begin{array}{rcl} f(x) &=& \sup\left\{ \langle x_{0}^{*}, x_{1}-x_{0}\rangle + \langle x_{1}^{*},x_{2}-x_{1}\rangle + \cdots + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle\right.\\&& \qquad\left. + \langle x_{n}^{*},x-x_{n}\rangle\right\}\qquad\qquad (0)\end{array}$

where the supremum is taken over all integers ${m}$ and all pairs ${x_{i}^{*}\in A(x_{i})}$ ${i=1,\dots, m}$. As a supremum of affine functions ${f}$ is clearly convex (even lower semicontinuous) and ${f(x_{0}) = 0}$ since ${A}$ is cyclically monotone (this shows that ${f}$ is proper, i.e. not equal to ${\infty}$ everywhere). Finally, for ${\bar x^{*}\in A(\bar x)}$ we have

$\displaystyle \begin{array}{rcl} f(x) &\geq& \sup\left\{ \langle x_{0}^{*}, x_{1}-x_{0}\rangle + \langle x_{1}^{*},x_{2}-x_{1}\rangle + \cdots + \langle x_{n-1}^{*},x_{n}-x_{n-1}\rangle\right.\\ & & \qquad \left.+ \langle x_{n}^{*},\bar x-x_{n}\rangle + \langle \bar x^{*},\bar x-\bar x\rangle\right\} \end{array}$

with the supremum taken over all integers ${m}$ and all pairs ${x_{i}^{*}\in A(x_{i})}$ ${i=1,\dots, m}$. The right hand side is equal to ${f(x) + \langle \bar x^{*},x-\bar x\rangle}$ and this shows that ${f}$ is indeed convex. $\Box$

Where did we use the fundamental theorem of calculus? Let us have another look at equation~(0). Just for simplicity, let us denote ${x_{i}^{*} =\nabla f(x_{i})}$. Now consider a path ${\gamma}$ from ${x_{0}}$ to ${x}$ and points ${0=t_{0}< t_{1}<\cdots < t_{n}< t_{n+1} = 1}$ with ${\gamma(t_{i}) = x_{i}}$. Then the term inside the supremum of~(0) equals

$\displaystyle \langle \nabla f(\gamma(t_{0})),\gamma(t_{1})-\gamma(t_{0})\rangle + \dots + \langle \nabla f(\gamma(t_{n})),\gamma(t_{n+1})-\gamma(t_{n})\rangle.$

This is Riemannian sum for an integral of the form ${\int_{\gamma}\nabla f(x)\cdot dx}$. By monotonicity of ${f}$, we increase this sum, if we add another point ${\bar t}$ (e.g. ${t_{i}<\bar t, and hence, the supremum does converge to the integral, i.e.~(0) is equal to

$\displaystyle f(x) = \int_{\gamma}\nabla f(x)\cdot dx$

where ${\gamma}$ is any path from ${x_{0}}$ to ${x}$.

In my last blog post I wrote about Luxemburg norms which are constructions to get norms out of a non-homogeneous function ${\phi:[0,\infty[\rightarrow[0,\infty[}$ which satisfies ${\phi(0) = 0}$ and are increasing and convex (and thus, continuous). The definition of the Luxemburg norm in this case is

$\displaystyle \|x\|_{\phi} := \inf\left\{\lambda>0\ :\ \sum_{k}\phi\left(\frac{|x_{k}|}{\lambda}\right)\leq 1\right\},$

and we saw that ${\|x\|_{\phi} = c}$ if ${\sum_{k}\phi\left(\frac{|x_{k}|}{c}\right)= 1}$.

Actually, one can have a little more flexibility in the construction as one can also use different functions ${\phi}$ in each coordinate: If ${\phi_{k}}$ are functions as ${\phi}$ above, we can define

$\displaystyle \|x\|_{\phi_{k}} := \inf\left\{\lambda>0\ :\ \sum_{k}\phi_{k}\left(\frac{|x_{k}|}{\lambda}\right)\leq 1\right\},$

and it still holds that ${\|x\|_{\phi_{k}} = c}$ if and only if ${\sum_{k}\phi_{k}\left(\frac{|x_{k}|}{c}\right)= 1}$. The proof that this construction indeed gives a norm is the same as in the one in the previous post.

This construction allows, among other things, to construct norms that behave like different ${p}$-norms in different directions. Here is a simple example: In the case of ${x\in{\mathbb R}^{d}}$ we can split the variables into two groups, say the first ${k}$ variables and the last ${d-k}$ variables. The first group shall be treated with a ${p}$-norm and the second group with a ${q}$-norm. For the respective Luxemburg norm one has

$\displaystyle \|x\|:= \inf\left\{\lambda>0\ :\ \sum_{i=1}^{k}\frac{|x_{i}|^{p}}{\lambda^{p}} + \sum_{i=k+1}^{d}\frac{|x_{i}|^{q}}{\lambda^{q}}\leq 1\right\},$

Note that there is a different way to do a similar thing, namely a mixed norm defined as

$\displaystyle \|x\|_{p,q}:= \left(\sum_{i=1}^{k}|x_{i}|^{p}\right)^{1/p} + \left(\sum_{i=k+1}^{d}|x_{i}|^{q}\right)^{1/q}.$

As any two norms, these are equivalent, but they induce a different geometry. On top of that, one could in principle also consider ${\Phi}$ functionals

$\displaystyle \Phi_{p,q}(x) = \sum_{i=1}^{k}|x_{i}|^{p} + \sum_{i=k+1}^{d}|x_{i}|^{q},$

which is again something different.

A bit more general, we may consider all these three conditions for general partitions of the index sets and a different exponent for each set.

Here are some observations on the differences:

• For the Luxemburg norm the only thing that counts are the exponents (or functions ${\phi_{k}}$). If you partition the index set into two parts but give the same exponents to both, the Luxemburg norm is the same as if you would consider the two parts as just one part.
• The mixed norm is not the ${p}$-norm, even if the set the exponent to ${p}$ for every part.
• The Luxemburg norm has the flexibility to use other functionals than just the powers.
• For the mixed norms one could consider additional mixing by not just summing the norms of the different parts, which is the same as taking the ${1}$-norm of the vector of norms. Of course, other norms are possible, e.g. constructions like

$\displaystyle \left(\left(\sum_{i=1}^{k}|x_{i}|^{p}\right)^{r/p} + \left(\sum_{i=k+1}^{d}|x_{i}|^{q}\right)^{r/q}\right)^{1/r}$

are also norms. (Actually, the term mixed norm is often used for the above construction with ${p=q\neq r}$.)

Here are some pictures that show the different geometry that these three functionals induce. We consider ${d=3}$ i.e., three-dimensional space, and picture the norm-balls (of level sets in the case the functionals ${\Phi}$).

• Consider the case ${k=1}$ and the first exponent to be ${p=1}$ and the second ${q=2}$. The mixed norm is

$\displaystyle \|x\|_{1,2} = |x_{1}| + \sqrt{x_{2}^{2}+x_{3}^{2}},$

the ${\Phi}$-functional is

$\displaystyle \Phi(x)_{1,2} = |x_{1}| + x_{2}^{2}+x_{3}^{2},$

and for the Luxemburg norm it holds that

$\displaystyle \|x\| = c\iff \frac{|x_{1}|}{c} + \frac{x_{2}^{2} + x_{3}^{2}}{c^{2}} = 1.$

Here are animated images of the respective level sets/norm balls for radii ${0.25, 0.5, 0.75,\dots,3}$:

You may note the different shapes of the balls of the mixed norm and the Luxemburg norm. Also, the shape of their norm balls stays the same as you scale the radius. The last observation is not true for the ${\Phi}$ functional: Different directions scale differently.

• Now consider ${k=2}$ and the same exponents. This makes the mixed norm equal to the ${1}$-norm, since

$\displaystyle \|x\|_{1,2} = |x_{1}| + |x_{2}| + \sqrt{x_{3}^{2}} = \|x\|_{1}.$

The ${\Phi}$-functional is

$\displaystyle \Phi(x)_{1,2} = |x_{1}| + |x_{2}|+x_{3}^{2},$

and for the Luxemburg norm it holds that

$\displaystyle \|x\| = c\iff \frac{|x_{1}| + |x_{2}|}{c} + \frac{x_{3}^{2}}{c^{2}} = 1.$

Here are animated images of the respective level sets/norm balls of the ${\Phi}$ functional and the Luxemburg norm for the same radii as above (I do not show the balls for the mixed norm – they are just the standard cross-polytopes/${1}$-norm balls/octahedra): Again note how the Luxemburg ball keeps its shape while the level sets of the ${\Phi}$-functional changes shape while scaling.

• Now we consider three different exponents: ${p_{1}=1}$, ${p_{2} = 2}$ and ${p_{3} = 3}$. The mixed norm is again the ${1}$-norm. The ${\Phi}$-functional is

$\displaystyle \Phi(x)_{1,2} = |x_{1}| + x_{2}^{2}+|x_{3}|^{3},$

and for the Luxemburg norm it holds that

$\displaystyle \|x\| = c\iff \frac{|x_{1}|}{c} + \frac{x_{2}^{2}}{c^{2}} + \frac{|x_{3}|^{3}}{c^{3}} = 1.$

Here are animated images of the respective level sets/norm balls of the ${\Phi}$ functional and the Luxemburg norm for the same radii as above (again, the balls for the mixed norm are just the standard cross-polytopes/${1}$-norm balls/octahedra):

Next Page »