I have on open position for a PhD student – here is the official job-ad:

The group of Prof. Dirk Lorenz at the Institute of Analysis and Algebra has an open PhD position for a Scientific Assistant (75\% TV-L EG 13). The position is available as soon as possible and is initially limited to three years.

The scientific focus of the group includes optimization for inverse problems and machine learning, and mathematical imaging. Besides teaching and research, the position includes work in projects or preperation of projects.

We offer

• a dynamic team and a creative research and work environment
• mentoring and career planning programs (offered by TU Braunschweig), possibilities for personal qualification, language courses and the possibility to participate in international conferences
• flexible work hours and a family friendly work environment.

We are looking for candidates with

• a degree (Masters or Diploma) in mathematics above average,
• a focus on optimization and/or numerical mathematics and applications, e.g. imaging or machine learning
• programming skills in MATLAB, Julia and/or Python
• good knowledge of German and/or English
• capacity for teamwork, independent work, high level of motivation and organizational talent

Equally qualified severely challenged persons will be given preference. The TU Braunschweig especially encourages women to apply for the position. Please send your application including CV, copies of certificates and letters of recommendation (if any) in electronic form via e-mail to Dirk Lorenz.

Contact Prof. Dirk Lorenz | Tel. +49 531 391 7423| d.lorenz@tu-braunschweig.de

Please forward to anyone interested!

I can’t claim that I am an expert in machine learning. I’d rather say that I am merely a tourist in this area. Anyway, here is a small piece of thought on how (supervised) machine learning imitates human learning.

What are some features of human learning? Ideally, humans aim to understand a subject. To achieve this, they study examples, try to make their own deductions, do experiments, make predictions and test them. The overall goal is to get to the heart of things.

What are features of so called supervised machine learning: The methods get training data, i.e. pairs in input and output that match. The foremost goal of the method is to perform good on test data, i.e. to produce correct output to an input the method hasn’t seen before. In practice, one sets up a fairly general model (such as a neural network or a kernelized support vector machine) and often does as little modeling of the task at hand as possible.

This does not sound as though supervised machine learning and human learning are the same or even related. Their goals and methods are genuinely different.

But let us look at how human learn for a very specific task: Preparing for an exam. Recently I had to prepare several larger exams in mathematics for engineers, each with hundred plus students and got to think how they approach the task of learning. When the exam comes closer, the interactions with the students get more frequent. I had a large “ask anything” meeting, I had students coming to office hours, and I had “digital office hours” where the students could ask question via a messenger in a chat room. So I had quite some interactions and could get a little insight into their way of learning, into their problems and progress.

Here are some observations of how the students tried to learn: The question I got were mostly about the exercises we had handed out (in other words, the students asked for information on the “training data”). They were studying heavy on these exercises, barely using the textbook or their lecture notes to look up theorems or definitions (in other words, some were working “model free” or with a “general purpose model” which says something like “do computations following general rules”). They work with the underlying assumption that the exam is made up of questions similar to the exercises (and actually, this is a reasonable assumption – I announced this in class) (in other words, the test data comes from the same distribution as the training data).

Viewed like this, learning of humans (for an exam, that is) and machine learning sound much more similar. And the similarities do not end here. Also some known problems with machine learning methods can be observed with the students: Students get stuck in local minima (they reach a point where further improvement in impossible by revising the seen data – even though they could, in principle, learn more from the given exercises, they keep going the known paths, practicing the known computations, not learning new techniques). Students overfit to the training data (on the test data, aka the exam, they face new problems and oftentimes apply the learned methods to tasks where they don’t work, getting wrong results which would be true if the problem would be a little different). The trained students are vulnerable to adversarial attacks (for every problem I posed as exercises I could make a slight change that would confuse most students). Also, similar to recent observations in machine learning, overparametrization helps to avoid overfitting and overparametrization helps to avoid spurious local valleys, i.e. when the students have more techniques at hand, which is related to a more flexible machine learning method, they do better on unseen data and do not get stuck at bad local minima where no improvement is possible.

Granted, some observation are a kind of a stretch, but still, in conclusion, I’d advocate to replace the term “machine learning” with machine cramming (the German version would be maschinelles Büffeln or maschinelles Pauken).

The Poincaré-constant of a domain ${\Omega\subset\mathbb{R}^{n}}$ is the smallest constant ${C}$ such that the estimate

$\displaystyle \|u- \bar u\|_{p}\leq C\|\nabla u\|_{p}$

holds (where ${\bar u = |\Omega|^{-1}\int_{\Omega} u}$ is the mean value of ${u}$).

These constants are known for some classes of domains and some values of ${p}$: E.g. Payne and Weinberger showed in 1960 that for ${p=2}$ and convex ${\Omega}$ the constant is ${\tfrac{\mathrm{diam}(\Omega)}{\pi}}$ and Acosta and Duran showed in 2004 that for ${p=1}$ and convex ${\Omega}$ one gets ${\tfrac{\mathrm{diam}(\Omega)}2}$.

I do not know of any other results on these constants, and while playing around with these kind of things, I derived another one: Since I often work with images ${u:\Omega\rightarrow [0,1]}$ I will assume that ${0\leq u\leq 1}$ in the following.

Using the co-area formula we can express the ${1}$-norm of the gradient via the length of the level sets: Denoting by ${\mathcal{H}^{n-1}}$ the ${n-1}$-dimensional Hausdorff measure (which is, roughly speaking, the length in the case of ${n=2}$) and with ${E_{t}}$ the level set of ${u}$ at level ${t}$, the co-area formula states that

$\displaystyle \int_{\Omega}|\nabla u| = \int_{0}^{1}\mathcal{H}^{n-1}(\partial E_{t}) dt. \ \ \ \ \ (1)$

Now we combine this with the isoperimetric inequality, which is

$\displaystyle \mathcal{H}^{n-1}(\partial E_{t})\geq n |B_{1}|^{\tfrac1n}|E_{t}|^{\tfrac{n-1}{n}}, \ \ \ \ \ (2)$

where ${B_{1}}$ is unit ball and ${|\cdot|}$ denotes the Lebesgue measure. Combining~(1) and~(2) we get

$\displaystyle \int_{\Omega}|\nabla u| \geq n |B_{1}|^{\tfrac1n}\int_{0}^{1}|E_{t}|^{\tfrac{n-1}{n}} dt. \ \ \ \ \ (3)$

Now we use the trivial fact that

$\displaystyle u(x)^{2} = \int_{0}^{u(x)}2t dt = \int_{0}^{1}2t\chi_{\{x\mid u(x)\geq t\}}(x)dt$

combined with Fubini to get

$\displaystyle \int_{\Omega}u^{2} = \int_{0}^{1}2t\int_{\Omega}\chi_{\{x\mid u(x)\geq t\}}(x)dx\, dt = 2\int_{0}^{1}t |E_{t}|dt. \ \ \ \ \ (4)$

Since ${|E_{t}|/|\Omega|\leq 1}$ and ${(n-1)/n<1}$ it holds that ${|E_{t}|\leq |\Omega|^{\tfrac1n}|E_{t}|^{\tfrac{n-1}n}}$. Combining this with~(4) we get

$\displaystyle \int_{\Omega}u^{2}\leq 2|\Omega|^{\tfrac1n}\int_{0}^{1}t |E_{t}|^{\tfrac{n-1}n}dt \leq 2|\Omega|^{\tfrac1n}\int_{0}^{1} |E_{t}|^{\tfrac{n-1}n}dt.$

Finally, we use~(3) to obtain

$\displaystyle \int_{\Omega}u^{2}\leq \frac{2|\Omega|^{\tfrac1n}}{n|B_{1}|^{\tfrac1n}}\int_{\Omega}|\nabla u|$

in other words

$\displaystyle \|u\|_{2}^{2}\leq \frac{2|\Omega|^{\tfrac1n}}{n|B_{1}|^{\tfrac1n}}\|\nabla u\|_{1}.$

This estimate is quite explicit, does not need the subtraction of the mean value, does not need convexity of ${\Omega}$, but also does not obey the scaling ${u\mapsto au}$ (which is of no surprise since we used the condition ${0\leq u\leq 1}$ which also does not obey this scaling).

In dimension ${n=2}$ the estimate takes the simpler form

$\displaystyle \|u\|_{2}^{2}\leq \sqrt{\tfrac{|\Omega|}{\pi}}\|\nabla u\|_{1}.$

In sparse recovery one aims to leverage the sparsity of some vector ${u}$ to get a good reconstruction of this vector from a noisy and indirect measurement ${u^{0}}$. Sparsity can be expressed in two different ways: The analysis way and the synthesis way. For both of them you need a system of other vectors ${v_{k}}$.

• The analysis way: Here you consider the inner products ${\langle u,v_{k}\rangle}$ and the sparsity assumption is, that this sequence is sparse.
• The synthesis way: Here you assume that there is sparse sequence of coefficients ${\psi_{k}}$ such that ${u=\sum_{k} \psi_{k}v_{k}}$.

The first approach is called “analysis” since here one analyzes the vector ${u}$ with the system of vectors. In the second approach, you use the vectors ${v_{k}}$ to synthesize the vector ${u}$.

We will keep things very simple here and only consider a noisy observation (and not an indirect one). The recovery approach in the analysis way is then to find ${u}$ as a solution of

$\displaystyle \tfrac12\|u-u_{0}\| + \alpha R(\langle u,v_{k}\rangle)$

while for the synthesis way you write ${u = \sum_{k} \psi_{k}v_{k} = V\psi}$ (where we collected the vectors ${v_{k}}$ as the columns of the matrix ${V}$) and solve

$\displaystyle \tfrac12\|V\psi-u_{0}\| + \alpha R(\psi)$

The functional ${R}$ should enforce the sparsity of the vector of coefficients in the former case and the sparsity of ${\psi}$ in the latter case. With the matrix ${V}$ the analysis way reads as

$\displaystyle \tfrac12\|u-u_{0}\| + \alpha R(V^{T}u)$

One important observation is, that both approaches coincide if the matrix ${V}$ is orthonormal: Since ${V^{-1} = V^{T}}$ in this case, ${u=V\psi}$ is equivalent to ${\psi = V^{T}u}$, i.e. ${\psi_{k} = \langle u,v_{k}\rangle}$. For other sets of vectors ${v_{k}}$, this is not true and it is not so clear, how the analysis and the synthesis approach are related. In this post I’d like to show one instance where both approaches do lead to results that are not even close to being similar.

1. The “ROF analysis” approach to denoising

If ${u^{0}}$ is an image, the famous Rudin-Osher-Fatemi denoising model fits under the “analysis” framework: The vectors ${v_{k}}$ are all the finite differences of neighboring pixels such that ${V^{T}u = \nabla u}$ with the discrete gradient ${\nabla u}$. As sparsifying functional you take a mixed ${2}$${1}$ norm, i.e. ${R(\nabla u) = \|\nabla u\|_{2,1} = \sum_{ij}\sqrt{(\partial_{1}u_{ij})^{2} + (\partial_{2}u_{ij})^{2}}}$. The denoised ${u}$ is the solution to

$\displaystyle \min_{u} \tfrac12\|u-u^{0}\|^{2} + \alpha\|\nabla u\|_{2,1}$

Thus, the model will lead to some denoised ${u}$ with a sparse gradient. This sparse gradient is partly the reason for the success of the model in that it keeps edges, but is also responsible for the drawback of this approach: The denoised image will be mostly piesewise constant.

2. The “ROF synthesis” approach

As ${V^{T} = \nabla}$, the corresponding synthesis approach would be to solve

$\displaystyle \min_{\psi}\tfrac12\|\nabla^{T}\psi - u^{0}\|^{2} + \alpha \|\psi\|_{2,1}$

and the set ${u = \nabla^{T}\psi}$. In this approach, the denoised image will be a sparse linear combination of particular two-pixel images. If you think about that, it does not really sound like a good idea to approximate an image by a sparse linear combination of two-pixel images. (One technicality: The operator ${\nabla^{T}}$ is not onto – ${\nabla}$ has a non-trivial kernel, consisting of the constant images and hence, all images in the range of ${\nabla^{T}}$ have zero mean. Thus, to get something remotely close to ${u^{0}}$ one should subtract the mean of ${u^{0}}$ before the reconstruction and add it back afterwards.)

Here are some results:

Note that the results actually fit to what we would have predicted: For larger ${\alpha}$ we get a “sparser gradient” (less edges, less variation) for the analysis approach, and “sparser two-pixel combinations” for the synthesis approach. In fact, for the synthesis approach to give something that is remotely close to the image, one needs quite a lot two-pixel images, i.e. a very low sparsity.

In conclusion, one should not consider the analysis and synthesis approach to be something similar.

Incidentally, the dual of the analysis approach

$\displaystyle \min_{u} \tfrac12\|u-u^{0}\|^{2} + \alpha\|\nabla u\|_{2,1}$

can be written as

$\displaystyle \min_{\|\phi\|_{2,\infty}\leq\alpha}\tfrac12\|\nabla^{T}\phi-u^{0}\|^{2} = \min_{\phi}\tfrac12\|\nabla^{T}\phi-u^{0}\|^{2} + I_{\|\cdot\|_{2,\infty}\leq\alpha}(\phi)$

which looks related to the synthesis approach (one basically replaces the ${2,1}$-norm by its conjugate function). Actually, a similar relation between the dual of the analysis approach and the synthesis approach always holds.

Consider the saddle point problem

$\displaystyle \min_{x}\max_{y}F(x) + \langle Kx,y\rangle - G(y). \ \ \ \ \ (1)$

(where I omit all the standard assumptions, like convexity, continuity ans such…). Fenchel-Rockafellar duality says that solutions are characterized by the inclusion

$\displaystyle 0 \in\left( \begin{bmatrix} \partial F & 0\\ 0 & \partial G \end{bmatrix} + \begin{bmatrix} 0 & K^{T}\\ -K & 0 \end{bmatrix}\right) \begin{bmatrix} x^{*}\\y^{*} \end{bmatrix}$

Noting that the operators

$\displaystyle A = \begin{bmatrix} \partial F & 0\\ 0 & \partial G \end{bmatrix},\quad B = \begin{bmatrix} 0 & K^{T}\\ -K & 0 \end{bmatrix}$

are both monotone, we may apply any of the splitting methods available, for example the Douglas-Rachford method. In terms of resolvents

$\displaystyle R_{tA}(z) := (I+tA)^{-1}(z)$

this method reads as

$\displaystyle \begin{array}{rcl} z^{k+1} & = & R_{tB}(\bar z^{k})\\ \bar z^{k+1}& = & R_{tA}(2z^{k+1}-\bar z^{k}) + \bar z^{k}-z^{k+1}. \end{array}$

For the saddle point problem, this iteration is (with ${z = (x,y)}$)

$\displaystyle \begin{array}{rcl} x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ y^{k+1} &=& R_{t\partial G}(\bar y^{k})\\ \begin{bmatrix} \bar x^{k+1}\\ \bar y^{k+1} \end{bmatrix} & = & \begin{bmatrix} I & tK^{T}\\ -tK & I \end{bmatrix}^{-1} \begin{bmatrix} 2x^{k+1}-\bar x^{k}\\ 2y^{k+1}-\bar y^{k} \end{bmatrix} + \begin{bmatrix} \bar x^{k}- x^{k+1}\\ \bar y^{k}-y^{k+1} \end{bmatrix}. \end{array}$

The first two lines involve proximal steps and we assume that they are simple to implement. The last line, however, involves the solution of a large linear system. This can be broken down to a slightly smaller linear system involving the matrix ${(I+t^{2}K^{T}K)}$ as follows: The linear system equals

$\displaystyle \begin{array}{rcl} \bar x^{k+1} & = & x^{k+1} - tK^{T}(y^{k+1}+\bar y^{k+1}-\bar y^{k})\\ \bar y^{k+1} & = & y^{k+1} + tK(x^{k+1} + \bar x^{k+1}-\bar x^{k}). \end{array}$

Plugging ${\bar y^{k+1}}$ from the second equation into the first gives

$\displaystyle \bar x^{k+1} = x^{k+1} - tK^{T}(2y^{k+1}-\bar y^{k}) - tK^{T}K(x^{k+1}-\bar x^{k+1}-\bar x^{k})$

Denoting ${d^{k+1}= x^{k+1}+\bar x^{k+1}-\bar x^{k}}$ this can be written as

$\displaystyle (I+t^{2}K^{T}K)d^{k+1} = (2x^{k+1}-\bar x^{k}) - tK^{T}(2y^{k+1}-\bar y^{k}).$

and the second equation is just

$\displaystyle \bar y^{k+1} = y^{k+1} + tKd^{k+1}.$

This gives the overall iteration

$\displaystyle \begin{array}{rcl} x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ y^{k+1} &=& R_{t\partial G}(\bar y^{k})\\ d^{k+1} &=& (I+t^{2}K^{T}K)^{-1}(2x^{k+1}-\bar x^{k} - tK(2y^{k+1}-\bar y^{k}))\\ \bar x^{k+1}&=& \bar x^{k}-x^{k+1}+d^{k+1}\\ \bar y^{k+1}&=& y^{k+1}+tKd^{k+1} \end{array}$

This is nothing else than using the Schur complement or factoring as

$\displaystyle \begin{bmatrix} I & tK^{T}\\ -tK & I \end{bmatrix} = \begin{bmatrix} 0 & 0\\ 0 & I \end{bmatrix} + \begin{bmatrix} I\\tK \end{bmatrix} (I + t^{2}K^{T}K)^{-1} \begin{bmatrix} I & -tK^{T} \end{bmatrix}$

and has been applied to imaging problems by O’Connor and Vandenberghe in “Primal-Dual Decomposition by Operator Splitting and Applications to Image Deblurring” (doi). For many problems in imaging, the involved inversion may be fairly easy to perform (if ${K}$ is the image gradient, for example, we only need to solve an equation with an operator like ${(I - t^{2}\Delta)}$ and appropriate boundary conditions). However, there are problems where this inversion is a problem.

I’d like to show the following trick to circumvent the matrix inversion, which I learned from Bredies and Sun’s “Accelerated Douglas-Rachford methods for the solution of convex-concave saddle-point problems”: Here is a slightly different saddle point problem

$\displaystyle \min_{x}\max_{y,x_{p}}F(x) + \langle Kx,y\rangle + \langle Hx,x_{p}\rangle- G(y) - I_{\{0\}}(x_{p}). \ \ \ \ \ (2)$

We added a new dual variable ${x_{p}}$, which is forced to be zero by the additional indicator functional ${I_{\{0\}}}$. Hence, the additional bilinear term ${\langle Hx,x_{p}\rangle}$ is also zero, and we see that ${(x,y)}$ is a solution of (1) if and only if ${(x,y,0)}$ is a solution of (2). In other words: The problem just looks differently, but is, in essence, the same as before.

Now let us write down the Douglas-Rachford iteration for (2). We write this problem as

$\displaystyle \min_{x}\max_{\tilde y} F(x) + \langle \tilde Kx,\tilde y\rangle -\tilde G(\tilde y)$

with

$\displaystyle \tilde y = \begin{bmatrix} y\\x_{p} \end{bmatrix}, \quad \tilde K = \begin{bmatrix} K\\H \end{bmatrix}, \quad \tilde G(\tilde y) = \tilde G(y,x_{p}) = G(y) + I_{\{0\}}(x_{p}).$

Writing down the Douglas-Rachford iteration gives

$\displaystyle \begin{array}{rcl} x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ \tilde y^{k+1} &=& R_{t\partial \tilde G}(\bar{ \tilde y}^{k})\\ \begin{bmatrix} \bar x^{k+1}\\ \bar {\tilde y}^{k+1} \end{bmatrix} & = & \begin{bmatrix} I & t\tilde K^{T}\\ -t\tilde K & I \end{bmatrix}^{-1} \begin{bmatrix} 2x^{k+1}-\bar x^{k}\\ 2\tilde y^{k+1}-\bar {\tilde y}^{k} \end{bmatrix} + \begin{bmatrix} \bar x^{k}- x^{k+1}\\ \bar {\tilde y}^{k}-\tilde y^{k+1} \end{bmatrix}. \end{array}$

Switching back to variables without a tilde, we get, using ${R_{tI_{\{0\}}}(x) = 0}$,

$\displaystyle \begin{array}{rcl} x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ y^{k+1} &=& R_{t\partial \tilde G}(\bar{ y}^{k})\\ x_{p}^{k+1} &=& 0\\ \begin{bmatrix} \bar x^{k+1}\\ \bar {y}^{k+1}\\ \bar x_{p}^{k+1} \end{bmatrix} & = & \begin{bmatrix} I & tK^{T} & tH^{T}\\ -t K & I & 0\\ -t H & 0 & I \end{bmatrix}^{-1} \begin{bmatrix} 2x^{k+1}-\bar x^{k}\\ 2 y^{k+1}-\bar {y}^{k}\\ 2x_{p}^{k+1}-\bar x_{p}^{k} \end{bmatrix} + \begin{bmatrix} \bar x^{k}- x^{k+1}\\ \bar {y}^{k}-y^{k+1}\\ \bar x_{p}^{k}-x_{p}^{k+1} \end{bmatrix}. \end{array}$

First not that ${x_{p}^{k+1}=0}$ throughout the iteration and from the last line of the linear system we get that

$\displaystyle \begin{array}{rcl} -tH\bar x^{k+1} + \bar x_{p}^{k+1} = -\bar x_{p}^{k} -tH(\bar x^{k}-x^{k+1}) + \bar x_{p}^{k} \end{array}$

which implies that

$\displaystyle \bar x_{p}^{k+1} = tH\bar x^{k+1}.$

Thus, both variables ${x_{p}^{k}}$ and ${\bar x_{p}^{k}}$ disappear in the iteration. Now we rewrite the remaining first two lines of the linear system as

$\displaystyle \begin{array}{rcl} \bar x^{k+1} + tK^{T}\bar y^{k+1} + t^{2}H^{T}H\bar x^{k+1} &=& x^{k+1} + tK^{T}(\bar y^{k}-y^{k+1}) + t^{2}H^{T}H\bar x^{k}\\ \bar y^{k+1}-tK\bar x^{k+1} &=& y^{k+1} + tK(x^{k+1}-\bar x^{k}). \end{array}$

Again denoting ${d^{k+1}=x^{k+1}+\bar x^{k+1}-\bar x^{k}}$, solving the second equation for ${\bar y^{k+1}}$ and plugging the result in the first gives

$\displaystyle (I+t^{2}H^{T}H)\bar x^{k+1} +tK^{T}(y^{k+1}+tKd^{k+1}) = x^{k+1}+tK(\bar y^{k}-y^{k+1}) + t^{2}H^{T}H\bar x^{k}.$

To eliminate ${\bar x^{k+1}}$ we add ${(I+t^{2}H^{T}H)(x^{k+1}-\bar x^{k})}$ on both sides and get

$\displaystyle (I+t^{2}(H^{T}H+K^{T}K))d^{k+1} = 2x^{k+1}-\bar x^{k} -tK(y^{k+1}+\bar y^{k+1}-\bar y^{k}) + t^{2}H^{T}Hx^{k+1}.$

In total we obtain the following iteration:

$\displaystyle \begin{array}{rcl} x^{k+1} &=& R_{t\partial F}(\bar x^{k})\\ y^{k+1} &=& R_{t\partial G}(\bar y^{k})\\ d^{k+1} &=& (I+t^{2}(H^{T}H + K^{T}K))^{-1}(2x^{k+1}-\bar x^{k} - tK(2y^{k+1}-\bar y^{k}) + t^{2}H^{T}Hx^{k+1})\\ \bar x^{k+1}&=& \bar x^{k}-x^{k+1}+d^{k+1}\\ \bar y^{k+1}&=& y^{k+1}+tKd^{k+1} \end{array}$

and note that only the third line changed.

Since the above works for any matrix ${H}$, we have a lot of freedom. Let us see, that it is even possible to avoid any inversion whatsoever: We would like to choose ${H}$ in a way that ${I+t^{2}(H^{T}H + K^{T}K) = \lambda I}$ for some positive ${\lambda}$. This is equivalent to

$\displaystyle H^{T}H = \tfrac{\lambda-1}{t^{2}}I - K^{T}K.$

As soon as the right hand side is positive definite, Cholesky decomposition shows that such an ${H}$ exists, and this happens if ${\lambda\geq 1+t^{2}\|K\|^{2}}$. Further note, that we do need ${H}$ in any way, but only ${H^{T}H}$, and we can perform the iteration without ever solving any linear system since the third row reads as

$\displaystyle d^{k+1} = \tfrac{1}{\lambda}\left(2x^{k+1}-\bar x^{k} - tK(2y^{k+1}-\bar y^{k}) + ((\lambda-1)I - t^{2}K^{T}K)x^{k+1})\right).$

This is a short note to self: Let $A$ be a symmetric positive semidefinite matrix with one-dimensional kernel spanned by $v$. How to solve $Ax=b$ (if you know that $b$ is in the range of $A$)? Just typing

x = A\b

should give a warning in a reasonable software (but also should produce some correct result, if it returns anything at all).

If you don’t want that warning and also want to get the solution that is orthogonal to the kernel you should do

x = (A+v*v')\b.

Note that $A + vv^T$ has full rank (and $v$ is still an eigenvector, but now for the eigenvalue $\|v\|^2$).

Surely, the solution of $Ax=b$ which is orthogonal to the kernel of $A$  also solves this $(A+vv^T)x = b$ since $(A+vv^T)x = Ax + vv^Tx = Ax = b$. Conversely, if $x$ solves $(A + vv^T)x = b$, then taking the inner product with $v$ gives $(Ax)^Tv + (v^Tx)^2 = b^Tv$ and since $b^Tv = 0$ and $(Ax)^T v = x^TAv = 0$ it follows that $v^T x = 0$ which shows that both $Ax=b$ and that $x$ is orthogonal to the kernel.

Also, if you want the solution which is orthogonal to some $z$ (and not to the kernel of $A$) you can solve $(A + zz^T)x=b$. By taking the inner product with $v$, you get that $v^T z\, x^T z=0$ and you get $x\bot z$ as soon as $v^Tz\neq 0$.

I have an open PhD position starting this fall. Here is the job ad:

The workgroup of Prof. Lorenz at the Institute of Analysis and Algebra at the TU Braunschweig is searching a Scientific Assistant (75\% TV-L EG 13). The position is available from 01.10.2017 and is initially limited to three years.

We are looking for candidates
– with a degree (Masters or Diploma) in mathematics above average,
– with a focus on at least one of the areas optimization, numerical mathematics or functional analysis,
– good knowledge of German and
– strong interest in applied mathematics. We suppose that the candidate brings a high commitment for scientific research.

The responsibilities include
– participation in teaching and
– independent research in one of the areas of the work group.

Equally qualified severely challenged persons will be given preference. The TU Braunschweig especially encourages women to apply for the position. Please send your application including CV, copies of certificates and letters of recommendation (if any) in electronic form via e-mail to Dirk Lorenz.

Contact: Prof. Dirk Lorenz, Tel. 0531 391 7423, d.lorenz@tu-braunschweig.de

An here is the job ad as a pdf.

Here is a lemma that I find myself googling regularly since I always forget it’s exact form.

Lemma 1 Let ${A}$ be a monotone operator, ${\lambda>0}$ and denote by ${R_{\lambda A} = (I+\lambda A)^{-1}}$ the resolvent of ${\lambda A}$. Then it holds that

$\displaystyle \begin{array}{rcl} R_{\lambda A^{-1}}(x) = x - \lambda R_{\lambda^{-1}A}(\lambda^{-1}x). \end{array}$

Proof: We start with the left hand side ${y = R_{\lambda A^{-1}}(x) = (I+\lambda A^{-1})^{-1} x}$ and deduce

$\displaystyle \begin{array}{rcl} x &\in& y + \lambda A^{-1}y\\ \iff \frac{x-y}{\lambda} &\in& A^{-1}y\\ \iff y &\in& A(\frac{x-y}{\lambda})\\ \iff x &\in& A(\frac{x-y}{\lambda}) + x-y\\ \iff \frac{x}{\lambda} &\in& \frac{1}{\lambda}A(\frac{x-y}{\lambda}) + \frac{x-y}{\lambda}\\ \iff \frac{x-y}{\lambda} & = &(I + \lambda^{-1}A)^{-1}(\lambda^{-1}x)\\ \iff x - \lambda (I+\lambda^{-1}A)^{-1}(\lambda^{-1}x) & = & y. \end{array}$

$\Box$

I do not know any official name of this, but would call it Moreau’s identity which is the name of the respective statement for proximal operators for convex functions ${f}$ and ${g}$:

$\displaystyle \begin{array}{rcl} \mathrm{prox}_{\lambda f^{*}}(x) = x - \lambda\mathrm{prox}_{\lambda^{-1}f}(\lambda^{-1}x). \end{array}$

The version for monotone operators is Proposition 23.18 in the first edition of Bauschke and Combette’s book “Convex Analysis and Monotone Operator Theory in Hilbert Spaces”.

I blogged about the Douglas-Rachford method before and in this post I’d like to dig a bit into the history of the method.

As the name suggests, the method has its roots in a paper by Douglas and Rachford and the paper is

Douglas, Jim, Jr., and Henry H. Rachford Jr., “On the numerical solution of heat conduction problems in two and three space variables.” Transactions of the American mathematical Society 82.2 (1956): 421-439.

At first glance, the title does not suggest that the paper may be related to monotone inclusions and if you read the paper you’ll not find any monotone operator mentioned. So let’s start and look at Douglas and Rachford’s paper.

1. Solving the heat equation numerically

So let us see, what they were after and how this is related to what is known as Douglas-Rachford splitting method today.

Indeed, Douglas and Rachford wanted to solve the instationary heat equation

$\displaystyle \begin{array}{rcl} \partial_{t}u &=& \partial_{xx}u + \partial_{yy}u \\ u(x,y,0) &=& f(x,y) \end{array}$

with Dirichlet boundary conditions (they also considered three dimensions, but let us skip that here). They considered a rectangular grid and a very simple finite difference approximation of the second derivatives, i.e.

$\displaystyle \begin{array}{rcl} \partial_{xx}u(x,y,t)&\approx& (u^{n}_{i+1,j}-2u^{n}_{i,j}+u^{n}_{i-1,j})/h^{2}\\ \partial_{yy}u(x,y,t)&\approx& (u^{n}_{i,j+1}-2u^{n}_{i,j}+u^{n}_{i,j-1})/h^{2} \end{array}$

(with modifications at the boundary to accomodate the boundary conditions). To ease notation, we abbreviate the difference quotients as operators (actually, also matrices) that act for a fixed time step

$\displaystyle \begin{array}{rcl} (Au^{n})_{i,j} &=& (u^{n}_{i+1,j}-2u^{n}_{i,j}+u^{n}_{i-1,j})/h^{2}\\ (Bu^{n})_{i,j} &=& (u^{n}_{i,j+1}-2u^{n}_{i,j}+u^{n}_{i,j+1})/h^{2}. \end{array}$

With this notation, our problem is to solve

$\displaystyle \begin{array}{rcl} \partial_{t}u &=& (A+B)u \end{array}$

in time.

Then they give the following iteration:

$\displaystyle Av^{n+1}+Bw^{n} = \frac{v^{n+1}-w^{n}}{\tau} \ \ \ \ \ (1)$

$\displaystyle Bw^{n+1} = Bw^{n} + \frac{w^{n+1}-v^{n+1}}{\tau} \ \ \ \ \ (2)$

(plus boundary conditions which I’d like to swipe under the rug here). If we eliminate ${v^{n+1}}$ from the first equation using the second we get

$\displaystyle (A+B)w^{n+1} = \frac{w^{n+1}-w^{n}}{\tau} + \tau AB(w^{n+1}-w^{n}). \ \ \ \ \ (3)$

This is a kind of implicit Euler method with an additional small term ${\tau AB(w^{n+1}-w^{n})}$. From a numerical point of it has one advantage over the implicit Euler method: As equations (1) and (2) show, one does not need to invert ${I-\tau(A+B)}$ in every iteration, but only ${I-\tau A}$ and ${I-\tau B}$. Remember, this was in 1950s, and solving large linear equations was a much bigger problem than it is today. In this specific case of the heat equation, the operators ${A}$ and ${B}$ are in fact tridiagonal, and hence, solving with ${I-\tau A}$ and ${I-\tau B}$ can be done by Gaussian elimination without any fill-in in linear time (read Thomas algorithm). This is a huge time saver when compared to solving with ${I-\tau(A+B)}$ which has a fairly large bandwidth (no matter how you reorder).

How do they prove convergence of the method? They don’t since they wanted to solve a parabolic PDE. They were after stability of the scheme, and this can be done by analyzing the eigenvalues of the iteration. Since the matrices ${A}$ and ${B}$ are well understood, they were able to write down the eigenfunctions of the operator associated to iteration (3) explicitly and since the finite difference approximation is well understood, they were able to prove approximation properties. Note that the method can also be seen, as a means to calculate the steady state of the heat equation.

We reformulate the iteration (3) further to see how ${w^{n+1}}$ is actually derived from ${w^{n}}$: We obtain

$\displaystyle (-I + \tau(A+B) - \tau^{2}AB)w^{n+1} = (-I-\tau^{2}AB)w^{n} \ \ \ \ \ (4)$

2. What about monotone inclusions?

What has the previous section to do with solving monotone inclusions? A monotone inclusion is

$\displaystyle \begin{array}{rcl} 0\in Tx \end{array}$

with a monotone operator, that is, a multivalued mapping ${T}$ from a Hilbert space ${X}$ to (subsets of) itself such that for all ${x,y\in X}$ and ${u\in Tx}$ and ${v\in Ty}$ it holds that

$\displaystyle \begin{array}{rcl} \langle u-v,x-y\rangle\geq 0. \end{array}$

We are going to restrict ourselves to real Hilbert spaces here. Note that linear operators are monotone if they are positive semi-definite and further note that monotone linear operators need not to be symmetric. A general approach to the solution of monotone inclusions are so-called splitting methods. There one splits ${T}$ additively ${T=A+B}$ as a sum of two other monotone operators. Then one tries to use the so-called resolvents of ${A}$ and ${B}$, namely

$\displaystyle \begin{array}{rcl} R_{A} = (I+A)^{-1},\qquad R_{B} = (I+B)^{-1} \end{array}$

to obtain a numerical method. By the way, the resolvent of a monotone operator always exists and is single valued (to be honest, one needs a regularity assumption here, namely one need maximal monotone operators, but we will not deal with this issue here).

The two operators ${A = \partial_{xx}}$ and ${B = \partial_{yy}}$ from the previous section are not monotone, but ${-A}$ and ${-B}$ are, so the equation ${-Au - Bu = 0}$ is a special case of a montone inclusion. To work with monotone operators we rename

$\displaystyle \begin{array}{rcl} A \leftarrow -A,\qquad B\leftarrow -B \end{array}$

and write the iteration~(4) in terms of monotone operators as

$\displaystyle \begin{array}{rcl} (I + \tau(A+B) + \tau^{2}AB)w^{n+1} = (I+\tau^{2}AB)w^{n}, \end{array}$

i.e.

$\displaystyle \begin{array}{rcl} w^{n+1} = (I+\tau A+\tau B+\tau^{2}AB)^{-1}(I+\tau AB)w^{n}. \end{array}$

Using ${I+\tau A+\tau B + \tau^{2}A = (I+\tau A)(I+\tau B)}$ and ${(I+\tau^{2}AB) = (I-\tau B) + (I + \tau A)\tau B}$ we rewrite this in terms of resolvents as

$\displaystyle \begin{array}{rcl} w^{n+1} & = &(I+\tau B)^{-1}[(I+\tau A)^{-1}(I-\tau B) + \tau B]w^{n}\\ & =& R_{\tau B}(R_{\tau A}(w^{n}-\tau Bw^{n}) + \tau Bw^{n}). \end{array}$

This is not really applicable to a general monotone inclusion since there ${A}$ and ${B}$ may be multi-valued, i.e. the term ${Bw^{n}}$ is not well defined (the iteration may be used as is for splittings where ${B}$ is monotone and single valued, though).

But what to do, when both and ${A}$ and ${B}$ are multivaled? The trick is, to introduce a new variable ${w^{n} = R_{\tau B}(u^{n})}$. Plugging this in throughout leads to

$\displaystyle \begin{array}{rcl} R_{\tau B} u^{n+1} & = & R_{\tau B}(R_{\tau A}(R_{\tau B}u^{n}-\tau B R_{\tau B}u^{n}) + \tau B R_{\tau B}u^{n}). \end{array}$

We cancel the outer ${R_{\tau B}}$ and use ${\tau B R_{\tau B}u^{n} = u^{n} - R_{\tau B}u^{n}}$ to get

$\displaystyle \begin{array}{rcl} u^{n+1} & = & R_{\tau A}(2R_{\tau B}u^{n} - u^{n}) + u^{n} - R_{\tau B}u^{n} \end{array}$

and here we go: This is exactly what is known as Douglas-Rachford method (see the last version of the iteration in my previous post). Note that it is not ${u^{n}}$ that converges to a solution, but ${w^{n} = R_{\tau B}u^{n}}$, so it is convenient to write the iteration in the two variables

$\displaystyle \begin{array}{rcl} w^{n} & = & R_{\tau B}u^{n}\\ u^{n+1} & = & R_{\tau A}(2w^{n} - u^{n}) + u^{n} - w^{n}. \end{array}$

The observation, that these splitting method that Douglas and Rachford devised for linear problems has a kind of much wider applicability is due to Lions and Mercier and the paper is

Lions, Pierre-Louis, and Bertrand Mercier. “Splitting algorithms for the sum of two nonlinear operators.” SIAM Journal on Numerical Analysis 16.6 (1979): 964-979.

Other, much older, splitting methods for linear systems, such as the Jacobi method, the Gauss-Seidel method used different properties of the matrices such as the diagonal of the matrix or the upper and lower triangluar parts and as such, do not generalize easily to the case of operators on a Hilbert space.

Assume that we are given a probability distribution ${p}$ on ${{\mathbb R}^{n}}$ and for simplicity we further assume that ${p}$ is continuous, i.e. ${p(x)}$ somehow indicates, how likely ${x\in{\mathbb R}^{n}}$ is. To be more precise, the probability, that ${x\in A}$ for some (measurable) set ${A}$ is, is ${\int_{A}dp}$. We do I start like this? There are cases, in which one known some probability distribution, then obtains some ${x}$ and wants to know, if this ${x}$ was particularly representative for this distribution. One example of this situation is a statistical inverse problem where the model is ${y = Ax + \epsilon}$ with a linear (known) map ${A}$ and some error ${\epsilon}$. From some observed ${y}$, one wants to know as much as possible about the underlying ${x}$. Assuming that the noise ${\epsilon}$ has a known distribution ${p(\epsilon)}$ and that one has prior information on ${x}$ (i.e. likely ${x}$ are ones where the prior distribution ${p(x)}$ is large) one can model the following probability distributions:

$\displaystyle \begin{array}{rcl} p(y\mid x) = p(y-Ax) = p(\epsilon) &&\text{prob. to see}\ y,\ \text{provided}\ x\\ p(x) && \text{prior distribution of}\ x\\ p(x\mid y) = \frac{p(y\mid x)p(x)}{p(y)} && \text{prob. of}\ x,\ \text{provided}\ y; \text{posterior}. \end{array}$

The latter distribution, the posterior, is what one is interested in, i.e. given that one has observed ${y}$, what is the probability that some ${x}$ (which we may generate in some way) is the true solution? So, in principle, all information we have is contained in the posterior, but how to get on grip on it?

What we know are the distribution ${p(y\mid x)}$ and ${p(x)}$ but note that ${p(y)}$ is not know and is usually not simple to compute. Another expression for ${p(y)}$ is ${\int p(y\mid x)p(x) dy}$ since the right hand side in the definition of the posterior has to be a probability distribution. So to calculate ${p(y)}$ we have to evaluate one integral, but note that this is in an integral over ${{\mathbb R}^{n}}$ and in the context of statistical inverse problems, this ${n}$ is usually not in the ten-thousands, but easily in the millions. So we see, that the calculation of ${p(y)}$ is usually out of reach (unless all distributions are very simple). So where are we: If we get some ${x}$ we can calculate ${p(y\mid x)}$ and ${p(x)}$, but what would it say if ${p(y\mid x)p(x) =0.01}$, for example? Not much, because we do not know the the normalization constant. While ${0.01}$ seems pretty small and hence ${x}$ to be somehow unlikely, it may be that ${p(y) = 0.011}$ and then ${ p(x\mid y) = 0.01/0.011 \approx 0.91}$ which would make ${x}$ to be rather likely. On the other hand we would really would like some more information about ${p(x\mid y)}$ to judge about that ${x}$ since ${p(x\mid y)}$ may have a large variance and a value of ${0.01}$ may be among the largest possible values, again rendering ${x}$ quite likely, but subject to large variance. To recap:

1. We have a probability distribution ${p}$ but we can only generate values ${Cp(x)}$ for some unknown ${C>0}$.
2. The domain of definition of ${p}$ has a huge dimension.
3. We want to know as much as possible about ${p}$, e.g. its expected value, its variance or its mode…

Note that some questions about ${p}$ can be answered without knowing the normalization constant, e.g. the mode ${x^{*}}$, i.e. the ${x}$ which maximizes ${p}$. That may one prime reason while MAP (maximum-a-posterior) estimators are so widely used… One approach, to get more information out of ${p}$ is to use sampling.

Before we come to methods that can sample from unnormalized distributions, we describe what we actually mean by sampling, and give the main building blocks.

1. Sampling from simple distribution

Sampling from a distribution ${p}$ means a method to generate values ${x}$ such that the values are distributed according to ${p}$. Expressed in formula: For every integrable function ${f}$ and sequence ${x_{k}}$ of generated samples, we want that

$\displaystyle \begin{array}{rcl} \lim_{n\rightarrow\infty}\frac1n\sum_{k=1}^{n}f(x_{k}) = \int f(x)dp(x). \end{array}$

In the following we want to describe a few methods to generate samples. All the methods will build on the ability to sample from some simple basic distributions, and these are

1. The uniform distribution on ${[0,1]}$ denoted by ${\mathcal{U}(0,1)}$. On a computer this is possible to a good approximation by pseudorandom number generators like the Mersenne twister (used, e.g., in MATLAB).
2. The standard normal distribution denoted by ${\mathcal{N}(0,1)}$. If you generate pseudorandom normally distributed numbers, then, under the hood, the machine generates uniformly distributed numbers and cleverly transforms these to be normally distributed, e.g. with the Box-Muller method.

Note that simple scaling allows to sample from ${\mathcal{U}(a,b)}$ and ${\mathcal{N}(\mu,\sigma^{2})}$.

If we sample ${x}$ from some distribution ${p}$ we will denote this by

$\displaystyle \begin{array}{rcl} x\sim p \end{array}$

so ${x\sim \mathcal{U}(0,1)}$ means that ${x}$ is drawn from a uniform distribution over ${[0,1]}$.

2. Rejection sampling

For rejection sampling from ${p}$ (having access to ${Cp}$ only) we need a so-called proposal distribution ${q}$ from which we can sample (i.e. a uniform one or a normal one) and we need to know some ${M>0}$ such that

$\displaystyle \begin{array}{rcl} Mq(x) \geq Cp(x) \end{array}$

for all ${x}$. The sampling scheme goes as follows: Draw proposals from the proposal distribution and accept them based on some rule that will ensure the we will end up with samples distributed according to ${p}$. In more detail:

1. generate ${x\sim q}$, i.e. sample ${x}$ from the proposal distribution,
2. calculate the acceptance rate

$\displaystyle \begin{array}{rcl} \alpha = \frac{Cp(x)}{Mq(x)}, \end{array}$

3. generate ${t\sim \mathcal{U}(0,1)}$,
4. accept ${x}$ if ${t\leq \alpha}$, otherwise reject it.

Proposition 1 The samples generated by rejection sampling are distributed according to ${p}$.

Proof: Let us denote by ${A}$ the event that a sample ${x}$, drawn from ${q}$ has been accepted. Further, we denote by ${\pi(x\mid A)}$ the distribution of ${x}$, provided it has been accepted. So we aim to show that ${\pi(x\mid A) = p(x)}$.

To do so, we employ Bayes’ theorem and write

$\displaystyle \begin{array}{rcl} \pi(x\mid A) = \frac{\pi(A\mid x)\pi(x)}{\pi(A)}. \end{array}$

For the three probabilities on the right hand side we know:

• ${\pi(x)}$ is the distribution of the sample, i.e the proposal distribution, ${\pi(x) = q(x)}$.
• The probability ${\pi(A\mid x)}$ is the probability of acceptance, provided that ${x}$ has been sampled. Looking at the acceptance step (step 2), we see that this probability is exactly ${\alpha}$, i.e.

$\displaystyle \begin{array}{rcl} \pi(A\mid x) & = & \alpha = \frac{Cp(x)}{Mq(x)}. \end{array}$

• The probability of the event ${A}$ is the probability of acceptance, and this is given by the integral of the joint distribution ${\pi(x,A)}$ over the values ${x}$. Since the joint distribution fulfills ${\pi(x,A) = \pi(A\mid x)\pi(x)}$ and ${\pi(x) = q(x)}$ we get

$\displaystyle \begin{array}{rcl} \pi(A) & = & \int \pi(x,A) dx = \int \pi(A\mid x)q(x)dx\\ & = & \int \frac{Cp(x)}{Mq(x)}q(x) dx\\ & = & \int \frac{C}{M}p(x) dx = \frac{M}{C} \int p(x)dx = \frac{M}{C} \end{array}$

since ${p}$ is a probability distribution.

Together we get

$\displaystyle \begin{array}{rcl} \pi(x\mid A) = \frac{\frac{Cp(x)}{Mq(x)}\, q(x)}{\frac{C}{M}} = p(x). \end{array}$

$\Box$

The crucial steps to apply rejection sampling is to find a good proposal distribution with small constant ${M}$ (and also, to calculate ${M}$ can be tricky). As an example, consider that you want to sample from the tail of a standard normal distribution, e.g. you want to obtain “normally distributed random numbers larger that ${2}$”. Rejection sampling is pretty straightforward: You choose ${q}$ as standard normal, get samples ${x}$ from that and only accept if ${x\geq 2}$. In this case you have ${M=1}$, but ${C}$ is small, namely ${C\approx 0.023}$ which means that less than 1 out of 43 samples are accepted…

3. (Non-adaptive) Metropolis sampling

Here comes another method. This method is from the class of Markov-Chain methods, i.e. we will generate a sequence ${x_{1},x_{2},\dots}$ such that they form a Markov chain with a given transition probability. This means, we have a distribution ${\kappa}$ such that ${x_{k+1}\sim \kappa(\cdot,x_{k})}$. Again, our goal is that the sequence is distributed according to ${p}$. A central result in the theory of Markov-chain methods is, that this is the case when the so-called (detailed) balance equation is fulfilled , i.e. if

$\displaystyle \begin{array}{rcl} p(x)\kappa(y,x) = p(y)\kappa(x,y). \end{array}$

Remember, that we only have access to ${Cp(x)}$ and we don’t know ${C}$. Here is the method which is called Metropolis-Hastings sampling: To begin, choose a symmetric proposal distribution ${q}$, (i.e. ${q(x,y)}$ is the distribution for “go from ${y}$ to ${x}$” and fulfills ${q(x,y) = q(y,x)}$), initialize with some ${x_{0}}$ and set ${k=1}$. Then:

1. generate ${x^{*}\sim q(\cdot,x_{k-1})}$, i.e. make a trial step from ${x_{k-1}}$,
2. set ${\alpha = \min(1,\tfrac{p(x^{*})}{p(x_{k-1})})}$,
3. generate ${t\sim\mathcal{U}(0,1)}$
4. accept ${x^{*}}$ with probability ${\alpha}$, i.e. set ${x_{k} = x^{*}}$ if ${t\leq \alpha}$, increase ${k}$ and repeat, otherwise do reject, i.e. do not increase ${k}$ and repeat.

Note that in step 3 we don’t really need ${p(x^{*})}$ and ${p(x_{k+1})}$ but only ${Cp(x^{*})}$ and ${Cp(x_{k-1})}$ since the unknown ${C}$‘s cancel out.

Proposition 2 The samples generated from Metropolis-Hastings sampling are distributed according to ${p}$.

Proof: We first calculate the transition probability of the method. The probability to go from ${x_{k-1}}$ to the proposed ${x^{*}}$ is “${q(x^{*},x_{k-1})}$ multiplied by the acceptance rate ${\alpha}$” and the probability to stay in ${x_{k-1}}$ is ${1-\alpha}$. In formula: We write the acceptance ratio as

$\displaystyle \begin{array}{rcl} \alpha(x_{k-1},x^{*}) = \min(1,\frac{p(x^{*})}{p(x_{k-1})}) = \begin{cases} 1 & p(x^{*})\geq p(x_{k-1})\\ \frac{p(x^{*})}{p(x_{k-1})} & p(x_{k-1})\geq p(x^{*}) \end{cases} \end{array}$

and can write the transition probability as

$\displaystyle \begin{array}{rcl} \kappa(y,x) = \alpha(x,y)q(y,x) + (1-\alpha^{*}(x))\delta_{x}(y) \end{array}$

with

$\displaystyle \begin{array}{rcl} \alpha^{*}(x) = \int \alpha(x,y)q(y,x) dy. \end{array}$

We aim to show the detailed balance equation ${p(x)\kappa(y,x) = p(y)\kappa(x,y)}$. First by a simple distinction of cases, we get

$\displaystyle \begin{array}{rcl} \frac{\alpha(x,y)}{\alpha(y,x)} = \frac{p(y)}{p(x)}. \end{array}$

This gives ${p(x)\alpha(x,y) = p(y)\alpha(y,x)}$ and since ${q}$ is symmetric, we get

$\displaystyle \begin{array}{rcl} p(x)\alpha(x,y)q(x,y) = p(y)\alpha(y,x)q(x,y). \end{array}$

Then we note that

$\displaystyle \begin{array}{rcl} (1-\alpha^{*}(x))\delta_{x}(y) p(x) = (1-\alpha^{*}(y))\delta_{y}(x)p(y) \end{array}$

(which can be seen by integration both sides against some ${f(x,y)}$). Putting things together, we get

$\displaystyle \begin{array}{rcl} p(x)\kappa(y,x) & = & p(x) \alpha(x,y)q(y,x) + p(x)(1-\alpha^{*}(x))\delta_{x}(y)\\ & = & p(y)\alpha(y,x)q(x,y) + p(y)(1-\alpha^{*}(y)\delta_{y}(x)\\ & = & p(y)\kappa(x,y) \end{array}$

as desired. $\Box$

Note that Metropolis-Hastings sampling is fairly easy to implement as soon as it is simple to get samples from the proposal distribution. A quick way is to use the standard normal distribution as proposal distribution. In this case, Metropolis-Hastings performs a “restricted random walk”, i.e. if we would accept every step, the sequence ${x_{k}}$ would be a random walk. However, we allow for the possibility to stop the walk in the cases where it would lead us to a region of lower probability — in the cases where it leads to a point of higher probability, we follow the random walk. Although this approach is simple to implement, it may take a lot of time for the chain to get something reasonable. The reasons are, that the random walk steps may be rejected quite often, that it is a long way to get to where the “most probability is” from the initialization or that the distribution ${p}$ has several modes, separated by valleys and that the random walk has difficulties to get from one mode to the other (again due to frequent rejection).

Also note that one can use Gaussian distributions with any variance as proposal distributions and that the variance acts like some stepsize. As often with stepsizes, there is a trade-off: smaller stepsizes mean, that the sampler will need more time to explore the distribution while larger stepsizes would allow faster exploration but also lead to more frequent rejection.