Assume that we are given a probability distribution {p} on {{\mathbb R}^{n}} and for simplicity we further assume that {p} is continuous, i.e. {p(x)} somehow indicates, how likely {x\in{\mathbb R}^{n}} is. To be more precise, the probability, that {x\in A} for some (measurable) set {A} is, is {\int_{A}dp}. We do I start like this? There are cases, in which one known some probability distribution, then obtains some {x} and wants to know, if this {x} was particularly representative for this distribution. One example of this situation is a statistical inverse problem where the model is {y = Ax + \epsilon} with a linear (known) map {A} and some error {\epsilon}. From some observed {y}, one wants to know as much as possible about the underlying {x}. Assuming that the noise {\epsilon} has a known distribution {p(\epsilon)} and that one has prior information on {x} (i.e. likely {x} are ones where the prior distribution {p(x)} is large) one can model the following probability distributions:

\displaystyle  \begin{array}{rcl}  p(y\mid x) = p(y-Ax) = p(\epsilon) &&\text{prob. to see}\ y,\ \text{provided}\ x\\ p(x) && \text{prior distribution of}\ x\\ p(x\mid y) = \frac{p(y\mid x)p(x)}{p(y)} && \text{prob. of}\ x,\ \text{provided}\ y; \text{posterior}. \end{array}

The latter distribution, the posterior, is what one is interested in, i.e. given that one has observed {y}, what is the probability that some {x} (which we may generate in some way) is the true solution? So, in principle, all information we have is contained in the posterior, but how to get on grip on it?

What we know are the distribution {p(y\mid x)} and {p(x)} but note that {p(y)} is not know and is usually not simple to compute. Another expression for {p(y)} is {\int p(y\mid x)p(x) dy} since the right hand side in the definition of the posterior has to be a probability distribution. So to calculate {p(y)} we have to evaluate one integral, but note that this is in an integral over {{\mathbb R}^{n}} and in the context of statistical inverse problems, this {n} is usually not in the ten-thousands, but easily in the millions. So we see, that the calculation of {p(y)} is usually out of reach (unless all distributions are very simple). So where are we: If we get some {x} we can calculate {p(y\mid x)} and {p(x)}, but what would it say if {p(y\mid x)p(x) =0.01}, for example? Not much, because we do not know the the normalization constant. While {0.01} seems pretty small and hence {x} to be somehow unlikely, it may be that {p(y) = 0.011} and then { p(x\mid y) = 0.01/0.011 \approx 0.91} which would make {x} to be rather likely. On the other hand we would really would like some more information about {p(x\mid y)} to judge about that {x} since {p(x\mid y)} may have a large variance and a value of {0.01} may be among the largest possible values, again rendering {x} quite likely, but subject to large variance. To recap:

  1. We have a probability distribution {p} but we can only generate values {Cp(x)} for some unknown {C>0}.
  2. The domain of definition of {p} has a huge dimension.
  3. We want to know as much as possible about {p}, e.g. its expected value, its variance or its mode…

Note that some questions about {p} can be answered without knowing the normalization constant, e.g. the mode {x^{*}}, i.e. the {x} which maximizes {p}. That may one prime reason while MAP (maximum-a-posterior) estimators are so widely used… One approach, to get more information out of {p} is to use sampling.

Before we come to methods that can sample from unnormalized distributions, we describe what we actually mean by sampling, and give the main building blocks.

1. Sampling from simple distribution

Sampling from a distribution {p} means a method to generate values {x} such that the values are distributed according to {p}. Expressed in formula: For every integrable function {f} and sequence {x_{k}} of generated samples, we want that

\displaystyle  \begin{array}{rcl}  \lim_{n\rightarrow\infty}\frac1n\sum_{k=1}^{n}f(x_{k}) = \int f(x)dp(x). \end{array}

In the following we want to describe a few methods to generate samples. All the methods will build on the ability to sample from some simple basic distributions, and these are

  1. The uniform distribution on {[0,1]} denoted by {\mathcal{U}(0,1)}. On a computer this is possible to a good approximation by pseudorandom number generators like the Mersenne twister (used, e.g., in MATLAB).
  2. The standard normal distribution denoted by {\mathcal{N}(0,1)}. If you generate pseudorandom normally distributed numbers, then, under the hood, the machine generates uniformly distributed numbers and cleverly transforms these to be normally distributed, e.g. with the Box-Muller method.

Note that simple scaling allows to sample from {\mathcal{U}(a,b)} and {\mathcal{N}(\mu,\sigma^{2})}.

If we sample {x} from some distribution {p} we will denote this by

\displaystyle  \begin{array}{rcl}  x\sim p \end{array}

so {x\sim \mathcal{U}(0,1)} means that {x} is drawn from a uniform distribution over {[0,1]}.

2. Rejection sampling

For rejection sampling from {p} (having access to {Cp} only) we need a so-called proposal distribution {q} from which we can sample (i.e. a uniform one or a normal one) and we need to know some {M>0} such that

\displaystyle  \begin{array}{rcl}  Mq(x) \geq Cp(x) \end{array}

for all {x}. The sampling scheme goes as follows: Draw proposals from the proposal distribution and accept them based on some rule that will ensure the we will end up with samples distributed according to {p}. In more detail:

  1. generate {x\sim q}, i.e. sample {x} from the proposal distribution,
  2. calculate the acceptance rate

    \displaystyle  \begin{array}{rcl}  \alpha = \frac{Cp(x)}{Mq(x)}, \end{array}

  3. generate {t\sim \mathcal{U}(0,1)},
  4. accept {x} if {t\leq \alpha}, otherwise reject it.

Proposition 1 The samples generated by rejection sampling are distributed according to {p}.

Proof: Let us denote by {A} the event that a sample {x}, drawn from {q} has been accepted. Further, we denote by {\pi(x\mid A)} the distribution of {x}, provided it has been accepted. So we aim to show that {\pi(x\mid A) = p(x)}.

To do so, we employ Bayes’ theorem and write

\displaystyle  \begin{array}{rcl}  \pi(x\mid A) = \frac{\pi(A\mid x)\pi(x)}{\pi(A)}. \end{array}

For the three probabilities on the right hand side we know:

  • {\pi(x)} is the distribution of the sample, i.e the proposal distribution, {\pi(x) = q(x)}.
  • The probability {\pi(A\mid x)} is the probability of acceptance, provided that {x} has been sampled. Looking at the acceptance step (step 2), we see that this probability is exactly {\alpha}, i.e.

    \displaystyle  \begin{array}{rcl}  \pi(A\mid x) & = & \alpha = \frac{Cp(x)}{Mq(x)}. \end{array}

  • The probability of the event {A} is the probability of acceptance, and this is given by the integral of the joint distribution {\pi(x,A)} over the values {x}. Since the joint distribution fulfills {\pi(x,A) = \pi(A\mid x)\pi(x)} and {\pi(x) = q(x)} we get

    \displaystyle  \begin{array}{rcl}  \pi(A) & = & \int \pi(x,A) dx = \int \pi(A\mid x)q(x)dx\\ & = & \int \frac{Cp(x)}{Mq(x)}q(x) dx\\ & = & \int \frac{C}{M}p(x) dx = \frac{M}{C} \int p(x)dx = \frac{M}{C} \end{array}

    since {p} is a probability distribution.

Together we get

\displaystyle  \begin{array}{rcl}  \pi(x\mid A) = \frac{\frac{Cp(x)}{Mq(x)}\, q(x)}{\frac{C}{M}} = p(x). \end{array}

\Box

The crucial steps to apply rejection sampling is to find a good proposal distribution with small constant {M} (and also, to calculate {M} can be tricky). As an example, consider that you want to sample from the tail of a standard normal distribution, e.g. you want to obtain “normally distributed random numbers larger that {2}”. Rejection sampling is pretty straightforward: You choose {q} as standard normal, get samples {x} from that and only accept if {x\geq 2}. In this case you have {M=1}, but {C} is small, namely {C\approx 0.023} which means that less than 1 out of 43 samples are accepted…

3. (Non-adaptive) Metropolis sampling

Here comes another method. This method is from the class of Markov-Chain methods, i.e. we will generate a sequence {x_{1},x_{2},\dots} such that they form a Markov chain with a given transition probability. This means, we have a distribution {\kappa} such that {x_{k+1}\sim \kappa(\cdot,x_{k})}. Again, our goal is that the sequence is distributed according to {p}. A central result in the theory of Markov-chain methods is, that this is the case when the so-called (detailed) balance equation is fulfilled , i.e. if

\displaystyle  \begin{array}{rcl}  p(x)\kappa(y,x) = p(y)\kappa(x,y). \end{array}

Remember, that we only have access to {Cp(x)} and we don’t know {C}. Here is the method which is called Metropolis-Hastings sampling: To begin, choose a symmetric proposal distribution {q}, (i.e. {q(x,y)} is the distribution for “go from {y} to {x}” and fulfills {q(x,y) = q(y,x)}), initialize with some {x_{0}} and set {k=1}. Then:

  1. generate {x^{*}\sim q(\cdot,x_{k-1})}, i.e. make a trial step from {x_{k-1}},
  2. set {\alpha = \min(1,\tfrac{p(x^{*})}{p(x_{k-1})})},
  3. generate {t\sim\mathcal{U}(0,1)}
  4. accept {x^{*}} with probability {\alpha}, i.e. set {x_{k} = x^{*}} if {t\leq \alpha}, increase {k} and repeat, otherwise do reject, i.e. do not increase {k} and repeat.

Note that in step 3 we don’t really need {p(x^{*})} and {p(x_{k+1})} but only {Cp(x^{*})} and {Cp(x_{k-1})} since the unknown {C}‘s cancel out.

Proposition 2 The samples generated from Metropolis-Hastings sampling are distributed according to {p}.

Proof: We first calculate the transition probability of the method. The probability to go from {x_{k-1}} to the proposed {x^{*}} is “{q(x^{*},x_{k-1})} multiplied by the acceptance rate {\alpha}” and the probability to stay in {x_{k-1}} is {1-\alpha}. In formula: We write the acceptance ratio as

\displaystyle  \begin{array}{rcl}  \alpha(x_{k-1},x^{*}) = \min(1,\frac{p(x^{*})}{p(x_{k-1})}) = \begin{cases} 1 & p(x^{*})\geq p(x_{k-1})\\ \frac{p(x^{*})}{p(x_{k-1})} & p(x_{k-1})\geq p(x^{*}) \end{cases} \end{array}

and can write the transition probability as

\displaystyle  \begin{array}{rcl}  \kappa(y,x) = \alpha(x,y)q(y,x) + (1-\alpha^{*}(x))\delta_{x}(y) \end{array}

with

\displaystyle  \begin{array}{rcl}  \alpha^{*}(x) = \int \alpha(x,y)q(y,x) dy. \end{array}

We aim to show the detailed balance equation {p(x)\kappa(y,x) = p(y)\kappa(x,y)}. First by a simple distinction of cases, we get

\displaystyle  \begin{array}{rcl}  \frac{\alpha(x,y)}{\alpha(y,x)} = \frac{p(y)}{p(x)}. \end{array}

This gives {p(x)\alpha(x,y) = p(y)\alpha(y,x)} and since {q} is symmetric, we get

\displaystyle  \begin{array}{rcl}  p(x)\alpha(x,y)q(x,y) = p(y)\alpha(y,x)q(x,y). \end{array}

Then we note that

\displaystyle  \begin{array}{rcl}  (1-\alpha^{*}(x))\delta_{x}(y) p(x) = (1-\alpha^{*}(y))\delta_{y}(x)p(y) \end{array}

(which can be seen by integration both sides against some {f(x,y)}). Putting things together, we get

\displaystyle  \begin{array}{rcl}  p(x)\kappa(y,x) & = & p(x) \alpha(x,y)q(y,x) + p(x)(1-\alpha^{*}(x))\delta_{x}(y)\\ & = & p(y)\alpha(y,x)q(x,y) + p(y)(1-\alpha^{*}(y)\delta_{y}(x)\\ & = & p(y)\kappa(x,y) \end{array}

as desired. \Box

Note that Metropolis-Hastings sampling is fairly easy to implement as soon as it is simple to get samples from the proposal distribution. A quick way is to use the standard normal distribution as proposal distribution. In this case, Metropolis-Hastings performs a “restricted random walk”, i.e. if we would accept every step, the sequence {x_{k}} would be a random walk. However, we allow for the possibility to stop the walk in the cases where it would lead us to a region of lower probability — in the cases where it leads to a point of higher probability, we follow the random walk. Although this approach is simple to implement, it may take a lot of time for the chain to get something reasonable. The reasons are, that the random walk steps may be rejected quite often, that it is a long way to get to where the “most probability is” from the initialization or that the distribution {p} has several modes, separated by valleys and that the random walk has difficulties to get from one mode to the other (again due to frequent rejection).

Also note that one can use Gaussian distributions with any variance as proposal distributions and that the variance acts like some stepsize. As often with stepsizes, there is a trade-off: smaller stepsizes mean, that the sampler will need more time to explore the distribution while larger stepsizes would allow faster exploration but also lead to more frequent rejection.

A few days ago I’ve been to IFIP TC7 and participated in the minisymposium “Optimization in Banach spaces with sparsity constraints” organized by Kristian Bredies and Christian Clason. My session was very interesting, consisting of talks of Masha Zhariy (Optimization algorithms for sparse reconstruction in inverse problems with adaptive stopping rules), Caroline Vehoeven (On a generalization of the iterative soft-thresholding algorithm for the case of non-separable penalty) and Kristian (Inverse problems in spaces of measures). Unfortunately, I could not take part in the second half of the minisymposium.

Although the minisymposium was interesting, I’d like to talk about the plenary Talk “Signal and systems representation; signal spaces, sampling, and quantization effects” by Holger Boche. The title sounded as I should be able to follow, but I’ve never heard the name before. And indeed, Holger Boche gave a fresh view onto the sampling and reconstruction problem, i.e. the digital-to-analog conversion and its inverse.

It was somehow refreshing to have a talk about sampling which was totally free of the words “compressive” and “compressed”. Holger Boche modelled continuous time signals in the well known Bernstein and Paley-Wiener spaces and started with fairly old but not too well known results on exact reconstruction from non-uniform sampling.

Definition 1 For {1\leq p\leq \infty} and {\sigma>0}, the Paley-Wiener space {\mathcal{PW}_\sigma^p} is

\displaystyle  \mathcal{PW}_\sigma^p = \{\hat g\ :\ g\in L^p([-\sigma,\sigma])\}.

Definition 2 A function {\phi} is of sine-type if it is of exponential type {\pi} and

  1. its zeros are separated, and
  2. there exist real {A,B,H} such that for all {|y|\geq H} it holds that

    \displaystyle  Ae^{\pi|y|}\leq |f(x+iy)| \leq Be^{\pi|y|}.

Then, one theorem of reconstruction for non-uniform sampling points goes as follows:

Theorem 3 Let {\phi} be a function of sine-type, whose zeros {(t_k)} are all real and define

\displaystyle  \phi_k(t) = \frac{\phi(t)}{\phi'(t_k)(t-t_k)}.

Then, for every {f\in\mathcal{PW_\pi^1}} and all {T>0} it holds that

\displaystyle  \lim_{N\rightarrow\infty}\|f - \sum_{k=-N}^N f(t_k)\phi_k\|_{L^\infty([-T,T])} \rightarrow 0.

This theorem is a generalization of a result of J.L. Brown on local uniform convergence of the Shannon-sampling series.

These results are related to a sampling theorem due to Logan which I learned some years ago. First we recall that the Hilbert transform {\mathcal{H}f} of a function {f} can be defined via the Fourier transform {\mathcal{F}} as

\displaystyle  \mathcal{F}(\mathcal{H}(u))(\omega) = (-i\text{sign}(\omega))\mathcal{F}(u)(\omega).

Then, Logan’s Theorem is as follows:

Theorem 4 (Logan’s Sampling Theorem) Let {f_1,f_2:\mathbb{R}\rightarrow\mathbb{R}} such that the support of both {\hat f_1} and {\hat f_2} is contained in {[-b,-a]\cup[a,b]} for some number {a} and {b} with {0<a<b<2a}. Then, if {f_1} and {f_2} do not have a root in common with their Hilbert transform, it holds that {\text{sign} f_1 = \text{sign} f_2} implies that {f_2} is equal to {f_1} up to a constant factor.

Proof: (Informal) Since modulation shifts the Fourier transform, we can write a function {f} with {\text{supp}\hat f\subset -b,-a]\cup[a,b]} as a modulated sum of functions {h_1} and {h_2} with bandwidth {(b-a)/2}, using {\mu=(a+b)/2} as follows

\displaystyle  f(t) = -i h_1(t)e^{i\mu t} + if_2(t)e^{-i\mu t}.

Expressing {\exp} in terms of {\sin} and {\cos} gives

\displaystyle  \begin{array}{rcl}  f(t) &=& (-if_1(t) + if_2(t))\cos(\mu t) + (f_1(t)+f_2(t))\sin(\mu t)\\ &=& p(t)\cos(\mu t) + q(t)\sin(\mu t). \end{array}

Note that {p} and {q} are real valued since {f} is assumed to be real valued. Using {\mathcal{H}(\sin) = \cos} and {\mathcal{H}(\cos) = -\sin} we conclude that

\displaystyle  \mathcal{H}(f)(t) = p(t)\sin(\mu t) - q(t)\cos(\mu t).

With the obvious notation we obtain that

\displaystyle  \mathcal{H}(f_1)(t)f_2(t) - \mathcal{H}(f_2)(t)f_1(t) = p_1(t)q_2(t) - p_2(t)q_1(t) = g(t).

Now, the zeros of {g} are at least the common zeros of {f_1} and {f_2}. Moreover, the bandwidth of {g} is at most {b-a} (since the bandwidth of {p_j} and {q_j} is at most {(b-a)/2}). We can conclude that {g} is identically zero by an argument which uses that upper density of the zeros of {f_1} (resp. {f_2}) is high enough to force {g} to zero. Hence, we know that

\displaystyle  \frac{f_1(t)}{\mathcal{H}(f_1)(t)} = \frac{f_2(t)}{\mathcal{H}(f_2)(t)}

To finish the proof one needs some argument from complex analysis (using that both sides of the above equations are extendable to meromorphic function of exponetial type) to conclude that {f_1} equals {f_2} up to a constant factor. \Box

When I asked about the relation between Logan’s theorem and his results, he replied that Logan’s theorem was the starting point since he was curious why there was no practical implementation of this sampling procedure available. More precisely, he was interested in designing a digital implementation, based on sampling points, which allows to digitize all linear time invariant filters. One of his negative results was the following.

Definition 5 A sequence {(t_k)} is called a complete interpolation sequence for {\mathcal{PW}_\pi^2} and {\ell^2} if for every {c\in\ell^2} there is exactly one {f\in \mathcal{PW}_\pi^2} such that {f(t_k) = c_k}.

Theorem 6 For a complete interpolation sequence {(t_k)} for {\mathcal{PW}_\pi^2} and {\ell^2}, the corresponding reconstruction functions {\phi_k} and a given {t\in\mathbb{R}} there always exists a stable linear time invariant filter {T} and a signal {f\in \mathcal{PW}_\pi^1} such that

\displaystyle  \lim\sup_{N\rightarrow\infty} | (Tf)(t) - \sum_{k=-N}^N f(t_k)(T\phi_k)(t)| = \infty.

He conjectured (Conjecture 2.1 in his talk), that this phenomenon of point-wise divergence of the approximation of a linear time invariant filter remains, even if oversampling is applied, that there always is a function {f\in\mathcal{PW}_{\beta\pi}^1} ({0<\beta<1}) such that point-wise divergence happens. Moreover, he conjectured that (Conjecture 2.2) digitizing stable linear time invariant filter should be possible if one replaced point-sampling by appropriate linear functionals and posed the open problem to both prove this conjecture and to find such linear functionals.