February 2017
Monthly Archive
February 24, 2017
Here is a lemma that I find myself googling regularly since I always forget it’s exact form.
Lemma 1 Let
be a monotone operator,
and denote by
the resolvent of
. Then it holds that

Proof: We start with the left hand side
and deduce


I do not know any official name of this, but would call it Moreau’s identity which is the name of the respective statement for proximal operators for convex functions
and
:

The version for monotone operators is Proposition 23.18 in the first edition of Bauschke and Combette’s book “Convex Analysis and Monotone Operator Theory in Hilbert Spaces”.
Like this:
Like Loading...
February 24, 2017
I blogged about the Douglas-Rachford method before and in this post I’d like to dig a bit into the history of the method.
As the name suggests, the method has its roots in a paper by Douglas and Rachford and the paper is
Douglas, Jim, Jr., and Henry H. Rachford Jr., “On the numerical solution of heat conduction problems in two and three space variables.” Transactions of the American mathematical Society 82.2 (1956): 421-439.
At first glance, the title does not suggest that the paper may be related to monotone inclusions and if you read the paper you’ll not find any monotone operator mentioned. So let’s start and look at Douglas and Rachford’s paper.
1. Solving the heat equation numerically
So let us see, what they were after and how this is related to what is known as Douglas-Rachford splitting method today.
Indeed, Douglas and Rachford wanted to solve the instationary heat equation

with Dirichlet boundary conditions (they also considered three dimensions, but let us skip that here). They considered a rectangular grid and a very simple finite difference approximation of the second derivatives, i.e.

(with modifications at the boundary to accomodate the boundary conditions). To ease notation, we abbreviate the difference quotients as operators (actually, also matrices) that act for a fixed time step

With this notation, our problem is to solve

in time.
Then they give the following iteration:


(plus boundary conditions which I’d like to swipe under the rug here). If we eliminate
from the first equation using the second we get

This is a kind of implicit Euler method with an additional small term
. From a numerical point of it has one advantage over the implicit Euler method: As equations (1) and (2) show, one does not need to invert
in every iteration, but only
and
. Remember, this was in 1950s, and solving large linear equations was a much bigger problem than it is today. In this specific case of the heat equation, the operators
and
are in fact tridiagonal, and hence, solving with
and
can be done by Gaussian elimination without any fill-in in linear time (read Thomas algorithm). This is a huge time saver when compared to solving with
which has a fairly large bandwidth (no matter how you reorder).
How do they prove convergence of the method? They don’t since they wanted to solve a parabolic PDE. They were after stability of the scheme, and this can be done by analyzing the eigenvalues of the iteration. Since the matrices
and
are well understood, they were able to write down the eigenfunctions of the operator associated to iteration (3) explicitly and since the finite difference approximation is well understood, they were able to prove approximation properties. Note that the method can also be seen, as a means to calculate the steady state of the heat equation.
We reformulate the iteration (3) further to see how
is actually derived from
: We obtain

2. What about monotone inclusions?
What has the previous section to do with solving monotone inclusions? A monotone inclusion is

with a monotone operator, that is, a multivalued mapping
from a Hilbert space
to (subsets of) itself such that for all
and
and
it holds that

We are going to restrict ourselves to real Hilbert spaces here. Note that linear operators are monotone if they are positive semi-definite and further note that monotone linear operators need not to be symmetric. A general approach to the solution of monotone inclusions are so-called splitting methods. There one splits
additively
as a sum of two other monotone operators. Then one tries to use the so-called resolvents of
and
, namely

to obtain a numerical method. By the way, the resolvent of a monotone operator always exists and is single valued (to be honest, one needs a regularity assumption here, namely one need maximal monotone operators, but we will not deal with this issue here).
The two operators
and
from the previous section are not monotone, but
and
are, so the equation
is a special case of a montone inclusion. To work with monotone operators we rename

and write the iteration~(4) in terms of monotone operators as

i.e.

Using
and
we rewrite this in terms of resolvents as
![\displaystyle \begin{array}{rcl} w^{n+1} & = &(I+\tau B)^{-1}[(I+\tau A)^{-1}(I-\tau B) + \tau B]w^{n}\\ & =& R_{\tau B}(R_{\tau A}(w^{n}-\tau Bw^{n}) + \tau Bw^{n}). \end{array} \displaystyle \begin{array}{rcl} w^{n+1} & = &(I+\tau B)^{-1}[(I+\tau A)^{-1}(I-\tau B) + \tau B]w^{n}\\ & =& R_{\tau B}(R_{\tau A}(w^{n}-\tau Bw^{n}) + \tau Bw^{n}). \end{array}](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cbegin%7Barray%7D%7Brcl%7D+w%5E%7Bn%2B1%7D+%26+%3D+%26%28I%2B%5Ctau+B%29%5E%7B-1%7D%5B%28I%2B%5Ctau+A%29%5E%7B-1%7D%28I-%5Ctau+B%29+%2B+%5Ctau+B%5Dw%5E%7Bn%7D%5C%5C+%26+%3D%26+R_%7B%5Ctau+B%7D%28R_%7B%5Ctau+A%7D%28w%5E%7Bn%7D-%5Ctau+Bw%5E%7Bn%7D%29+%2B+%5Ctau+Bw%5E%7Bn%7D%29.+%5Cend%7Barray%7D+&bg=ffffff&fg=000000&s=0&c=20201002)
This is not really applicable to a general monotone inclusion since there
and
may be multi-valued, i.e. the term
is not well defined (the iteration may be used as is for splittings where
is monotone and single valued, though).
But what to do, when both and
and
are multivaled? The trick is, to introduce a new variable
. Plugging this in throughout leads to

We cancel the outer
and use
to get

and here we go: This is exactly what is known as Douglas-Rachford method (see the last version of the iteration in my previous post). Note that it is not
that converges to a solution, but
, so it is convenient to write the iteration in the two variables

The observation, that these splitting method that Douglas and Rachford devised for linear problems has a kind of much wider applicability is due to Lions and Mercier and the paper is
Lions, Pierre-Louis, and Bertrand Mercier. “Splitting algorithms for the sum of two nonlinear operators.” SIAM Journal on Numerical Analysis 16.6 (1979): 964-979.
Other, much older, splitting methods for linear systems, such as the Jacobi method, the Gauss-Seidel method used different properties of the matrices such as the diagonal of the matrix or the upper and lower triangluar parts and as such, do not generalize easily to the case of operators on a Hilbert space.
Like this:
Like Loading...
February 13, 2017
Assume that we are given a probability distribution
on
and for simplicity we further assume that
is continuous, i.e.
somehow indicates, how likely
is. To be more precise, the probability, that
for some (measurable) set
is, is
. We do I start like this? There are cases, in which one known some probability distribution, then obtains some
and wants to know, if this
was particularly representative for this distribution. One example of this situation is a statistical inverse problem where the model is
with a linear (known) map
and some error
. From some observed
, one wants to know as much as possible about the underlying
. Assuming that the noise
has a known distribution
and that one has prior information on
(i.e. likely
are ones where the prior distribution
is large) one can model the following probability distributions:

The latter distribution, the posterior, is what one is interested in, i.e. given that one has observed
, what is the probability that some
(which we may generate in some way) is the true solution? So, in principle, all information we have is contained in the posterior, but how to get on grip on it?
What we know are the distribution
and
but note that
is not know and is usually not simple to compute. Another expression for
is
since the right hand side in the definition of the posterior has to be a probability distribution. So to calculate
we have to evaluate one integral, but note that this is in an integral over
and in the context of statistical inverse problems, this
is usually not in the ten-thousands, but easily in the millions. So we see, that the calculation of
is usually out of reach (unless all distributions are very simple). So where are we: If we get some
we can calculate
and
, but what would it say if
, for example? Not much, because we do not know the the normalization constant. While
seems pretty small and hence
to be somehow unlikely, it may be that
and then
which would make
to be rather likely. On the other hand we would really would like some more information about
to judge about that
since
may have a large variance and a value of
may be among the largest possible values, again rendering
quite likely, but subject to large variance. To recap:
- We have a probability distribution
but we can only generate values
for some unknown
.
- The domain of definition of
has a huge dimension.
- We want to know as much as possible about
, e.g. its expected value, its variance or its mode…
Note that some questions about
can be answered without knowing the normalization constant, e.g. the mode
, i.e. the
which maximizes
. That may one prime reason while MAP (maximum-a-posterior) estimators are so widely used… One approach, to get more information out of
is to use sampling.
Before we come to methods that can sample from unnormalized distributions, we describe what we actually mean by sampling, and give the main building blocks.
1. Sampling from simple distribution
Sampling from a distribution
means a method to generate values
such that the values are distributed according to
. Expressed in formula: For every integrable function
and sequence
of generated samples, we want that

In the following we want to describe a few methods to generate samples. All the methods will build on the ability to sample from some simple basic distributions, and these are
- The uniform distribution on
denoted by
. On a computer this is possible to a good approximation by pseudorandom number generators like the Mersenne twister (used, e.g., in MATLAB).
- The standard normal distribution denoted by
. If you generate pseudorandom normally distributed numbers, then, under the hood, the machine generates uniformly distributed numbers and cleverly transforms these to be normally distributed, e.g. with the Box-Muller method.
Note that simple scaling allows to sample from
and
.
If we sample
from some distribution
we will denote this by

so
means that
is drawn from a uniform distribution over
.
2. Rejection sampling
For rejection sampling from
(having access to
only) we need a so-called proposal distribution
from which we can sample (i.e. a uniform one or a normal one) and we need to know some
such that

for all
. The sampling scheme goes as follows: Draw proposals from the proposal distribution and accept them based on some rule that will ensure the we will end up with samples distributed according to
. In more detail:
- generate
, i.e. sample
from the proposal distribution,
- calculate the acceptance rate

- generate
,
- accept
if
, otherwise reject it.
Proposition 1 The samples generated by rejection sampling are distributed according to
.
Proof: Let us denote by
the event that a sample
, drawn from
has been accepted. Further, we denote by
the distribution of
, provided it has been accepted. So we aim to show that
.
To do so, we employ Bayes’ theorem and write

For the three probabilities on the right hand side we know:
Together we get


The crucial steps to apply rejection sampling is to find a good proposal distribution with small constant
(and also, to calculate
can be tricky). As an example, consider that you want to sample from the tail of a standard normal distribution, e.g. you want to obtain “normally distributed random numbers larger that
”. Rejection sampling is pretty straightforward: You choose
as standard normal, get samples
from that and only accept if
. In this case you have
, but
is small, namely
which means that less than 1 out of 43 samples are accepted…
3. (Non-adaptive) Metropolis sampling
Here comes another method. This method is from the class of Markov-Chain methods, i.e. we will generate a sequence
such that they form a Markov chain with a given transition probability. This means, we have a distribution
such that
. Again, our goal is that the sequence is distributed according to
. A central result in the theory of Markov-chain methods is, that this is the case when the so-called (detailed) balance equation is fulfilled , i.e. if

Remember, that we only have access to
and we don’t know
. Here is the method which is called Metropolis-Hastings sampling: To begin, choose a symmetric proposal distribution
, (i.e.
is the distribution for “go from
to
” and fulfills
), initialize with some
and set
. Then:
- generate
, i.e. make a trial step from
,
- set
,
- generate
- accept
with probability
, i.e. set
if
, increase
and repeat, otherwise do reject, i.e. do not increase
and repeat.
Note that in step 3 we don’t really need
and
but only
and
since the unknown
‘s cancel out.
Proposition 2 The samples generated from Metropolis-Hastings sampling are distributed according to
.
Proof: We first calculate the transition probability of the method. The probability to go from
to the proposed
is “
multiplied by the acceptance rate
” and the probability to stay in
is
. In formula: We write the acceptance ratio as

and can write the transition probability as

with

We aim to show the detailed balance equation
. First by a simple distinction of cases, we get

This gives
and since
is symmetric, we get

Then we note that

(which can be seen by integration both sides against some
). Putting things together, we get

as desired. 
Note that Metropolis-Hastings sampling is fairly easy to implement as soon as it is simple to get samples from the proposal distribution. A quick way is to use the standard normal distribution as proposal distribution. In this case, Metropolis-Hastings performs a “restricted random walk”, i.e. if we would accept every step, the sequence
would be a random walk. However, we allow for the possibility to stop the walk in the cases where it would lead us to a region of lower probability — in the cases where it leads to a point of higher probability, we follow the random walk. Although this approach is simple to implement, it may take a lot of time for the chain to get something reasonable. The reasons are, that the random walk steps may be rejected quite often, that it is a long way to get to where the “most probability is” from the initialization or that the distribution
has several modes, separated by valleys and that the random walk has difficulties to get from one mode to the other (again due to frequent rejection).
Also note that one can use Gaussian distributions with any variance as proposal distributions and that the variance acts like some stepsize. As often with stepsizes, there is a trade-off: smaller stepsizes mean, that the sampler will need more time to explore the distribution while larger stepsizes would allow faster exploration but also lead to more frequent rejection.
Like this:
Like Loading...
February 10, 2017
Consider a convex optimization problem of the form

with convex
and
and matrix
. (We formulate everything quite loosely, skipping over details like continuity and such, as they are irrelevant for the subject matter). Optimization problems of this type have a specific type of dual problem, namely the Fenchel-Rockafellar dual, which is

and under certain regularity conditions it holds that the optimal value of the dual equals the the objective value of the primal and, moreover, that a pair
is both primal and dual optimal if and only if the primal dual gap is zero, i.e. if and only if

Hence, it is quite handy to use the primal dual gap as a stopping criteria for iterative methods to solve these problems. So, if one runs an algorithm which produces primal iterates
and dual iterates
one can monitor

and stop if the value falls below a desired tolerance.
There is some problem with this approach which appears if the method produces infeasible iterates in the sense that one of the four terms in
is actually
. This may be the case if
or
are not everywhere finite or, loosely speaking, have linear growth in some directions (since then the respective conjugate will not be finite everywhere). In the rest of the post, I’ll sketch a general method that can often solve this particular problem.
For the sake of simplicity, consider the following primal dual algorithm

(also know as primal dual hybrid gradient method or Chambolle-Pock’s algorithm). It converges as soon as
.
While the structure of the algorithm ensures that
and
are always finite (since always
), is may be that
or
are indeed infinite, rendering the primal dual gap useless.
Let us assume that the problematic term is
. Here is a way out in the case where one can deduce some a-priori bounds on
, i.e. a bounded and convex set
with
. In fact, this is often the case (e.g. one may know a-priori that there exist lower bounds
and upper bounds
, i.e. it holds that
). Then, adding these constraints to the problem will not change the solution.
Let us see, how this changes the primal dual gap: We set
where
is the set which models the bound constraints. Since
is a bounded convex set and
is finite on
, it is clear that

is finite for every
. This leads to a finite duality gap. However, one should also adapt the prox operator. But this is also simple in the case where the constraint
and the function
are separable, i.e.
encodes bound constraints as above (in other words
) and

Here it holds that
![\displaystyle \begin{array}{rcl} \mathrm{prox}_{\sigma \tilde F}(x)_{i} = \mathrm{prox}_{\sigma f_{i} + I_{[l_{i},u_{i}]}}(x_{i}) \end{array} \displaystyle \begin{array}{rcl} \mathrm{prox}_{\sigma \tilde F}(x)_{i} = \mathrm{prox}_{\sigma f_{i} + I_{[l_{i},u_{i}]}}(x_{i}) \end{array}](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cbegin%7Barray%7D%7Brcl%7D+%5Cmathrm%7Bprox%7D_%7B%5Csigma+%5Ctilde+F%7D%28x%29_%7Bi%7D+%3D+%5Cmathrm%7Bprox%7D_%7B%5Csigma+f_%7Bi%7D+%2B+I_%7B%5Bl_%7Bi%7D%2Cu_%7Bi%7D%5D%7D%7D%28x_%7Bi%7D%29+%5Cend%7Barray%7D+&bg=ffffff&fg=000000&s=0&c=20201002)
and it is simple to see that
![\displaystyle \begin{array}{rcl} \mathrm{prox}_{\sigma f_{i} + I_{[l_{i},u_{i}]}}(x_{i}) = \mathrm{proj}_{[l_{i},u_{i}]}\mathrm{prox}_{\tau f_{i}}(x_{i}), \end{array} \displaystyle \begin{array}{rcl} \mathrm{prox}_{\sigma f_{i} + I_{[l_{i},u_{i}]}}(x_{i}) = \mathrm{proj}_{[l_{i},u_{i}]}\mathrm{prox}_{\tau f_{i}}(x_{i}), \end{array}](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+%5Cbegin%7Barray%7D%7Brcl%7D+%5Cmathrm%7Bprox%7D_%7B%5Csigma+f_%7Bi%7D+%2B+I_%7B%5Bl_%7Bi%7D%2Cu_%7Bi%7D%5D%7D%7D%28x_%7Bi%7D%29+%3D+%5Cmathrm%7Bproj%7D_%7B%5Bl_%7Bi%7D%2Cu_%7Bi%7D%5D%7D%5Cmathrm%7Bprox%7D_%7B%5Ctau+f_%7Bi%7D%7D%28x_%7Bi%7D%29%2C+%5Cend%7Barray%7D+&bg=ffffff&fg=000000&s=0&c=20201002)
i.e., only uses the proximal operator of
and project onto the constraints. For general
, this step may be more complicated.
One example, where this makes sense is
denoising which can be written as

Here we have

The guy that causes problems here is
which is an indicator functional and indeed
will usually be dual infeasible. But since
is an image with a know range of gray values one can simple add the constraints
to the problem and obtains a finite dual while still keeping a simple proximal operator. It is quite instructive to compute
in this case.
Like this:
Like Loading...