Signal and image processing


ISMP LogoISMP is over now and I’m already home. I do not have many things to report on from the last day. This is not due the lower quality of the talks but due to the fact that I was a little bit exhausted, as usual at the end of a five-day conference. However, I collect a few things for the record:

  • In the morning I visited the semi-planary by Xiaojun Chenon non-convex and non-smooth minimization with smoothing methods. Not surprisingly, she treated the problem

    \displaystyle \min_x f(x) + \|x\|_p^p

    with convex and smooth {f:{\mathbb R}^n\rightarrow{\mathbb R}} and {0<p<1}. She proposed and analyzed smoothing methods, that is, to smooth the problem a bit to obtain a Lipschitz-continuous objective function {\phi_\epsilon}, minimizing this and then gradually decreasing {\epsilon}. This works, as she showed. If I remember correctly, she also treated “iteratively reweighted least squares” as I described in my previous post. Unfortunately, she did not include the generalized forward-backward methods based on {\text{prox}}-functions for non-convex functions. Kristian and I pursued this approach in our paper Minimization of non-smooth, non-convex functionals by iterative thresholding and some special features of our analysis include:

    • A condition which excludes some (but not all) local minimizers from being global.
    • An algorithm which avoids this non-global minimizers by carefully adjusting the steplength of the method.
    • A result that the number of local minimizers is still finite, even if the problem is posed in {\ell^2({\mathbb N})} and not in {{\mathbb R}^n}.

    Most of our results hold true, if the {p}-quasi-norm is replaced by functions of the form

    \displaystyle \sum_n \phi_n(|x_n|)

    with special non-convex {\phi}, namely fulfilling a list of assumptions like

    • {\phi'(x) \rightarrow \infty} for {x\rightarrow 0} (infinite slope at {0}) and {\phi(x)\rightarrow\infty} for {x\rightarrow\infty} (mild coercivity),
    • {\phi'} strictly convex on {]0,\infty[} and {\phi'(x)/x\rightarrow 0} for {x\rightarrow\infty},
    • for each {b>0} there is {a>0} such that for {x<b} it holds that {\phi(x)>ax^2}, and
    • local integrability of some section of {\partial\phi'(x) x}.

    As one easily sees, {p}-quasi-norms fulfill the assumptions and some other interesting functions as well (e.g. some with very steep slope at {0} like {x\mapsto \log(x^{1/3}+1)}).

  • Jorge Nocedalgave a talk on second-order methods for non-smooth problems and his main example was a functional like

    \displaystyle \min_x f(x) + \|x\|_1

    with a convex and smooth {f}, but different from Xiaojun Chen, he only considered the {1}-norm. His talked is among the best planary talks I have ever attended and it was a great pleasure to listen to him. He carefully explained things and put them in perspective. In the case he skipped slides, he made me feel that I either did not miss an important thing, or understood them even though he didn’t show them He argued that it is not necessarily more expensive to use second order information in contrast to first order methods. Indeed, the {1}-norm can be used to reduce the number of degrees of freedom for a second order step. What was pretty interesting is, that he advocated semismooth Newton methods for this problem. Roland and I pursued this approach some time ago in our paper A Semismooth Newton Method for Tikhonov Functionals with Sparsity Constraints and, if I remember correctly (my notes are not complete at this point), his family of methods included our ssn-method. The method Roland and I proposed worked amazingly well in the cases in which it converged but the method suffered from non-global convergence. We had some preliminary ideas for globalization, which we could not tune enough to retain the speed of the method, and abandoned the topic. Now, that the topic will most probably be revived by the community, I am looking forward to fresh ideas here.

Today there are several things I could blog on. The first is the planary by Rich Baraniuk on Compressed Sensing. However, I don’t think that I could reflect the content in a way which would be helpful for a potential reader. Just for the record: If you have the chance to visit one of Rich’s talk: Do it!

The second thing is the talk by Bernd Hofmann on source conditions, smoothness and variational inequalities and their use in regularization of inverse problems. However, this would be too technical for now and I just did not take enough notes to write a meaningful post.

As a third thing I have the talk by Christian Clason on inverse problems with uniformly distributed noise. He argued that for uniform noise it is much better to use an {L^\infty} discrepancy term instead of the usual {L^2}-one. He presented a path-following semismooth Newton method to solve the problem

\displaystyle \min_x \frac{1}{p}\|Kx-y^\delta\|_\infty^p + \frac{\alpha}{2}\|x\|_2^2

and showed examples with different kinds of noise. Indeed the examples showed that {L^\infty} works much better than {L^2} here. But in fact it works even better, if the noise is not uniformly distributed but “impulsive” i.e. it attains bounds {\pm\delta} almost everywhere. It seems to me that uniform noise would need a slightly different penalty but I don’t know which one – probably you do? Moreover, Christian presented the balancing principle to choose the regularization parameter (without knowledge about the noise level) and this was the first time I really got what it’s about. What one does here is, to choose {\alpha} such that (for some {\sigma>0} which only depends on {K}, but not on the noise)

\displaystyle \sigma\|Kx_\alpha^\delta-y^\delta\|_\infty = \frac{\alpha}{2}\|x_\alpha^\delta\|_2^2.

The rational behind this is, that the left hand side is monotonically non-decreasing in {\alpha}, while the right hand side is monotonically non-increasing. Hence, there should be some {\alpha} “in the middle” which make both somewhat equally large. Of course, we do neither want to “over-regularize” (which would usually “smooth too much”) nor to “under-regularize” (which would not eliminate noise). Hence, balancing seems to be a valid choice. From a practical point of view the balancing is also nice because one can use the fixed-point iteration

\displaystyle \alpha^{n+1} = 2\sigma\frac{\|Kx_{\alpha^n}^\delta - y^\delta\|_\infty}{\|x_{\alpha_n}^\delta\|_2^2}

which converges in a few number of iterations.

Then there was the talk by Esther Klann, but unfortunately, I was late so only heard the last half…

Last but not least we have the talk by Christiane Pöschl. If you are interested in Total-Variation-Denoising (TV denoising), then you probably have heard many times that “TV denoising preserves edges” (have a look at the Wikipedia page – it claims this twice). What Christiane showed (in a work with Vicent Caselles and M. Novaga) that this claim is not true in general but only for very special cases. In case of characteristic functions, the only functions for which the TV minimizer has sharp edges are these so-called calibrated sets, introduced by Caselles et el. Building on earlier works by Caselles and co-workers she calculated exact minimizers for TV denoising in the case that the image consists of characteristic functions of two convex sets or of a single star shaped domain, that is, for a given set B she calculated the solution of

\displaystyle \min_u\int (u - \chi_B)^2dx + \lambda \int|Du|.

This is not is as easy as it may sound. Even for the minimizer for a single convex set one has to make some effort. She presented a nice connection of the shape of the obtained level-sets with the morphological operators of closing and opening. With the help of this link she derived a methodology to obtain the exact TV denoising minimizer for all parameters. I do not have the images right now but be assured that most of the time, the minimizers do not have sharp edges all over the place. Even for simple geometries (like two rectangles touching in a corner) strange things happen and only very few sharp edges appear. I’ll keep you posted in case the paper comes out (or appears as a preprint).

Christiane has some nice images which make this much more clear:

For two circles edges are preserved if they are far enough away from each other. If they are close, the area “in between” them is filled and, moreover, obey this fuzzy boundary. I remember myself seeing effects like this in the output of TV-solvers and thinking “well, it seems that the algorithm is either not good or not converged yet – TV should output sharp edges!”.

 

For a star-shaped shape (well, actually a star) the output looks like this. The corners are not only rounded but also blurred and this is true both for the “outer” corners and the “inner” corners.

 

So, if you have any TV-minimizing code, go ahead and check if your code actually does the right things on images like this!
Moreover, I would love to see similar results for more complicated extensions of TV like Total Generalized Variation, I treated here.

 

 

 

 

The scientific program at ISMP started today and I planned to write a small personal summary of each day. However, it is a very intense meeting. Lot’s of excellent talks, lot’s of people to meet and little spare time. So I’m afraid that I have to deviate from my plan a little bit. Instead of a summary of every day I just pick out a few events. I remark that these picks do not reflect quality, significance or something like this in any way. I just pick things for which I have something to record for personal reasons.

My day started after the first plenary which the session Testing environments for machine learning and compressed sensing in which my own talk was located. The session started with the talk by Michael Friedlander of the SPOT toolbox. Haven’t heard of SPOT yet? Take a look! In a nutshell its a toolbox which turns MATLAB into “OPLAB”, i.e. it allows to treat abstract linear operators like matrices. By the way, the code is on github.

The second talk was by Katya Scheinberg (who is giving a semi-planary talk on derivative free optimization at the moment…). She talked about speeding up FISTA by cleverly adjusting step-sizes and over-relaxation parameters and generalizing these ideas to other methods like alternating direction methods. Notably, she used the “SPEAR test instances” from our project homepage! (And credited them as “surprisingly hard sparsity problems”.)

My own talk was the third and last one in that session. I talked about the issue of constructing test instance for Basis Pursuit Denoising. I argued that the naive approach (which takes a matrix {A}, a right hand side {b} and a parameter {\lambda} and let some great solver run for a while to obtain a solution {x^*}) may suffer from “trusted method bias”. I proposed to use “reverse instance construction” which is: First choose {A}, {\lambda} and the solution {x^*} and the construct the right hand side {b} (I blogged on this before here).

Last but not least, I’d like to mention the talk by Thomas Pock: He talked about parameter selection on variational models (think of the regularization parameter in Tikhonov, for example). In a paper with Karl Kunisch titled A bilevel optimization approach for parameter learning in variational models they formulated this as a bi-level optimization problem. An approach which seemed to have been overdue! Although they treat somehow simple inverse problems (well, denoising) (but with not so easy regularizers) it is a promising first step in this direction.

In this post I just collect a few papers that caught my attention in the last moth.

I begin with Estimating Unknown Sparsity in Compressed Sensing by Miles E. Lopes. The abstract reads

Within the framework of compressed sensing, many theoretical guarantees for signal reconstruction require that the number of linear measurements {n} exceed the sparsity {\|x\|_0} of the unknown signal {x\in\mathbb{R}^p}. However, if the sparsity {\|x\|_0} is unknown, the choice of {n} remains problematic. This paper considers the problem of estimating the unknown degree of sparsity of {x} with only a small number of linear measurements. Although we show that estimation of {\|x\|_0} is generally intractable in this framework, we consider an alternative measure of sparsity {s(x):=\frac{\|x\|_1^2}{\|x\|_2^2}}, which is a sharp lower bound on {\|x\|_0}, and is more amenable to estimation. When {x} is a non-negative vector, we propose a computationally efficient estimator {\hat{s}(x)}, and use non-asymptotic methods to bound the relative error of {\hat{s}(x)} in terms of a finite number of measurements. Remarkably, the quality of estimation is dimension-free, which ensures that {\hat{s}(x)} is well-suited to the high-dimensional regime where {n<<p}. These results also extend naturally to the problem of using linear measurements to estimate the rank of a positive semi-definite matrix, or the sparsity of a non-negative matrix. Finally, we show that if no structural assumption (such as non-negativity) is made on the signal {x}, then the quantity {s(x)} cannot generally be estimated when {n<<p}.

It’s a nice combination of the observation that the quotient {s(x)} is a sharp lower bound for {\|x\|_0} and that it is possible to estimate the one-norm and the two norm of a vector {x} (with additional properties) from carefully chosen measurements. For a non-negative vector {x} you just measure with the constant-one vector which (in a noisy environment) gives you an estimate of {\|x\|_1}. Similarly, measuring with Gaussian random vector you can obtain an estimate of {\|x\|_2}.

Then there is the dissertation of Dustin Mixon on the arxiv: Sparse Signal Processing with Frame Theory which is well worth reading but too long to provide a short overview. Here is the abstract:

Many emerging applications involve sparse signals, and their processing is a subject of active research. We desire a large class of sensing matrices which allow the user to discern important properties of the measured sparse signal. Of particular interest are matrices with the restricted isometry property (RIP). RIP matrices are known to enable efficient and stable reconstruction of sfficiently sparse signals, but the deterministic construction of such matrices has proven very dfficult. In this thesis, we discuss this matrix design problem in the context of a growing field of study known as frame theory. In the first two chapters, we build large families of equiangular tight frames and full spark frames, and we discuss their relationship to RIP matrices as well as their utility in other aspects of sparse signal processing. In Chapter 3, we pave the road to deterministic RIP matrices, evaluating various techniques to demonstrate RIP, and making interesting connections with graph theory and number theory. We conclude in Chapter 4 with a coherence-based alternative to RIP, which provides near-optimal probabilistic guarantees for various aspects of sparse signal processing while at the same time admitting a whole host of deterministic constructions.

By the way, the thesis is dedicated “To all those who never dedicated a dissertation to themselves.”

Further we have Proximal Newton-type Methods for Minimizing Convex Objective Functions in Composite Form by Jason D Lee, Yuekai Sun, Michael A. Saunders. This paper extends the well explored first order methods for problem of the type {\min g(x) + h(x)} with Lipschitz-differentiable {g} or simple {\mathrm{prox}_h} to second order Newton-type methods. The abstract reads

We consider minimizing convex objective functions in composite form

\displaystyle \min_{x\in\mathbb{R}^n} f(x) := g(x) + h(x)

where {g} is convex and twice-continuously differentiable and {h:\mathbb{R}^n\rightarrow\mathbb{R}} is a convex but not necessarily differentiable function whose proximal mapping can be evaluated efficiently. We derive a generalization of Newton-type methods to handle such convex but nonsmooth objective functions. Many problems of relevance in high-dimensional statistics, machine learning, and signal processing can be formulated in composite form. We prove such methods are globally convergent to a minimizer and achieve quadratic rates of convergence in the vicinity of a unique minimizer. We also demonstrate the performance of such methods using problems of relevance in machine learning and high-dimensional statistics.

With this post I say goodbye for a few weeks of holiday.

Today I would like to comment on two arxiv-preprints I stumbled upon:

1. “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” – The Elastic Net rediscovered

The paper “Augmented L1 and Nuclear-Norm Models with a Globally Linearly Convergent Algorithm” by Ming-Jun Lai and Wotao Yin is another contribution to a field which is (or was?) probably the fastest growing field in applied mathematics: Algorithms for convex problems with non-smooth {\ell^1}-like terms. The “mother problem” here is as follows: Consider a matrix {A\in{\mathbb R}^{m\times n}}, {b\in{\mathbb R}^m} try to find a solution of

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad Ax=b

or, for {\sigma>0}

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma

which appeared here on this blog previously. Although this is a convex problem and even has a reformulation as linear program, some instances of this problem are notoriously hard to solve and gained a lot of attention (because their applicability in sparse recovery and compressed sensing). Very roughly speaking, a part of its hardness comes from the fact that the problem is neither smooth nor strictly convex.

The contribution of Lai and Yin is that they analyze a slight perturbation of the problem which makes its solution much easier: They add another term in the objective; for {\alpha>0} they consider

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad Ax=b

or

\displaystyle  \min_{x\in{\mathbb R}^n}\|x\|_1 + \frac{1}{2\alpha}\|x\|_2^2\quad\text{s.t.}\quad \|Ax-b\|\leq\sigma.

This perturbation does not make the problem smooth but renders it strongly convex (which usually makes the dual more smooth). It turns out that this perturbation makes life with this problem (and related ones) much easier – recovery guarantees still exists and algorithms behave better.

I think it is important to note that the “augmentation” of the {\ell^1} objective with an additional squared {\ell^2}-term goes back to Zou and Hastie from the statistics community. There, the motivation was as follows: They observed that the pure {\ell^1} objective tends to “overpromote” sparsity in the sense that if there are two columns in {A} which are almost equally good in explaining some component of {b} then only one of them is used. The “augmented problem”, however, tends to use both of them. They coined the method as “elastic net” (for reasons which I never really got).

I also worked on elastic-net problems for problems in the form

\displaystyle  \min_x \frac{1}{2}\|Ax-b\|^2 + \alpha\|x\|_1 + \frac{\beta}{2}\|x\|_2^2

in this paper (doi-link). Here it also turns out that the problem gets much easier algorithmically. I found it very convenient to rewrite the elastic-net problem as

\displaystyle  \min_x \frac{1}{2}\|\begin{bmatrix}A\\ \sqrt{\beta} I\end{bmatrix}x-\begin{bmatrix}b\\ 0\end{bmatrix}\|^2 + \alpha\|x\|_1

which turns the elastic-net problem into just another {\ell^1}-penalized problem with a special matrix and right hand side. Quite convenient for analysis and also somehow algorithmically.

2. Towards a Mathematical Theory of Super-Resolution

The second preprint is “Towards a Mathematical Theory of Super-Resolution” by Emmanuel Candes and Carlos Fernandez-Granda.

The idea of super-resolution seems to pretty old and, very roughly speaking, is to extract a higher resolution of a measured quantity (e.g. an image) than the measured data allows. Of course, in this formulation this is impossible. But often one can gain something by additional knowledge of the image. Basically, this also is the idea behind compressed sensing and hence, it does not come as a surprise that the results in compressed sensing are used to try to explain when super-resolution is possible.

The paper by Candes and Fernandez-Granada seems to be pretty close in spirit to Exact Reconstruction using Support Pursuit on which I blogged earlier. They model the sparse signal as a Radon measure, especially as a sum of Diracs. However, different from the support-pursuit-paper they use complex exponentials (in contrast to real polynomials). Their reconstruction method is basically the same as support pursuit: The try to solve

\displaystyle   \min_{x\in\mathcal{M}} \|x\|\quad\text{s.t.}\quad Fx=y, \ \ \ \ \ (1)

i.e. they minimize over the set of Radon measures {\mathcal{M}} under the constraint that certain measurements {Fx\in{\mathbb R}^n} result in certain given values {y}. Moreover, they make a thorough analysis of what is “reconstructable” by their ansatz and obtain a lower bound on the distance of two Diracs (in other words, a lower bound in the Prokhorov distance). I have to admit that I do not share one of their claims from the abstract: “We show that one can super-resolve these point sources with infinite precision—i.e. recover the exact locations and amplitudes—by solving a simple convex program.” My point is that I can not see to what extend the problem (1) is a simple one. Well, it is convex, but it does not seem to be simple.

I want to add that the idea of “continuous sparse modelling” in the space of signed measures is very appealing to me and appeared first in Inverse problems in spaces of measures by Kristian Bredies and Hanna Pikkarainen.

Recently the recipients of Sloan Fellowships for 2012 has been announced. This is a kind grant/price awarded to young scientists, usually people who are assistant professors on the tenure track, from the U.S. and Canada. While the actual award is not exorbitant (but still large) the Sloan Fellowships seem to be a good indicator for further success in the career and, more importantly, for awaited breakthroughs by the fellows.

This year there are 20 mathematicians among the recipients and it appeared that I knew three of them:

  • Rachel Ward (UTexas at Austin), a student of Ingrid Daubechies, has written a very nice PhD Thesis “Freedom through imperfection” on signal processing based on redundancy. Among several interesting works she has a very recent preprint Stable image reconstruction using total variation minimization which I plan to cover in a future post (according to the abstract the paper provides rigorous theory for exact recovery with discrete total variation which is cool).
  • Ben Recht (University of Wisconsin, Madison) also works in the field of signal processing (among other fields) and is especially known for his work on exact matrix completion with compressed sensing techniques (together with Emmanuel Candes); I also liked his paper “The Convex Geometry of Linear Inverse Problems” (with Venkat Chandrasekaran, Pablo A. Parrilo, and Alan Willsky) with elaborated on the generalization of {\ell^1}-minimization and nuclear norm minimization.
  • Greg Blekherman (Georgia Tech) works (among other things) on algebraic geometry and especially on the problem which non-negative polynomials (in more the one variable) can be written as sums of squares of polynomials, a question which is related to one of Hilbert’s 23 problems, namely Hilbert’s seventeenth problem. An interesting thing about non-negative polynomials and sums of squares is that: 1. Checking if a polynomial is non-negative is NP-hard. 2. Checking is a polynomial is a sum of squares can be done fast. 3. It seem that “most” non-negative polynomials are actually sums of squares but it unclear “how many of them”, see Greg’s chapter of the forthcoming book “Semidefinite Optimization and Convex Algebraic Geometry” here.

Congratulations!

Today I write about sparse recovery of “multidimensional” signal. With “multidimensional” I mean something like this: A one-dimensional signal is a vector {x\in{\mathbb R}^n} while a two-dimensional signal is a matrix {x\in{\mathbb R}^{n_1\times n_2}}. Similarly, {d}-dimensional signal is {x\in{\mathbb R}^{n_1\times\cdots\times n_d}}. Of course, images as two-dimensional signals, come time mind. Moreover, movies are three-dimensional, a hyperspectral 2D image (which has a whole spectrum attached to any pixel) is also three-dimensional), and time-dependent volume-data is four-dimensional.

Multidimensional data is often a challenge due the large amount of data. While the size of the signals is usually not the problem it is more the size of the measurement matrices. In the context of compressed sensing or sparse recovery the signal is measured with a linear operator, i.e. one applies a number {m} of linear functionals to the signal. In the {d}-dimensional case this can be encoded as a matrix {A\in {\mathbb R}^{m\times \prod_1^d n_i}} and this is where the trouble with the data comes in: If you have megapixel image (which is still quite small) the matrix has a million of columns and if you have a dense matrix, storage becomes an issue.

One approach (which is indeed quite old) to tackle this problem is, to consider special measurement matrices: If the signal has a sparse structure is every slice, i.e. every vectors of the form {x(i_1,\dots,i_{k-1},:,i_{k+1},\dots,i_d)} where we fix all but the {k}-th component, then the Kronecker product of measurement matrices for each dimension is the right thing.

The Kronecker product of two matrices {A\in{\mathbb R}^{m\times n}} and {B\in{\mathbb R}^{k\times j}} is the {mk\times nj} matrix

\displaystyle  A\otimes B = \begin{bmatrix} a_{11}B & \dots & a_{1n}B\\ \vdots & & \vdots\\ a_{n1}B & \dots & a_{nn}B \end{bmatrix}.

This has a lot to do with the tensor product and you should read the Wikipedia entry. Moreover, it is numerically advantageous not to build the Kronecker product of dense matrices if you only want to apply it to a given signal. To see this, we introduce the vectorization operator {\text{vec}:{\mathbb R}^{m\times n}\rightarrow{\mathbb R}^{nm}} which takes a matrix {X} and stacks its columns into a tall column vector. For matrices {A} and {B} (of fitting sizes) it holds that

\displaystyle  (B^T\otimes A)\text{vec}(X) = \text{vec}(AXB).

So, multiplying {X} from the left and from the right gives the application of the Kronecker product.

The use of Kronecker products in numerical linear algebra is fairly old (for example they are helpful for multidimensional finite difference schemes where you can build Kronecker products of sparse difference operators in respective dimensions). Recently, they have been discovered for compressed sensing and sparse recovery in these two papers: Sparse solutions to underdetermined Kronecker product systems by Sadegh Jokar and Volker Mehrmann and the more recent Kronecker Compressed Sensing by Marco Duarte and Rich Baraniuk.

From these papers you can extract some interestingly simple and nice theorems:

Theorem 1 For matrices {A_1,\dots A_d} with restricted isometry constant {\delta_K(A_1),\dots,\delta_K(A_d)} of order {K} it holds that the restricted isometry constant of the Kronecker product fulfills

\displaystyle  \max_i \delta_K(A_i) \leq \delta_K(A_1\otimes\cdots\otimes A_d) \leq \prod_1^d (1+\delta_K(A_i))-1.

Basically, the RIP constant of a Kronecker product is not better than the worst one but still not too large.

Theorem 2 For matrices {A_1,\dots, A_d} with columns normalized to one, it hold that the spark of their Kronecker product fulfills

\displaystyle  \text{spark}(A\otimes \dots\otimes A_d) = = \min_i\text{spark}(A_i).

Theorem 3 For matrices {A_1,\dots, A_d} with columns normalized to one, it hold that the mutual coherence of their Kronecker product fulfills

\displaystyle  \mu(A_1\otimes\dots\otimes A_d) = \max_i \mu(A_i).

In a recent post I wrote about time varying channels. These were operators, described by the spreading function {s:{\mathbb R}\times{\mathbb R}\rightarrow{\mathbb C}}, mapping an input signal {x} to an output signal {y} via

\displaystyle y(t) = H_s x(t) = \int_{{\mathbb R}}\int_{{\mathbb R}} s(\tau,\nu)x(t-\tau)\mathrm{e}^{\mathrm{i}\nu t}d\nu d\tau.

1. The problem statement

In this post I write about the problem of identifying the channels from input-output measurements. More precisely, the question is as follows:

Is it possible to choose a single input signal such that the corresponding output signal allows the computation of the spreading function?

At first sight this seems hopeless: The degrees of freedom we have is a single function {x:{\mathbb R}\rightarrow{\mathbb C}} and we look for a function {s:{\mathbb R}\times{\mathbb R}\rightarrow{\mathbb C}}. However, additional assumptions on {s} may help and indeed there is the following identification theorem due to Kailath:

Theorem 1 (Kailath) The channel {H_s} can be identified if the support of the spreading function {s} is contained in a rectangle with area smaller than {2\pi}.

Probably this looks a bit weird: The size of the support shall allow for identification? However, after turning the problem a bit around one sees that there is an intuitive explanation for this theorem which also explains the constant {2\pi}.

2. Translation to integral operators

The first step is, to translate the problem into one with integral operators of the form

\displaystyle F_k x(t) = \int_{\mathbb R} k(t,\tau)x(\tau)d\tau.

We observe that {k} is linked to {s} via

\displaystyle k(t,\tau) = \int s(t-\tau,\nu)e^{i\nu t}d\nu = (2\pi)^{1/2}\mathcal{F}_2^{-1}(s(t-\tau,\cdot))(t). \ \ \ \ \ (1)

 

and hence, identifying {s} is equivalent to identifying {k} (by the way: I use the same normalization of the Fourier transform as in this post).

From my point of view, the formulation of the identification problem is a bit clearer in the form of the integral operator {F_k}. It very much resembles “matrix-vector multiplication”: First you multiply {k(t,\tau)} with you input signal {x} in {\tau} (forming “pointwise products of the rows and the input vector” as in matrix-vector multiplication) and then integrate with respect to {t} (“summing up” the result).

3. Sending Diracs through the channel

Now, let’s see, if we can identify {k} from a single input-output measurement. We start, and send a single Dirac impulse through the channel: {x = \delta_0}. This formally gives

\displaystyle y(t) = \int_{\mathbb R} k(t,\tau)\delta_0(\tau)d\tau = k(t,0).

Ok, this is not too much. From the knowledge of {y} we can directly infer the values {k(t,0)} but that’s it – nothing more.

But we could also send a Dirac at a different position {\tau_0}, {x = \delta_{\tau_0}} and obtain

\displaystyle y(t) = \int_{\mathbb R} k(t,\tau)\delta_{\tau_0}(\tau)d\tau = k(t,\tau_0).

Ok, different Diracs give us information about {k} for all {t} but only for a single {\tau_0}. Let’s send a whole spike train (or Dirac comb) of “width” {c>0}: {x = \Delta_c = \sum_{n\in{\mathbb Z}} \delta_{cn}}. We obtain

\displaystyle y(t) = \int_{\mathbb R} k(t,\tau)\sum_{n\in{\mathbb Z}}\delta_{nc}(\tau)d\tau = \sum_{n\in{\mathbb Z}}k(t,cn).

Oh – this does not reveal a single value of {k} but “aggregated” values…

4. Choosing the right spike train

Now let’s translate the one condition on {s} we have to a condition on {k}: The support of {s} is contained in a rectangle with area less then {2\pi}. Let’s assume that this rectangle in centered around zero (which is no loss of generality) and denote it by {[-a/2,a/2]\times [-b/2,b/2]}, i.e. {s(\tau,\nu) = 0} if {|\tau|>a/2} or {|\nu|>b/2} (and of course {ab<2\pi}). From (1) we conclude that

\displaystyle k(t,\tau) = 0\ \text{ for }\ |t-\tau|>a/2.

In the {(t,\tau)}-plane, the support looks like this:

If we now send this spike train in the channel, we can visualize the output as follows:

We observe that in this case we can infer the values of {k(t,nc)} at some points {t} but at other points we have an overlap and only observe a sum of {k(t,nc)} and {k(t,(n+1)c)}. But if we make {c} large enough, namely larger than {a}, we identify the values of {k(t,nc)} exactly in the output signal!

5. Is that enough?

Ok, now we know something about {k} but how can we infer the rest of the values? Don’t forget, that we know that {ab < 2\pi} and that we know that {s(\tau,\nu)} is supported in {[-a/2,a/2]\times[-b/2,b/2]}. Up to now we only used the value of {a}. How does the support limitation if {\nu} translates to {k}? We look again at (1) and see: The function {\tilde k_t:\tau\mapsto k(t,t-\tau)} is bandlimited with bandwidth {b/2} for every {t}. And what do we know about these functions {\tilde k_t}? We know the samples {k(t,nc) = \tilde k_t(t-nc)}. In other words, we have samples of {\tilde k_t} with the sampling rate {c}. But there is the famous Nyquist-Shannon sampling theorem:

Theorem 2 (Nyquist-Shannon) A function {f:{\mathbb R}\rightarrow{\mathbb C}} which is bandlimited with bandwidth {B} is totally determined by its discrete time samples {f(n\pi/B)}, {n\in{\mathbb Z}} and it holds that

\displaystyle f(t) = \sum_{n\in{\mathbb Z}} f(n\pi/N)\textup{sinc}\Big(\tfrac{B}{\pi}(x - \tfrac{n\pi}{B})\Big).

We are happy with the first assertion: The samples totally determine the function if they are dense enough. In our case the bandwidth of {\tilde k_t} is {b/2} and hence, we need the sampling rate {2\pi/b}. What we have, are samples with rate {c}.

We collect our conditions on {c}: We need {c}

  • larger than {a} to separate the values of {k(t,nc)} in the output.
  • smaller than {2\pi/b} to determine the full functions {\tilde k_t(\tau) = k(t,t-\tau)}.

Both together say

\displaystyle a < c < \frac{2\pi}{b}.

Such a {c} exists, if {ab < 2\pi} and we have proven Theorem 1.

6. Concluding remarks

The proof of Kailath’s theorem reveal the role of the constraint {ab <2\pi}: We want {a} not to be too large to be able to somehow separate the values of {k(t,nc)} in the output measurement and we need {b} not too large to ensure that we can interpolate the values {k(t,nc)} exactly.

However, a severe drawback of this results is that one needs to know {a} to construct the “sensing signal” which was the spike train {\Delta_c}. Moreover, this sensing signal has itself an infinite bandwidth which is practically not desirable. It seems to be an open question whether there are more practical sensing signals.

There is more known about sensing of time varying channels: The support of {s} does not have to be in a rectangle, it is enough if the measure of the support is smaller than {2\pi} (a result usually attributed to Bello). Moreover, there are converse results which say that linear and stable identification is only possible under this restriction on the support size (see the work of Götz Pfander and coauthors).

I stumbled upon the notion of “time varying channels” in signal processing and after reading a bit I realized that these are really interesting objects and appear in many different parts of mathematics. In this post I collect of few of their realizations and relations.

In signal processing a “time varying channel” is a mapping which maps a signal {x:{\mathbb R}^d\rightarrow {\mathbb C}} to an output {y:{\mathbb R}^d\rightarrow{\mathbb C}} via

\displaystyle y(t) = \int_{{\mathbb R}^d}\int_{{\mathbb R}^d} s(\tau,\nu)x(t-\tau)\mathrm{e}^{\mathrm{i}\nu t}d\nu d\tau.

Before we explain the name “time varying channel” we fix the notation for the Fourier transform I am going to use in this post:

Definition 1 For {f:{\mathbb R}^d\rightarrow{\mathbb C}} we define the Fourier transform by

\displaystyle \mathcal{F}(f)(\omega) = \hat f(\omega) = (2\pi)^{-d/2}\int f(t)\mathrm{e}^{-\mathrm{i}\omega t}dt

and denote the inverse Fourier transform by {\mathcal{F}^{-1}f} or {\check f}. For a function {f:{\mathbb R}^d\times{\mathbb R}^d\rightarrow{\mathbb C}} we denote with {\mathcal{F}_1} and {\mathcal{F}_2} the Fourier transform with respect to the first and second {{\mathbb R}^d}-component, respectively.

Remark 1 In all what follows we do formal calculations with integral not caring about integrability. All calculation are justified in the case of Schwartz-functions and often hold in a much broader context of tempered distributions (this for example happens if the integrals represent Fourier transforms of functions).

The name “time varying channel” can be explained as follows: Consider a pure frequency as input: {x(t) = \mathrm{e}^{-\mathrm{i}\omega t}}. A usual linear channel gives as output a damped signal and the damping depends on the frequency {\omega}: {y(t) = h(\omega) \mathrm{e}^{-\mathrm{i}\omega t}}. If we send the pure frequency in our time varying channel we get

\displaystyle \begin{array}{rcl} y(t) & = &\int\int s(\tau,\nu) \mathrm{e}^{-\mathrm{i}\omega(t-\tau)}e^{\mathrm{i}\nu t}d\nu dt\\ & =& \int\int s(\tau,\nu) \mathrm{e}^{\mathrm{i}(\omega\tau +\nu t)}d\nu dt\, \mathrm{e}^{-\mathrm{i}\omega t}\\ & =& (2\pi)^d \hat s(-\omega,-t) \mathrm{e}^{-\mathrm{i}\omega t}. \end{array}

Hence, the time varying channel also damps the pure frequencies but with a time dependent factor.

Let’s start quite far away from signal processing:

1. Pseudo-differential operators

A general linear differential operator of order {N} on functions on {{\mathbb R}^d} is defined with multiindex notation as

\displaystyle Af(t) = \sum_{|\alpha|\leq N} \sigma_\alpha(t) D^\alpha f(t)

with coefficient functions {\sigma_\alpha}. Using Fourier inversion we get

\displaystyle D^\alpha f(t) - (2\pi)^{-d/2} \int \hat f(\omega) (\mathrm{i} \omega)\mathrm{e}^{\mathrm{i}\omega t}d\omega

and hence

\displaystyle \begin{array}{rcl} Af(t) & =& \int \Big((2\pi)^{-d/2} \sum_{|\alpha|\leq N} \sigma_\alpha(t)(\mathrm{i} \omega)^\alpha) \hat f(\omega)\mathrm{e}^{\mathrm{i}\omega t}d\omega\\ &=& \int \sigma(\omega,t) \hat f(\omega)\mathrm{e}^{\mathrm{i}\omega t}d\omega \\ & = & K_\sigma f(t). \end{array}

For a general function (usually obeying some restrictions) {\sigma:{\mathbb R}^d\times{\mathbb R}^d\rightarrow{\mathbb C}} we call the corresponding {K_\sigma} a pseudo-differential operator with symbol {\sigma}.

2. As integral operators

By integral operator I mean something like

\displaystyle F_k f(t) = \int k(t,\tau)f(\tau)d\tau.

We plug the definition of the Fourier transform into {K_\sigma} and obtain

\displaystyle \begin{array}{rcl} K_\sigma f(t) &=& \int \sigma(\omega,t) (2\pi)^{-d/2} \int f(\tau)\mathrm{e}^{-\mathrm{i} \omega \tau}d\tau \mathrm{e}^{\mathrm{i} \omega t}d\omega\\ & = & \int \underbrace{(2\pi)^{-d/2} \int \sigma(\omega,t)\mathrm{e}^{-\mathrm{i}\omega(\tau-t)}d\omega}_{=k(t,\tau)} f(\tau)d\tau\\ &=& \int k(t,\tau)f(\tau)d\tau. \end{array}

Using the Fourier transform we can express the relation between {\sigma} and {k} as

\displaystyle k(t,\tau) = (2\pi)^{-d/2}\int \sigma(\omega,\tau)\mathrm{e}^{-\mathrm{i}\omega(\tau-t)}d\omega = \mathcal{F}_1(\sigma(\cdot,t))(\tau-t). \ \ \ \ \ (1)

3. As “time varying convolution”

The convolution of two functions {f} and {g} is defined as

\displaystyle (f*g)(t) = \int f(\tau) g(t-\tau)d\tau

and we write “the convolution with {g}” as an operator {C_g f = f * g}.

Defining

\displaystyle h_t(\tau) = (2\pi)^{-d/2}\int \sigma(\omega,t)\mathrm{e}^{\mathrm{i}\omega \tau}d\omega

we deduce from (1)

\displaystyle K_\sigma f(t) = \int f(\tau) h_t(t-\tau)d\tau= (f * h_t)(t) = C_{h_t}f(t).

4. As superposition of time-frequency shifts

Using that iterated Fourier transforms with respect to components give the Fourier transform, i.e. {\mathcal{F}_2\mathcal{F}_1 = \mathcal{F}}, we obtain

\displaystyle \mathcal{F}_1\sigma = \mathcal{F}_2^{-1}\mathcal{F}_2\mathcal{F}_1\sigma = \mathcal{F}_2^{-1}\hat\sigma.

From (1) we get

\displaystyle \begin{array}{rcl} k(t,\tau) &=& \mathcal{F}_2^{-1}\hat\sigma(\tau-t,t)\\ &=& (2\pi)^{-d/2} \int\hat\sigma(\tau-t,\nu)\mathrm{e}^{\mathrm{i} t\nu}d\nu. \end{array}

Before plugging this into {K_\sigma} we define time shifts {T_u} and frequency shifts (or modulations) {M_\nu} as

\displaystyle T_u f(t) = f(t-u),\quad M_\nu f(t) = \mathrm{e}^{\mathrm{i} t\nu}.

With this we get

\displaystyle \begin{array}{rcl} K_\sigma f(t) &=& (2\pi)^{-d/2}\int\int \hat\sigma(\tau-t,\nu)\mathrm{e}^{\mathrm{i} t\nu}f(\tau)d\nu d\tau\\ &=& (2\pi)^{-d/2}\int\int \hat\sigma(u,\nu)\mathrm{e}^{\mathrm{i} t\nu}f(t+u)d\nu du\\ &=& (2\pi)^{-d/2}\int\int\hat\sigma(u,\nu) M_\nu T_{-u} f(t) d\nu du\\ &=& \int\int w(u,\nu) M_\nu T_{-u} f(t)d\nu du = S_wf(t). \end{array}

Hence, {K_\sigma} is also a weighted superposition of {M_\nu T_{-u} f} (time-frequency shifts) with weight {(2\pi)^{-d/2} \hat\sigma(u,\nu)}.

5. Back to time varying channels

Simple substitution brings us back the situation of a time varying channel

\displaystyle \begin{array}{rcl} K_\sigma f(t) &=& (2\pi)^{-d/2} \int\int \hat\sigma(u,\nu)\mathrm{e}^{\mathrm{i} t\nu}f(t+u)d\nu du\\ &=& \int \int s(\tau,\nu) f(t-\tau) \mathrm{e}^{\mathrm{i} t\nu} d\nu d\tau\\ &=& H_s f(t) \end{array}

with {s(\tau,\nu) = (2\pi)^{-d/2} \hat\sigma(-\tau,\nu)}.

6. As superposition of product-convolution operators

Finally, I’d like to illustrate that this kind of operators can be seen as superposition of product convolution operators.

Introducing product operators {P_g f(t) = g(t)f(t)} we define a product-convolution operator as a convolution with a function {h} followed by a multiplication with another function {g}: {P_g C_h f}.

To express {K_\sigma} with product-convolution operators we choose an orthonormal basis which consists of tensor-products of functions {(\phi_n\otimes\psi_k)} and develop {\sigma} into this basis as

\displaystyle \sigma(\omega,t) = \sum_{n,k} a_{n,k} \phi_n(\omega)\psi_k(t).

Then

\displaystyle \begin{array}{rcl} K_\sigma f(t) &=& \sum_{n,k} a_{n,k}\psi_k(t) \int \phi_n(\omega) \hat f(\omega) \mathrm{e}^{\mathrm{i} \omega t}d\omega\\ & = & \sum_{n,k} a_{n,k}\psi_k(t) (2\pi)^{d/2}\mathcal{F}^{-1}(\phi_n\, \hat f)(t)\\ &=& \sum_{n,k} a_{n,k}\psi_k(t) (\check \phi_n * f)(t) \end{array}

and we obtain

\displaystyle K_\sigma f(t) = \sum_{n,k} a_{n,k}P_{\psi_k} C_{\check\phi_n} f (t).

Remark 2 The integral operators of the form {F_k} are indeed general objects as can be seen from the Schwartz kernel theorem. Every reasonable operator mapping Schwartz functions linearly onto the tempered distributions is itself a generalized integral operator.

I tried to capture the various relation between the first five representations time varying channels in a diagram (where I went out of motivation before filling in all fields…):

If you want to have a sparse solution to a linear system of equation and have heard of compressed sensing or sparse reconstruction than you probably know what to do: Get one of the many solvers for Basis Pursuit and be happy.

Basis Pursuit was designed as a convex approximation of the generally intractable problem of finding the sparsest solution (that is, the solution with the smallest number of non-zero entries). By abuse of notation, we define for {x\in\mathbb{R}^n}

\displaystyle \|x\|_0 = \#\{i\ : x_i\neq 0\}.

(Because of {\|x\|_0 = \lim_{p\rightarrow 0}\|x\|_p^p} some people prefer the, probably more correct but also more confusing, notation {\|x\|_0^0}…).

Then, the sparsest solution of {Ax=b} is the solution of

\displaystyle \min_x \|x\|_0,\quad \text{s.t.}\ Ax=b

and Basis Pursuit replaces {\|x\|_0} with “the closest convex proxy”, i.e.

\displaystyle \min_x \|x\|_1,\quad\text{s.t.}\ Ax=b.

The good thing about Basis Pursuit suit is, that is really gives the sparsest solution under appropriate conditions as is widely known nowadays. Here I’d like to present two simple examples in which the Basis Pursuit solution is

  • not even close to the sparsest solution (by norm).
  • not sparse at all.

1. A small bad matrix

We can build a bad matrix for Basis Pursuit, even in the case {2\times 3}: For a small {\epsilon>0} define

\displaystyle A = \begin{bmatrix} \epsilon & 1 & 0\\ \epsilon & 0 & 1 \end{bmatrix}, \quad b = \begin{bmatrix} 1\\1 \end{bmatrix}.

Of course, the sparsest solution is

\displaystyle x_0 = \begin{bmatrix} 1/\epsilon\\ 0\\ 0\end{bmatrix}

while the solution of Basis Pursuit is

\displaystyle x_1 = \begin{bmatrix} 0\\1\\1 \end{bmatrix}.

The summarize: For {\epsilon<1/2}

\displaystyle \|x_0\|_0 = 1 < 2 = \|x_1\|_0,\quad \|x_0\|_1 = 1/\epsilon > 2 = \|x_1\|_1.

(There is also a least squares solution that has three non-zero entries and a one-norm slightly larger than 2.)

Granted, this matrix is stupid. Especially, its first column has a very small norm compared to the others. Ok, let’s construct a matrix with normalized columns.

2. A small bad matrix with normalized columns

Fix an integer {n} and a small {\epsilon>0}. We define a {n\times(n+2)}-matrix

\displaystyle \begin{bmatrix} 1+\epsilon/2 & -1+\epsilon/2 & 1 & 0 & \dots & \dots &0\\ -1+\epsilon/2 & 1+\epsilon/2 & 0 & 1 & \ddots & & 0\\ \epsilon/2 & \epsilon/2 & \vdots & \ddots& \ddots & \ddots& \vdots\\ \vdots & \vdots & \vdots & & \ddots & \ddots& 0\\ \epsilon/2 & \epsilon/2 & 0 & \dots& \dots& 0 & 1 \end{bmatrix}.

Ok, the first two columns do not have norm 1 yet, so we normalize them by multiplying with the right constant

\displaystyle c = \frac{1}{\sqrt{2 + \tfrac{n\epsilon^2}{4}}}

(which is close to {1/\sqrt{2}}) to get

\displaystyle A = \begin{bmatrix} c(1+\epsilon/2) & c(-1+\epsilon/2) & 1 & 0 & \dots & \dots &0\\ c(-1+\epsilon/2) & c(1+\epsilon/2) & 0 & 1 & \ddots & & 0\\ c\epsilon/2 & c\epsilon/2 & \vdots & \ddots& \ddots & \ddots& \vdots\\ \vdots & \vdots & \vdots & & \ddots & \ddots& 0\\ c\epsilon/2 & c\epsilon/2 & 0 & \dots& \dots& 0 & 1 \end{bmatrix}.

Now we take the right hand side

\displaystyle b = \begin{bmatrix} 1\\\vdots\\1 \end{bmatrix}

and see what solutions to {Ax=b} are there.

First, there is the least squares solution {x_{\text{ls}} = A^\dagger b}. This has only non-zero entries, the last {n} entries are slightly smaller than {1} and the first two are between {0} and {1}, hence, {\|x_{\text{ls}}\|_1 \approx n} (in fact, slightly larger).

Second, there is a very sparse solution

\displaystyle x_0 = \frac{1}{\epsilon c} \begin{bmatrix} 1\\ 1\\ 0\\ \vdots\\ 0 \end{bmatrix}.

This has two non-zero entries and a pretty large one-norm: {\|x_0\|_1 = 2/(\epsilon c)}.

Third there is a solution with small one-norm:

\displaystyle x_1 = \begin{bmatrix} 0\\ 0\\ 1\\ \vdots\\ 1 \end{bmatrix}.

We have {n} non-zero entries and {\|x_1\|_1 = n}. You can check that this {x_1} is also the unique Basis Pursuit solution (e.g. by observing that {A^T[1,\dots,1]^T} is an element of {\partial\|x_1\|_1} and that the first two entries in {A^T[1,\dots,1]^T} are strictly smaller than 1 and positive – put differently, the vector {[1,\dots,1]^T} is dual certificate for {x_1}).

To summarize, for {\epsilon < \sqrt{\frac{8}{n^2-n}}} it holds that

\displaystyle \|x_0\|_0 = 2 < n = \|x_1\|_0,\quad \|x_0\|_1 = 2/(c\epsilon) > n = \|x_1\|_1.

The geometric idea behind this matrix is as follows: We take {n} simple normalized columns (the identity part in {A}) which sum up to the right hand side {b}. Then we take two normalized vectors which are almost orthogonal to {b} but have {b} in their span (but one needs huge factors here to obtain {b}).

Well, this matrix looks very artificial and indeed it’s constructed for one special purpose: To show that minimal {\ell^1}-norm solution are not always sparse (even when a sparse solution exists). It’s some kind of a hobby for me to construct instances for sparse reconstruction with extreme properties and I am thinking about a kind of “gallery” of these instances (probably extending the “gallery” command in Matlab).
By the way: if you want to play around with this matrix, here is the code

n = 100;
epsilon = sqrt(8/(n^2-n))+0.1;
c = 1/sqrt(2+n*epsilon^2/4);
A = zeros(n,n+2);
A(1:2,1:2) = ([1 -1;-1,1]+epsilon/2)*c;
A(3:n,1:2) = epsilon/2*c;
A(1:n,3:n+2) = eye(n);
b = ones(n,1);

Next Page »

Follow

Get every new post delivered to your Inbox.

Join 30 other followers