June 2011

In this post I would like to comment on two papers I “stumbled upon”, one in regularization theory and one in image processing.

The first one is A regularization parameter for nonsmooth Tikhonov regularization by Kazufumi Ito, Bangti Jin and Tomoya Takeuchi. As the title announces, the paper addresses the problem of determining suitable regularization parameter for some kind of Tikhonov regularization. In particular, the authors propose a new heuristic method, i.e. method which does not use any estimate of the noise level in the data. This is an interesting and important topic for several reasons:

  1. Practically, estimates on the noise level are rarely available and if they are, they are not too reliable.
  2. Strictly speaking, these kind of rules are “bad” since there is the “Bakushinksii Veto”: Such rules only provide regularizations (e.g. in the sense of Engl, Hanke, Neubauer for problems which are well-posed (as a great service, the authors state and prove this statement as Theore 3.2).
  3. Despite this veto, several heuristic rules produce excellent results in practice.

Note that the last second points are not in contradiction. They merely say that the notion of “regularization” may be too strict. Usually, it uses a worst case estimate which may practically never observed.

The paper contributes a new rule and state that it is applicable to a broad range of problems. They use very general Tikhonov functional:

\displaystyle  \phi(x,y^\delta) + \eta\psi(x)

and do not assume that {\phi} or {\psi} are smooth. They use the value function

\displaystyle  F(\eta) = \min_x \phi(x,y^\delta) + \eta\psi(x)

and propose the following rule for {\eta}: For some {\gamma} choose {\eta} such that

\displaystyle  \Phi_\gamma(\eta) = \frac{F(\eta)^{1+\gamma}}{\eta}

is minimal. I do not have any intuition for this rule (however, from they proofs you see that they work, at least for “partially smooth cases”, see below). Lacking a name for this rule, one may use the term “weighted value function rule”.

They prove several nice properties of the value function (continuity, monotonicity and concavity) with loose assumptions on {\phi} and {\psi} (especially they do not even need existence of minimizers for {\phi(x,y^\delta) + \eta\psi(x)}, only that the minimum exists). However, when it comes to error estimates, they only obtain results for a specific discrepancy measure, namely a squares Hilbert space norm:

\displaystyle  \phi(x,y^\delta) = \tfrac12\|Kx-y^\delta\|^2.

It seems that, for general convex and lower-semicontinuous penalties {\psi} they build upon results from my paper with Bangti Jin on the Hanke-Raus rule and the quasi-optimality principle.

Another contribution of the paper is that it gives an algorithm that realizes the weighted value function rule (a thing which I omitted in my paper with Bangti). Their numerical experiments demonstrate that the weighted value function rule and the proposed algorithm works well for academic examples.

The next paper I want to discuss is the preprint Properties of {L^1-\text{TGV}^2}: The one-dimensional case by Kristian Bredies, Karl Kunisch and Tuomo Valkonen. There the authors analyze the somehow recent generalization “total generalized variation” {\text{TGV}} of the omnipresent total variation. The TGV has been proposed by Bredies, Kunisch and Pock in this paper recently and Kristian and me also briefly described it in our book on mathematical image processing. Loosely speaking, the TGV shall be a generalization of the usual total variation which does not lead to “staircasing”. While one may observe from the construction of the TGV functional, that staircasing is not to be expected, the authors in this paper give precise statements. By restricting to the one dimensional case they prove several interesting properties of the TGV functional, most notably that it leads to an equivalent norm of the space {BV}.

Maybe I should state the definitions here: The total variation of a function {u\in L^1(\Omega)} is

\displaystyle  \text{TV}(u) = \sup\{\int_\Omega u v'\ |\ v\in C^1_c(\Omega),\ \|v\|_\infty\leq 1\}

leading the the {BV}-norm

\displaystyle  \|u\|_{BV} = \|u\|_{L^1} + \text{TV}(u).

The {\text{TGV}^2} seminorm for a parameter tuple {(\alpha,\beta)} is

\displaystyle  \text{TGV}^2_{(\alpha,\beta)}(u) = \sup\{\int_\Omega u v''\ |\ C^2_c(\Omega), \|v\|_\infty\leq\beta,\ \|v'\|_\infty\leq\alpha\}

and the associated norm is

\displaystyle  \|u\|_{BGV^2} = \|u\|_{L^1} + \text{TGV}^2(u).

In Lemma 3.3 they prove that {\|\cdot\|_{BV}} and {\|\cdot\|_{BGV^2}} are equivalent norms on {\text{BV}}. In Section 4 they show that minimizers of

\displaystyle  \|u-f\|_{L^1} + \alpha\text{TV}(u)

obey staircasing of degree 0, i.e. the solution {u} is piecewise constant when it is not equal to {f}. For the minimizers of

\displaystyle  \|u-f\|_{L^1} + \text{TGV}^2_{(\alpha,\beta)}(u)

one has staircasing of degree 1: {u} is affine linear where it is not equal to {f}.

These two facts combined (norm equivalence of {\text{BV}} and {\text{BGV}^2} and the staircasing of degree 1) seem quite remarkable to me. They somehow show that staircasing is not related to the space {\text{BV}} of functions of bounded variation but only to the specific {\text{TV}} semi-norm. This is somehow satisfying since I still remember the thorough motivation of L. Rudin in his 1987 thesis for the usage of the space {\text{BV}} in image processing: If there where images which are not in {\text{BV}} we could not observe them. (He even draws an analogy to the question: How many angles can dance on the point of a needle?) Moreover, he further argues that {\text{BV}} is not too large in the sense that its elements are still accessible to analysis (e.g. in defining a weak notion of curvature although they may be discontinuous). The {\text{BGV}^2}-model shows that it is possible to overcome the undesired effect of staircasing while staying in the well founded and mathematically sound and appealing framework of {\text{BV}}.

The paper contains several more interesting results (e.g. on preservation of continuity and “affinity” and on convergence of with respect to {(\alpha,\beta)} which I do not collect here.

My colleague K.-J. Wirths came up with another Rule of Sarrus for {4\times 4} matrices. His suggestion is somehow closeto the original (at least graphically) and is easier to memorize. One has to use the “original” Rule of Sarrus for the {4\times 4} case but now three times. For the first case use the original matrix and for the next two case one has to permute two columns. Graphically this gives the following the pictures:

In principle this generalizes to larger {n\times n} matrices. But beware: {n!} is large! For the {5\times 5} case one has a sum of 120 products but each “standard Sarrus” only gives 10 of them. Hence, one has to figure out 12 different permutations. In the {n\times n} case one even needs to memorize {\frac{n!}{2n}} permutation, let alone all the computations…

I am sure that somebody with stronger background in algebra and more knowledge about permutation groups could easily figure out what is going on here, and to visualize the determinants better.

Update: Indeed! Somebody with more background in algebra already explored how to generalize the Sarrus rule to larger matrices. Again it was my colleague K.-J. Wirths who found the reference and here it is:

  • Обобщенное правило Саррюса, by С. Аршон, Матем. сб., 42:1 (1935), 121–128

and it is from 1935 already! If you don’t speak Russian, in German it is

  • “Verallgemeinerte Sarrussche Regel”, S. Arschon, Mat. Sb., 42:1 (1935), 121–128

and if you don’t speak German either, you can visit the page in mathnet.ru or to the page in the Zentralblatt (but it seems that there is no English version of the paper or the abstract available…) Anyway, you need (n-1)!/2 permutations of the columns and apply the plain rule of Sarrus to all these (and end up, of course, with 2n(n-1)!/2=n! summands, each of which has n factors – way more than using LU of QR factorization.)

There is this famous Rule of Sarrus for calculating the determinant of a {3\times 3} matrix. The nice thing is, that one can remember it quite quickly with the help of a simple picture which I cite from the Wikipedia page:

Being at a university which has a lot of engineers who need to learn mathematics, there are plenty of students who learn about determinants and also learn the Rule of Sarrus. A quite amusing thing which happens frequently in exams is the following (at least colleagues told me so): If you ask the students to calculate the determinant of a {4\times 4} matrix there will be a number of people adopting the Rule of Sarrus without thinking, ending up with the sum of eight products. You will also find interesting discussions if you google “Sarrus 4×4”. Of course, the Rule of Sarrus can not work in this simple manner since, according to the Leibniz formula for determinants, you need a sum of 24 products.

Well, I thought it could be nice to have an illustration of the 24 “paths” along which you should take products in a {4\times 4} matrix. My first idea was to you Laplace’s formula for the first column. That is, I use the formula

\displaystyle \det(A) = \sum_{i=1}^4 (-1)^{i+1}a_{i,1}M_{i,1}

in which {M_{i,1}} is the determinant of the {3\times 3} sub-matrix obtained by removing the first column and the {i}th row of {A}. For the entry {a_{1,1}} this gives this picture:

Here, the green lines indicate products which get the sign {+} and red lines indicate products which get the sign {-}. The blue dots are just for better orientation.

Similarly, the elements {a_{2,1}}, {a_{3,1}} and {a_{4,1}} lead to similar pictures:

Putting all graphics together we obtain the nice and intuitive picture

for the 4×4 Rule of Sarrus. Ok – have fun memorizing this…

Since this picture looks so ugly, one may be tempted to call the corresponding rule the “Rule of Sauron”…

P.S.: Probably there is some graph theorist somewhere who could produce a nicer picture, e.g. one which minimizes the number of crossing. Moreover, there are probably other thoughts about the Ruls of Sarrus and its interpretations for larger matrices – I would be glad to learn about them.

In a follow-up post, I show a simpler visualization.

I remember that I read a blog post somewhere which tried to make a point on multiple choice questions for exams. I have the impression that mathematicians at German universities usually have the opinion that multiple choice questions are not an appropriate way to build an exam and to form grades. A frequent argument against multiple choice questions is, that it is important to check if the student has acquired techniques and is able to use them to solve problems; and this can certainly not be achieved by multiple choice questions. However, I think that there are certain points at which multiple choice questions could be the first choice even for this case.

In most exercise classes in the first courses in Analysis and Linear Algebra the following thing happens at least once: There is an exercise of the form “Prove that this and that holds.” or “Show that this object has that property.” Then some students hand in solutions in which there is a wrong argument hidden somewhere in the middle (while the general approach is ok). Then the teaching assistant tries to explain that the solution is not acceptable since the line of argument contains an error. And at this point the discussion often gets complicated: The discussion mixes aspects on the fact to be proven and the arguments used. For example, the teaching assistant tries to show that the argument is not valid by a counter example. However, this counterexample has the be considerably different from the fact itself (since this is true). The student respond that this is not a valid counterexample since it does not fulfill some assumption in the exercise. And there are several more problems which can occur during a discussion like that.

I think, that multiple choice questions are particularly well suited to put this problems into exercises. As an example, one may give the following questions:

Exercise 1 Which of the following arguments is not valid to show that the function {f:[0,1]\rightarrow{\mathbb R}} defined by {f(x) = x^2} is continuous?

  1. For {\epsilon=3/4} and {\delta=1/2} it holds that {|x-y|\leq \delta} implies that {|f(x)-f(y)|\leq \epsilon}.
  2. For {x_n\rightarrow 1} it holds that {f(x_n) = x_n^2 \rightarrow 1}.
  3. The preimage of the open interval {]a,b[} is the open interval {]\sqrt{a},\sqrt{b}[}.
  4. For every {\epsilon>0} one may set {\delta=\epsilon/2} and then it holds that

    \displaystyle  |x-y| < \delta \implies |f(x)-f(y)|<\epsilon.

Exercise 2 Which of the following arguments is not valid to show that the unit step function {f:{\mathbb R}\rightarrow{\mathbb R}} defined by

\displaystyle  f(x) = \begin{cases} 0 & \text{if }\ x\leq 0\\ 1 &\text{if }\ x> 0\\ \end{cases}

is not continuous?

  1. The preimage of the closed set {\{1\}} is the open set {]0,\infty[}.
  2. The preimage of the open set {]-1/2,1/2[} is the closed set {]-\infty,0]}.
  3. For {x_n = 1/n} it holds that {x_n\rightarrow 0} but {f(x_n)\rightarrow 1\neq 0 = f(0)}.

If this is given as a homework question which has to be handed in, one may add the sentence “Indicate why the argument is not valid and propose a correction if you think there is one.”

Although I did not test this kind of questions yet, I think that they could lead to interesting discussions is an exercise class since they somehow swaps to roles in the above mentioned situation.

While this is just one example which has come to my mind, I think that there are more situations in which multiple choice questions may be helpful. I still have to think more about situations in more advances courses\dots