The mother example of optimization is to solve problems

\displaystyle  \min_{x\in C} f(x)

for functions {f:{\mathbb R}^n\rightarrow{\mathbb R}} and sets {C\in{\mathbb R}^n}. One further classifies problems according to additional properties of {f} and {C}: If {C={\mathbb R}^n} one speaks of unconstrained optimization, if {f} is smooth one speaks of smooth optimization, if {f} and {C} are convex one speaks of convex optimization and so on.

1. Classification, goals and accuracy

Usually, optimization problems do not have a closed form solution. Consequently, optimization is not primarily concerned with calculating solutions to optimization problems, but with algorithms to solve them. However, having a convergent or terminating algorithm is not fully satisfactory without knowing an upper bound on the runtime. There are several concepts one can work with in this respect and one is the iteration complexity. Here, one gives an upper bound on the number of iterations (which are only allowed to use certain operations such as evaluations of the function {f}, its gradient {\nabla f}, its Hessian, solving linear systems of dimension {n}, projecting onto {C}, calculating halfspaces which contain {C}, or others) to reach a certain accuracy. But also for the notion of accuracy there are several definitions:

  • For general problems one can of course desire to be within a certain distance to the optimal point {x^*}, i.e. {\|x-x^*\|\leq \epsilon} for the solution {x^*} and a given point {x}.
  • One could also demand that one wants to be at a point which has a function value close to the optimal one {f^*}, i.e, {f(x) - f^*\leq \epsilon}. Note that for this and for the first point one could also desire relative accuracy.
  • For convex and unconstrained problems, one knowns that the inclusion {0\in\partial f(x^*)} (with the subgradient {\partial f(x)}) characterizes the minimizers and hence, accuracy can be defined by desiring that {\min\{\|\xi\|\ :\ \xi\in\partial f(x)\}\leq \epsilon}.

It turns out that the first two definitions of accuracy are much to hard to obtain for general problems and even for smooth and unconstrained problems. The main issue is that for general functions one can not decide if a local minimizer is also a solution (i.e. a global minimizer) by only considering local quantities. Hence, one resorts to different notions of accuracy, e.g.

  • For a smooth, unconstrained problems aim at stationary points, i.e. find {x} such that {\|\nabla f(x)\|\leq \epsilon}.
  • For smoothly constrained smooth problems aim at “approximately KKT-points” i.e. a point that satisfies the Karush-Kuhn-Tucker conditions approximately.

(There are adaptions to the nonsmooth case that are in the same spirit.) Hence, it would be more honest not write {\min_x f(x)} in these cases since this is often not really the problem one is interested in. However, people write “solve {\min_x f(x)}” all the time even if they only want to find “approximately stationary points”.

2. The gradient method for smooth, unconstrainted optimization

Consider a smooth function {f:{\mathbb R}^n\rightarrow {\mathbb R}} (we’ll say more precisely how smooth in a minute). We make no assumption on convexity and hence, we are only interested in finding stationary points. From calculus in several dimensions it is known that {-\nabla f(x)} is a direction of descent from the point {x}, i.e. there is a value {h>0} such that {f(x - h\nabla f(x))< f(x)}. Hence, it seems like moving into the direction of the negative gradient is a good idea. We arrive at what is known as gradient method:

\displaystyle  x_{k+1} = x_k - h_k \nabla f(x_k).

Now let’s be more specific about the smoothness of {f}. Of course we need that {f} is differentiable and we also want the gradient to be continuous (to make the evaluation of {\nabla f} stable). It turns out that some more smoothness makes the gradient method more efficient, namely we require that the gradient of {f} is Lipschitz continuous with a known Lipschitz constant {L}. The Lipschitz constant can be used to produce efficient stepsizes {h_k}, namely, for {h_k = 1/L} one has the estimate

\displaystyle  f(x_k) - f(x_{k+1})\geq \frac{1}{2L}\|\nabla f(x_k)\|^2.

This inequality is really great because one can use telescoping to arrive at

\displaystyle  \frac{1}{2L}\sum_{k=0}^N \|\nabla f(x_k)\|^2 \leq f(x_0) - f(x_{N+1}) \leq f(x_0) - f^*

with the optimal value {f} (note that we do not need to know {f^*} for the following). We immediately arrive at

\displaystyle  \min_{0\leq k\leq N} \|\nabla f(x_k)\| \leq \frac{1}{\sqrt{N+1}}\sqrt{2L(f(x_0)-f^*))}.

That’s already a result on the iteration complexity! Among the first {N} iterates there is one which has a gradient norm of order {N^{-1/2}}.

However, from here on it’s getting complicated: We can not say anything about the function values {f(x_k)} and about convergence of the iterates {x_k}. And even for convex functions {f} (which allow for more estimates from above and below) one needs some more effort to prove convergence of the functional values to the global minimal one.

But how about convergence of the iterates for the gradient method if convexity is not given? It turns out that this is a hard problem. As illustration, consider the continuous case, i.e. a trajectory of the dynamical system

\displaystyle  \dot x = -\nabla f(x)

(which is a continuous limit of the gradient method as the stepsize goes to zero). A physical intuition about this dynamical system in {{\mathbb R}^2} is as follows: The function {f} describes a landscape and {x} are the coordinates of an object. Now, if the landscape is slippery the object slides down the landscape and if we omit friction and inertia, the object will always slide in the direction of the negative gradient. Consider now a favorable situation: {f} is smooth, bounded from below and the level sets {\{f\leq t\}} are compact. What can one say about the trajectories of the {\dot x = -\nabla f(x)}? Well, it seems clear that one will arrive at a local minimum after some time. But with a little imagination one can see that the trajectory of {x} does not even has to be of finite length! To see this consider a landscape {f} that is a kind of bowl-shaped valley with a path which goes down the hillside in a spiral way such that it winds around the minimum infinitely often. This situation seems somewhat pathological and one usually does not expect situation like this in practice. If you tried to prove convergence of the iterates of gradient or subgradient descent you may have noticed that one sometimes wonders why the proof turns out to be so complicated. The reason lies in the fact that such pathological functions are not excluded. But what functions should be excluded in order to avoid this pathological behavior without restricting to too simple functions?

3. The Kurdyka-Łojasiewicz inequality

Here comes the so-called Kurdyka-Łojasiewicz inequality into play. I do not know its history well, but if you want a pointer, you could start with the paper “On gradients of functions definable in o-minimal structures” by Kurdyka.

The inequality shall be a way to turn a complexity estimate for the gradient of a function into a complexity estimate for the function values. Hence, one would like to control the difference in functional value by the gradient. One way to do so is the following:

Definition 1 Let {f} be a real valued function and assume (without loss of generality) that {f} has a unique minimum at {0} and that {f(0)=0}. Then {f} satisfies a Kurdyka-Łojasiewicz inequality if there exists a differentiable function {\kappa:[0,r]\rightarrow {\mathbb R}} on some interval {[0,r]} with {\kappa'>0} and {\kappa(0)=0} such that

\displaystyle  \|\nabla(\kappa\circ f)(x)\|\geq 1

for all {x} such that {f(x)<r}.

Informally, this definition ensures that that one can “reparameterize the range of the function such that the resulting function has a kink in the minimum and is steep around that minimum”. This definition is due to the above paper by Kurdyka from 1998. In fact it is a slight generalization of the Łowasiewicz inequality (which dates back to a note of Łojasiewicz from 1963) which states that there is some {C>0} and some exponent {\theta} such that in the above situation it holds that

\displaystyle  \|\nabla f(x)\|\geq C|f(x)|^\theta.

To that that, take {\kappa(s) = s^{1-\theta}} and evaluate the gradient to {\nabla(\kappa\circ f)(x) = (1-\theta)f(x)^{-\theta}\nabla f(x)} to obtain {1\leq (1-\theta)|f(x)|^{-\theta}\|\nabla f(x)\|}. This also makes clear that in the case the inequality is fulfilled, the gradient provides control over the function values.

The works of Łojasiewicz and Kurdyka show that a large class of functions {f} fulfill the respective inequalities, e.g. piecewise analytic function and even a larger class (termed o-minimal structures) which I haven’t fully understood yet. Since the Kurdyka-Łojasiewicz inequality allows to turn estimates from {\|\nabla f(x_k)\|} into estimates of {|f(x_k)|} it plays a key role in the analysis of descent methods. It somehow explains, that one really never sees pathological behavior such as infinite minimization paths in practice. Lately there have been several works on further generalization of the Kurdyka-Łojasiewicz inequality to the non-smooth case, see e.g. Characterizations of Lojasiewicz inequalities: subgradient flows, talweg, convexity by Bolte, Daniilidis, Ley and Mazet Convergence of non-smooth descent methods using the Kurdyka-Łojasiewicz inequality by Noll (however, I do not try to give an overview over the latest developments here). Especially, here at the French-German-Polish Conference on Optimization which takes place these days in Krakow, the Kurdyka-Łojasiewicz inequality has popped up several times.

A quick post to keep track of several things:

This is the last post in a series of posts (which was not intended to be a series). The series started with this post in which I reported some experience with job interviews I had in spring last year, and continued with this post which told what happened afterwards. The story ended with the paragraph “Now the final steps will be: Preparation of the next list of wishes, negotiations to stay and finally, the most difficult part, forming a well-founded decision.” That was about eleven month ago and was not totally wrong. Here is what happened then:

1. The Bleibeverhandlung

The Bleibeverhandlung (the negotiations to stay) is in principle similar to the negotiations with other universities. But there are important differences: The first one is due to the fact that there has been no official committee of the department involved so far and hence, the department has to form an opinion on how much they want to keep you. As far as I know this usually happens in an informal way and can be very different in different places and also the dean (or, in my case, the “speaker of the department”) may handle this in its own way. Also, you will not be involved in this step in any way (I think). The amount of support you get is crucial for the next steps. The department has to report its opinion to the dean (who is basically in control of the money here) and the dean has to decide about a strategy how to keep you (if your support is strong enough). Again, this is very different in different cases. Also, I do not know too much about this process, but at least there will be some communication between the dean and the president (or their offices). But after this procedure, the next steps of negotiations are basically the same as before: First negotiations with the dean, then with the president. Again, the first negotiation is somehow more important as many things are handled by the dean. In my case there was the question on which kind of position the department could keep me and how it could be arranged to fill this position with me. I have been Juniorprofessor (basically equal to assistant professor) and according to German law, there is not way of promotion to the “next level”. The university had to find an “open position for a professor”. But there was such a position (which I knew before since I had a tenure track position). The next obstruction was that, again according to German law, there usually has to be an official public announcement if such a position is to be filled. Consequently, anyone who is qualified could apply and should have a positive probability to get the job. However, I learned that my state has the possibility to fill a position without public announcement and it was part of my offer that my university “offered to start the initiation of this procedure”. It is somehow difficult to translate but the university could not even offer to initiate this “filling without public announcement” because this is something on which several committees has to agree.

2. The decision

Well, I had the official offer of my university pretty quick after negotiations. Basically, it was on par with the other offers (slightly better in some respects, slightly worse in others – but clearly no clear winner). The only caveat was, that there was no guarantee that I could get a permanent position because this depended on the decision of more committees. However, I had a formal statement that the president and the dean would initiate and support the procedure. Moreover, colleagues told me that my university had done a great job in keeping its promises in similar respects.

So, the decision was not easy. However, I decided not to play “ping-pong” with the other universities (which could be possible – but I can not tell you how that works) and to decide on basis of the facts I had after one round of negotiations. It was a tough and close decision which I do not comment in more detail here. But I decided to stay at TU Braunschweig.

3. Another application

Correct: I had to write another application! Although the process to “fill the position without public announcement” was granted by all committees (which took quite some time), that still meant that there had to be a normal procedure for an appointment: A committee had to be formed which had to agree on criteria for the position and had to ask me to apply for the position. Then, I applied and I even were invited to give a talk, a test lecture and had a job interview (all at my own university). Usually, there also had to be external referee reports but somebody found that there was the possibility to have enough external committee members present at the presentation and the job interview who then have their opinion recorded. After that, the committee had to form a decision and a one person list which went the usual way through administration. Then I got the offer in the official way and was even asked for another negotiations. But since the university really had realized the whole procedure totally as promised, I did not see any reason to insist on further negotiations. They had done anything as we had agreed on, and so I just said, that I agree to take the position on the basis of the offer they had formulated. This went smoothly and I accepted (officially by mail). The next step was that I had to go to the Amtsarzt (translates to “public health officer”) who had to check if was healthy enough for a lifetime professorship. (I had been at the Amtsarzt before I started my Juniorprofessorship, but then he only checked if I was healthy enough for six years…). Also, this went smoothly and a few days ago there was my “appointment for life”. (Last remark: I wondered that I did not had to repeat the “official oath” (Amtseid) because I had done this when I started the Juniorprofessorship, and this remains a lifetime.)

End of the story – indeed a happy end for me. The procedure was quite slow – but as far as I’ve heard all the people who have been involved did their best to make the procedure as quick as possible and I am very thankful for the effort and support of many people. It is just an awfully complicated procedure to appoint a professor in Germany which consists of many steps and many people and committees are involved…

I found this draft of a post in my backlog and decided that it would be better to finish it than to leave it unpublished. Actually, the story happened already over a year ago.

Some month ago I stumbled upon this piece of shadow art

by Fred Eerdekens and a few days later I received the information that my university was going to celebrate his yearly “open house event” this year as “TU Night”. The theme of the TU Night was “Night, Light, Energy” and all member were invited to submit ideas for talks, experiments and exhibits.

The piece of shadow art prompted the question “If this weired piece of metal can cast this sentence as a shadow, wouldn’t it be possible to produce another piece of metal that can produce two different sentences, when illuminated from different directions” Together with the announcement of the upcoming TU Night I thought if one could even produce an exhibit like this.

Since I am by no means an artists, I looked around at my university and found that there is a Department of Architecture. Since architects are much closer to being artist than I am, I contacted the department and proposed a collaboration and well, Henri Greil proposed to have a joint seminar on this topic. Hence, this summer term I made the experience and worked with students of architecture.

In the end, the student produced very nice pieces of shadow art:

IMG_5932

IMG_5941

IMG_5940

Although the exhibits produced interesting and unexpected shadows, no group of students could make it and produce two different shadows out of the same object.

However, some nice effects can be produced pretty easy:

The basic idea is that moving one object around will move around both shadows rather independently. Well this is not totally true but what you can do is to “zoom” one shadow while moving the other sideways (just move the object straight towards one light source). See this movie for a small illustration:

I also did my best to produce a more complex object. While it is theoretically not very difficult to see that some given shadows are possible in some given projection geometry, it is not at all straight forward to produce the object theoretically (not to speak of the real world problems while building the piece). It tried hard but I could not do better than this:

In this post I will explore a bit the question on how to calculate the discrete gradient and the discrete divergence of an image and a vector field, respectively.

Let {u_0\in{\mathbb R}^{N\times M}} be a discrete image, i.e. {u_0(i,j)} denotes the gray value at the {i,j}-th pixel. The famous total variation denoising amounts to minimizing the functional

\displaystyle \Phi(u) = \tfrac12 \int (u-u_0)^2 + \lambda\int|\nabla u|

where the integral shall be understood as summation, {\nabla} stand for the gradient, i.e. the vector of the partial derivatives, and {|\cdot|} stands for the euclidean absolute value.

When using primal-dual methods, it is of crucial importance, that the used operators for the gradient and the divergence are numerically adjoint (up to the minus sign). That is, the numerical operations are adjoint in the following sense: If grad is the operation for the gradient and div is the operation for the divergence, then it should hold for any variables {u} and {v} of suitable size and with gradu = grad(u), divv = div(v) that sum(gradu(:).*v(:)) and -sum(u(:).*divv(:)) are equal up to numerical precision. Due to the boundary treatment, the internal MATLAB operations gradient and divergence do not fulfill this requirement.

The most common discretization of the gradient uses discrete forward differences and a constant padding at the boundary (which means that Neumann boundary values are applied). In formula, this reads as

\displaystyle (D_xu)_{i,j} = \begin{cases} u_{i+1,j} - u_{i,j} & i<N\\ 0 & i=N \end{cases} \qquad (D_yu)_{i,j} = \begin{cases} u_{i,j+1} - u_{i,j} & j<M\\ 0 & j=M \end{cases}

The respective adjoints are backward differences with zero boundary treatment (check it!). Apparently, there are many different ways to implement this routines (and the respective adjoints) in MATLAB. Here are four of them:

  1. For loops: Run through all pixels in two for-loops and assign the difference to the output. Of course, one should preallocate the output prior to the loop. But you may probably know the first two rules of MATLAB coding? If not, here they are: 1. Avoid for-loops. 2. Avoid for-loops, seriously. I put the routines into extra functions and created anonymous function to call the gradient and the divergence as
    grad = @(u) cat(3,dxp(u),dyp(u));
    div = @(V) dxm(V(:,:,1)) + dym(V(:,:,2));
    
  2. Shift and subtract: MATLAB is great in using vectors and matrices. And to avoid the for-loop one could also implement the forward difference in {x}-direction by shifting the matrix and subtract the original one, i.e. [u(:,2:end) u(:,end)] - u (and similarly for the other differences). Again, I wrote extra functions and used anonymous function as above.
  3. Small sparse matrices from the left and from the right: MATLAB is also pretty good with sparse matrices. Since the derivatives in {x}-direction only involve the subtraction of two elements in the same column, one can realize this by multiplying an image from the left with a sparse diagonal matrix with just two non-zero diagonals. Similarly, the derivative in {y}-direction can be realized by multiplying from the right with a suitable matrix. More precisely, this approach is realized by
    Dy = spdiags([-ones(M,1) ones(M,1)],[0 1],M,M);
    Dy(M,:) = 0;
    Dx = spdiags([-ones(N,1) ones(N,1)],[0 1],N,N);
    Dy(N,:) = 0;
    Dxu = Dx*u;
    Dyu = u*Dy';
    

    (check it). Note that the adjoint of {x}-derivative if simple the operation Dx'*u and the operation of the {y}-derivative is u*Dy. Together, the calculation of the gradient and the divergence was done by the anonymous functions

    grad = @(u) cat(3,u*Dx',Dy*u);
    div = @(V) V(:,:,1)*Dx + Dy'*V(:,:,2);
    
  4. Large sparse matrices: One could think about the following: Vectorize the image by U = u(:) (which amounts to stacking the columns above each other). Then assemble a large {NM\times NM} sparse matrix which has just two non-zero diagonals to do the forward (and other) differences. More precisely this can be done by
    (with Dx and Dy from above)

    DX = kron(Dx,speye(M));
    DY = kron(speye(N),Dy);
    DxU = DX*U;
    DyU = DY*U;
    

    Here, it is clear that the respective adjoint are just the multiplication with the transposed matrices. Here, the anonymous functions are

    grad = @(u) [DX*u DY*u];
    div = @(V) DX'*V(:,1) + DY'*V(:,2);
    

The different approaches have different pros and cons. Well, the for-loop only has cons: It is presumably slow and uses a lot indexing which easily leads to bugs. The shift-and-subtract method should go into an extra function to make it easy to use – but this is not necessarily a drawback. For the multiplication with the large matrix, one has to vectorize the image first and every time one wants to look at the result and need to do reshape(U,N,M). But let’s focus on speed: I implemented all methods, and let the run on square images of different sizes for 50 times (after a warmup) and measures the execution times with tic and toc. The assembly of the matrices did not enter the timing. Moreover, no parallel computation was used – just a single core. Finally, memory was not an issue since the larges matrices (of size {2500\times 2500\times 2}) only need roughly 100MB of memory.

Here is the table with the average times for one calculation of the gradient (in seconds):

{N} For-loop Shift-subtract left-right left
100 0.0004 0.0002 0.0001 0.0003
200 0.0018 0.0010 0.0005 0.0015
300 0.0057 0.0011 0.0014 0.0020
400 0.0096 0.0031 0.0022 0.0035
500 0.0178 0.0035 0.0030 0.0054
750 0.0449 0.0114 0.0097 0.0123
1000 0.0737 0.0189 0.0128 0.0212
1500 0.2055 0.0576 0.0379 0.0601
2000 0.3942 0.0915 0.0671 0.1136
2500 0.6719 0.1571 0.1068 0.1788

and here is the one for the divergences:

{N} For-loop Shift-subtract left-right left
100 0.0004 0.0003 0.0002 0.0002
200 0.0018 0.0015 0.0005 0.0008
300 0.0048 0.0016 0.0015 0.0012
400 0.0090 0.0036 0.0020 0.0022
500 0.0158 0.0057 0.0027 0.0035
750 0.0409 0.0132 0.0073 0.0069
1000 0.0708 0.0238 0.0130 0.0125
1500 0.2008 0.0654 0.0344 0.0370
2000 0.3886 0.1285 0.0622 0.0671
2500 0.6627 0.2512 0.1084 0.1361

As expected, the for-loop is clearly slower and also as expected, all methods basically scale quadratically (doubling {N} amounts to multiplying the running time by four) since the work per pixel is constant. A little surprisingly, the multiplication from the left and from the right is fastest and also consistently a little faster then multiplication from the left with the larger sparse matrix. I don’t know, why the results are that different for the gradient and for the divergence. Maybe this is related to my use on anonymous functions or the allocation of memory?

The second day of SSVM started with an invited lecture of Tony Lindeberg, who has written one very influential and very early book about scale space theory. His talk was both a tour through scale space and a recap of the recent developments in the field. Especially he show how the time aspect could be incorporated into scale space analysis by a close inspection of how receptive fiels are working. There were more talks but I only took notes from the talk of Jan Lellmann who talked about the problem of generating an elevation map from a few given level lines. One application of this could be to observe coastlines at different tides and then trying the reconstruct the full height map at the coast. One specific feature here is that the surface one looks for may have ridges which stem from kinks in the level lines and these ridges are important features of the surface. He argued that a pure convex regularization will not work and proposed to use more input namely a vector field which is derived from the contour lines such that the vector somehow “follows the ridges”, i.e. it connects to level lines in a correct way.

Finally another observation I had today: Well, this is not a trend, but a notion which I heard for the first time here but which sounds very natural is the informal classification of data terms in variational models as “weak” or “strong”. For example, a denoising data term {\|u-u^0\|^2_{L^2(\Omega)}} is a strong data term because it gives tight information on the whole set {\Omega}. On the other hand, an inpainting data term {u|_{\Omega\setminus\Omega'} = u^0|_{\Omega\setminus\Omega'}} is a weak data term because it basically tell nothing within the region {\Omega'}.

For afternoon the whole conference has been on tour to three amazing places:

  • the Riegersburg, which is not only an impressive castle but also features interesting exhibitions about old arm and witches,
  • the Zotter chocolate factory where they make amazing chocolate in mind-boggling varieties,
  • and to Schloss Kronberg for the conference dinner (although it was pretty tough to start eating the dinner after visiting Zotter…).

I am at the SSVM 13 and the first day is over. The conference is in a very cozy place in Austria, namely the Schloss Seggau which is located on a small hill near the small town Leibnitz in Styria (which is in Austria but totally unaffected by the floods). The conference is also not very large but there is a good crowd of interesting people here and the program reads interesting.

Schloss Seggau

The time schedule is tight and I don’t know if I can make it to post something everyday. Especially, I will not be able to blog about every talk.

From the first day I have the impression that two things have appeared frequently at different places:

  • Differential geometry: Tools from differential geometry like the notion of geodesics or Riemannian metrics have not only been used by Martin Rumpf to model different types of spaces of shapes but also to interpolate between textures.
  • Primal dual methods: Variational models should be convex these days and if they are not you should make them convex somehow. Then you can use primal dual methods. Somehow the magic of primal dual methods is that they are incredibly flexible. Also they work robustly and reasonably fast.

Probably, there are more trends to see in the next days.

Follow

Get every new post delivered to your Inbox.

Join 54 other followers