I am not an optimizer by training. My road to optimization went through convex analysis. I started with variational methods for inverse problems and mathematical imaging with the goal to derive properties of minimizers of convex functions. Hence, I studied a lot of convex analysis. Later I got interested in how to actually solve convex optimization problems and started to read books about (convex) optimization. At first I was always distracted by the way optimizers treated constraints. To me, a convex optimization problem always looks like
Everything can be packed into the convex objective. If you have a convex objective and a constraint with a convex function , just take , i.e., add the indicator function of the constraint to the objective (for some strange reason, Wikipedia has the name and notation for indicator and characteristic function the other way round than I, and many others…). . Similarly for multiple constraints or linear equality constraints and such.
In this simple world it is particularly easy to characterize all solutions of convex minimization problems: They are just those for which
Simple as that. Only take the subgradient of the objective and that’s it.
When reading the optimization books and seeing how difficult the treatment of constraints is there, I was especially puzzled how complicated optimality conditions such as KKT looked like in contrast to and also and by the notion of constraint qualifications.
These constraint qualifications are additional assumptions that are needed to ensure that a minimizer fulfills the KKT-conditions. For example, if one has constraints then the linear independence constraint qualification (LICQ) states that all the gradients for constraints that are “active” (i.e. ) have to be linearly independent.
It took me while to realize that there is a similar issue in my simple “convex analysis view” on optimization: When passing from the gradient of a function to the subgradient, many things stay as they are. But not everything. One thing that does change is the simple sum-rule. If and are differentiable, then , always. That’s not true for subgradients! You always have that . The reverse inclusion is not always true but holds, e.g., if there is some point for which is finite and is continuous. At first glance this sounds like a very weak assumption. But in fact, this is precisely in the spirit of constraint qualifications!
Take two constraints and with convex and differentiable . We can express these by (). Then it is equivalent to write
So characterizing solution to either of these is just saying that . Oh, there we are: Are we allowed to pull the subgradient apart? We need to apply the sum rule twice and at some point we need that there is a point at which is finite and the other one is continuous (or vice versa)! But an indicator function is only continuous in the interior of the set where it is finite. So the simplest form of the sum rule only holds in the case where only one of two constraints is active! Actually, the sum rule holds in many more cases but it is not always simple to find out if it really holds for some particular case.
So, constraint qualifications are indeed similar to rules that guarantee that a sum rule for subgradients holds.
Geometrically speaking, both shall guarantee that if one “looks at the constraints individually” one still can see what is going on at points of optimality. It may well be that the sum of individual subgradients is too small to get any points with but still there are solutions to the optimization problem!
As a very simple illustration take the constraints and in two dimensions. The first constraint says “be in the lower half-plane” while the second says “be above the parabola ”. Now take the point which is on the boundary for both sets. It’s simple to see (geometrically and algebraically) that and , so treating the constraints individually gives . But the full story is that , thus and consequently, the subgradient is much bigger.