 I am at at IFIP TC7 and today I talked about the inertial primal-dual forward-backward method Tom Pock and I developed in this paper (find my slides here). I got a few interesting questions and one was about the heavy-ball method.

I used the heavy-ball method by Polyak as a motivation for the inertial primal-dual forward-backward method: To minimize a convex function ${F}$, Polyak proposed the heavy-ball method $\displaystyle y_k = x_k + \alpha_k(x_k-x_{k-1}),\qquad x_{k+1} = y_k - \lambda_k \nabla F(x_k) \ \ \ \ \ (1)$

with appropriate step sizes ${\lambda_k}$ and extrapolation factors ${\alpha_k}$. Polyaks motivation was as follows: The usual gradient descent ${x_{k+1} = x_k - \lambda_k \nabla F(x_k)}$ can be seen as a discretization of the ODE ${\dot x = -\nabla F(x)}$ and its comparably slow convergence comes from the fact that after discretization, the iterates starts to “zig-zag” in directions that do not point straight towards the minimizer. Adding “inertia” to the iteration should help to keep the method on track towards the solution. So he proposed to take the ODE ${\gamma\ddot x + \dot x = -\nabla F(x)}$, leading to his heavy ball method. After the talk, Peter Maaß asked me, if the heavy-ball method has an interpretation in a way that you do usual gradient descent but change to function in each iteration (somehow in the spirit of the conjugate gradient method). Indeed, one can do the following: Write the iteration as $\displaystyle x_{k+1} = x_k - \lambda_k\Big[\tfrac{\alpha_k}{\lambda_k}(x_{k-1}-x_k) + \nabla F(x_k)\Big]$

and then observe that this is $\displaystyle x_{k+1} = x_k - \lambda_k \nabla G_k(x_k)$

with $\displaystyle G_k(x) = - \tfrac{\alpha_k}{2\lambda_k}\|x-x_{k-1}\|^2 + F(x).$

Hence, you have indeed a perturbed gradient descent and the perturbation acts in a way, that it moves the minimizer of the objective a bit such that it lies more in the direction towards which you where heading anyway and, moreover, pushes you away from the previous iterate ${x_{k-1}}$. This nicely contrasts the original interpretation from~(1) in which one says that one takes the direction coming from the current gradient, but before going into this direction move a bit more in the direction where you were moving.