Optimal transport in a discrete setting goes as follows: Consider two vectors with non-negative entries and . You may also say that and are two probability distributions or discrete measures with same total mass. A transport plan for and is a matrix with non-negative entries such that

The interpretation of is, that indicates how much of the mass sitting in the -th entry of is transported to the -th entry of . To speak about *optimal* transport we add an objective function to the constraints, namely, the cost that says how much it costs to transport one unit from the -th entry in to the -th entry in . Then the optimal transport problem is

The resulting is an optimal transport plan and the resulting objective value is the minimal cost at which can be transported to . In fact the minimization problem is a linear program and not only that, it’s one of the best studied linear programs and I am sure there is a lot that I don’t know about the structure of this linear program (you may have a look at these slides by Jesus De Loera to get an impression what is known about the structure of the linear program)

So it looks like discrete optimal transport is a fairly standard problem with standard solvers available. But all solvers have one severe drawback when it comes to large : The optimization variable has entries. If is too large to store or keep in memory, it seems that there is not much one can do. This is the memory bottleneck for optimal transport problems.

**1. Kantorovich-Rubinstein duality **

In the case when the cost has the special form on can reduce the memory-burden. This special cost makes sense if the indices and correspond to spatial locations, since then the cost is just the distance from location to . It turns out that in this case there is a simple dual optimal transport problem, namely

(This is a simple form of the Kantorovich-Rubinstein duality and works similarly if the cost is any other metric on the set of indices.) The new optimization problem is still linear but the memory requirements is only and not anymore and moreover there are only constraints for . This idea is behind the method from the paper Imaging with Kantorovich-Rubinstein discrepancy by Jan Lellmann, myself, Carola Schönlieb and Tuomo Valkonen.

**2. Entropic regularization **

In this post I’d like to describe another method to break through the memory bottleneck of optimal transport. This method works for basically any cost but involves a little bit of smoothing/regularization.

We go from the linear program to a non-linear but still convex one by adding the negative entropy of the transport plan to the objective, i.e. we consider the objective

for some .

What’s the point of doing so? Let’s look at the Lagrangian: For the constraints and we introduce the ones vector , write them as and , add Lagrange multipliers and and get

The cool thing happens with the optimality condition when deriving with respect to :

We can solve for and get

What does that say? It says that the optimal is obtained from the matrix

with rows and columns rescaled by vectors and , respectively, i.e.

The reduces the memory requirement from to ! The cost for doing so is the regularization by the entropy.

Actually, the existence of the vectors and also follows from Sinkhorn’s theorem which states that for every matrix with positive entries there exists diagonal matrices with positive diagonals such that for some doubly stochastic matrix (i.e. one with non-negative entries and unit row and column sums). Replace “doubly stochastic” with “predefined positive row and column sums”. Also, the entropic regularization for the transport plan ensures that the entries of the transport plan has positive (especially non-zero) entries.

But there is even more to the story:. To calculate the vectors and you simply have to do the following iteration:

where the fraction means element-wise division. Pretty simple.

What the iteration does in the first step is simply to take the actual and calculates a column scaling such that the column sums match . In the second step it calculates the row scaling such that the row sums match . This iteration is also known as Sinkhorn-Knopp algorithm.

This is pretty simple to code in MATLAB. Here is a simple code that does the above iteration (using ):

%Parameters gamma = 10; %reg for entropy maxiter = 100; % maxiter map = colormap(gray); N = 100; % size x = linspace(0,1,N)';%spatial coordinate % marginals p = exp(-(x-0.2).^2*10^2) + exp(-abs(x-0.4)*20);p=p./sum(p); %for colums sum q = exp(-(x-0.8).^2*10^2);q = q./sum(q); % for row sum [i,j] = meshgrid(1:N); M = exp(-(i-j).^2/gamma); % exp(-cost/gamma) % intialize u and v u = ones(N,1);v = ones(N,1); % Sinkhorn-Knopp % iteratively scale rows and columns for k = 1:maxiter % update u and v u = p./(M*v); v = q./(M'*u); % assemble pi (only for illustration purposes) pi = diag(v)*M*diag(u); % display pi (with marginals on top and to the left) imagesc([p'/max(p) 0;pi/max(pi(:)) q/max(q)]) colormap(1-map) drawnow end

Here are some results:

(From the left to the right: . The first row of pixels is , the last column is and in between there is , all things normalized such that black is the maximal value in , and , respectively.)

You see that for large , the plan is much more smooth and not so well localized as it should be for an optimal plan.

Oh, and here is an animation of 100 iterations of Sinkhorn-Knopp showing the result after both and have been updated:

There is a catch with this regularization: For small (in this example about ) the method runs into problem with under- and overflow: the entries in and become *very* small. One can fight this effect a bit but I don’t have a convincing numerical method to deal with this, yet.It seems that the entries of the optimal and really have to be incredibly small and large, and I mean *really* small and large (in the order of and in both and ).

While the Sinkhorn-Knopp algorithm is already from the 1960s, its application to optimal transport seems fairly new – I learned about from in talks by Gabriel Peyré and Bernhard Schmitzer and the reference is Sinkhorn Distances: Lightspeed Computation of Optimal Transport (presented at NIPS 2013) by Marco Cuturi.

September 17, 2015 at 4:13 pm

Superb!

Naive question: this matrix M still has to be stored somewhere, so there is still a quadratic memory cost, right? And a super-quadratic computational cost?

September 18, 2015 at 8:45 am

Actually not. To iterate for and you only need to evaluate matrix-vector products. Since usually is given by some formula (e.g. there is no need to build and store the matrix. It is even better: If the cost only depends on , the application of is a convolution and hence, methods with complexity can be used. The computational cost per iteration is then also .

September 18, 2015 at 11:51 am

Thanks for the clarification, very nice indeed 🙂

March 1, 2016 at 4:30 pm

It seems, the main iteration of the Sinkhorn-Knopp algorithm should be written in the alternating fashion:

\begin{align}

u^{n+1} & = \frac{p}{Mv^n}\\

v^{n+1} &= \frac{q}{M^T u^{n+1}}

\end{align}

March 1, 2016 at 4:37 pm

Oops, corrected, thanks! I think that the simultaneous update also works but is *much* slower…

March 1, 2016 at 5:08 pm

I am afraid, it doesn’t. At least in my own implementation. That’s why I noticed the misprint. And I think there is no theoretical guarantees for its convergence.

March 3, 2016 at 12:30 pm

I’m not totally sure what is going on here. In my experiment that “simultaneous update” does converge to something that looks like a constant multiple of the true solution…

March 4, 2016 at 6:24 pm

First of all, the solutions u and v are not unique, they are precisely up to a multiplicative factor. Secondly, I observed that simultaneous algorithm does not converge if it starts from arbitrary $(u^0, v^0)$ but it does if it starts from $(u^0, Mu^0)$.

In fact, the alternating algorithms is just a coordinate descent method applied to the convex function of u and v. But it is unclear how one can interpret simultaneous update of this particular problem from the optimization point of view. Anyway you are right, even if it converges, its speed of convergence is slower.

Sorry, it seems, I do not know how to write math formulas here.

April 26, 2016 at 4:53 pm

Nice blog post! Concerning overflow/underflow: have you tried to renormalize u and v by after each iteration? In fact, as far as I can see the stationary points of the dual seem to satisfy the fix point equation with renormalization. There might still be an numeric issues with small numbers but at least they don’t add up.

April 27, 2016 at 4:15 pm

Not sure what you mean, actually, throughout the iteration. So indeed, the fixed point equation is invariant under that renormalization, but it does not help. There should be some other way to renormalize using that one can add and subtract the same function from and …

June 21, 2018 at 9:31 pm

Great blog post! However, I think you got your logic backwards when you discuss the application of Sinkhorn’s theorem. M is obviously not a doubly stochastic matrix (\sum_i M_ij > 1 since M_ii = 1). However, M is strictly positive in all its components. From Sinkhorns theorem it follows that there exists a decomposition: pi = U M V where the sum of the rows and columns is p and q.

June 22, 2018 at 8:21 am

It’s quite unfortunate that M changes its role in this paragraph. What I meant was, as you noted, that the M = exp(-c/gamma) has non-negative entries and hence can be rescale by rows and columns to get pi (which is not doubly stochastic, but has predefined row- and colum-sums). Gonna rewrite to make this more clear!