Optimal transport in a discrete setting goes as follows: Consider two vectors with non-negative entries and . You may also say that and are two probability distributions or discrete measures with same total mass. A transport plan for and is a matrix with non-negative entries such that

The interpretation of is, that indicates how much of the mass sitting in the -th entry of is transported to the -th entry of . To speak about *optimal* transport we add an objective function to the constraints, namely, the cost that says how much it costs to transport one unit from the -th entry in to the -th entry in . Then the optimal transport problem is

The resulting is an optimal transport plan and the resulting objective value is the minimal cost at which can be transported to . In fact the minimization problem is a linear program and not only that, it’s one of the best studied linear programs and I am sure there is a lot that I don’t know about the structure of this linear program (you may have a look at these slides by Jesus De Loera to get an impression what is known about the structure of the linear program)

So it looks like discrete optimal transport is a fairly standard problem with standard solvers available. But all solvers have one severe drawback when it comes to large : The optimization variable has entries. If is too large to store or keep in memory, it seems that there is not much one can do. This is the memory bottleneck for optimal transport problems.

**1. Kantorovich-Rubinstein duality **

In the case when the cost has the special form on can reduce the memory-burden. This special cost makes sense if the indices and correspond to spatial locations, since then the cost is just the distance from location to . It turns out that in this case there is a simple dual optimal transport problem, namely

(This is a simple form of the Kantorovich-Rubinstein duality and works similarly if the cost is any other metric on the set of indices.) The new optimization problem is still linear but the memory requirements is only and not anymore and moreover there are only constraints for . This idea is behind the method from the paper Imaging with Kantorovich-Rubinstein discrepancy by Jan Lellmann, myself, Carola Schönlieb and Tuomo Valkonen.

**2. Entropic regularization **

In this post I’d like to describe another method to break through the memory bottleneck of optimal transport. This method works for basically any cost but involves a little bit of smoothing/regularization.

We go from the linear program to a non-linear but still convex one by adding the negative entropy of the transport plan to the objective, i.e. we consider the objective

for some .

What’s the point of doing so? Let’s look at the Lagrangian: For the constraints and we introduce the ones vector , write them as and , add Lagrange multipliers and and get

The cool thing happens with the optimality condition when deriving with respect to :

We can solve for and get

What does that say? It says that the optimal is obtained from the matrix

with rows and columns rescaled by vectors and , respectively, i.e.

The reduces the memory requirement from to ! The cost for doing so is the regularization by the entropy.

Actually, the existence of the vectors and also follows from Sinkhorn’s theorem which states that every matrix with positive entries can be written as with diagonal matrices and a doubly stochastic matrix (i.e. one with non-negative entries and unit row and column sums). The entropic regularization for the transport plan ensures that the entries of the transport plan has indeed positive (especially non-zero) entries.

But there is even more to the story:. To calculate the vectors and you simply have to do the following iteration:

where the fraction means element-wise division. Pretty simple.

What the iteration does in the first step is simply to take the actual and calculates a column scaling such that the column sums match . In the second step it calculates the row scaling such that the row sums match . This iteration is also known as Sinkhorn-Knopp algorithm.

This is pretty simple to code in MATLAB. Here is a simple code that does the above iteration (using ):

%Parameters gamma = 10; %reg for entropy maxiter = 100; % maxiter map = colormap(gray); N = 100; % size x = linspace(0,1,N)';%spatial coordinate % marginals p = exp(-(x-0.2).^2*10^2) + exp(-abs(x-0.4)*20);p=p./sum(p); %for colums sum q = exp(-(x-0.8).^2*10^2);q = q./sum(q); % for row sum [i,j] = meshgrid(1:N); M = exp(-(i-j).^2/gamma); % exp(-cost/gamma) % intialize u and v u = ones(N,1);v = ones(N,1); % Sinkhorn-Knopp % iteratively scale rows and columns for k = 1:maxiter % update u and v u = p./(M*v); v = q./(M'*u); % assemble pi (only for illustration purposes) pi = diag(v)*M*diag(u); % display pi (with marginals on top and to the left) imagesc([p'/max(p) 0;pi/max(pi(:)) q/max(q)]) colormap(1-map) drawnow end

Here are some results:

(From the left to the right: . The first row of pixels is , the last column is and in between there is , all things normalized such that black is the maximal value in , and , respectively.)

You see that for large , the plan is much more smooth and not so well localized as it should be for an optimal plan.

Oh, and here is an animation of 100 iterations of Sinkhorn-Knopp showing the result after both and have been updated:

There is a catch with this regularization: For small (in this example about ) the method runs into problem with under- and overflow: the entries in and become *very* small. One can fight this effect a bit but I don’t have a convincing numerical method to deal with this, yet.It seems that the entries of the optimal and really have to be incredibly small and large, and I mean *really* small and large (in the order of and in both and ).

While the Sinkhorn-Knopp algorithm is already from the 1960s, its application to optimal transport seems fairly new – I learned about from in talks by Gabriel Peyré and Bernhard Schmitzer and the reference is Sinkhorn Distances: Lightspeed Computation of Optimal Transport (presented at NIPS 2013) by Marco Cuturi.