[MRG] EMD and Wasserstein 1D #89

rtavenar · 2019-06-20T13:00:33Z

Hi there,

I started coding a specific EMD for mono-dimensional case (i.e. when sorting both arrays is enough).
Doc is missing for the moment (will do that asap), but a basic implementation that covers the non uniform weight case and tests that checks if the results are coherent with EMD are already there.

On my machine, I ran the following timing test:

>>> n = 20000
>>> m = 3000
>>> u = np.random.randn(n, 1)
>>> v = np.random.randn(m, 1)
>>> ot.tic(); ot.emd_1d([], [], u, v, metric='sqeuclidean'); ot.toc()
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
Elapsed time : 2.3728668689727783 s
2.3728668689727783
>>> ot.tic(); M = ot.dist(u, v, metric='sqeuclidean'); ot.emd([], [], M); ot.toc()
RESULT MIGHT BE INACURATE
Max number of iteration reached, currently 100000. Sometimes iterations go on in cycle even though the solution has been reached, to check if it's the case here have a look at the minimal reduced cost. If it is very close to machine precision, you might actually have the correct solution, if not try setting the maximum number of iterations a bit higher
/Users/tavenard_r/Documents/costel/src/POT/ot/lp/__init__.py:104: UserWarning: numItermax reached before optimality. Try to increase numItermax.
  result_code_string = check_result(result_code)
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
Elapsed time : 8.67806887626648 s

Romain

rflamary · 2019-06-20T13:18:20Z

ot/lp/emd_wrap.pyx

-                  np.ndarray[double, ndim=2, mode="c"] M):
+                  np.ndarray[double, ndim=2, mode="c"] u,
+                  np.ndarray[double, ndim=2, mode="c"] v,
+                  str metric='sqeuclidean'):
    r"""
    Roro's stuff


nice documentation indeed ;)

rflamary · 2019-06-20T13:20:44Z

Thank you romain, this is nice.

Is it me of Cython is not particularly fast (2 sec for n=20000?)? it is probably due to the use of the dist function, you should probably implement it in cython for squared and absolute value and use the dist only for weird stuff ;)

Rémi.

rtavenar · 2019-06-20T13:49:25Z

If I change to the following:

        if metric == 'sqeuclidean':
            m_ij = (u[i, 0] - v[j, 0]) ** 2
        else:
            m_ij = dist(u[i].reshape((1, 1)), v[j].reshape((1, 1)),
                        metric=metric)[0, 0]

I get the same timings (in the order of 2secs)...

The slow part seems to be when we deal with G. If I remove all the stuff related to G I get:

Elapsed time : 0.0061719417572021484 s

I will check if using a sparse representation for G helps.

EDIT: OK, when I remove the overhead for G, I can see a 100x improvement in timings with this if...else, so will keep it for L1 and L2 norms and resort to dist for other distances.

test/test_ot.py

agramfort · 2019-06-20T19:11:31Z

ot/lp/emd_wrap.pyx

+                                                           dtype=np.float64)
+    while i < n and j < m:
+        m_ij = dist(u[i].reshape((1, 1)), v[j].reshape((1, 1)),
+                    metric=metric)[0, 0]


since you have a pure python function call in the loop I doubt that cython brings you any speed gain.

my 2c

I've tried something for basic metrics (euclidean and sqeuclidean), not sure how to do otherwise

rtavenar · 2019-06-21T09:23:34Z

Also, new timings (for larger problem than above) are:

>>> import ot
>>> import numpy as np
>>> from scipy.stats import wasserstein_distance
>>> 
>>> n = 20000
>>> m = 30000
>>> u = np.random.randn(n)
>>> v = np.random.randn(m)
>>> 
>>> ot.tic(); _ = wasserstein_distance(u, v); _ = ot.toc()
Elapsed time : 0.012831926345825195 s
>>> ot.tic(); _ = ot.emd_1d([], [], u, v, metric='euclidean', dense=False); _ = ot.toc()
Elapsed time : 0.04144096374511719 s
>>> ot.tic(); M = ot.dist(u.reshape((-1, 1)), v.reshape((-1, 1)),
...                       metric='euclidean'); _ = ot.emd([], [], M); _ = ot.toc()

RESULT MIGHT BE INACURATE
Max number of iteration reached, currently 100000. Sometimes iterations go on in cycle even though the solution has been reached, to check if it's the case here have a look at the minimal reduced cost. If it is very close to machine precision, you might actually have the correct solution, if not try setting the maximum number of iterations a bit higher
/Users/tavenard_r/Documents/costel/src/POT/ot/lp/__init__.py:106: UserWarning: numItermax reached before optimality. Try to increase numItermax.
  result_code_string = check_result(result_code)
Elapsed time : 312.5033311843872 s

We are a bit slower than scipy's implementation, not sure whether this is due to Cython or to the fact that scipy does not deal with G :/

agramfort · 2019-06-21T10:07:52Z

@rtavenar did you run "cython -a" on the pyx file to see if it's white (no yellow slow python lines)?

rtavenar · 2019-06-21T11:37:49Z

@agramfort

Did not know this one, thanks for the tip
I've changed the np.abs, now the only yellow lines I get are the return, the np.zeros lines and the cdist, but I do not know how to remove these ones.

agramfort · 2019-06-21T12:08:00Z

you cannot remove yellow lines of np.zeros or return.

For cdist either you can directly call blas functions from scipy or you need
to code the metrics in cython

rtavenar · 2019-06-21T12:32:16Z

I've had a look there, could not find obvious matches for distances, but maybe it's not the right place :/

Regarding coding the metrics in Cython, this is what I have done for Euclidean distance and Squared Euclidean distance up to now. The question is: should I code all of them even if they are unlikely to be used, or only a subset?

rflamary · 2019-06-21T13:57:00Z

hello, I think those two are OK just be clear in the documentation that the others are slower and use cdist (such a slow function btw ;) )

Rémi

rtavenar · 2019-06-21T13:58:46Z

OK, and I'll also have to be clear that only strings are accepted as metrics for emd_1d

rtavenar · 2019-06-24T07:18:58Z

OK, so now I added proper docstrings. Let me know if something is missing or should be changed.

…matrix)

rflamary · 2019-06-27T12:07:30Z

This is great, thank you @rtavenar for the code and optimization.

I will merge it now.

rtavenar added 2 commits June 20, 2019 14:29

EMD 1d without doc

f63f34f

EMD 1d without doc made faster

15b2161

rflamary reviewed Jun 20, 2019

View reviewed changes

hichamjanati reviewed Jun 20, 2019

View reviewed changes

test/test_ot.py Outdated Show resolved Hide resolved

agramfort reviewed Jun 20, 2019

View reviewed changes

rtavenar added 2 commits June 21, 2019 11:15

Sparse G matrix for EMD1d + standard metrics computed without cdist

cada9a3

Sparse G matrix for EMD1d + standard metrics computed without cdist

18502d6

Removed np.abs in Cython code

67d3bd4

rtavenar added 2 commits June 21, 2019 18:27

Started documenting

9e1d74f

Added docstrings

71f9b5a

rtavenar added 2 commits June 24, 2019 13:09

Added more docstrings (Cython) + fixed link to ot.dist doc

77452dd

Made weight vectors optional to match scipy's wass1d API

0a039eb

rflamary added the new feature label Jun 25, 2019

rtavenar added 5 commits June 27, 2019 10:04

Added minkowski variants and wasserstein_1d functions

1140141

Improved tests and docs for wasserstein_1d

0d333e0

Merge branch 'master' into master

bbc56e7

Wasserstein defined as the cost itself (do not return transportation …

c92e595

…matrix)

Added RT as a contributor + "optimized" Cython math operations

362a7f8

rflamary changed the title ~~[WIP] EMD 1d~~ [MRG] EMD and Wassersyein 1D Jun 27, 2019

rflamary changed the title ~~[MRG] EMD and Wassersyein 1D~~ [MRG] EMD and Wasserstein 1D Jun 27, 2019

rflamary merged commit a9b8af1 into PythonOT:master Jun 27, 2019

[MRG] EMD and Wasserstein 1D #89

[MRG] EMD and Wasserstein 1D #89

Uh oh!

Conversation

rtavenar commented Jun 20, 2019

Uh oh!

rflamary Jun 20, 2019

Choose a reason for hiding this comment

Uh oh!

rtavenar Jun 20, 2019

Choose a reason for hiding this comment

Uh oh!

rflamary commented Jun 20, 2019

Uh oh!

rtavenar commented Jun 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

agramfort Jun 20, 2019

Choose a reason for hiding this comment

Uh oh!

rtavenar Jun 21, 2019

Choose a reason for hiding this comment

Uh oh!

rtavenar commented Jun 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agramfort commented Jun 21, 2019

Uh oh!

rtavenar commented Jun 21, 2019

Uh oh!

agramfort commented Jun 21, 2019

Uh oh!

rtavenar commented Jun 21, 2019

Uh oh!

rflamary commented Jun 21, 2019

Uh oh!

rtavenar commented Jun 21, 2019

Uh oh!

rtavenar commented Jun 24, 2019

Uh oh!

rflamary commented Jun 27, 2019

Uh oh!

Uh oh!

rtavenar commented Jun 20, 2019 •

edited

Loading

rtavenar commented Jun 21, 2019 •

edited

Loading