tslearn Multithreading and Sakoe-Chiba constraint for softDTW

Hi there. I'm using Kmeans with softDTW for my thesis on a dataset made of 115 time series that are up to 10000 long. I'm running my script on a Ryzen 3600X (pretty much last gen mainstream processor) and observing a CPU usage of around 10% which is extremely low. I setted Kmeans to run with 12 jobs which is equal to the number of threads my processor is capable of handling.

k = TimeSeriesKMeans(n_clusters=20, metric="softdtw", max_iter=3, verbose=1, n_jobs = 12, max_iter_barycenter = 3) This low CPU usage causes extremely long execution times, for example to process 115 time serieses each ~4000 long to 20 clusters it takes 24+ h. Running everything on a single thread didn't change much, my CPU is still used at 10%. Is this a normal behaviour or am i doing something wrong?

I'm running tslearn 0.4.1 on Windows 10 with Python 3.8

Thank you for your help and for your time.

Dec 08 '20 18:12 RaffaLinux

After digging into the code a bit i found out that _soft_dtw, which is where most of the time is spent on is ran on single thread. I don't know if there would be an issue by doing that but i think this particular bit of code could be run in parallel

`def cdist_soft_dtw(dataset1, dataset2=None, gamma=1.):

dataset1 = to_time_series_dataset(dataset1, dtype=numpy.float64)
self_similarity = False
if dataset2 is None:
    dataset2 = dataset1
    self_similarity = True
else:
    dataset2 = to_time_series_dataset(dataset2, dtype=numpy.float64)
dists = numpy.empty((dataset1.shape[0], dataset2.shape[0]))
equal_size_ds1 = check_equal_size(dataset1)
equal_size_ds2 = check_equal_size(dataset2)
for i, ts1 in enumerate(dataset1):
    if equal_size_ds1:
        ts1_short = ts1
    else:
        ts1_short = ts1[:ts_size(ts1)]
    for j, ts2 in enumerate(dataset2):
        if equal_size_ds2:
            ts2_short = ts2
        else:
            ts2_short = ts2[:ts_size(ts2)]
        if self_similarity and j < i:
            dists[i, j] = dists[j, i]
        else:
            dists[i, j] = soft_dtw(ts1_short, ts2_short, gamma=gamma)

return dists

`

In particular it should be possible to compute more softDTW dists at the same time.

Also i ran some search about the Sakoe-Chiba band constraint and this pull request on mblondel's softDTW github repository provides a ready to use implementation for that. Unluckyly mblondel never merged that pull request and never edited that code once since release 4 years ago. This library is really good for time series analysis but i would not rely on abandonware (we are talking of only 115 lines of code). Also numpy uses (if i'm not wrong) SIMD instructions now therefore there should be a better way to implement softDTW exploiting numpy. Mblondel uses cython instructions which are indeed good but using numpy (and SIMD) might be a better way now. Honestly i'm not a python nor a numpy expert so i can only provide you suggestions.

Thank you again for your time and patience, keep the good work :D

Dec 10 '20 19:12 RaffaLinux

Hi @RaffaLinux

You are right that soft-dtw cross distances are not parallelized at the moment. Would you be OK to run some tests on your side to check whether it would be likely to help? If so, I could cook up some quick and dirty benchmark and ask you to run it on your side.

Dec 10 '20 19:12 rtavenar

Hello @rtavenar. i tried a bit using this piece of code

def cdist_soft_dtw(dataset1, dataset2=None, gamma=1.):

    ...
    ...
    Parallel(n_jobs=12)(delayed(soft_dtw_parallel)(self_similarity,equal_size_ds1,equal_size_ds2,dists,gamma,i, j,ts1, ts2)for i,ts1 in enumerate(dataset1) for j,ts2 in enumerate(dataset2))
    return dists

def soft_dtw_parallel(self_similarity,equal_size_ds1,equal_size_ds2,dists,gamma,i, j,ts1, ts2):
        if equal_size_ds1:
            ts1_short = ts1
        else:
            ts1_short = ts1[:ts_size(ts1)]
        if equal_size_ds2:
            ts2_short = ts2
        else:
            ts2_short = ts2[:ts_size(ts2)]
        if self_similarity and j < i:
            dists[i, j] = dists[j, i]
        else:
            dists[i, j] = soft_dtw(ts1_short, ts2_short, gamma=gamma)

But it leads to overflows, as I expected my python coding skills are crap and honestly i can't spend much time trying stuff at the moment :(

tslearn\clustering.py:111: RuntimeWarning: overflow encountered in square
  distance_to_candidates = cdist_metric(X[candidate_ids], X) ** 2

In any case it does for sure use all the cores on my CPU but because of me being bad at python coding it slows the code ( i tested with 115 time series, each 200 long to be divided into 5 clusters) and is worse than the single threaded way probably because of the overflows. I'm quite sure that calculating multiple instances of softDTW for each time series couple should lead to some sort of speedup proportional to the number of threads that the CPU can handle. With a better code than mine it should be possible to do it. As far i as i can tell each softDTW cross-distance can be calculated indipendently (right?) cutting down a large portion of the computation time since the _soft_dtw function from mblondel is by far the biggest name when profiling performance.

On that note, the Sakoe-Chiba band constraint is much easier to implement using the pull request code that i mentioned in my previous comment since it only needs to add a parameter to all function calls down the line (i did it this way in my tests, but for sure seeing it already mentioned in tslearn there is a nicer way).

PS.: at this point i'm changing the title of this issue to "Multithreading and Sakoe-Chiba constraint for softDTW", if would you please also change the label to feature request or something else :)

Dec 10 '20 23:12 RaffaLinux

DTW is better than soft-DTW respect to parallelization. In fact, these two essentially output similar results. I hope it helps.

Jan 04 '21 13:01 PercyLau

Hi, i tried to add parallel computing to softdtw TimeSeriesKmeans, but results are disappointing. My changes are in: https://github.com/lekcyjna123/tslearn/tree/lekcyjna/parallelSoftDTW

Tests were made using dataset of 50 timeseries of length 256:

%%timeit
model=TimeSeriesKMeans(n_jobs=-1, metric="softdtw")
r=model.fit(szeregi)

Master: 13.8 s ± 3.81 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

One run (using %%time):

CPU times: user 10.3 s, sys: 3.31 ms, total: 10.3 s
Wall time: 10.4 s

Using changes from linked branch (best probe from 3 different approaches which i tried): 8.94 s ± 1.96 s per loop (mean ± std. dev. of 7 runs, 1 loop each) 9.1 s ± 2 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

CPU times: user 6.38 s, sys: 71.2 ms, total: 6.45 s
Wall time: 9.27 s

But using this changes calculations on 1 process are much longer: 19.4 s ± 7.65 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

So to summarize multiprocessing computation are faster about 34% on processor with 4 threads (i5-2400S) , but on the other hand serial computation are 40% slower. This can be mitigated by adding if statement in code with checking if we made serial or parallel computation, but i don't know if there is any sens. Have you any opinions?

Jul 25 '21 17:07 lekcyjna123