tslearn icon indicating copy to clipboard operation
tslearn copied to clipboard

ValueError on TimeSeriesSVR when having input temporal length of more than 405

Open skasai5296 opened this issue 5 years ago • 11 comments

Describe the bug ValueError when having a long time series input

To Reproduce

from tslearn.generators import random_walk_blobs
X, y = random_walk_blobs(n_ts_per_blob=10, sz=406, d=2, n_blobs=2)
y = y.astype(np.float) + np.random.randn(20) * .1
reg = TimeSeriesSVR(kernel="gak", gamma="auto")
reg.fit(X, y)

Expected behavior The model should not return an error

Environment (please complete the following information):

  • OS: macOS Mojave 10.14.6
  • tslearn version 0.3.0

Additional context I'm not sure why this is happening, but the max temporal length of not getting an error is 405.

skasai5296 avatar Feb 25 '20 07:02 skasai5296

The issue comes from tslearn.metrics.cdist_gak: The returned array contains NaN values for some reason. This is not expected though.

import numpy as np
from tslearn.generators import random_walk_blobs
from tslearn.metrics import cdist_gak

X, y = random_walk_blobs(n_ts_per_blob=10, sz=406, d=2, n_blobs=2, random_state=42)
X_new = cdist_gak(X, sigma=1000)
np.isnan(X_new).any()

I can try to find why this is happening.

johannfaouzi avatar Feb 26 '20 08:02 johannfaouzi

cdist_gak returns the product of three matrices:

return (diagonal_left.dot(matrix)).dot(diagonal_right)

where matrix contains the pairwise values of unnormalized_gak.

unnormalized_gak calls _gak_gram followed by njit_gak.

If the value for sigma is really high, the values in gram

gram = _gak_gram(s1, s2, sigma=sigma)

will be close to 1.

Then, the value for gak_val

gak_val = njit_gak(s1, s2, gram)

will be really high (close to 3 ** n if n is the number of time points, i.e. n=sd). This value becomes np.inf because it is too high.

Then, when doing the dot product, 0 * np.inf will produce np.nan values.

Then sklearn.svm.SVR is not happy because the precomputed kernel contains NaN values.

johannfaouzi avatar Feb 26 '20 10:02 johannfaouzi

sigma is computed using:

  • self.gamma_ = gamma_soft_dtw(X)
  • sigma=numpy.sqrt(self.gamma_ / 2.)

which can be high for long time series.

To me, there are two possibilities:

  • either using a non-normalized version of GAK is not great because it leads to floating point overflow,
  • or there is a bug in the code.

But I don't know anything of GAK so I will let people who know more about it answer.

johannfaouzi avatar Feb 26 '20 10:02 johannfaouzi

This might explain why the results reported in the seglearn paper were not great. Capture d’écran 2020-02-26 à 12 25 10

johannfaouzi avatar Feb 26 '20 11:02 johannfaouzi

The same problem arises while using TimeSeriesSVC() (Maybe obvious)

emial637 avatar Mar 20 '20 14:03 emial637

Yes it makes sense, both algorithms have the same preprocessing step, which is computing the kernel matrix. Then the SVR and SVC classes from scikit-learn are called. The issue arises when computing the kernel matrix.

johannfaouzi avatar Mar 20 '20 17:03 johannfaouzi

I am also experiencing this issue. Same exact symptoms (the unnormalized gak matrix is all inf) and I indeed have timeseries that have > 500 steps.

ghost avatar Jun 02 '20 12:06 ghost

The automatic ('auto') computation of gamma can lead to this behavior. Could you try with a smaller value of gamma (or sigma if you are using the GAK functions directly)? It's an important hyper-parameter (like gamma for the RBF kernel), but the computation of the GAK kernel is a bit more complex and the current implementation can lead to overflow issues.

johannfaouzi avatar Jun 02 '20 13:06 johannfaouzi

Yep, gamma=0.1 works for my timeseries. Thanks I'll try to keep it low!

ghost avatar Jun 02 '20 14:06 ghost

I ran into a similar issue, and lowering gamma worked for me. However, an alternative I found was to use scipy.signal.resample to downsample. My timeseries had 800 values and downsampling to 400 (allowing gamma='auto') achieved higher classification performance, compared to setting a low gamma.

ryanhammonds avatar Aug 11 '21 23:08 ryanhammonds

The most confusing thing for me about the errors generated when 'gak' has bad values in it is that the error message reports that the issue is with the data. I spent a long time checking and double checking that my X data infact did not have any nans. Perhaps a warning should be generated in the pipeline that points to a poorly formed kernel instead?

lomnes-atlast-food avatar Mar 24 '23 13:03 lomnes-atlast-food