alibi-detect icon indicating copy to clipboard operation
alibi-detect copied to clipboard

np.roll in seq2seq model score function yields a non plausible match for X[-1,:] and y[-1,:]

Open robroe-tsi opened this issue 5 years ago • 6 comments

Hi,

thanks for the package. Just testing the seq2seq model on timeseries and maybe stumbled over of bug:

In https://github.com/SeldonIO/alibi-detect/blob/9249dcbce15f41db19482de4939015cf6480360f/alibi_detect/od/seq2seq.py#L124 np.roll is used to create the labels for the teacher force learning. With np.roll(X, -1, axis=0) the first record from X moves to the last position in y and so it will be the label for X[-1]. That way X[-1] gets X[0] as label. Doesn't make sense for me but maybe I overlook something.

Best

robroe-tsi avatar Feb 08 '21 15:02 robroe-tsi

Hi @robroe-tsi , thanks for bringing this to my attention. This does indeed look like a bug, so I'll add a fix for it this afternoon. In practice it shouldn't affect the model much since it only impacts 1 data point from your whole dataset.

arnaudvl avatar Feb 17 '21 12:02 arnaudvl

This might be trickier than initially thought because I can't just exclude that data point which would break things as the model takes (batch size, sequence length, nb features) as input. I might have to change the trainer function to include optional masking of the individual training points if no easier solution is possible.

arnaudvl avatar Feb 17 '21 18:02 arnaudvl

But couldn't you just exclude the last record of y and X?

robroe-tsi avatar Feb 18 '21 17:02 robroe-tsi

I could indeed provide an option to drop the last record. I would be inclined to make it optional because it might actually have a bigger impact on training than including 1 faulty target label.

arnaudvl avatar Feb 18 '21 17:02 arnaudvl

Possible.

I would like also point your attention to the reshaping done at the same line. If the data is in 2d (which most often means that the records represent time) your stacking is not efficiently in terms of data usage. Given a time series 1 to 9 and a seq_len of 3 - the reshaping will result in the instances 1-3, 4-6, 7-9 - which ignores possible instances as 2-4, 3-5, 5-7, 6 -8. And it also yields an error if you put in a dataset with len 10 and use a seq_len of 3. Therefore I first transform my data using:

def window_stack(a, lookup, stepsize=1):
    return np.hstack( a[i:1+i-lookup or None:stepsize] for i in range(0,lookup) )

Maybe you just put a note about this behavior of reshaping in the comments.

robroe-tsi avatar Feb 18 '21 17:02 robroe-tsi

Wrt the data usage: while it is true that it is not the most efficient way to use the data when the reshaping has not been done in advance, it is hard to say in general what the best way to do this is. Especially as the sequence length increases and the sequences largely overlap if all potential sequences are considered. I like your suggestion to make this behaviour more explicit in the docs/comments and encourage reshaping the data as the user sees fit before feeding it to the detector.

So in summary: I'll add an option to drop the last sequence given the faulty last target value and make the reshaping behaviour more explicit.

arnaudvl avatar Feb 18 '21 19:02 arnaudvl