Bagel icon indicating copy to clipboard operation
Bagel copied to clipboard

how to mitigate the impact of abnormal points in historical data on training?

Open lalafanchuan opened this issue 5 years ago • 2 comments

Hi, Zeyan Li. I have a question when applying the model on my datasets:

Bagel and donut assumes that the historical data follow normal pattern, however, when the amount of historical data is not very large, the impact of the abnormal points can not be ignored. So I want to ask that: how to mitigate the impact of abnormal points in historical data on training?

I have tried to introduce the labels of the data into the model training, but it has not improved much.

I will be appreciated if you can help me to solve this problem, thank you.

lalafanchuan avatar Jul 14 '20 06:07 lalafanchuan

Well, in our experiments, there are abnormal points in historical data. We actually assume that abnormal points are much less than normal points such that our model learns the normal pattern from a contaminated dataset. So if this assumption does not hold due to too much abnormal points, Bagel and Donut would not work. We have not studied on such cases since in practice normal points are always much more prevalent than abnormal ones. Maybe we can give you more suggesstions if you describe your data with more details.

BTW, as for the influence of labels, you can refer to Fig.7 in Donut paper. (Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications)

lizeyan avatar Jul 14 '20 07:07 lizeyan

Hi Zeyan, thank you for your answer! Our dataset is the business KPI data, the dataset is different from the dataset described in bagel paper from the following parts: (1) our dataset has an interval of 1 hour between two observations. (2) considering that the dataset varies a lot from one month ago to today, we only use one month data to train the model. we will train the model every hour and use the trained model to detect the following hour. (3) the holidays and some specific events like 618 affect our business a lot, data pattern during these days looks different from the other days.

Therefore, our training dataset is not very large, and sometimes there exits a series of abnormal points in historical data which has affected the model performance . Actually, in order to solve the above problem, we have introduced a predict model to replace abnormal points with predict values during the training process. This technique has mitigated the impacts somehow, but we do not think this is a perfect solution(predict model may introduce prediction error), so we try to solve it in another way.

BTW, in last question, I said that 'I have tried to introduce the labels of the data into the model training, but it has not improved much.' this means I have introduced the labels in the training datasets to make the M-ELBO work. However, with or without(all labels are zero) labels does not show much difference in our datasets.

lalafanchuan avatar Jul 15 '20 07:07 lalafanchuan