AnomalyDetection icon indicating copy to clipboard operation
AnomalyDetection copied to clipboard

High frequency data sets anomaly detection

Open FlowQ opened this issue 11 years ago • 3 comments

I am trying to perform an anomaly detection on a data set with very high frequency (more than 5/10 row per seconds) and the timestamps are not consecutive (sometimes there is no row for as second

Exemple : 09:23:59 2014-12-19 09:23:59 2014-12-19 09:24:00 2014-12-19 09:24:00 2014-12-19 09:24:02 2014-12-19 09:24:02 2014-12-19 09:24:02

I understand that I should use AnomalyDetectionTs to perform the detection on this type of set.

But my set has 50K rows but the function cannot compute the detection and crashes. Maybe it is also due to the fact that the timeseries are not spaced with a fixe time (sometimes 1sec, or 0 or even 2 secs)?

What are your recommendations to work with this type of dataset ?

Thanks,

Flow

FlowQ avatar Jan 28 '15 12:01 FlowQ

This is how I did it for my dataset with hourly frequency. It also sets the missing rows' count to 0 instead of NA.

dates.min <- as.POSIXct(dates.min.text)
dates.max <- as.POSIXct(dates.max.text)

dates.seq.all <- seq(dates.min, dates.max, by='hour')
dates.all <- data.frame(list(date=dates.seq))

data <- merge(dates.all, data.db, all=TRUE)
data$count[is.na(data$count)] <- 0

pepijn avatar Jan 28 '15 14:01 pepijn

We're looking into more gracefully handling datasets with missing values. Patch soon...

jhochenbaum avatar Mar 16 '15 05:03 jhochenbaum

Owen and I looked into this tonight and it's a tricky one. STL decomposition can't really handle datasets with NAs in them, however, here is what we propose. We're handling the cases where there are leading and/or trailing NAs, but will throw an exception when we detect non-leading NAs.

In the latter case, we recommend you use interpolation to replace the NAs. The zoo package provides such a function (linear interpolation) called na.approx.

Let us know your thoughts, thanks.

jhochenbaum avatar Mar 16 '15 05:03 jhochenbaum