RecurrenceAnalysis.jl [FR] Estimations of `ε` [$100]

Would be great if we had a function with various methods to estimate ε , the threshold of a recurrence plot.

This could work similarly with estimate_delay which gives delay times for timeseries.

Different methods are described in the book from N. Marwan and citations within, see chapter 1 (we cite the book in most docs). Some of these methods are actually very easy and straightforward and would be a great issue for newcomer contributor.

There is a $100 open bounty on this issue. Add to the bounty at Bountysource.

Jan 26 '19 14:01 Datseris

Yes, we were also already discussing it in our group. We have some ideas and we can share them with you if you like.

Am 26.01.2019 um 15:53 schrieb George Datseris [email protected]:

Would be great if we had a function with various methods to estimate ε , the threshold of a recurrence plot.

This could work similarly with estimate_delay which gives delay times for timeseries.

Different methods are described in the book from N. Marwan and citations within, see chapter 1 (we cite the book in most docs). Some of these methods are actually very easy and straightforward and would be a great issue for newcomer contributor.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JuliaDynamics/RecurrenceAnalysis.jl/issues/67, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7WkNx9HQle89RxGN_GpVENH8dafiL0ks5vHGvwgaJpZM4aUNLp.

Jan 26 '19 16:01 pucicu

Awesome, please do so!

Jan 26 '19 18:01 Datseris

This issue now has a 100$ bounty on it. (Which means anyone who solves it gets the 100$)

Mar 16 '19 16:03 Datseris

related: https://aip.scitation.org/doi/10.1063/1.5024914

Apr 29 '19 16:04 Datseris

Note: The https://aip.scitation.org/doi/10.1063/1.5024914 paper is extremely wordy, and basically says that "counting the shortest N% of distances in the recurrence plot as recurring" is a good way to ensure some Recurrence Quantification Analysis less sensitive to embedding dimensions.

That is, keeping recurrence rate constant as embedding dimensions increase allows other RQA measures to be stabler.

Feb 11 '20 11:02 yuxi-liu-wired

I read chapter 1 of Webber and Marwan. It seems that this question is solved? Just pass kwarg to the integrator, and it's done. The only notable

Thus possibly this bounty is outdated.

Feb 11 '20 17:02 yuxi-liu-wired

@yuxiliu1995 can you please give more detail? What we are talking about here is having a method (a function) that you give in your dataset and then function returns an optimal value for ε, much like the functions estimate_delay which return an optimal value for e.g. τ.

Feb 11 '20 21:02 Datseris

I read Chapter 1 of Marwan, https://aip.scitation.org/doi/10.1063/1.5024914 , as well as many cited papers. It seems there is no good way to do this other than to try a few and see what works. It is just a collection of gut feelings, case studies, and appeal to numerical simulation.

Not only is it unmathematical, it is also simple. There are only a few methods for choosing ε:

ε = p * std, with 0.2 < p < 0.4
ε < 0.1 * mean.
ε = p * median with p ~ 0.2.
ε = p * diameter, with p < 0.1, or p ~ 0.05, or 0.2 < p < 0.4.
ε = p-quantile, with 0.01< p < 0.05.
ε > 5 * std of noise, in the case where the measurement device has a Gaussian noise distribution with a known std.
ε = argmin |1 − Np(ε)/Nn(ε)|, where
- Np is the number of "significant peaks" in the histogram, where each bin counts the total number of black pixels in a diagonal line in the recurrence plot. ("significant" is not well-defined)
- Nn is the average number of neighbours, that is, the average number of black pixels in each vertical column in the recurrence plot.
Draw RR vs ε on a loglog plot, and select ε in the middle of a linear region, since that's where a scaling law is working, and the dynamics is presumably found.

Relevant parts of Chapter 1 of Marwan:

The recurrence threshold is a crucial parameter in the RP analysis. Although several works have contributed to this discussion [9, 22, 24, 25], a general and systematic study on the recurrence threshold selection remains an open task for future work. Nevertheless, recurrence threshold selection is a trade-off of to have a threshold as small as possible but at the same time a sufficient number of recurrences and recurrence structures. In general, the optimal choice of σ depends on the application and the experimental conditions (e.g., noise), but in all cases it is desirable that the smallest threshold possible is chosen.

[9] is very similar to this chapter.
- if there's observational noise, then, citing [22], ε > 5 * σ(dataset).
- for (quasi-)periodic processes, citing [24], select ε = argmin |1 − Np(ε)/Nn(ε)|, where
  - Np is the number of "significant peaks" in the histogram, where each bin counts the total number of black pixels in a diagonal line in the recurrence plot, and "significant" means "more than average height of the histogram plus three times the standard deviation [of the height of the bars in the histogram? The authors did not specify]".
  - Nn is the average number of neighbours, that is, the average number of black pixels in each vertical column in the recurrence plot.
- select ε so that the number of neighbours for every point of the trajectory is constant (fixed amount of nearest neighbours). This however breaks the symmetry of recurrence plot, and creates a generalized recurrence plot instead. Section 3.2.4 lists more generalized recurrence plots which can also be read here.
[25] finds that setting ε = (0.2~0.4) * diameter(data) maximizes discrimination between chaotic signal and noise using recurrence plot. The authors also says in the conclusion that ε =0.05 * diameter(data) works well.

A “rule of thumb” for the choice of the threshold σ is to select it as a few per cent (not larger than 10%) of the maximum phase space diameter [26–28]. For classification purpose and signal detection, a better choice is to select σ between 20 and 40% of the signal’s standard deviation [25]. However, the influence of noise can necessitate a larger threshold, because noise would distort any existing structure in the RP. Higher threshold may preserve these structures [9]. Suggestions from literature show that this threshold should be a few per cent of the maximum phase space diameter [26] and should not exceed 10% of the mean or the maximum phase space diameter [27,28].

ε = p * diameter(dataset), with p < 0.1
ε = p * σ(dataset), with 0.2 < p < 0.4
ε < 0.1 * mean distance of dataset.
[26] simply stated that "The threshold e is fixed, typically at a few percent of the diameter of the attractor". It says "attractor" because this paper only concerns with reconstructing attractors in series that are generated near an attractor.
[27] contains no useful information on selecting ε.
[28] is an early (1992) paper that introduced RQA. It simply tested a few RQA on Lorenz system and a few others. It only said "we fix r values as small as possible (typically no greater than 10% of the normalized mean distance of the first embedding) relative to the noise level". Here "r value" is ε.

Using the recurrence point density of the RP, the threshold can be chosen from the analysis of this measure in respect to a changing threshold [7]. The threshold can then be found by looking for a scaling region in the recurrence point density. However, this may not work for nonstationary data. For this case Zbilut et al. [7] have suggested to choose σ so that the recurrence point density is approximately 1%. For noisy periodic processes, [24] have suggested to use the diagonal structures within the RP in order to determine an optimal threshold. Their criterion minimizes the fragmentation and thickness of the diagonal lines in respect to the threshold.

[7]
- calculate RR vs ε, plot on a loglog plot, and select ε in the middle of a linear region, since that's where a scaling law is working, and the dynamics is presumably found.
- If the data are extremely nonstationary, a scaling region may not be found. Then select ε so that RR ~ 0.01.
- Look at the recurrence plot: if it looks sparse, increase ε.
- Windowed RQA is especially informative. If a given window fails to achieve RR = 1%, ε should be increased, or the window enlarged.
[24] see above.

Recent studies about RPs in our group have revealed a more exact criterion for choosing this threshold. This criterion takes into account that a measurement of a process is a composition of the real signal and some observational noise with standard deviation. In order to get similar results by using RPs, a threshold has to be chosen which is five times larger than the standard deviation of the observational noise [22]. This criterion holds for a wide class of processes.

Nothing new here.

Feb 12 '20 07:02 yuxi-liu-wired

The first five methods are easily implemented in RecurrenceMatrix by using scale keyword, and there's no need for a separate function that selects ε upon receiving a dataset.

ε = p * std, with 0.2 < p < 0.4
ε < 0.1 * mean.
ε = p * median with p ~ 0.2.
ε = p * maximum, with p < 0.1, or p ~ 0.05, or 0.2 < p < 0.4. Note that what is often called diameter in the papers is just maximum.
ε = p-quantile, with 0.01< p < 0.05.

Of these, maximum and mean are already implemented by using keyword scale. quantile is implemented but with a weird fixedrate argument, rather than using the scale keyword.

The std and median are easy to implement like mean. My current implementation is in matrices.jl (untested, since there's a bug in the RecurrenceMatrix) is


function _computedistances(x, y, metric::Metric)
    if x===y
        distlist = Vector{Real}(undef, length(x)*(length(x)-1)/2)
        index = 1
        @inbounds for i in 1:length(x)-1, j=(i+1):length(x)
            distlist[index] = evaluate(metric, x[i], y[j])
            index += 1
        end
    else
        distlist = Vector{Real}(undef, length(x)*length(y))
        index = 1
        @inbounds for xi in x, yj in y
            distlist[index] += evaluate(metric, xi, yj)
            index += 1
        end
    end
    return distlist
end

function _computescale(scale::typeof(var), x, y, metric::Metric)
    return Statistics.var(_computedistances(x, y, metric))
end

function _computescale(scale::typeof(median), x, y, metric::Metric)
    return Statistics.mean(_computedistances(x, y, metric))
end

ε > 5 * std of noise, in the case where the measurement device has a Gaussian noise distribution with a known std.

This one requires the user to know how their measurement device is. It cannot be done in general. However, suppose the user can assure the program that the noise in measurement device is significantly bigger than the evolution of the system over a short timescale, the program can then use a high-pass filter to get what's presumably pure noise from the measurement error, then calculate 5 * std of noise, and advice the user to select ε bigger than that.

ε = argmin |1 − Np(ε)/Nn(ε)|, where
- Np is the number of "significant peaks" in the histogram, where each bin counts the total number of black pixels in a diagonal line in the recurrence plot.
- Nn is the average number of neighbours, that is, the average number of black pixels in each vertical column in the recurrence plot.

This one can be implemented in a standalone function.

The "significant peaks" might be found by this peak finding algorithm.

Nn = RR * n, where n is the size of the dataset.

Draw RR vs ε on a loglog plot, and select ε in the middle of a linear region, since that's where a scaling law is working, and the dynamics is presumably found.

This might be a bit risky to automate, but if the user wants to use this method, it would be useful to have a function that draws the RR vs ε loglog plot, and let the user decide whether there is a scaling law region, and which ε in the region to use.

linear_regions can be used here

Feb 12 '20 07:02 yuxi-liu-wired

I just implemented the last two, but they are very slow and don't improve (in fact, they give worse epsilons).

I don't think they are worth it, but if you want, I'll put them into the pull commit too.

Feb 12 '20 14:02 yuxi-liu-wired

@pucicu are there any other methods we should consider here, besides what @yuxiliu1995 already wrote?

@yuxiliu1995 yes, include everything and put a comment for methods that you think are not good enough after you've tested them.

Feb 12 '20 14:02 Datseris

Yuxi, thanks for summarizing all these methods.

First, I would like to emphasize that the selection of the threshold depends on the research question. There is no “perfect rule” valid for everything. For example, whereas the estimation of dynamical invariants requires a very small threshold (but also very long time series), the reconstruction of a time series from the recurrence matrix or the creation of twin surrogates require much larger thresholds.

@all, I suggest to not include all of these methods. The most suitable method for most of the research questions is using the quantile of the distance matrix. This is the same as pre-selecting the recurrence rate. In particular when working with sliding windows, this approach is the preferred one.

The next-to-last method in Yuxi’s list ("ε = argmin |1 − Np(ε)/Nn(ε)|,”) is also worth to be implemented.

The methods using the diameter or the loglog-plot of RR vs. ε are not so optimal. The first one is too sensitive to outliers. The loglog-plot method can not be really justified, why this criterion would make sense. A similar approach using the turning point of RR vs. ε was suggested, but critically reviewd by R. V. Donner et al, Ambiguities in recurrence-based complex network representations of time series, Physical Review E, 81, 015101(R)p. (2010). DOI:10.1103/PhysRevE.81.015101 http://doi.org/10.1103/PhysRevE.81.015101

An alternative and still missing method in Yuxi's list is the one we published here: D. Eroglu et al, Finding recurrence networks' threshold adaptively for a specific time series, Nonlinear Processes in Geophysics, 21, 1085–1092p. (2014). DOI:10.5194/npg-21-1085-2014 http://doi.org/10.5194/npg-21-1085-2014 Although the focus is on recurrence networks, this approach gives the best threshold for reconstruction of a time series from the recurrence matrix (because the reconstruction is based on a graph).

I hope this helps.

Am 12.02.2020 um 15:58 schrieb George Datseris [email protected]:

@pucicu https://github.com/pucicu are there any other methods we should consider here, besides what @yuxiliu1995 https://github.com/yuxiliu1995 already wrote?

@yuxiliu1995 https://github.com/yuxiliu1995 yes, include everything and put a comment for methods that you think are not good enough after you've tested them.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/JuliaDynamics/RecurrenceAnalysis.jl/issues/67?email_source=notifications&email_token=AAHNNEHQN5Y2CTXPZJSNANLRCQFBBA5CNFSM4GSQ2LU2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELRCIJA#issuecomment-585245732, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHNNEGKPSIAVH2IPVV2ROLRCQFBBANCNFSM4GSQ2LUQ.

Feb 12 '20 20:02 pucicu

I just implemented the last two, but they are very slow and don't improve (in fact, they give worse epsilons).

I don't think they are worth it, but if you want, I'll put them into the pull commit too.

The next-to-the-last one would be really helpful. Most of the others methods can be easily implemented by the user of the package, but this one would be more difficult. Therefore, if the package could provide this, it would be great.

Feb 12 '20 20:02 pucicu

Some work on this has started in https://github.com/JuliaDynamics/RecurrenceAnalysis.jl/pull/89 , anyone willing feel free to take over.

Mar 03 '20 15:03 Datseris