pycbc icon indicating copy to clipboard operation
pycbc copied to clipboard

Indexing bug in PyCBC Live's followup code

Open titodalcanton opened this issue 2 years ago • 14 comments

This happened on the O3 replay MDC, currently running the prepare-v215 branch, earlier today. I included the last few lines from the rank 0 process for context.

2024-02-18T00:44:44.519-08:00 node836 0 Analyzing from 1392281088.069336
2024-02-18T00:44:44.574-08:00 node836 0 Skipping recalculation of H1 PSD, 252.69764532256113-250.98143016119255
2024-02-18T00:44:44.632-08:00 node836 0 Skipping recalculation of L1 PSD, 302.19981974510756-299.56498426867745
2024-02-18T00:44:44.684-08:00 node836 0 Recalculating V1 PSD, 109.41863923048133
2024-02-18T00:44:44.685-08:00 node836 0 Gathering triggers
2024-02-18T00:44:51.226-08:00 node836 0 Checking H1's DQ vector
2024-02-18T00:44:51.228-08:00 node836 0 Keeping 4400/4400 H1 triggers after DQ flags
2024-02-18T00:44:51.228-08:00 node836 0 Checking H1's iDQ information
2024-02-18T00:44:51.229-08:00 node836 0 Keeping 4400/4400 H1 triggers after iDQ
2024-02-18T00:44:51.229-08:00 node836 0 Checking L1's DQ vector
2024-02-18T00:44:51.230-08:00 node836 0 Keeping 4410/4410 L1 triggers after DQ flags
2024-02-18T00:44:51.230-08:00 node836 0 Checking L1's iDQ information
2024-02-18T00:44:51.231-08:00 node836 0 Keeping 4410/4410 L1 triggers after iDQ
2024-02-18T00:44:51.237-08:00 node836 0 H1-L1: 11080958 coincs, 201326592 bytes
2024-02-18T00:44:51.237-08:00 node836 0 adding singles to the background estimate...
2024-02-18T00:44:53.233-08:00 node836 0 H1-L1: 27477 background and zerolag coincs
2024-02-18T00:44:53.258-08:00 node836 0 Clustering H1-L1 coincs
2024-02-18T00:44:53.260-08:00 node836 0 Clustering events over 8.024025692304448 s window
2024-02-18T00:44:53.267-08:00 node836 0 11307 triggers remaining
2024-02-18T00:44:53.338-08:00 node836 0 Coincs 0: H1-L1: 11080363 in cbuffer
2024-02-18T00:44:53.341-08:00 node836 0 Found H1-L1 coinc with ifar 0.0007362098187409289
2024-02-18T00:44:53.342-08:00 node836 0 computing followup data for coinc
2024-02-18T00:44:53.343-08:00 node836 0 Generating SPAtmplt, 200.0s, 4057, starting from 23.1 Hz
2024-02-18T00:44:53.676-08:00 node836 0 Generating SPAtmplt, 528.0s, 4057, starting from 23.1 Hz
Traceback (most recent call last):
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/mpi4py/__main__.py", line 7, in <module>
    main()
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/mpi4py/run.py", line 198, in main
    run_command_line(args)
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/mpi4py/run.py", line 47, in run_command_line
    run_path(sys.argv[0], run_name='__main__')
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/runpy.py", line 288, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/pycbc.live/.conda/envs/o4-test-env/bin/pycbc_live", line 1289, in <module>
    evnt.check_coincs(list(results.keys()), best_coinc, psds)
  File "/home/pycbc.live/.conda/envs/o4-test-env/bin/pycbc_live", line 512, in check_coincs
    sld = self.compute_followup_data(
  File "/home/pycbc.live/.conda/envs/o4-test-env/bin/pycbc_live", line 213, in compute_followup_data
    pvalue_info = followup_event_significance(
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/pycbc/filter/matchedfilter.py", line 1856, in followup_event_significance
    stilde = data_reader.overwhitened_data(htilde.delta_f)
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/pycbc/strain/strain.py", line 1755, in overwhitened_data
    fseries = execute_cached_fft(self.strain[s:e],
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/pycbc/types/array.py", line 65, in convert
    return func(self, *args, **kwargs)
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/pycbc/types/array.py", line 819, in __getitem__
    return self._getslice(index)
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/pycbc/types/timeseries.py", line 131, in _getslice
    raise ValueError(('Negative start index ({})'
ValueError: Negative start index (-46820) not supported

I doubt this is due to recent work on PyCBC Live, it is probably a rare pre-existing edge case.

titodalcanton avatar Feb 18 '24 16:02 titodalcanton

Another occurrence earlier today.

2024-02-18T16:45:45.204-08:00 node836 0 Analyzing from 1392338752.069336
2024-02-18T16:45:46.363-08:00 node836 0 Skipping recalculation of L1 PSD, 296.9423530646091-294.87994008534906
2024-02-18T16:45:46.511-08:00 node836 0 Skipping recalculation of V1 PSD, 107.66368933729835-107.88926682682732
2024-02-18T16:45:46.565-08:00 node836 0 Skipping recalculation of H1 PSD, 250.65431577749732-251.63033187290597
2024-02-18T16:45:46.565-08:00 node836 0 Gathering triggers
2024-02-18T16:45:50.185-08:00 node836 0 Checking L1's DQ vector
2024-02-18T16:45:50.187-08:00 node836 0 Keeping 4410/4410 L1 triggers after DQ flags
2024-02-18T16:45:50.187-08:00 node836 0 Checking L1's iDQ information
2024-02-18T16:45:50.188-08:00 node836 0 Keeping 4410/4410 L1 triggers after iDQ
2024-02-18T16:45:50.188-08:00 node836 0 Checking H1's DQ vector
2024-02-18T16:45:50.189-08:00 node836 0 Keeping 4410/4410 H1 triggers after DQ flags
2024-02-18T16:45:50.189-08:00 node836 0 Checking H1's iDQ information
2024-02-18T16:45:50.190-08:00 node836 0 Keeping 4410/4410 H1 triggers after iDQ
2024-02-18T16:45:50.193-08:00 node836 0 H1-L1: 8817556 coincs, 201326592 bytes
2024-02-18T16:45:50.194-08:00 node836 0 adding singles to the background estimate...
2024-02-18T16:45:51.930-08:00 node836 0 H1-L1: 24165 background and zerolag coincs
2024-02-18T16:45:51.948-08:00 node836 0 Clustering H1-L1 coincs
2024-02-18T16:45:51.949-08:00 node836 0 Clustering events over 8.024025692304448 s window
2024-02-18T16:45:51.954-08:00 node836 0 10754 triggers remaining
2024-02-18T16:45:52.006-08:00 node836 0 Coincs 0: L1-H1: 8817932 in cbuffer
2024-02-18T16:45:52.008-08:00 node836 0 Found L1-H1 coinc with ifar 0.0024406516679990027
2024-02-18T16:45:52.008-08:00 node836 0 computing followup data for coinc
2024-02-18T16:45:52.009-08:00 node836 0 Generating SPAtmplt, 192.0s, 18895, starting from 23.1 Hz
2024-02-18T16:45:52.275-08:00 node836 0 Generating SPAtmplt, 520.0s, 18895, starting from 23.1 Hz
Traceback (most recent call last):
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/mpi4py/__main__.py", line 7, in <module>
    main()
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/mpi4py/run.py", line 198, in main
    run_command_line(args)
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/mpi4py/run.py", line 47, in run_command_line
    run_path(sys.argv[0], run_name='__main__')
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/runpy.py", line 288, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/pycbc.live/.conda/envs/o4-test-env/bin/pycbc_live", line 1289, in <module>
    evnt.check_coincs(list(results.keys()), best_coinc, psds)
  File "/home/pycbc.live/.conda/envs/o4-test-env/bin/pycbc_live", line 512, in check_coincs
    sld = self.compute_followup_data(
  File "/home/pycbc.live/.conda/envs/o4-test-env/bin/pycbc_live", line 213, in compute_followup_data
    pvalue_info = followup_event_significance(
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/pycbc/filter/matchedfilter.py", line 1856, in followup_event_significance
    stilde = data_reader.overwhitened_data(htilde.delta_f)
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/pycbc/strain/strain.py", line 1755, in overwhitened_data
    fseries = execute_cached_fft(self.strain[s:e],
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/pycbc/types/array.py", line 65, in convert
    return func(self, *args, **kwargs)
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/pycbc/types/array.py", line 819, in __getitem__
    return self._getslice(index)
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/pycbc/types/timeseries.py", line 131, in _getslice
    raise ValueError(('Negative start index ({})'
ValueError: Negative start index (-30436) not supported

I am now wondering if this is triggered (or rather, made more frequent) by switching to the O4 bank in the MDC. Note how in both cases the template is a pretty long SPAtmplt waveform.

titodalcanton avatar Feb 19 '24 11:02 titodalcanton

-46820 samples is about 23 s, -30436 about 15 s. This might simply mean we have templates that are a bit too long for --max-length.

We should add checks to the code to make sure these situations are caught immediately at startup.

titodalcanton avatar Feb 19 '24 11:02 titodalcanton

Another case yesterday:

2024-03-10T00:11:52.228-08:00 node836 0 Analyzing from 1394093520.069336
2024-03-10T00:11:54.321-08:00 node836 0 Skipping recalculation of V1 PSD, 127.21759560980665-127.03162850396691
2024-03-10T00:11:54.396-08:00 node836 0 Skipping recalculation of L1 PSD, 291.86297202552765-291.61464774707844
2024-03-10T00:11:54.463-08:00 node836 0 Skipping recalculation of H1 PSD, 250.96582180920075-253.43235303526205
2024-03-10T00:11:54.463-08:00 node836 0 Gathering triggers
2024-03-10T00:11:58.779-08:00 node836 0 Checking L1's DQ vector
2024-03-10T00:11:58.781-08:00 node836 0 Keeping 4410/4410 L1 triggers after DQ flags
2024-03-10T00:11:58.781-08:00 node836 0 Checking L1's iDQ information
2024-03-10T00:11:58.782-08:00 node836 0 Keeping 4410/4410 L1 triggers after iDQ
2024-03-10T00:11:58.783-08:00 node836 0 Checking H1's DQ vector
2024-03-10T00:11:58.783-08:00 node836 0 Keeping 4409/4409 H1 triggers after DQ flags
2024-03-10T00:11:58.784-08:00 node836 0 Checking H1's iDQ information
2024-03-10T00:11:58.785-08:00 node836 0 Keeping 4409/4409 H1 triggers after iDQ
2024-03-10T00:11:58.789-08:00 node836 0 H1-L1: 12087287 coincs, 201326592 bytes
2024-03-10T00:11:58.789-08:00 node836 0 adding singles to the background estimate...
2024-03-10T00:12:01.109-08:00 node836 0 H1-L1: 31134 background and zerolag coincs
2024-03-10T00:12:01.150-08:00 node836 0 Clustering H1-L1 coincs
2024-03-10T00:12:01.151-08:00 node836 0 Clustering events over 8.024025692304448 s window
2024-03-10T00:12:01.160-08:00 node836 0 12491 triggers remaining
2024-03-10T00:12:01.255-08:00 node836 0 Coincs 0: L1-H1: 12094344 in cbuffer
2024-03-10T00:12:01.260-08:00 node836 0 Found L1-H1 coinc with ifar 0.0003649531813773357
2024-03-10T00:12:01.261-08:00 node836 0 computing followup data for coinc
2024-03-10T00:12:01.261-08:00 node836 0 Generating SPAtmplt, 200.0s, 15195, starting from 23.1 Hz
2024-03-10T00:12:01.610-08:00 node836 0 Generating SPAtmplt, 520.0s, 15195, starting from 23.1 Hz
Traceback (most recent call last):
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/mpi4py/__main__.py", line 7, in <module>
    main()
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/mpi4py/run.py", line 198, in main
    run_command_line(args)
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/mpi4py/run.py", line 47, in run_command_line
    run_path(sys.argv[0], run_name='__main__')
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/runpy.py", line 288, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/pycbc.live/.conda/envs/o4-test-env/bin/pycbc_live", line 1319, in <module>
    evnt.check_coincs(list(results.keys()), best_coinc, psds)
  File "/home/pycbc.live/.conda/envs/o4-test-env/bin/pycbc_live", line 518, in check_coincs
    sld = self.compute_followup_data(
  File "/home/pycbc.live/.conda/envs/o4-test-env/bin/pycbc_live", line 219, in compute_followup_data
    pvalue_info = followup_event_significance(
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/pycbc/filter/matchedfilter.py", line 1856, in followup_event_significance
    stilde = data_reader.overwhitened_data(htilde.delta_f)
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/pycbc/strain/strain.py", line 1755, in overwhitened_data
    fseries = execute_cached_fft(self.strain[s:e],
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/pycbc/types/array.py", line 65, in convert
    return func(self, *args, **kwargs)
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/pycbc/types/array.py", line 819, in __getitem__
    return self._getslice(index)
  File "/home/pycbc.live/.conda/envs/o4-test-env/lib/python3.9/site-packages/pycbc/types/timeseries.py", line 131, in _getslice
    raise ValueError(('Negative start index ({})'
ValueError: Negative start index (-30436) not supported

titodalcanton avatar Mar 11 '24 15:03 titodalcanton

A quick observation from me:

2024-03-10T00:12:01.261-08:00 node836 0 Generating SPAtmplt, 200.0s, 15195, starting from 23.1 Hz
2024-03-10T00:12:01.610-08:00 node836 0 Generating SPAtmplt, 520.0s, 15195, starting from 23.1 Hz

Why is it generating the template twice (apparently with different delta_f)? I can't see anywhere in here that would cause the code to loop and run twice:

https://github.com/gwastro/pycbc/blob/b95b2604178172f839796540081ae30e5477b2d0/pycbc/waveform/bank.py#L585

and pycbc_live:

https://github.com/gwastro/pycbc/blob/master/bin/pycbc_live#L195

should only call this once (per new coinc) ???

spxiwh avatar Mar 26 '24 14:03 spxiwh

Oh wait, it's called a second time here (??):

https://github.com/gwastro/pycbc/blob/master/pycbc/filter/matchedfilter.py#L1810

Except here the delta_f is different (no idea why). I think this is a possible cause of this, as this is then being directly fed into the place where the failure occurs.

spxiwh avatar Mar 26 '24 14:03 spxiwh

After chatting with Tito, the issue is here:

https://github.com/gwastro/pycbc/blob/master/pycbc/filter/matchedfilter.py#L1855

and arises because the code asks for additional look back time for [reasons I haven't understood]. However, as we are only storing 256s of data in the buffer, this is looking for data that doesn't exist, and we get the failure above.

I'm not sure what the right thing (TM) to do here is, but think it should be easily possible to stop the code failing when this occurs.

spxiwh avatar Mar 26 '24 15:03 spxiwh

One thing I do not immediately understand is why this check does not prevent the problem from happening.

titodalcanton avatar Mar 26 '24 15:03 titodalcanton

@titodalcanton Actually, I think I spotted a bug in the logic here.

In the template code, it uses "min_buffer" as something in addition to the duration of the signal itself. https://github.com/gwastro/pycbc/blob/b95b2604178172f839796540081ae30e5477b2d0/pycbc/waveform/bank.py#L603

So the delta_f will be longer than dur calculated in the followup code. In the followup code, it assumes that this 'min_buffer' is the total duration of the signal.

@titodalcanton That would explain why it bypasses the check you pointed to as the bank then increases the duration by the signal length.

ahnitz avatar Mar 26 '24 15:03 ahnitz

@spxiwh The first template generation is from this line here. https://github.com/gwastro/pycbc/blob/master/bin/pycbc_live#L195

And that's used for the snr time series snippet (not the background estimate which needs to be longer).

ahnitz avatar Mar 26 '24 16:03 ahnitz

@titodalcanton found an example of an event where we think the followup code should have hit this bug but did not.

2024-03-19T15:49:34.486-07:00 node836 0 Checking H1's DQ vector
2024-03-19T15:49:34.487-07:00 node836 0 Keeping 4410/4410 H1 triggers after DQ flags
2024-03-19T15:49:34.488-07:00 node836 0 Checking H1's iDQ information
2024-03-19T15:49:34.489-07:00 node836 0 Keeping 4410/4410 H1 triggers after iDQ
2024-03-19T15:49:34.489-07:00 node836 0 Checking L1's DQ vector
2024-03-19T15:49:34.490-07:00 node836 0 Keeping 4410/4410 L1 triggers after DQ flags
2024-03-19T15:49:34.490-07:00 node836 0 Checking L1's iDQ information
2024-03-19T15:49:34.491-07:00 node836 0 Keeping 4410/4410 L1 triggers after iDQ
2024-03-19T15:49:34.495-07:00 node836 0 H1-L1: 12166700 coincs, 201326592 bytes
2024-03-19T15:49:34.495-07:00 node836 0 adding singles to the background estimate...
2024-03-19T15:49:36.138-07:00 node836 0 H1-L1: 32764 background and zerolag coincs
2024-03-19T15:49:36.157-07:00 node836 0 Clustering H1-L1 coincs
2024-03-19T15:49:36.159-07:00 node836 0 Clustering events over 8.024025692304448 s window
2024-03-19T15:49:36.166-07:00 node836 0 11227 triggers remaining
2024-03-19T15:49:36.228-07:00 node836 0 Coincs 0: H1-L1: 12165403 in cbuffer
2024-03-19T15:49:36.231-07:00 node836 0 Found H1-L1 coinc with ifar 4.168327967483797
2024-03-19T15:49:36.231-07:00 node836 0 computing followup data for coinc
2024-03-19T15:49:36.232-07:00 node836 0 Generating SPAtmplt, 112.0s, 306285, starting from 23.1 Hz
2024-03-19T15:49:36.374-07:00 node836 0 Generating SPAtmplt, 352.0s, 306285, starting from 23.1 Hz
2024-03-19T15:49:36.852-07:00 node836 0 Adding V1 to candidate, pvalue 0.06404907975460122, 4074 samples
2024-03-19T15:49:36.878-07:00 node836 0 Computing p_astro
2024-03-19T15:49:36.946-07:00 node836 0 Coincident candidate! Saving as /local/pycbc.live/pre_O4_mdc/full_bandwidth/triggers/2024_03_19/candidate_1394923778.509_yhjq/coinc-1394923778.509.xml.gz
2024-03-19T15:49:37.156-07:00 node836 0 Calculating IFAR

So the 2nd time the waveform was generated, it generated with a length of 352 s which is longer than the current 256s strain buffer, but somehow the followup didn't crash.

maxtrevor avatar Apr 03 '24 19:04 maxtrevor

Following up on what @maxtrevor just wrote, the puzzling thing at the moment is that here I would expect s to be negative (because buffer_length is about 352), causing self.strain[s:e] to trigger the exception we see further up. But for reasons I do not understand, that happens only sometimes.

titodalcanton avatar Apr 03 '24 19:04 titodalcanton

So looking more closely at the code, there's a line here https://github.com/gwastro/pycbc/blob/master/pycbc/strain/strain.py#L1991 which doubles max_length. So when we ask for 256, we get 512.

This would then mean that rather than being something that happens very commonly, it only happens at the very edge of parameter space (note that the original failure is asking for 528 seconds).

Hopefully this is consistent with what has been observed .... What I'm not seeing though is any check in this followup code that the data being read in is actually valid. What if it asks for 528 seconds, but the first 200 seconds is not locked (for e.g.)?

spxiwh avatar Apr 03 '24 20:04 spxiwh

Aha, that explains a lot @spxiwh, thanks for catching that.

@ahnitz what motivates the doubling? To avoid repeating the same confusion in a while, I propose dropping it, though that will require changing --max-length in a few places. I also noticed that max_buffer is given different defaults in different places (including docstrings), so we may as well clear that up for good (probably not giving any default at all).

titodalcanton avatar Apr 03 '24 21:04 titodalcanton

This is nominally addressed by #4676, will keep open for a bit longer until we are convinced everything works well on the MDC.

titodalcanton avatar Apr 11 '24 08:04 titodalcanton

Looks like this is no longer a problem. Closing, will reopen if something similar pops up again.

titodalcanton avatar Jun 11 '24 11:06 titodalcanton