spikeinterface icon indicating copy to clipboard operation
spikeinterface copied to clipboard

Mishandled sorting provenance during WaveformExtractor to SortingAnalyzer conversion

Open grahamfindlay opened this issue 6 months ago • 1 comments

(I had a call with @samuelgarcia this morning about this issue -- filing it here for tracking purposes.)

I am revisiting some of my old unit data, which was sorted and postprocessed using, I think, spikeinterface == 0.98.0.dev0 , or thereabouts. I am trying to convert some old WaveformExtractor folders to the newer SortingAnalyzer folders. I can load a MockWaveformExtractor using:

waveform_sorting = spikeinterface.extractors.read_kilosort(sorter_output_dir)
we = si.load_waveforms(waveform_output_dir, with_recording=False, sorting=waveform_sorting) # Takes ~15 min.

Then I can write the SortingAnalyzer to disk like so:

sa = we.sorting_analyzer
analyzer_output_dir = waveform_output_dir.parent / "si_sorting_analyzer"
sa.save_as(folder=analyzer_output_dir, format="binary_folder") # Takes ~2min

If I try to write the SortingAnalzyer to disk using format="zarr", I get an error: ValueError: Codec does not support buffers of > 2147483647 bytes. I think this is because the SortingAnalyzer is trying to write with chunks > 2GB, and this is not supported by numcodecs.Pickle() in the call to zarr_root.create_dataset("sorting_provenenance", ...) in SortingAnalzyer.create_zarr() (line 614). So I tried to just write as a binary folder, which is successful, but I suspect that this issue is related to another issue (the main issue) that I am about to describe.

The main issue: the analyzer_output_dir that gets created is 22.85 GB, where the waveform_output_dir was only 6.01 GB. This is where the disk usage is coming from:

- analyzer_output_dir = 22.85 GB
    - /extensions = 6.01 GB # same size as original waveform_output_dir
    - /sorting_provenance.pickle = 5.61 GB
    - /sorting/provenance.pkl = 5.61 GB (seems duplicated?)
    - /sorting/spikes.npy = 5.61 GB (triple duplication? sus)

My guess is that these 5.61 GB files are also the files that zarr_root.create_dataset("sorting_provenance", ...) above was trying to write to disk, probably unchunked, above.

For reference, the original sorter_output_dir is 131.19 GB, of which template_features.npy makes up the largest share (29.94 GB), followed by amplitudes.npy (1.87 GB) and spike_times.npy (1.87 GB).

Here is what I think is happening when loading (i.e. creating the MockWaveformExtractor):

  1. si.load_waveforms receives an extractor object, which is passed on line 425 to _read_old_waveform_extractor_binary.
  2. _read_old_waveform_extractor_binary passes this extractor as the first argument to SortingAnalyzer.create_memory on line 498.
  3. SortingAnalyzer.create_memory converts this to a NumpySorting on line 391. Even though with_metadata=True, provenance is lost.

Here is what I think is happening when saving the SortingAnalyzer:

  1. sa.save_as calls sa._save_or_select_or_merge, which tries to ascertain sorting provenance on line 965 using sa.get_sorting_provenance().
  2. sa.get_sorting_provenance() checks sa.format, finds that sa.format == "memory", and therefore returns None. Apparently an in-memory SortingAnalyzer cannot have sorting provenance.
  3. Because sa.sorting_provenance == None, it gets set in SortingAnalyzer._save_or_select_merge to sa.sorting (the NumpySorting) on line 968.
  4. SortingAnalzyer._save_or_select_merge passes the NumpySorting to SortingAnalzyer.create_binary_folder on line 1002.
  5. The NumpySorting gets written to disk twice from a single call to sorting.save on line 422 of SortingAnalzyer.create_binary_folder.
    • BaseExtractor.save_to_folder tests self.check_serializability("pickle") on line 963, which passes, so self.dump_to_pickle writes the first copy of the sorting, provenance.pkl, on line 965.
    • BaseExtractor.save_to_folder also calls self._save without a format argument on line 972, which is supplied by BaseSorting._save, whose default format="numpy_folder" kwarg triggers NumpyFolderSorting.write_sorting on line 257, which writes the second copy of the sorting spikes.npy.
  6. The NumpySorting gets written to disk a third time on line 439 of SortingAnalyzer.create_binary_folder by sorting.dump, after sorting.check_serializability("pickle") passes. This writes sorting_provenance.pickle.

😵 !

grahamfindlay avatar Jun 30 '25 16:06 grahamfindlay

Thank you Graham for this summary. I will try to fix this.

samuelgarcia avatar Jul 02 '25 05:07 samuelgarcia