Mishandled sorting provenance during WaveformExtractor to SortingAnalyzer conversion
(I had a call with @samuelgarcia this morning about this issue -- filing it here for tracking purposes.)
I am revisiting some of my old unit data, which was sorted and postprocessed using, I think, spikeinterface == 0.98.0.dev0 , or thereabouts. I am trying to convert some old WaveformExtractor folders to the newer SortingAnalyzer folders. I can load a MockWaveformExtractor using:
waveform_sorting = spikeinterface.extractors.read_kilosort(sorter_output_dir)
we = si.load_waveforms(waveform_output_dir, with_recording=False, sorting=waveform_sorting) # Takes ~15 min.
Then I can write the SortingAnalyzer to disk like so:
sa = we.sorting_analyzer
analyzer_output_dir = waveform_output_dir.parent / "si_sorting_analyzer"
sa.save_as(folder=analyzer_output_dir, format="binary_folder") # Takes ~2min
If I try to write the SortingAnalzyer to disk using format="zarr", I get an error: ValueError: Codec does not support buffers of > 2147483647 bytes. I think this is because the SortingAnalyzer is trying to write with chunks > 2GB, and this is not supported by numcodecs.Pickle() in the call to zarr_root.create_dataset("sorting_provenenance", ...) in SortingAnalzyer.create_zarr() (line 614). So I tried to just write as a binary folder, which is successful, but I suspect that this issue is related to another issue (the main issue) that I am about to describe.
The main issue: the analyzer_output_dir that gets created is 22.85 GB, where the waveform_output_dir was only 6.01 GB.
This is where the disk usage is coming from:
- analyzer_output_dir = 22.85 GB
- /extensions = 6.01 GB # same size as original waveform_output_dir
- /sorting_provenance.pickle = 5.61 GB
- /sorting/provenance.pkl = 5.61 GB (seems duplicated?)
- /sorting/spikes.npy = 5.61 GB (triple duplication? sus)
My guess is that these 5.61 GB files are also the files that zarr_root.create_dataset("sorting_provenance", ...) above was trying to write to disk, probably unchunked, above.
For reference, the original sorter_output_dir is 131.19 GB, of which template_features.npy makes up the largest share (29.94 GB), followed by amplitudes.npy (1.87 GB) and spike_times.npy (1.87 GB).
Here is what I think is happening when loading (i.e. creating the MockWaveformExtractor):
-
si.load_waveformsreceives an extractor object, which is passed on line 425 to_read_old_waveform_extractor_binary. -
_read_old_waveform_extractor_binarypasses this extractor as the first argument toSortingAnalyzer.create_memoryon line 498. -
SortingAnalyzer.create_memoryconverts this to aNumpySortingon line 391. Even thoughwith_metadata=True, provenance is lost.
Here is what I think is happening when saving the SortingAnalyzer:
-
sa.save_ascallssa._save_or_select_or_merge, which tries to ascertain sorting provenance on line 965 usingsa.get_sorting_provenance(). -
sa.get_sorting_provenance()checkssa.format, finds thatsa.format == "memory", and therefore returnsNone. Apparently an in-memorySortingAnalyzercannot have sorting provenance. - Because
sa.sorting_provenance == None, it gets set inSortingAnalyzer._save_or_select_mergetosa.sorting(theNumpySorting) on line 968. -
SortingAnalzyer._save_or_select_mergepasses theNumpySortingtoSortingAnalzyer.create_binary_folderon line 1002. - The
NumpySortinggets written to disk twice from a single call tosorting.saveon line 422 ofSortingAnalzyer.create_binary_folder.-
BaseExtractor.save_to_foldertestsself.check_serializability("pickle")on line 963, which passes, soself.dump_to_picklewrites the first copy of the sorting,provenance.pkl, on line 965. -
BaseExtractor.save_to_folderalso callsself._savewithout aformatargument on line 972, which is supplied byBaseSorting._save, whose defaultformat="numpy_folder"kwarg triggersNumpyFolderSorting.write_sortingon line 257, which writes the second copy of the sortingspikes.npy.
-
- The
NumpySortinggets written to disk a third time on line 439 ofSortingAnalyzer.create_binary_folderbysorting.dump, aftersorting.check_serializability("pickle")passes. This writessorting_provenance.pickle.
😵 !
Thank you Graham for this summary. I will try to fix this.