spikeinterface icon indicating copy to clipboard operation
spikeinterface copied to clipboard

About the issue of recording saving.

Open yyyaaaaaaa opened this issue 1 year ago • 15 comments

When I try to save preprocessed data after extracting the recordings, my terminal always seems unresponsive, as if the entire code has stopped running. However, when I check the saved folder, it already exists. But when I try to extract using the si.load_extractor() , it throws the following error. Spikeinterface version is V0.100.6.

Traceback (most recent call last): File "E:\y\python_files\sort\test.py", line 101, in recording_rec = si.load_extractor(DATA_DIRECTORY / preprocessed) File "D:\software\Anaconda3\envs\kilosort4\lib\site-packages\spikeinterface\core\base.py", line 1146, in load_extractor return BaseExtractor.load(file_or_folder_or_dict, base_folder=base_folder) File "D:\software\Anaconda3\envs\kilosort4\lib\site-packages\spikeinterface\core\base.py", line 781, in load raise ValueError(f"This folder is not a cached folder {file_path}") ValueError: This folder is not a cached folder H:\MEA_DATA_binary\yy\20240130\20240130_19531_D13\240130\19531\Network\000015\binary_for_ks4

Here's the script I'm using.

recording_f = bandpass_filter(recording=recording, freq_min=300, freq_max=6000) recording_cmr = common_reference(recording=recording_f, operator="median") recording_sub = recording_cmr

preprocessed = "binary_for_ks4" job_kwargs = dict(n_jobs=30, chunk_duration='1s', progress_bar=True) rec_saved = recording_sub.save(folder=DATA_DIRECTORY/preprocessed, overwrite=True, format='binary', **job_kwargs)

yyyaaaaaaa avatar May 10 '24 07:05 yyyaaaaaaa

Hi @yyyaaaaaaa

What spikeinterface version are you using? How large is your recording? If the save function is not printing anything, it means it didn't run successfully and so it's expected that you are not able to reload the extractor

alejoe91 avatar May 10 '24 08:05 alejoe91

I'm using version 0.100.6, and this is information about my preprocessed recording.

CommonReferenceRecording: 1012 channels - 20.0kHz - 1 segments - 6,000,200 samples 300.01s (5.00 minutes) - int16 dtype - 11.31 GiB

yyyaaaaaaa avatar May 10 '24 08:05 yyyaaaaaaa

Can you try with n_jobs=1? Just to see if it runs :)

alejoe91 avatar May 10 '24 08:05 alejoe91

After waiting for a while, it started working normally. However, it seems a bit slow. Is this normal?

write_binary_recording with n_jobs = 1 and chunk_size = 20000 write_binary_recording: 26%|##5 | 77/301 [04:00<08:53, 2.38s/it]

yyyaaaaaaa avatar May 10 '24 08:05 yyyaaaaaaa

With 1 job it's supposed to be slow. Can you try to gradually increase it? Does it work with 2?

alejoe91 avatar May 10 '24 09:05 alejoe91

It's not working. So far, no relevant information has been printed out.

yyyaaaaaaa avatar May 10 '24 09:05 yyyaaaaaaa

Just for our provenance this is now the 4th case of n_jobs >1 not working on save binary. 3 times on Windows. #2820 is another example. I don't understand the deeper levels of the chunking to try to troubleshoot this. But I think this has to do with some nitty-gritty environment issues on specific computers. For example for my labmate's computer it works in ipython in the terminal but not in an ide.

zm711 avatar May 10 '24 11:05 zm711

Yes, I'm using a Windows system and running my script through PyCharm. Fortunately, setting n_jobs = 1 allows me to work normally, although it's a bit slower :)

yyyaaaaaaa avatar May 10 '24 11:05 yyyaaaaaaa

One last questino that will be useful for us. What format is your original data? That is, what format is your original recording? Also, your chunks are I think too smal for writing.

This is a deep issue but I suggest to try to two things.

  1. First, on windows, when you run parallel on windows, your script should be protected see this comment in the previous issue: https://github.com/SpikeInterface/spikeinterface/issues/2122
  2. Can you show us your system resources as you run the program with this branch? https://github.com/SpikeInterface/spikeinterface/pull/2796

I find the later unlikely because your process sould be killed at some point but maybe is overswapping and that's why it becomes so slow.

h-mayorquin avatar May 10 '24 14:05 h-mayorquin

Suggestion: We should do the CI testing with the spawn method to avoid the bias that has creep because Alessio, me and Sam are linux users.

h-mayorquin avatar May 10 '24 14:05 h-mayorquin

Also, your chunks are I think too smal for writing.

That's our default so if that is the case then it's really our fault for choosing that :)

@h-mayorquin the one linux person who had this fail had an issue with numpy backend stuff see this comment here

zm711 avatar May 10 '24 14:05 zm711

I am linking the numpy issue since it is broken in the other thread: https://github.com/numpy/numpy/issues/11734

Unfortunatley, I don't think we can safeguard against bugs at the numpy/Intel compilers level.

h-mayorquin avatar May 10 '24 15:05 h-mayorquin

But in this case I think we need to make sure that people have the option to use n_jobs=1 at each stage. And @alejoe91 and I had previously talked about adding a troubleshooting in the docs to let people know they could look into this for their own computer if they want to try to use the multiprocessing.

zm711 avatar May 10 '24 15:05 zm711

I was talking about the CI. I think that testing for that specific case is too granular even for my preferences, specially if it just hangs.

I think that n_jobs=1 default is a good idea. You will have to convince @samuelgarcia about it though.

h-mayorquin avatar May 10 '24 15:05 h-mayorquin

Here's another saving issue on Windows I had forgotten about: #1922.

When I have time I might open a separate global issue so I can try to summarize the state of problems with multiprocessing in the repo.

zm711 avatar May 10 '24 17:05 zm711

For more info on this issue. One of my labmates can do multiprocessing from the terminal, but not from an IDE. Since this user is using an IDE maybe there is a problem there.... Not sure how to solve that though.

zm711 avatar Jun 19 '24 16:06 zm711

Which IDE does your colleague uses? So what we have now is that:

  • Windows
  • Pycharm?
  • Maybe "spawn" as context?

h-mayorquin avatar Jun 19 '24 18:06 h-mayorquin

Windows 11 (so spawn is required). Miniconda, python 3.11, Spyder (which incidentally also breaks tqdm). Even n_jobs=1 can be a little buggy for writing binary files, but it might be because they are large so we just aren't waiting long enough.

Using an ipython repl in the same conda env works fine for tqdm and multiprocessing.

zm711 avatar Jun 19 '24 20:06 zm711

Is the error still present if you try to write a generate_recording() ? (this would allow us to discard a reading problem so we can focus on the core functions intead of the input format).

Tomorrow I will alocate some time to solve the plexon as it was a request from my boss. While on windows I can try to see if I can reproduce the issue but I need to know the input format.

h-mayorquin avatar Jun 19 '24 21:06 h-mayorquin

Will test later today. It is only on a specific computer so I have to wait for the person to come in...

zm711 avatar Jun 20 '24 12:06 zm711

I got an update. So using generate_recording saved fine with n_jobs > 1. She also told me that the saving of a binary file with mountainsort5 worked fine, but the save binary was always failing from Spyder for running any of the Kilosorts. So maybe the run_sorter function was hanging. She ended up having to redownload her OS (IT took care of it) due to some issues lately so maybe here latest updates got rid of the whatever the problem was.

zm711 avatar Jun 20 '24 15:06 zm711

OK, and the input previously was intan? (assuming this from your lab).

h-mayorquin avatar Jun 20 '24 15:06 h-mayorquin

Yep.

zm711 avatar Jun 20 '24 16:06 zm711

OK, once this is done: https://github.com/NeuralEnsemble/python-neo/pull/1491

We should give the intan version you guys are using the same treatment than this to make it smother:

https://github.com/SpikeInterface/spikeinterface/pull/1781

h-mayorquin avatar Jun 20 '24 16:06 h-mayorquin

Let's close this. This discussion is too long in the past and many things have changed. If the problem appears again we can focus on it.

h-mayorquin avatar Jun 20 '24 16:06 h-mayorquin

Yep.

This short answer is under a lisense I have deposite in 1857. You can use but this is not free.

samuelgarcia avatar Jun 21 '24 06:06 samuelgarcia

What? x D OK, that's the last thing, I need to go to sleep : )

h-mayorquin avatar Jun 21 '24 06:06 h-mayorquin