spikeinterface [Feature request] Add return_scaled option for spikeinterface.core.write_binary

Hello, I'm using spikeinterface to convert my data from openEphys to a binary file for further analysis. I found that the spikeinterface.core.write_binary_recording() function just directly write the original data into binary without any scale factors such as gain_to_uV or offset_to_uV which can be read from recording.

From the view of the project, it is necessary not to change the raw recording data, and I compeletely understand why spikeinterface keeps the raw data as it was. However, from the persepective of a user, when I call the write_binary_recording() function, not seeing a return_scaled option makes me feel like the function has done it automatically, especially I can use the sorters naturally without considering if the data scale is returned or not.

It did confuse me for couple of days figuring out why the converted binary data is "wrong". Now, I'm using a code snippet to deal with it, which is not so elegent.

info = recording.to_dict()['properties']
gain_to_uV = info['gain_to_uV'][0]
offset_to_uV = info['offset_to_uV'][0]
recording = spre.scale(recording, gain=gain_to_uV, offset=offset_to_uV)
si.write_binary_recording(recording, file_paths=file_path, dtype=dtype, **job_kwargs)

For the sake of user experience, would you consider adding a return_scaled option for the write_binary_recording() function? If it's inconvenient for you, I can learn to submit a PR since I'm new to GitHub.

Jun 02 '24 09:06 jnjnnjzch

I want to keep the function write_binary_recording with as few arguments as we can because it is easier to test and optimize. The reason for this is that it is a very central function in the library and it is better if the core is very compact.

One extra thing that you should know is that recordings are a large amount of data. The reason that the default is to write them unscaled is to keep this economic representatino of the data (the gains and channels are written).

Suggestion your script can be written like this.


gain_to_uV = recording.get_channel_gains()
offset_to_uV = recording.get_channel_offsets()
recording = spre.scale(recording, gain=gain_to_uV, offset=offset_to_uV)
si.write_binary_recording(recording, file_paths=file_path, dtype=dtype, **job_kwargs)

Btw, what do you consider inelegant about this? you are making the operation very explicit. It is an interesting debate because I prefer being explicit through operations instead of function arguments. I think that the former can be combined in many ways without having the disvantage of eventually having functions with 20 arguments.

How do you think about it?

Jun 05 '24 16:06 h-mayorquin

Now that I think about it myself what I would like is to have a preprocessing called scale_to_uV that which allows me to write:

recording_in_uV = spre.scale_to_uV(recording=raw_recording)
si.write_binary_recording(recording_in_uV, file_paths=file_path, dtype=dtype, **job_kwargs)

I think this is both concise while at the same time generalizes to situations beyond write_binary_recording. What do you think @jnjnnjzch, do you feel this would make the API more in line with your concept of elegance?

I actually would like your take on this @JoeZiminski as you have been thinking about preprocessing. My thinking here is that I think is better to combine simple functions and apply them flexibly depending on the situation at hand rather than adding a new parameter to each of the core functions for every situation. The latter also has the downside of making the core functions more difficult to optimize and document. See here for another instance where I think this question arises: https://github.com/SpikeInterface/spikeinterface/issues/2985. I recognize there might be performance limitations that might limit this approach but I would rather take it as a baseline.

Jun 07 '24 14:06 h-mayorquin

Hi, @h-mayorquin, sorry for my late reply. Here are my thoughts.

It is quite hard to "define" an elegent code style because such styles can be defined from many perspectives. And I agree with your thought that one should minimize the number of arguments for those core functions and to build an API-provided function via simple function combinations.

However, I also think that if the function is that "core", then this function should not be "explicitly called" by the user. For example, I think a better way is to rename the current write_binary_recording to _write_binary_recording (I think in Python, such naming convention implies this function should not be explicitly called) and re-build write_binary_recording with necessary parameters like return_scaled etc. Since the core function is "hidden", any modifactions to the new function will not affect the whole project, which is good for maintenance.

The words above are from a developer's perspective. From the user's perspective, I think the "elegant" codes should be

uniform (this means in different functions, the arguments remain consistent if the operation is the same)
highly readable (this means it can be directly read even without looking up the documentation),
compatible with processing pipeline perceptions (this means the codes just do the same as what neuroscientists were doing before and not lead in extra steps)

Let's look at your suggested codes if I may...

# Your first suggestion
gain_to_uV = recording.get_channel_gains()
offset_to_uV = recording.get_channel_offsets()
recording = spre.scale(recording, gain=gain_to_uV, offset=offset_to_uV)
si.write_binary_recording(recording, file_paths=file_path, dtype=dtype, **job_kwargs)

# Your second suggestion
recording_in_uV = spre.scale_to_uV(recording=raw_recording)
si.write_binary_recording(recording_in_uV, file_paths=file_path, dtype=dtype, **job_kwargs)

Your second suggestion simplified the process of separately get some properties in the first suggestion. I think it improves the readability because one may not consider reading some extra properties just for a spre.scale operation if they have not carefully read the documents. As for the perception compability, both suggestions introduce an extra step to "scale" the recording file. This operation might sound weird to a neuroscientist as they do not typically perform such steps separately. In fact, very few papers describe such a"scale" step. Instead, other steps such as band pass filter and common mode remove are more "traditional" steps. Besides, the codes seem not uniform with other functions like sw.plot_traces(recording, return_scaled, **args), because those functions implement the "return scale" with an argument while the write_binary_recording() requires another line of code...

I can personally accept your both suggestions as I have some knowledge of coding. I write these just for those who don't have such necessary knowdedge about coding, I hope my thoughts can be helpful to you.

Jun 09 '24 08:06 jnjnnjzch

Very interesting discussion, I am not 100% sure what is the best approach in this case. I have a clarifying question to make sure I understand the problem. At the moment the write_binary_recording() always writes to the format of the original recording (e.g. int16) + the metadata on scaling factors. The return_scaled argument would make it write directly in scaled format (e.g. float64)? Assuming the above is correct, out of interest, what is the use case of saving as scaled format rather than as int and rescaling to float? (just considering the larger file size).

I agree the scaling is in general confusing, it really took a long time for me to get my head around and still trips me up. At the very minimum it would be good to 1) add to the docstring of write_binary_recording indicating the format it writes data and an example of how to get it to write scaled 2) A short how-to or another explanatory article in the docs covering data types, data scaling and how this is represented in SI.

Jun 10 '24 10:06 JoeZiminski

Well, I'm not sure if the function write_binary_recording() writes the metadata of scaling factors, but from my observation (comparing the binary data output by self-written codes and write_binary_recording()), it seems it does not.

Also, it seems we are talking about the scale of values instead of the data format. The data format can perhaps be converted via the dtype argument?

Jun 10 '24 15:06 jnjnnjzch

Yeah the write_binary_recording only writes the binary file

Jun 10 '24 15:06 alejoe91

@jnjnnjzch thanks for the detailed answer.

uniform (this means in different functions, the arguments remain consistent if the operation is the same)

highly readable (this means it can be directly read even without looking up the documentation),

compatible with processing pipeline perceptions (this means the codes just do the same as what neuroscientists were doing before and not lead in extra steps)

I think those three are desirable properties of the code, agree, plus you bring the important point that we do use the return_scaled in some other places so adding it here will make more uniform. I appreciate a lot that you say that scale_to_uV is not a common operation for neuroscientists and might confuse some users (that said, I don't think that return_scaled as an argument is that clear either while scale_to_uV is at least more transparent in what it does).

One final, thing: why and in what context are you using write_binary_recording? You could use recording.save_to_folder instead and over there I think it might make more sense to add a return_scaled because that is inded intended to be user facing. That will save properties like the scale and anything else. But I am asking your use case because I don't want to make assumptions about your workflow. It is is just good for me as a developer to know how are you using the software.

You would be pleased to know that your suggestion of having a friendly user facing API and a more unfriendly core is a common suggestion in the good practices: https://learn.scientific-python.org/development/principles/design/#permissiveness-isnt-always-convenient

Jun 10 '24 16:06 h-mayorquin

Thanks @jnjnnjzch, If I understand correct if the recording is originally in int, then saving it scaled will implicitly also cast it to float32 (and, increase the space on disk assuming the original recording is in the commonly used int16 dtype)?

Another related question, @h-mayorquin might also be able to help me out here, if the recording is already in float64, I assume it is already scaled? Or might there be some case where for some reason the data is in float64, but it still needs scaling to uV? (I think this is slightly related to our discussion in #2871). Actually, looking again at #2871 it seems it is possible for data to be in float but unscaled. @jnjnnjzch is this the case for your data?

I think I lean towards a return_scaled argument just because it is nice to abstract away as much as possible the scaling and datatype issues, which are quite confusing topics. Related to this in general it would make sense to add some documentation on this and ensure all get_traces() functions are behaving in the same way with respect to datatype and scaling (possibly according to #2871).

Jun 10 '24 18:06 JoeZiminski

Hi, @h-mayorquin ,there are several reasons of using write_binary_recording.

we built our workflow based on the binary file as we were using kilosort3 before, while the function recording.save_to_folder seems write too many things for us, and we are not fully prepared to use them yet.
Another reason (my fault) is I didn't see a recording.save_to_folder in the document before (even with searching?)... though the usage of write_binary_recording can be easily found.
Considering the data size, we also want to save it as int16, and this can be easily done in write_binary_recording via a dtype argument.

I agree the recording.save_to_folder might be a more user-friendly solution if there are tutorials provided? However, as we just need the binary file, write_binary_recording sounds more straightforward for our purpose. And thanks for pointing out the design principles, I will carefully review it. :D

For @JoeZiminski ,

The scaling indeed cast the original data into float, and we convert it back to int16 for further analysis considering the disk space. There are some precision loss due to the conversion and we just ignore that. Another thing is that the data format has no direct relationship with whether the data is scaled or not, so yes it is possible for data to be in float but unscaled. I think the data format is determined by the upstream reading function or the data aqusition software or hardware, which seems hard to predict. An extra information (might be helpful): as far as I know the intan hardwares convert the raw voltage signal from brain into int16, so any software/function reading the intan data as float32 does not really increase the precision level.

Jun 12 '24 07:06 jnjnnjzch

Sorry for the slow down in the discussion @jnjnnjzch. This is still on my radar, adding some previous discussion regarding return_scaled https://github.com/SpikeInterface/spikeinterface/pull/2320#pullrequestreview-1777998536

Jun 19 '24 21:06 h-mayorquin

Close this issue since the feature is implemented and discussion about I/O tutorial is moved to #3098

Sep 29 '24 10:09 jnjnnjzch

Thank you for your patience, is still on my radar to write the tutorial but I have not gotten contract work for Spikeinterface recently so things are slower.

Oct 09 '24 20:10 h-mayorquin

[Feature request] Add return_scaled option for spikeinterface.core.write_binary_recording