spikeinterface analyzer.get_extension using "|"

@DradeAW : this is for you.

Mar 13 '24 08:03 samuelgarcia

Yeah, I also think this will introduce bugs such as the space one mentioned by @zm711 .

Is this format used anywhere else? If yes, maybe the docstring should reference it. If not, it is usually considered a bad idea to invent specifications for parsing arguments.

Plus, the behavior of returning None at failure to fetch will cause silent errors with mispelling.

Why not just use a list or a tuple?

Mar 13 '24 17:03 h-mayorquin

For me a tuple/list would mean return me two extensions. Here it is return the first and if not the second. The "|" is also used in the dependency class chain.

Mar 13 '24 18:03 samuelgarcia

For me a tuple/list would mean return me two extensions

Do I read you correctly that you are trying to disambiguate that by using an analogy short-circuit operation of booleans to strings and the fact that | is a boolean operator?

I think you are trying to be too clever, Sam! It is not obvious at all that your proposed notation means "fetch the first that is available".

I think that in both cases you will need documentation or a better variable name. The processing problems caused by string parsing under the hood can be avoided though.

The "|" is also used in the dependency class chain.

You mean somewhere else in the code base or outside of it? I am not familiar with it.

Anyway, just warnings from my side, if you are aware of the possible problems I enumerated and you made your mind about it, that's fine.

Mar 13 '24 18:03 h-mayorquin

I mean related to Heberto's point with the semantics of the operation. What does it really mean to do

sorting_analyzer.get_extension('waveforms|spike_locations')

It seems to me like there are special cases like could a user want to do ('waveforms|templates|fast_templates'), but then any typo could trigger an error, no?

I think users may over-interpret this | operation. Although I agree that a list or tuple would imply return all extensions I request rather than the first one found. I'm more worried like @h-mayorquin about bugs silent or otherwise that may creep in.

Mar 13 '24 21:03 zm711

I think the '|' operator would almost never be used by the users,

Isn't it mostly for internal stuff (mainly the templates) to avoid having if loops everywhere?

I think users usually know what extensions they computed and would get what they want no?

Mar 14 '24 12:03 DradeAW

I use the get_extension a lot, so even though the expectation is that users shouldn't really use that feature and just give a string, they might start using it and it isn't super consistent. For me if this is really just an internal aid then couldn't we hard code a check for 'templates|fast_templates' and then do the split logic so that no one can misuse it and then complain? Or is there another use case?

Mar 14 '24 14:03 zm711

I'm not saying get_extension isn't used by users, I'm saying I don't think the subcase with "|" will occur a lot for users

Mar 14 '24 15:03 DradeAW

I'm not saying get_extension isn't used by users, I'm saying I don't think the subcase with "|" will occur a lot for users

I don't understand this argument. Are you arguing we should not be that concerned about something because it will not be used by a lot of users? I think that if we could make something better and less error prone without cost we should.

Summary:

We want to have a mechanism to "fetch the first available extension from a set of ordered options".
That functionality can be implemented in different ways. The proposal here is a string parsing. Another is a list. Maybe there is something better out there.
Sam does not like to use a list because he argues that for him the semantics of that mean return all the extensions.
I don't think this is the case and I would rather disambiguate with good docstring or variable name instead.
Me and Zach are pointing out some common errors of string parsing specifications such as whitespace.
I also think that the string parsing solution is not only more error prone for users but harder to mantain internally and harder to debug.

I think the problem of semantics comes from wanting to implicitly overload the function get_extension and trying to make it do too many things. Why not just make a new function that is called fetch_first_available_extension that calls get_extension under the hood and handles existence? with that name, there is no ambiguity about what the function is doing.

There are two further advantages of this approach:

The function fetch_first_available_extension can contain good error messages and be oriented to be user friendly. See the problems discussed in https://github.com/SpikeInterface/spikeinterface/issues/2560
The original function get_extension then can be specialized in being performant for internal use and optimization without the burden of user friendliness in mind.

Mar 14 '24 15:03 h-mayorquin

To justify the approach proposed above I would like to cite the guidelines of design of scientific python that we have discused before:

Overly permissive code can lead to very confusing bugs. If you need a flexible user-facing interface that tries to “do the right thing” by guessing what the users wants, separate it into two layers: a thin “friendly” layer on top of a “cranky” layer that takes in only exactly what it needs and does the actual work. The cranky layer should be easy to test; it should be constrained about what it accepts and what it returns. This layered design makes it possible to write many friendly layers with different opinions and different defaults.

https://learn.scientific-python.org/development/principles/design/#permissiveness-isnt-always-convenient

Mar 14 '24 15:03 h-mayorquin

I'm fine with a solution get_first_available_extension, I find it less pretty, but it might be a better scenario for SpikeInterface.

I don't understand however how '|' is not user friendly. It's used by numpy, C, Java, PHP .... in the same context (as an 'or')

Mar 15 '24 09:03 DradeAW

The only real application for this is fast_templates|templates (or viceversa). Maybe it would be better to only support this two with a specific function.

Mar 15 '24 09:03 alejoe91

The only real application for this is fast_templates|templates (or viceversa)

This is true, we could just implement a function analyzer.get_templates_extension()

Mar 15 '24 09:03 DradeAW

That sounds good to me. It's a specific case so it has a specific function and we leave the other function to do its job without being overly permissive (to quote @h-mayorquin ).

Mar 15 '24 10:03 zm711

I do not like the idea that we have SortingAnalyzer.get_template() I perfer to keep it neutral and not inject back extension semantic.

Having one extensions or another similar extensions one is really usefull. At the moment it is only templates but could be done for some others (one extension is an approximation the other is is the full computation for instance on all spikes). I like a lot the semantic with the pipe.

For me the choice is:

SortingAnalyzer.get_extension("templates|fast_templates)"
SortingAnalyzer.get_one_or_other_extension(["templates", "fast_templates"])

And because it will be only internal stuff I strongly vote the 1. which was Aurelien proposal. There are maybe already ~8 places in the entire that could benefit this.

Also note that I have in mind to do soon this SortingAnalyzer.get_extension_data("templates)" which internally do SortingAnalyzer.get_extension("templates).get_data()" but internally check that extension is existing. Then solution 2 would lead to a specific get_one_or_other_extension_data which is becaming quite ugly instead of SortingAnalyzer.get_extension_data("fast_templates|templates")

What do you think ?

Mar 15 '24 12:03 samuelgarcia

I would love to hear how you plan to deal with the string parsing issues that Heberto and I are worried about. I get the idea in general, but how do we protect end-users and not developers.

Sometimes explicit can be beautiful too even if the naming is not ideal. If we follow Aurelien's argument (I can't do accents on my work keyboard sorry!) that this is mainly for developers then even if it is an ugly name the end users won't need to worry about it they can keep the elegant outward facing function. The ugly function will only be used by people that need to worry about using it.

If you think that it will be outward facing in general then I think we really need to worry about how end-users will misuse the function. Again I bring up two examples: ('fast_templates '|'templates') and ('waveforms'|'spike_locations'). what is the solution for those?

Mar 15 '24 12:03 zm711

Oh yes a public debate on a detail! I propose to remove the space what do you think ? And then if we have a typo the get_extension class should triger the appropriate bug when #2571 merged. what do you think ?

Mar 15 '24 13:03 samuelgarcia

Okay, I think after #2571 is merged the error messaging will be clearer and will deal with typos. Stripping spaces is also easy enough. I'll leave it to @h-mayorquin to keep the fight going if he wants.. I still stand by him for the design choice reasoning and debugging in general for the PR. But it seems you really want it this way :) .

Mar 15 '24 13:03 zm711

@samuelgarcia

For me the choice is:

SortingAnalyzer.get_extension("templates|fast_templates)" SortingAnalyzer.get_one_or_other_extension(["templates", "fast_templates"]) And because it will be only internal stuff I strongly vote the 1

I still don't understand why. You can avoid errors and more code by using a different implementation that would make no difference, as you say, it is a detail. Why not go for the solution with less maintenance burden: having a function get_one_or_other_extension or fetch_first_available_extension:

It does not require string parsing, knowledge of short-circuit boolean property or white space managment.
If it is to be used internally then you avoid adding more code and complexity to a function that is user facing (get_extension).

Is there an argumen against against it other than aesthethics? I disagree with you about what looks bad and good but I think is your prerogative to make this type of calls (I really mean this!).

@DradeAW

I don't understand however how '|' is not user friendly. It's used by numpy, C, Java, PHP .... in the same context (as an 'or')

The argument is that string parsing is harder to maintain and has some obvious errors with white space. As a user of pd.query myself I am at odds about user friendliness. I kind of like it but it has make me lost a lot of time.

About the use in C. I think that's a stretch. I know it is used for short-cirtcuit boolean evaluation but this is not what we are doing here. Maybe I am wrong, can you point out to what you mean?

Plus, even short-circuit booleans is not a widely know property in python. I remember that Alessio was not aware of it until recently and another developer that I work with was not aware of that. Both Alessio and my friend have way way more software knowledge that the average user of the library: a scientific user who uses code to do data analysis. I think that relying on the implicit knowledge that | is used to short-circuit booleans and that we are using it here for strings in an analogy where True means "Found" is just not ... that straighfoward. You guys are smart and I think you forget how is not to know.

@alejoe91

This is true, we could just implement a function analyzer.get_templates_extension()

This would be even better to me actually @zm711. I like the heuristic of avoiding early generalization.

Mar 15 '24 18:03 h-mayorquin

@zm711 @h-mayorquin : thanks a lot for the string warning signals against this. The debate is interesting. I think I will use my jocker card "I am a capricious child". I have the deep feeling that this will be usefull. I still need to make some proposal with simpler syntax to access extension data and this will help a lot.

Mar 20 '24 08:03 samuelgarcia

That makes sense to me, @samuelgarcia.

You know possible future problems but you still like this design more. I think that's fair.

There is not perfect decision, only trade-offs.

Mar 20 '24 14:03 h-mayorquin

Guys, I'm reading this again..

So everything boils down to get_extension("templates|fast_tamplates"). I'm not seeing other use cases at the moment. My suggestion then is to handle this downstream, when needed.

The syntax is not that bad either. We can do:

ext = analyzer.get_extension("templates") or analyzer.get_extension("fast_templates")

This will do exactly what we want. No?

Mar 25 '24 11:03 alejoe91

I'm not seeing other use cases at the moment.

For example, let's say in the future someone want to implement an extension select_spikes (similar to random_spikes, but you can choose the indices you want to take). Then you might want to get one of both extensions :)

Mar 25 '24 12:03 DradeAW

I'm not seeing other use cases at the moment.

For example, let's say in the future someone want to implement an extension select_spikes (similar to random_spikes, but you can choose the indices you want to take). Then you might want to get one of both extensions :)

Sure, and you could do:

ext = anayzer.get_extension("random_spikes") or anayzer.get_extension("my_other_extension")

My point is that this notation is super clear, it relies on the Python or operator behavior, and it doesn't require any string parsing

Mar 25 '24 12:03 alejoe91

ext = analyzer.get_extension("templates") or analyzer.get_extension("fast_templates")

This is more explicit to me. I understand it is more verbose, but is very clear. I'm still firmly on this side :)

Mar 25 '24 12:03 zm711

I still prefer the option with | because it doesn't ask the user to retype the variable name and function (the variable can have a long name in some cases) and I find it easier to read,

But I won't cry if you decide to go with the other option :)

Mar 25 '24 13:03 DradeAW

@alessio this will not work on analyzer.get_extension_data("templates|fast_templates") which is what I have in mind after this PR.

Mar 25 '24 13:03 samuelgarcia

analyzer.get_extension_data("templates|fast_templates")

analyzer.get_extension_data("templates") or analyzer.get_extension_data("fast_templates")? Not a lot of extra typing

Mar 25 '24 14:03 alejoe91

Not a lot of extra typing

What if your variable is called sorting_analyzer_merged[index]

In my case when dealing with multiple analyses, the variables have quite long names :)

Mar 25 '24 15:03 DradeAW

@DradeAW you can copy paste :P

Mar 25 '24 15:03 alejoe91

But lines exceed the number of characters that the PEP recommends :p

I can understand your point of view as I said, if you decide that it's the best way forward I won't fight it and use it that way, It's just my opinion that the other option is better :)

Mar 25 '24 15:03 DradeAW