Improving dataselection with strings
Is your feature request related to a problem? Please describe.
When you want to select events from a database that all have a non-Nan value for a truth variable. Ideally, one would have a way to use the selection argument of the graphnet.data.dataset like this:
"energy.notna()"
Where energy is a truth variable and .notna() is a pandas.DataFrame.query-compatible syntax.
Until now, this has not worked since the handling of attributes/functions that are called as attributes is not handled in
https://github.com/graphnet-team/graphnet/blob/631b1a69d70c9e10a71fd906e8b8a74fcc9875d4/src/graphnet/data/utilities/string_selection_resolver.py#L18
more specifically, in
https://github.com/graphnet-team/graphnet/blob/631b1a69d70c9e10a71fd906e8b8a74fcc9875d4/src/graphnet/data/utilities/string_selection_resolver.py#L123-L130
Describe the solution you'd like
One needs to expand the parsing of the string with ast to adapt to this case.
It's quite easy, and I will pull up a PR with a proposed solution.
There is a related PR from @Aske-Rosted in #791 that introduces SQL-query on the SQLite datasets for string-selections.
I think that the fact that both of you have found the need for expanding the options in the string-selections in the dataset config files is a good indicator that we should revise the implementation.
However, the two approaches (#827 from @sevmag ) and (#791 from @Aske-Rosted ) are diverging.
Here's the overarching question, as I see it: Should string-selections in the DatasetConfig files be able to produce any selection, or are there scenarios where we instead opt for the usual list of integers as selection argument? If so, where do we draw the line?
In their current implementation, string-selections are expected to accept any pandas.dataframe.query-accepted syntax, which suggests that cases where pandas.dataframe.query falls short can be handled using a user-provided list of event ids. This issue have found a case where such a pandas.dataframe.query-syntax is not supported and a fix is proposed in #827. The contribution in #791 is different - it extends the functionality of the string-selections to SQL queries - which suggests that a second language can be used in the string-selections when pandas.dataframe.query falls short.
For a revision of the string-selections, I think we should insist on the following high-level requirements:
- DatasetConfig are file format independent. I.e. they should not assume a particular Dataset class or file format in the string-selections.
- Should (ideally) not leave existing/older DatasetConfig files obsolete. I.e. previous files should be compatible with a future overhaul
- We want to keep the manual parsing of entries in the string-selection field at a minimum.
- Consistent user experience
#827 ticks all the boxes and is essentially a bug-fix - and on those grounds I would suggest that we accept this contribution.
If there are good examples of pandas.dataframe.query-syntax being too inflexible for your use-case @Aske-Rosted, we could knit together a future overhaul that fully replaces pandas.dataframe.query-syntax with SQL and that also ticks the boxes above. This could be done using file-format agnostic SQL libraries like duckdb . The potential benefit of this would be to greatly simplify the graphnet code for parsing the string-selections. The StringSelectionResolver is quite complex and merging #827 will only increase the complexity. By adopting duckdb I think we could remove quite a bit of the code.
I agree with this. My proposed solution in #827 was simply an easy fix to the current code, making it more flexible for the case described above. As I found that the addition would be useful for other users as well, I decided to open #827.
Your approach to this and your criteria for longer-term evolution of the database selections are definitely important considerations. Beyond that, I am not sure how easy it is to be backwards compatible when switching to SQL-style queries.
I do not think that #827 adds too much complexity to the code, and we would be fine with accepting it as a fast solution to this problem, but we should probably investigate the switch to other query formats.
I agree with these points. 👍