LAPIS icon indicating copy to clipboard operation
LAPIS copied to clipboard

[DO NOT MERGE] feature branch regex

Open Taepper opened this issue 1 year ago • 2 comments

resolves #750

This feature branch maps string fields from the API to SILO's StringSearch instead of StringEquals expressions

This is a breaking change as any search for a string field will now also assume the value to be a regex string, whereas before it was just a string that was compared against the column values with equality.

The problem with a permanent solution is that the current api just expects a pair of "column": "value", where not specification of whether the filtering should be done with equality or regex matching is possible.

Possible solutions:

  • Upon short consideration, it is expected that string columns are going to need not both types, but only one type of the filter expressions. E.g. author lists will generally be filtered with regexes, the primary key with equality. This could therefore be configured in the column declarations (i.e. in the database_config)
  • Make LAPIS parse the "value". If it is of the form "regex({})" / "LIKE {}", we will regex-match it, if "= {}" we will compare with equality.
  • Make LAPIS parse the "column". If it is of the form "{column}Match", we will regex-match, otherwise we will compare with equality.
  • Change the API to take objects instead of one "value"

PR Checklist

  • [ ] All necessary documentation has been adapted.
  • [ ] The implemented feature is covered by an appropriate test.

Taepper avatar Aug 07 '24 10:08 Taepper

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
lapis ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 7, 2024 10:11am

vercel[bot] avatar Aug 07 '24 10:08 vercel[bot]

This is a preview of the changelog of the next release:

0.2.9 (2024-08-07)

Features

  • map string fields from the API to SILO's StringSearch instead of StringEquals expressions (62f3959)

github-actions[bot] avatar Aug 07 '24 10:08 github-actions[bot]

This is a breaking change as any search for a string field will now also assume the value to be a regex string, whereas before it was just a string that was compared against the column values with equality.

I don't think it's good to force regex. There should be two types of string comparisons:

  • exact
  • regex

IIUC, you are proposing to get rid of exact?

I see that you list possible workarounds though. I would say it would be ok to start with expecting each field to be of exactly one of the two string types, configured in the database config. Is there a reason not to just start off with this? It'd probably be better to do the two things in one PR: don't remove exact search in one release, and then add it back in a later release. Instead: add regex search and don't make a breaking change.

Also, if this is a breaking change, this shouldn't be 0.2.9 but 0.3.0, right?

corneliusroemer avatar Aug 20 '24 12:08 corneliusroemer

This PR will not be merged as such. We are thinking about how to handle it best. I would suggest adding new parameters for string values which have a suffix that indicates that it is a regex search. The question related to our other discussion is then whether the suffix should be Regex or _regex.

chaoran-chen avatar Aug 20 '24 14:08 chaoran-chen

This PR will not be merged as such.

Ah I see, it sad "this is a breaking change" - which sounded like something planned to be released as a breaking change ;)

I would suggest adding new parameters for string values which have a suffix that indicates that it is a regex search. The question related to our other discussion is then whether the suffix should be Regex or _regex.

It might be safer not to rely on these magic parameter names - you could put the type also in the query value, e.g. authors=regex:^[^\t]*$&country=exact:Germany

Rather than authors_regex=^[^\t]*$&country_exact=Germany

corneliusroemer avatar Aug 20 '24 15:08 corneliusroemer

It might be safer not to rely on these magic parameter names - you could put the type also in the query value, e.g. authors=regex:^[^\t]*$&country=exact:Germany

This would we a major breaking change, wouldn't it? I think that there should continue to be a simple way for people to filter for a exact value without prepending something.

chaoran-chen avatar Aug 20 '24 21:08 chaoran-chen

The new field type could be opt in. Old field type would have simple exact string semantics and be backwards compatible.

You can also make the search type gate able via a second parameter: authors=[^\t]*&authors_type=regex or something like that

But yeah then one can just modify the name directly. I see your points 😀

corneliusroemer avatar Aug 20 '24 22:08 corneliusroemer

superseded by #902 (#897 for overview)

fengelniederhammer avatar Aug 21 '24 13:08 fengelniederhammer