[DO NOT MERGE] feature branch regex
resolves #750
This feature branch maps string fields from the API to SILO's StringSearch instead of StringEquals expressions
This is a breaking change as any search for a string field will now also assume the value to be a regex string, whereas before it was just a string that was compared against the column values with equality.
The problem with a permanent solution is that the current api just expects a pair of "column": "value", where not specification of whether the filtering should be done with equality or regex matching is possible.
Possible solutions:
- Upon short consideration, it is expected that string columns are going to need not both types, but only one type of the filter expressions. E.g. author lists will generally be filtered with regexes, the primary key with equality. This could therefore be configured in the column declarations (i.e. in the database_config)
- Make LAPIS parse the
"value". If it is of the form"regex({})"/"LIKE {}", we will regex-match it, if"= {}"we will compare with equality. - Make LAPIS parse the
"column". If it is of the form"{column}Match", we will regex-match, otherwise we will compare with equality. - Change the API to take objects instead of one
"value"
PR Checklist
- [ ] All necessary documentation has been adapted.
- [ ] The implemented feature is covered by an appropriate test.
The latest updates on your projects. Learn more about Vercel for Git ↗︎
| Name | Status | Preview | Comments | Updated (UTC) |
|---|---|---|---|---|
| lapis | ✅ Ready (Inspect) | Visit Preview | 💬 Add feedback | Aug 7, 2024 10:11am |
This is a preview of the changelog of the next release:
0.2.9 (2024-08-07)
Features
- map string fields from the API to SILO's StringSearch instead of StringEquals expressions (62f3959)
This is a breaking change as any search for a string field will now also assume the value to be a regex string, whereas before it was just a string that was compared against the column values with equality.
I don't think it's good to force regex. There should be two types of string comparisons:
- exact
- regex
IIUC, you are proposing to get rid of exact?
I see that you list possible workarounds though. I would say it would be ok to start with expecting each field to be of exactly one of the two string types, configured in the database config. Is there a reason not to just start off with this? It'd probably be better to do the two things in one PR: don't remove exact search in one release, and then add it back in a later release. Instead: add regex search and don't make a breaking change.
Also, if this is a breaking change, this shouldn't be 0.2.9 but 0.3.0, right?
This PR will not be merged as such. We are thinking about how to handle it best. I would suggest adding new parameters for string values which have a suffix that indicates that it is a regex search. The question related to our other discussion is then whether the suffix should be Regex or _regex.
This PR will not be merged as such.
Ah I see, it sad "this is a breaking change" - which sounded like something planned to be released as a breaking change ;)
I would suggest adding new parameters for string values which have a suffix that indicates that it is a regex search. The question related to our other discussion is then whether the suffix should be Regex or _regex.
It might be safer not to rely on these magic parameter names - you could put the type also in the query value, e.g. authors=regex:^[^\t]*$&country=exact:Germany
Rather than authors_regex=^[^\t]*$&country_exact=Germany
It might be safer not to rely on these magic parameter names - you could put the type also in the query value, e.g. authors=regex:^[^\t]*$&country=exact:Germany
This would we a major breaking change, wouldn't it? I think that there should continue to be a simple way for people to filter for a exact value without prepending something.
The new field type could be opt in. Old field type would have simple exact string semantics and be backwards compatible.
You can also make the search type gate able via a second parameter: authors=[^\t]*&authors_type=regex or something like that
But yeah then one can just modify the name directly. I see your points 😀
superseded by #902 (#897 for overview)