Fulltextsearch: CSV data is not indexed correctly, issue in recognizing semicolons within numeric sequences

Open mudi0 opened this issue 2 years ago • 1 comments

Background

While using Nextcloud's Fulltextsearch functionality, I observed unexpected search results. Specifically, I attempted to search for a string within Nextcloud and did not obtain the desired outcome. I later manually executed a wildcard search and got the results I was expecting, suggesting that the search might not be utilizing a full wildcard search.

My search string within Nextcloud UI is 00340435081879685627 but unfortunately i do not get the expected results.

If i search for this larger string 263200002;30.05.23;00340435081879685627 i get the expected results, but not if i search for 00340435081879685627. It seems that i do not there is no option to make a wildcard search within the nextcloud UI.

Issue

The underlying problem seems to be associated with how Elasticsearch, in the backend, deals with semicolons (;) present in numeric sequences when using the standard analyzer. To illustrate, the string "263200002;30.05.23;00340435081879685627" is perceived as one single token ("263200002;30.05.23;00340435081879685627") rather than being split into three separate tokens: "263200002", "30.05.23" and "00340435081879685627"

One line of the CSV: 50784682326201;GERMANY;31.05.23;1549911022;DE;Dummy Text;City;70374;;DE;263200002;30.05.23;00340435081879685627;AU-840488-2023;1;;1,000;1;2,61;;0,49;3,10;;"

If i run this string into the analyzer i get following tokens.

root@ubuntu2204:~# curl -X POST "localhost:9200/astek_index/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "text": "50784682326201;GERMANY;31.05.23;1549911022;DE;Dummy Text;City;70374;;DE;263200002;30.05.23;00340435081879685627;AU-840488-2023;1;;1,000;1;2,61;;0,49;3,10;;"
}'

Tokens:

50784682326201 germany 31.05.23;1549911022 de Dummy Text City 70374 de 263200002;30.05.23;00340435081879685627 au 840488 2023;1 1,000;1;2,61 0,49;3,10

Expected Behavior

The standard analyzer should consistently recognize semicolons as separators, irrespective of whether they are amidst textual or numeric strings.

Proposed Solution

A possible approach could be to modify Elasticsearch's standard tokenizer or develop a custom tokenizer that accurately identifies semicolons as separators. This should, in turn, improve the precision of searches within Nextcloud.

Steps to Reproduce

Set up and customize the Fulltextsearch plugin with Elasticsearch.
Index documents containing semicolons interspersed within numeric sequences. Use for example a file with the string as content "263200002;30.05.23;00340435081879685627"
Use Elasticsearch's _analyze API to scrutinize how the text is tokenized.
Observe that numbers delineated by semicolons aren't correctly fragmented into distinct tokens.
In Nextcloud, execute a search for a numeric string with semicolons and observe the unexpected search results.
Manually conduct a wildcard search, confirming that the desired results can be achieved through this method.

Sep 21 '23 10:09 mudi0

i think, i found a solution for my issue.

In Elastic Search, i will use a different tokenizer, instead of the standard tokenizer.

For that i will use a custom pattern tokenizer. At the moment, the following regular expression, is doing fine for me. It's doing almost the same as the standard tokenizer and in addition, it process CVS data correctly.

Once i add this tokenizer to elastic search, i should be able to reference it in the nextcloud in the fulltext search settings tab.


curl -X PUT "localhost:9200/my_index/_settings" -H 'Content-Type: application/json' -d '{
  "analysis": {
    "tokenizer": {
      "my_new_tokenizer": {
        "type": "pattern",
        "pattern": "(?<!\\d)[.,](?!\\d)|\\s|;"
      }
    }
  }
}'

Sep 22 '23 10:09 mudi0