OpenRefine icon indicating copy to clipboard operation
OpenRefine copied to clipboard

non-breakable space and preview

Open fabiolinus opened this issue 3 years ago • 6 comments

Hi. I attached an example of csv file; it has only three rows. Row 1 and 3 have a non-breakable space at the end, Row 2 has a normal space instead (notepad++ screenshot):

image

I checked the option 'Trim leading & trailing whitespace from strings' before to create a project. It fails because I think it can't recognize non-breakable spaces. Once the project is created, the trim() command deletes spaces without any problems. Can you confirm that? Thanks a lot,

Fabio

ExampleData.csv

fabiolinus avatar Sep 11 '22 21:09 fabiolinus

Hi @fabiolinus, perhaps it would be easier to reproduce this if you could provide an example file (as a file, not a string in the issue, because GitHub might be sanitizing things).

wetneb avatar Sep 11 '22 21:09 wetneb

I had a look at the code after @fabiolinus posted this on the mailing list and found (if I'm reading it correctly) that 'Trim leading & trailing whitespace from strings' on import and the GREL trim() command work differently.

  • Trim leading & trailing whitespace from strings -> defined in TabularImportingParserBase https://github.com/OpenRefine/OpenRefine/blob/cb55cdfdf6f9ca916839778dc847cce803688998/main/src/com/google/refine/importers/TabularImportingParserBase.java#L187 which uses the Java String 'trim()' function )defines whitespace as any character with a code point less than or equal to the space code point (\u0020) )
  • GREL 'trim()' -> uses com.google.common.base.CharMatcher.whitespace() which defines whitespace as any unicode whitespace character (see https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7Bwhitespace%7D)

The trim() command originally used Java String 'trim()' but was changed to the CharMatcher some time ago (12 years!) exactly because Java String trim() doesn't include non-breaking spaces (https://github.com/OpenRefine/OpenRefine/commit/f29f77e8f8a0024deb50b3a13dec2874c859a7fb)

ostephens avatar Sep 12 '22 09:09 ostephens

Brilliant, thanks for the investigation! Then this is probably a good first issue.

wetneb avatar Sep 12 '22 15:09 wetneb

I think #5105 is a superset of this.

tfmorris avatar Sep 12 '22 18:09 tfmorris

And please, please remember me #4425

isaomatsunami avatar Sep 12 '22 18:09 isaomatsunami

Sorry Antonin. I updated my post in github. Fabio Il domenica 11 settembre 2022 23:22:04 CEST, Antonin Delpeuch @.***> ha scritto:

Hi @fabiolinus, perhaps it would be easier to reproduce this if you could provide an example file (as a file, not a string in the issue, because GitHub might be sanitizing things).

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

fabiolinus avatar Sep 12 '22 21:09 fabiolinus

Hey @wetneb If no one is working on this, can I pick this up? I see this is a good first issue and would like to use this opportunity to work on this.

sritejap avatar Sep 24 '22 05:09 sritejap

Okay! As @tfmorris mentioned above this would probably be solved by #5105 so it is worth considering tackling that instead :)

wetneb avatar Sep 24 '22 06:09 wetneb

Thanks! seems like someone is already working on this.

sritejap avatar Sep 24 '22 06:09 sritejap

Yes, sure. Sorry again, Fabio Il domenica 11 settembre 2022 23:22:04 CEST, Antonin Delpeuch @.***> ha scritto:

Hi @fabiolinus, perhaps it would be easier to reproduce this if you could provide an example file (as a file, not a string in the issue, because GitHub might be sanitizing things).

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

fabiolinus avatar Oct 11 '22 07:10 fabiolinus

Closed by #5336

wetneb avatar Oct 25 '22 19:10 wetneb