Named Entity Recognition - Incorrect spans shown after labelling
Describe the bug In a Named Entity Recognition project the incorrect span is shown after labelling in some cases.
To Reproduce Steps to reproduce the behavior:
- Create a "Named Entity Recognition" project and import the following as a .txt ("Treat as list of tasks"):
👨🏻🚒 firemen drive firetrucks at work
- Click "Label All Tasks"
- Select firetrucks" to be labelled
- Note "ve firetru" is selected as the label and the end of the text is cut off (see screenshot), but "firetrucks" is correctly marked in the exported JSON.
Expected behavior Selected word is "firetrucks" is highlighted as the labelled span.
Screenshots
Environment (please complete the following information):
- OS: macOS Ventura 13.6
- Label Studio Version 1.9.1 (Docker)
Additional context I assume this is something related to label indices (start and end) being positions in either a sequence of 16-bit Unicode code units (as they are in TypeScript/JavaScript) or in a sequence of Unicode code points (as they are in Python).
Take text = "👨🏻🚒 firemen drive firetrucks at work" as an example. Suppose we label the word "firetrucks":
- In TypeScript/JavaScript, the word has start, end = 22, 32
- In Python, the word has start, end = 19, 29, i.e.,
text[19:29] == "firetrucks"
See this Better Programming article for further explanation.
I note that the code unit span (19, 29) (the correct code point span is (19, 29)) corresponds to the code point span (16, 26), for which text[16:26] == "ve firetru", as is displayed.
Possibly related issues:
- https://github.com/HumanSignal/label-studio/issues/4929
- https://github.com/HumanSignal/label-studio/issues/4843
- https://github.com/HumanSignal/label-studio/issues/2777
Hi @pdhall99 - this bug may be related to the # of bytes in the emoji. We'll reproduce it on our side + file a ticket to get this addressed. Thank you for the detailed report!
Some extra info:
After selecting the first hand:
After selecting the 1:
After selecting the 1 and the hand:
I thought this was related and probably will be fixed together, but let me know and I will file a separate bug!