label-studio icon indicating copy to clipboard operation
label-studio copied to clipboard

Named Entity Recognition - Incorrect spans shown after labelling

Open pdhall99 opened this issue 2 years ago • 2 comments

Describe the bug In a Named Entity Recognition project the incorrect span is shown after labelling in some cases.

To Reproduce Steps to reproduce the behavior:

  1. Create a "Named Entity Recognition" project and import the following as a .txt ("Treat as list of tasks"):
👨🏻‍🚒 firemen drive firetrucks at work
  1. Click "Label All Tasks"
  2. Select firetrucks" to be labelled
  3. Note "ve firetru" is selected as the label and the end of the text is cut off (see screenshot), but "firetrucks" is correctly marked in the exported JSON.

Expected behavior Selected word is "firetrucks" is highlighted as the labelled span.

Screenshots Screenshot 2023-10-31 at 05 55 32 Screenshot 2023-10-31 at 05 55 46

Environment (please complete the following information):

  • OS: macOS Ventura 13.6
  • Label Studio Version 1.9.1 (Docker)

Additional context I assume this is something related to label indices (start and end) being positions in either a sequence of 16-bit Unicode code units (as they are in TypeScript/JavaScript) or in a sequence of Unicode code points (as they are in Python).

Take text = "👨🏻‍🚒 firemen drive firetrucks at work" as an example. Suppose we label the word "firetrucks":

  • In TypeScript/JavaScript, the word has start, end = 22, 32
  • In Python, the word has start, end = 19, 29, i.e., text[19:29] == "firetrucks"

See this Better Programming article for further explanation.

I note that the code unit span (19, 29) (the correct code point span is (19, 29)) corresponds to the code point span (16, 26), for which text[16:26] == "ve firetru", as is displayed.

Possibly related issues:

  • https://github.com/HumanSignal/label-studio/issues/4929
  • https://github.com/HumanSignal/label-studio/issues/4843
  • https://github.com/HumanSignal/label-studio/issues/2777

pdhall99 avatar Oct 31 '23 06:10 pdhall99

Hi @pdhall99 - this bug may be related to the # of bytes in the emoji. We'll reproduce it on our side + file a ticket to get this addressed. Thank you for the detailed report!

jombooth avatar Nov 02 '23 19:11 jombooth

Some extra info:

Screenshot from 2024-06-25 10-43-53

After selecting the first hand: Screenshot from 2024-06-25 10-44-01

After selecting the 1: Screenshot from 2024-06-25 10-44-14

After selecting the 1 and the hand: Screenshot from 2024-06-25 10-44-28

I thought this was related and probably will be fixed together, but let me know and I will file a separate bug!

mlumingu-ugent avatar Jun 25 '24 08:06 mlumingu-ugent