tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

NormalizedString.clear() broken?

Open lkurlandski opened this issue 1 year ago • 3 comments

Hello. I think there are some problems with NormalizedString (tokenizers 0.15.2).

In the following example, append() works as expected.

from tokenizers import NormalizedString

s = NormalizedString("Hi.")  # NormalizedString(original="Hi.", normalized="Hi.")
s.append("Hello.") # NormalizedString(original="Hi.", normalized="Hi. Hello.")

After using clear(), append() no longer modifies the normalized attribute.

from tokenizers import NormalizedString

s = NormalizedString("Hi.")  # NormalizedString(original="Hi.", normalized="Hi.")
s.clear()  # NormalizedString(original="Hi.", normalized="")
s.append("Hello.")  # NormalizedString(original="Hi.", normalized="")

This is also a problem with prepend.

lkurlandski avatar Sep 25 '24 15:09 lkurlandski

Indeed, would you like to have a go at it and open a PR ? 🤗

ArthurZucker avatar Sep 26 '24 16:09 ArthurZucker

Has there been any update about this? I just encountered this as well :)

shaltielshmid avatar Nov 30 '24 23:11 shaltielshmid

Update: This issue was fixed in #1717

olp-cs avatar Nov 28 '25 11:11 olp-cs