language_tool_python icon indicating copy to clipboard operation
language_tool_python copied to clipboard

Offset position "longer" than text

Open Kailegh opened this issue 2 years ago β€’ 1 comments

I have a match that look like this:

Match({'ruleId': 'MORFOLOGIK_RULE_ES', 'message': 'Se ha encontrado un posible error ortogrΓ‘fico.', 'replacements': ['telΓ©fonos', 'telΓ©fono', 'telefotos'], 'offsetInContext': 43, 'context': '...π’š π’›π’‚π’‘π’Šπ’π’π’‚ podemos compartir tus telefonos con el conductor π‘Ίπ’Š', 'offset': 307, 'errorLength': 9, 'category': 'TYPOS', 'ruleIssueType': 'misspelling', 'sentence': 'Rider > Lost Items > Standard lost item > Driver found riders itemdescripcion del articulo perdido 𝑴𝒆 π’π’π’—π’Šπ’…π’† 𝒖𝒏𝒂 π’Žπ’π’„π’‹π’Šπ’π’‚ π’π’†π’ˆπ’“π’” ingresa un numero de telefono alternativo incluye el codigo de tu pais informacion sobre el viaje π’‚π’π’…π’“π’†π’”π’Šπ’•π’ π’š π’›π’‚π’‘π’Šπ’π’π’‚ podemos compartir tus telefonos con el conductor π‘Ίπ’Š'})

Original sentence look like shown in example: Rider > Lost Items > Standard lost item > Driver found riders itemdescripcion del articulo perdido 𝑴𝒆 π’π’π’—π’Šπ’…π’† 𝒖𝒏𝒂 π’Žπ’π’„π’‹π’Šπ’π’‚ π’π’†π’ˆπ’“π’” ingresa un numero de telefono alternativo incluye el codigo de tu pais informacion sobre el viaje π’‚π’π’…π’“π’†π’”π’Šπ’•π’ π’š π’›π’‚π’‘π’Šπ’π’π’‚ podemos compartir tus telefonos con el conductor π‘Ίπ’Š Problem is that offset is said to be 307, while sentence length in chars 296.
I think that the problem is that the text has some chars that actually internally take more than one position in unicode encoding (are compose but multiple chars). The problem is that when I try to reference detection to original text I get an error because that position is wrong and does not reference the true position in the text

Kailegh avatar Nov 13 '23 10:11 Kailegh

@Kailegh

Your problem is indeed reproducible with the following code:

from language_tool_python import LanguageTool

language = "ES"
tool = LanguageTool(language)
text = 'Rider > Lost Items > Standard lost item > Driver found riders itemdescripcion del articulo perdido 𝑴𝒆 π’π’π’—π’Šπ’…π’† 𝒖𝒏𝒂 π’Žπ’π’„π’‹π’Šπ’π’‚ π’π’†π’ˆπ’“π’” ingresa un numero de telefono alternativo incluye el codigo de tu pais informacion sobre el viaje π’‚π’π’…π’“π’†π’”π’Šπ’•π’ π’š π’›π’‚π’‘π’Šπ’π’π’‚ podemos compartir tus telefonos con el conductor π‘Ίπ’Š'
print("len(text)", len(text))
matches = tool.check(text)
for match in matches:
  print(match)
corrected_text = language_tool_python.utils.correct(text, matches)
print(corrected_text)

tool.close()

Note that some words of your text seem to have special formatting. If you clean your text as follows it seems to work normally:

text_cleaned = 'Rider > Lost Items > Standard lost item > Driver found riders itemdescripcion del articulo perdido Me olvide una mocjila negrs ingresa un numero de telefono alternativo incluye el codigo de tu pais informacion sobre el viaje andresito y zapiola podemos compartir tus telefonos con el conductor Si'

pidefrem avatar Apr 11 '24 14:04 pidefrem

Seems resolved in #94. Updating shortly.

jxmorris12 avatar Aug 22 '24 18:08 jxmorris12