MMN icon indicating copy to clipboard operation
MMN copied to clipboard

0th is tokenized instead of 4th, 5th, 6th etc..

Open tbrodbeck opened this issue 5 years ago • 0 comments

Here is an example of 0th instead of 5th: (2nd line of the tifu_all_tokenized_and_filtered.json)

"selftext_html": "[...] Confuse a 5th grade girl for a boy in front of half of her class. Kids are mean. Sorry Sandra.</strong></p>\n</div><!-- SC_ON -->",
"tldr_tokenized": [
    "confuse",
    "a",
    "0th",
    "grade",
    "girl",
    "for",
    "a",
    "boy",
    "in",
    "front",
    "of",
    "half",
    "of",
    "her",
    "class",
    "kids",
    "are",
    "mean",
    "sorry",
    "sandra",
    "*"
  ],

I guess this is an error or is this intended for some reason?

PS: Additionally, I just realized that the * is erroneous as well, isn't it? It is probably because of the bold text in the original string (see https://www.reddit.com/r/tifu/comments/1ggydk/tifu_by_genderstereotyping/)

tbrodbeck avatar Jul 13 '20 09:07 tbrodbeck