Tests for username preceded and surrounded by Japanese characters
In the unit test module, I noticed that both test_username_preceded_japaneseand test_username_surrounded_japanese expect the parser to generate a html link element, although the the @username is preceded by あ. On the other hand, the reply regular expression states that usernames should be preceded by white spaces.
Is there a reason I missed for this behaviour?
If I'm reading this right, my thinking is that a tweet which is actually a reply requires that the username resides at the beginning of the tweet, with no other characters except perhaps whitespace.
The Japanese tests are looking for @ mentions, which can be proceeded or surrounded by other characters, and thus would not be considered a reply.
Maybe I wasn't so clear. Let's say I write a@user, then the regexp will not consider the @user substring a username, and thus the parser won't create a link tag.
My question is why this would be different if I have あ@user instead, although "あ" is a character?
I came across this observation while trying to port the code on python 3. As python 3's unicode will consider "あ" a character, it will be captured the same way "a" would be.
Another contributor has committed a Python 3 port, are you aware of this? I went ahead and merged it in today.
https://github.com/edburnett/twitter-text-python/pull/6
The same Japanese tests are failing.
I'm the new maintainer so I probably need to study this a bit more and get back to you on your question.
Yes, I saw his pull request, as well as his struggle with this very problem. I'm looking forward to hearing from you once you got an eye on this issue.
Unless I'm thinking of this wrong but can you not simply use re.UNICODE along with re.IGNORECASE?
re.UNICODE
Make \w, \W, \b, \B, \d, \D, \s and \S dependent on the Unicode character properties database.
New in version 2.0.
It does make the あ be highlighted as a failing test.
It also allows it work on Python 3.
re.A
re.ASCII
Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns.
Note that for backward compatibility, the re.U flag still exists (as well as its synonym re.UNICODE and its embedded counterpart (?u)), but these are redundant in Python 3 since matches are Unicode by default for strings (and Unicode matching isn’t allowed for bytes).
It looks like twitter itself links "あ@username", but not "a@username"...
The twitter-text conformance tests also show this: https://github.com/twitter/twitter-text/blob/master/conformance/autolink.yml#L29