URL extraction does not capture entire path and query if "www" is omitted.
Describe the bug In URL extraction, I believe URLs missing "www." and with at least one path segment do not correctly recognise the full path and query string. As far as I can tell only the first path segment is matched, and none of the query parameters.
To Reproduce
JS examples
The following examples were checked in the chrome devtools console, running a Rails stack. It uses the twitter-text-js-rails wrapper gem at v1.9.1, so that claims to be using v1.9.1 of twitter-text. I don't have a convenient environment that can upgrade the JS version.
- Does not behave as expected:
twttr.txt.extractUrls("https://t.co/a?amp=1")
> ["https://t.co/a"]
twttr.txt.extractUrls("https://t.co/a/b?amp=1")
> ["https://t.co/a"]
twttr.txt.extractUrls("https://t.co/a/b")
> ["https://t.co/a"]
- Behaves as expected:
twttr.txt.extractUrls("https://www.t.co?amp=1")
> ["https://www.t.co?amp=1"]
twttr.txt.extractUrls("https://www.t.co/a/b?amp=1")
> ["https://www.t.co/a/b?amp=1"]
twttr.txt.extractUrls("https://www.t.co/a/b")
> ["https://www.t.co/a/b"]
twttr.txt.extractUrls("https://t.co?amp=1")
> ["https://t.co?amp=1"]
Ruby examples
I also reproduced this on the 3.1.0 version of the twitter-text gem, which I think is the latest version.
- Does not behave as expected:
irb(main):002:0> Twitter::TwitterText::Extractor.extract_urls("https://t.co/a?amp=1")
=> ["https://t.co/a"]
irb(main):004:0> Twitter::TwitterText::Extractor.extract_urls("https://t.co/a/b?amp=1")
=> ["https://t.co/a"]
irb(main):005:0> Twitter::TwitterText::Extractor.extract_urls("https://t.co/a/b")
=> ["https://t.co/a"]
Expected behavior I think the string argument is in all cases a complete URL, and I would have expected the return value to be the whole string (in an array).
Environment Apart from the details above, I'm running Ubuntu.
I took a closer look into the source code for extractUrls. I didn't realise that t.co URLs were a special case! In particular, this comment seems to imply that the behaviour I'm describing is done intentionally.
// In the case of t.co URLs, don't allow additional path characters.
if (url.match(validTcoUrl)) {
...
Indeed,
twttr.txt.extractUrls("https://t.co/abc?amp=1")
["https://t.co/abc"]
twttr.txt.extractUrls("https://g.co/abc?amp=1")
["https://g.co/abc?amp=1"]
The issue arose from users copying links from twitter and pasting them onto our platform. The link has an additional ?amp=1, so the effect is that somebody pastes https://t.co/LY1EMFy7TW?amp=1 to us, and the resulting HTML looks like <a>https://t.co/LY1EMFy7TW</a>?amp=1.
This is not a very nice solution, but I found that if I set
(ruby)
Twitter::TwitterText::Regex::REGEXEN[:valid_tco_url] = /$^/
(js)
twttr.txt.regexen.validTcoUrl = /$^/;
then I get the behaviour I expect.