bot Adjust rich_embeds filter so it doesn't catch edge case

Right now, discord converts twitter links when applying the standard twitter embed treatment. Our check for rich embeds determines whether an embed is autogenerated or not by checking if the embed link is present within the source message here https://github.com/python-discord/bot/blob/fa02180246e6cf1245d5705b36a2056ed449ef6b/bot/exts/filters/filtering.py#L520, but this does not account for discords transformation of mobile.twitter.com links to twitter.com links, and thus a message with mobile.twitter.com in it is falsely detected as a rich embed for the watchlist.

This is probably fixable by detecting if the link goes to twitter.com and there is a link to mobile.twitter.com with the same parameters. That may get a little tricky, but twitter links aren't complex links so perhaps simple string operations will suffice.

We may also want to consider writing an "apply_link_transformations" function that will extract and transform relative links within a message to compare against the embed.

Jan 23 '21 22:01 bast0006

Superseded by #1530

Apr 17 '21 14:04 Akarys42

Considering this being an ongoing issue, it's unnecessary to block this bugfix until the new system. This is now open to implementing a fix for by anyone interested in doing so.

Jun 23 '21 21:06 scragly

Discussed in discord with @mbaruh.

I'll reproduce the issue, and when I understand that, I'll drop the proposition here and we can discuss it.

Sep 25 '22 21:09 shtlrs

@bast0006 Can you please provide the examples you've tried ? Because I'm not sure I understand correctly here since you said

and thus a message with mobile.twitter.com in it is falsely detected as a rich embed for the watchlist.

When I use this link https://twitter.com/Nadallica86/status/1574145915048398854 or https://mobile.twitter.com/nedbat/status/1504062673809338370, _has_rich_embed works just fine.

But as soon as I use twitter's Lite version (with the 'mobile' prefix), e.g. https://mobile.twitter.com/Nadallica86/status/1574145915048398854 or https://mobile.twitter.com/nedbat/status/1504062673809338370, the if msg.embeds: doesn't even run, because it's currently detected as a message with no embeds, when in fact it does.

Sep 25 '22 22:09 shtlrs

https://mobile.twitter.com/nedbat/status/1504062673809338370 is rendered as the following:

{'author': {'icon_url': 'https://pbs.twimg.com/profile_images/1559004270170628096/AaSLxb_F_400x400.jpg',
            'name': 'Ned Batchelder (@nedbat)',
            'proxy_icon_url': 'https://images-ext-1.discordapp.net/external/KcDnbosWojYPkPQ4Ummp1Ad-f2wJ8Ukq3PXnU7IK-ZU/https/pbs.twimg.com/profile_images/1559004270170628096/AaSLxb_F_400x400.jpg',
            'url': 'https://twitter.com/nedbat'},
 'color': 1942002,
 'description': 'A list in #python can contain itself as an element!',
 'fields': [{'inline': True, 'name': 'Likes', 'value': '686'}],
 'footer': {'icon_url': 'https://abs.twimg.com/icons/apple-touch-icon-192x192.png',
            'proxy_icon_url': 'snip',
            'text': 'Twitter'},
 'image': {'height': 452,
           'proxy_url': 'snip',
           'url': 'https://pbs.twimg.com/media/FN9skDtXoAMLWKI.jpg',
           'width': 912},
 'timestamp': '2022-03-16T11:51:00.226000+00:00',
 'type': 'rich',
 'url': 'https://twitter.com/nedbat/status/1504062673809338370'}

note how the url field here is for https://twitter.com/nedbat/status/1504062673809338370, but the url inside the message is https://mobile.twitter.com/nedbat/status/1504062673809338370, and the embed is of type rich.

I just tested with that link, and it does trigger the alert, as you can see:

Sep 27 '22 06:09 bast0006

There are three twitter embed cases here.

The first is https://twitter.com/nedbat/status/1504062673809338370, which embeds as type: rich with a 'url' of https://twitter.com/nedbat/status/1504062673809338370, which works as expected.

The second is https://m.twitter.com/nedbat/status/1504062673809338370, which embeds as type: link and thus is ignored appropriately.

The third is https://mobile.twitter.com/nedbat/status/1504062673809338370, which does embed as type: rich, but also has the 'url' edited to https://twitter.com/nedbat/status/1504062673809338370, which breaks the filter.

If you're having issues reproducing, check to make sure the bot doesn't think you have a staff account, it won't run this filter on staff users.

Sep 27 '22 06:09 bast0006

Thanks for the explanation, I turned out that I understood something else before, but I do see what you meant now.

I was thinking maybe we can eliminate the subdomain and only keep the 2nd level domain & TLD along with the same path & query parameters of course for comparison, but I'm wondering what could the edge cases be here.

The path would be the same and the domain as well (eventually), what's making me wonder if whether query parameters can be altered upon rendering that URL. ( on discord's side)

Like you said, it can take place using simple string manipulation, for example:

from urllib.parse import urlparse

def apply_link_transformations(url):
    "Transforms potential relative urls to absolute ones."
    parsed_url = urlparse(url)
    netloc_components = parsed_url.netloc.split(".")
    # Eliminate sublevel domain and use the second level domain and top level domain only 
    netloc_components[:] = netloc_components[-2:]
    netloc = ".".join(netloc_components)
    parsed_url = parsed_url._replace(netloc=netloc)
    return parsed_url.geturl()

And we'll only apply this on rich embeds of course.

Later, upon running our checks, we can either:

Check if the transformed url exists in the the content OR
Check if one of the two urls that we now have, e.g. the original and transformed, exists in the message content.

Let me know what you think

Sep 27 '22 08:09 shtlrs