osf.io icon indicating copy to clipboard operation
osf.io copied to clipboard

[ENG-5862] SPAM - Fix Wiki Spamming

Open antkryt opened this issue 7 months ago • 2 comments

Purpose

verify if spammy domains are detected

Changes

  • merge check_resource_for_domains_postcommit and check_resource_with_spam_services tasks to avoid race condition
  • compare note to value not enum
  • log detected domains to sentry

QA Notes

You can test it with domain xakw1.com on staging3. Currently project won't be banned with this domain, regardless of whether it's public or not.

Documentation

Side Effects

Ticket

https://openscience.atlassian.net/browse/ENG-5862

antkryt avatar Jun 05 '25 15:06 antkryt

The issue described in the ticket is not entirely accurate. I found the domain "xawk1.com" on staging, and content containing this domain will not be banned even if the project is public (here is an example). And on the other side, if you repeat the steps from the ticket using some other domains, everything works as expected.

However, the spam check will always be triggered because the DomainReference is always created and you can verify it in django admin (the only place it’s created is in the function _check_resource_for_domains). Therefore, only two possibilities remain:

if notable_domain.note == NotableDomain.Note.EXCLUDE_FROM_ACCOUNT_CREATION_AND_CONTENT is not being triggered or resource.confirm_spam(save=True, domains=list(spammy_domains)) silently fails (changes are not saved in the database or racing condition and changes are overwritten with some other process).

Probably changes made by check_resource_for_domains_postcommit are overwritten by check_resource_with_spam_services. Both of them start almost at the same time and load same (non-spam) version of the resource at the beginning. Which task is completed last, such changes will be saved in the database (typically check_resource_with_spam_services ends last because we make request to external service)

antkryt avatar Jun 05 '25 15:06 antkryt

@Johnetordoff I'm not sure that it's possible to write a test case to illustrate what I'm changing here, so I'll try to explain better.

Both check_resource_for_domains_postcommit and check_resource_with_spam_services are doing the same thing: start after response is sent to client (run_postcommit decorator) -> load node with spam_status=UNKNOWN -> process -> save changes to db.

As you can see, tasks don't know anything about each other and changes that are made (it's two different parallel processes). So, if check_resource_with_spam_services finishes after check_resource_for_domains_postcommit, then changes made by check_resource_for_domains_postcommit will be overwritten (and vice versa). Locally everything works because SPAM_SERVICES_ENABLED is false and only check_resource_for_domains_postcommit is running.

Also, I've changed signature of the def _check_resource_for_domains(guid, content) to def _check_resource_for_domains(resource, content) and removed confirm_spam() call from it, that's why I've updated some tests to make them work as before

As for checking spam during privacy change tests, we have bunch of those (see osf_tests/test_node.py::TestNodeSpam). I'll add a few tests to check if new check_resource_for_spam_postcommit function works properly

antkryt avatar Jun 10 '25 12:06 antkryt