python-wordpress-xmlrpc icon indicating copy to clipboard operation
python-wordpress-xmlrpc copied to clipboard

Implement throttling

Open gregturn opened this issue 9 years ago • 3 comments

My provider has stiff brute force protections in place that this library easily bumps into. Anyway to implement some built in throttling? As a comparison, Pywikipediabot automatically slows things down.

Here is the type of policy I'm bumping into (which may be common across many providers):

  • An IP gets blocked for ten minutes if it accesses wp-login.php or xmlrpc.php more than 20 times in one minute.
  • If there are three such blocks in a three hour period, the IP is permanently blocked.

It appears that every call() is a new connection to xmlrpc.php, hence a mark against the threshold. If there was a way to count, and then if it hits 10 calls, automatically sleep for a minute.

Thoughts?

gregturn avatar Jan 24 '16 17:01 gregturn

This could probably be implemented as a custom Transport implementation, but we should start by figuring out how it would work at a conceptual level.

From code that is consuming this library, what would the expected API or behavior be? That the whole thread gets put to sleep for minutes/hours after you hit a configurable throttle level? Or should it throw an exception and let the consumer figure out how it wants to wait (e.g., put a followup task in a message queue, or spawn another thread/process, or ...)?

Do you have links to documentation for pywikipediabot or other libraries that implement similar behavior?

maxcutler avatar Jan 24 '16 18:01 maxcutler

This is what I wrote to implement a find-and-replace (slightly redacted):

#!/usr/bin/env python

import re

from wordpress_xmlrpc import Client
from wordpress_xmlrpc.methods import posts

from difflib import Differ
from pprint import pprint

import time

def process(client, post):
    #print "Analyzing %s..." % post
    if '<old>' in post.content:
        print "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
        print "!!! '%s' contains the pattern you seek!" % post.title
        update = re.sub('<old>', '<new>', post.content)
        print "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
        print "BEFORE: %s" % post.content
        print "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
        print "AFTER: %s" % update
        print "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
        post.content = update
        if client.call(posts.EditPost(post.id, post)):
            print "!!!! SUCCESS!!!"
        print
    else:
        print "___ '%s' does NOT contain the pattern you seek" % post.title
        print


client = Client('<website>/xmlrpc.php', 'username', 'password')

offset = 0
increment = 10
while True:
    articles = client.call(posts.GetPosts({'number': increment, 'offset': offset}))
    if len(articles) == 0:
        break  # no more posts returned
    for article in articles:    
        process(client, article)
    offset = offset + increment
    time.sleep(2) // <--- turns out, this wasn't enough throttling!

gregturn avatar Jan 24 '16 22:01 gregturn

A link to up-to-date pywikibot (formerly pywikipediabot) is here =>

https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/throttle.py

Essentially, it encapsulates a lock. The lock is acquired, a timer to sleep run, and then the lock is released. I think the idea of routing any/all calls to ServerProxy through the throttle would then let you plugin different strategies. Perhaps default is to NOT throttle (backwards compatibility) at first. But...config options would let you plug in a policy, like counting and timing, to avoid exceeding an effective SLA?

gregturn avatar Jan 24 '16 22:01 gregturn