Implement throttling
My provider has stiff brute force protections in place that this library easily bumps into. Anyway to implement some built in throttling? As a comparison, Pywikipediabot automatically slows things down.
Here is the type of policy I'm bumping into (which may be common across many providers):
- An IP gets blocked for ten minutes if it accesses wp-login.php or xmlrpc.php more than 20 times in one minute.
- If there are three such blocks in a three hour period, the IP is permanently blocked.
It appears that every call() is a new connection to xmlrpc.php, hence a mark against the threshold. If there was a way to count, and then if it hits 10 calls, automatically sleep for a minute.
Thoughts?
This could probably be implemented as a custom Transport implementation, but we should start by figuring out how it would work at a conceptual level.
From code that is consuming this library, what would the expected API or behavior be? That the whole thread gets put to sleep for minutes/hours after you hit a configurable throttle level? Or should it throw an exception and let the consumer figure out how it wants to wait (e.g., put a followup task in a message queue, or spawn another thread/process, or ...)?
Do you have links to documentation for pywikipediabot or other libraries that implement similar behavior?
This is what I wrote to implement a find-and-replace (slightly redacted):
#!/usr/bin/env python
import re
from wordpress_xmlrpc import Client
from wordpress_xmlrpc.methods import posts
from difflib import Differ
from pprint import pprint
import time
def process(client, post):
#print "Analyzing %s..." % post
if '<old>' in post.content:
print "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
print "!!! '%s' contains the pattern you seek!" % post.title
update = re.sub('<old>', '<new>', post.content)
print "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
print "BEFORE: %s" % post.content
print "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
print "AFTER: %s" % update
print "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
post.content = update
if client.call(posts.EditPost(post.id, post)):
print "!!!! SUCCESS!!!"
print
else:
print "___ '%s' does NOT contain the pattern you seek" % post.title
print
client = Client('<website>/xmlrpc.php', 'username', 'password')
offset = 0
increment = 10
while True:
articles = client.call(posts.GetPosts({'number': increment, 'offset': offset}))
if len(articles) == 0:
break # no more posts returned
for article in articles:
process(client, article)
offset = offset + increment
time.sleep(2) // <--- turns out, this wasn't enough throttling!
A link to up-to-date pywikibot (formerly pywikipediabot) is here =>
https://github.com/wikimedia/pywikibot-core/blob/master/pywikibot/throttle.py
Essentially, it encapsulates a lock. The lock is acquired, a timer to sleep run, and then the lock is released. I think the idea of routing any/all calls to ServerProxy through the throttle would then let you plugin different strategies. Perhaps default is to NOT throttle (backwards compatibility) at first. But...config options would let you plug in a policy, like counting and timing, to avoid exceeding an effective SLA?