urlwatch Delay between requests

Where a URL list contains many pages on the same site would it be possible to add an arbitrary delay (possibly random) between consecutive requests? Our own web page has a software firewall that prevents access from addresses making more than, for example, 20 requests per minute. If my watch process takes 10 minutes rather than 10 seconds that is not an issue, as long as it checks all the pages and doesn't get blocked. Something like a PauseBetweenRequests : 10 seconds in the config file would allow checking of sites that have countermeasures for scripted requests.

Dec 06 '18 10:12 cjohnsonuk

This seems difficult to implement within the current structure, and possibly a bit out of scope.

I believe it's in urlwatch's design to finish its jobs as quickly as possible, and leave long term scheduling issues to the scheduler and other tools.

Dec 14 '18 18:12 cfbao

How about a delay in the jobs. For example to set a interval for each job with each urlwatch run we exclude all jobs that didn't med the timeframe.

interval: 86400 # Check only after 24 Hours interval: 86400-172800 # Checks after 24 Hours up to 48 Hours

So if you run Multiple Times, Jobs that are already checked would be excluded.

It would not fix the Problem in general but you can increase the urlwatch runs and decrease amount of requests

Apr 24 '19 12:04 nille02

How about a delay in the jobs. For example to set a interval for each job with each urlwatch run we exclude all jobs that didn't med the timeframe.

This is the same idea as #148, #171, and can be easily implemented once the cache database redesign #360 is merged.

However, this issue is different. To resolve this, we need to space out different jobs in the same run.

Apr 24 '19 15:04 cfbao

But it would lessen the requests because you can dispense it over some time. I have a similar Website that blocks request after X requests per second. My Workaround currently is to split the url file with around 40 to 80 Jobs per File. And each Block is called with some time between them.

Apr 24 '19 18:04 nille02

If you've already split your jobs into multiple urls files, then there's not really a need to set an interval in urlwatch. You can handle interval from the scheduler. Just like how your workaround works.

I believe this issue is about spacing out requests even if you've put everything into one urls file.

Apr 24 '19 18:04 cfbao

How about a additional option. i would make a copy from list_urls in command.py and instead of print just run one job and use sleep() after each job. Its a small hack but it would be working.

Apr 28 '19 13:04 nille02

@cjohnsonuk you can try a automatchfilter from the hooks (CustomRegexMatchUrlFilter) and add a sleep in the filter. with this there are Max 10 requests at the same time.

import re
import random
import time
from urlwatch import filters

class CustomRegexMatchSleepUrlFilter(filters.RegexMatchFilter):
    # Similar to AutoMatchFilter
    MATCH = {'url': re.compile('(http|https)://(www.)?example.org/.*')}

    def filter(self, data):
        time.sleep(random.randint(10, 20))  # Sleeps between 10 and 20 seconds
        return data

But be aware, that has a drawback. if the server response with a http 304 status code, the filter is not executed and runs the next job without a delay.

Edit: I test this and it worked for me fine. But be aware of the caching that i mentioned. i altered the regex a bit to be aware of http and https and the optional www. before the hostname.

May 02 '19 12:05 nille02

Can someone please explain how Nille02's code snipped posted on May 2, 2019 should be used? Where should it be added? How does it get called into execution? Many thanks.

Oct 14 '20 10:10 thomas2net

Until this is supported, you can do something like this as a workaround (note that jobs will still run in parallel, and potentially execute at the same time, it's just that from the start of the job to the actual retrieval there's some delay, in this case from 1 to 10 seconds):

name: "Example job"
command: |
  sleep $[ ( $RANDOM % 10 )  + 1 ]s
  curl --silent http://example.org

Oct 14 '20 14:10 thp

Another hack solution is to patch worker.py and handler.py like this:

diff --git a/lib/urlwatch/handler.py b/lib/urlwatch/handler.py
index 585d4b5..aa4258a 100644
--- a/lib/urlwatch/handler.py
+++ b/lib/urlwatch/handler.py
@@ -38,6 +38,7 @@ import os
 import shlex
 import subprocess
 import email.utils
+import random
 
 from .filters import FilterBase
 from .jobs import NotModifiedError
@@ -97,6 +98,7 @@ class JobState(object):
 
     def process(self):
         logger.info('Processing: %s', self.job)
+        time.sleep(random.randint(0, 10))
 
         if self.exception:
             return self
diff --git a/lib/urlwatch/worker.py b/lib/urlwatch/worker.py
index 70006dd..a0a0374 100644
--- a/lib/urlwatch/worker.py
+++ b/lib/urlwatch/worker.py
@@ -38,7 +38,7 @@ from .jobs import NotModifiedError
 
 logger = logging.getLogger(__name__)
 
-MAX_WORKERS = 10
+MAX_WORKERS = 1
 
 
 def run_parallel(func, items):

Oct 14 '20 14:10 thp