Support for patching mode.
Support for standalone patching mode, eg. /patch/ endpoint which only records content that is missing from the current collection, to complement /record/ which records everything.
Current work on the patch mode can be found on this branch: https://github.com/webrecorder/pywb/tree/patch-work
A requirement for patching is support for realtime deduplication via Redis, also available as a standalone PR here: #597
Current work for patching:
- [x] Dedup via redis
- [ ] API for managing redis based dedup
- [ ] Documentation for patching mode
- [ ] Tests
With my dedup patch I also directly use the redis db as Index. This way I don't need to use auto-index which uses 100% cpu for a few seconds every time it runs. Now, my config looks like this:
# pywb config file
# ========================================
#
collections:
all: $all
live: $live
rec_play:
index:
type: redis
redis_url: 'redis://localhost/0/dedup:rec'
archive_paths: './'
# Settings for each collection
use_js_obj_proxy: true
# Memento support, enable
enable_memento: false
# Replay content in an iframe
framed_replay: true
proxy:
coll: rec
recording: true
enable_content_rewrite: false
recorder:
source_coll: live
rollover_size: 10737418240
rollover_idle_secs: 86400
filename_template: rec-{timestamp}.warc.gz
source_filter: live
dedup_index:
type: redis
dupe_policy: revisit
redis_url: 'redis://localhost/0/dedup:{coll}'
certificates:
cert_reqs: 'CERT_REQUIRED'
Maybe a better approach to configure things would be to allow recording to a collection with redis index and remove redis_url (or dedup_index in your patch) from the recording section.
@Lukey3332 I merged your PR, but also adding to it based on the existing work from the patch work branch. #611 PR simplifies the configuration, so all you need is just:
recorder:
dedup_policy: revisit
The Redis URL is set by default. Also, the replay automatically checks the Redis index, so you don't need a custom index config in collections at all to replay from this index.