pywb icon indicating copy to clipboard operation
pywb copied to clipboard

Support for patching mode.

Open ikreymer opened this issue 5 years ago • 2 comments

Support for standalone patching mode, eg. /patch/ endpoint which only records content that is missing from the current collection, to complement /record/ which records everything.

Current work on the patch mode can be found on this branch: https://github.com/webrecorder/pywb/tree/patch-work

A requirement for patching is support for realtime deduplication via Redis, also available as a standalone PR here: #597

Current work for patching:

  • [x] Dedup via redis
  • [ ] API for managing redis based dedup
  • [ ] Documentation for patching mode
  • [ ] Tests

ikreymer avatar Dec 15 '20 18:12 ikreymer

With my dedup patch I also directly use the redis db as Index. This way I don't need to use auto-index which uses 100% cpu for a few seconds every time it runs. Now, my config looks like this:

# pywb config file
# ========================================
#

collections:
  all: $all
  live: $live
  rec_play:
    index:
      type: redis
      redis_url: 'redis://localhost/0/dedup:rec'
    archive_paths: './'

# Settings for each collection
use_js_obj_proxy: true

# Memento support, enable
enable_memento: false

# Replay content in an iframe
framed_replay: true

proxy:
  coll: rec
  recording: true
  enable_content_rewrite: false

recorder:
  source_coll: live
  rollover_size: 10737418240
  rollover_idle_secs: 86400
  filename_template: rec-{timestamp}.warc.gz
  source_filter: live
  dedup_index:
    type: redis
    dupe_policy: revisit
    redis_url: 'redis://localhost/0/dedup:{coll}'

certificates:
  cert_reqs: 'CERT_REQUIRED'

Maybe a better approach to configure things would be to allow recording to a collection with redis index and remove redis_url (or dedup_index in your patch) from the recording section.

lukasstraub2 avatar Dec 16 '20 17:12 lukasstraub2

@Lukey3332 I merged your PR, but also adding to it based on the existing work from the patch work branch. #611 PR simplifies the configuration, so all you need is just:

recorder:
  dedup_policy: revisit

The Redis URL is set by default. Also, the replay automatically checks the Redis index, so you don't need a custom index config in collections at all to replay from this index.

ikreymer avatar Jan 27 '21 02:01 ikreymer