ArchiveBox Feature Request: Scheduling Archival from the UI

Type

[ ] General question or discussion
[x] Propose a brand new feature
[ ] Request modification of existing behavior or design

What is the problem that your feature request solves

Currently scheduling ingestion of new urls requires writing a cron job external to the web UI (external to the docker container in my case) which isn't entirely ideal in a docker/self-contained setup. I believe this would be a nice convenience feature for users that might want to manage the entire operation of AB from within the web UI.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

This feature would add a method for setting up scheduled pulls from various data sources via the web UI rather than only externally via cron. I specifically imagine at least a way to specify a RSS feed to be subscribed to that it can watch for new content from (something like Wallabag in my particular imagined use case). Technically I think this would involve a new menu/button in the UI and should dovetail with the internal scheduling processes already available.

How badly do you want this new feature?

[ ] It's an urgent deal-breaker, I can't live without it
[x] It's important to add it in the near-mid term future
[ ] It would be nice to have eventually

[x] I'm willing to contribute dev time / money to fix this issue
[x] I like ArchiveBox so far / would recommend it to a friend
[ ] I've had a lot of difficulty getting ArchiveBox set up

Dec 10 '20 13:12 BlipRanger

Yeah this is definitely on our mind, it probably won't be added for a couple versions but this is definitely something I've been planning.

It's blocked by adding a background queue system like Huey or dramatiq: https://github.com/ArchiveBox/ArchiveBox/issues/91

In the meantime I recommend using docker-compose instead of docker alone, as it allows you to declaratively define your scheduled imports all in one place (you can see the docker-compose.yml commented out section for an example of how to do that).

Dec 10 '20 14:12 pirate

Gotcha, I saw the future queuing system and that makes sense! And yes, currently using compose, so I'll look into doing that. Thanks!

Dec 10 '20 14:12 BlipRanger

Here's my proposed implementation of a new model to track scheduled imports: https://github.com/ArchiveBox/ArchiveBox/pull/707/files

Remaining TODOs:

figure out which python scheduler to use
- huey + django-huey-monitor (my current favorite)
- celery (ugh...)
- APScheduler (will require lots of manual models and concurrency control code)
- yacron (not sure if it can be configured dynamically)
- dramatiq (doesn't support sqlite)
decide whether to continue supporting system crontab at all, or tear it out (imo we should just tear it out and move to using an internal scheduler)
fork the scheduled task worker off the server process automatically on startup, so no need to run separate archivebox schedule --foreground process manually
figure out how to enforce "at least once" or "at most once" concurrency model for scheduled tasks

Follow that PR for more updates as work progresses. https://github.com/ArchiveBox/ArchiveBox/pull/707

See this thread here for my WIP design that moves us towards a message-passing / async job worker structure internally: https://github.com/ArchiveBox/ArchiveBox/issues/91#issuecomment-871343428

Apr 16 '21 04:04 pirate