Feature Request: Scheduling Archival from the UI
Type
- [ ] General question or discussion
- [x] Propose a brand new feature
- [ ] Request modification of existing behavior or design
What is the problem that your feature request solves
Currently scheduling ingestion of new urls requires writing a cron job external to the web UI (external to the docker container in my case) which isn't entirely ideal in a docker/self-contained setup. I believe this would be a nice convenience feature for users that might want to manage the entire operation of AB from within the web UI.
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
This feature would add a method for setting up scheduled pulls from various data sources via the web UI rather than only externally via cron. I specifically imagine at least a way to specify a RSS feed to be subscribed to that it can watch for new content from (something like Wallabag in my particular imagined use case). Technically I think this would involve a new menu/button in the UI and should dovetail with the internal scheduling processes already available.
How badly do you want this new feature?
- [ ] It's an urgent deal-breaker, I can't live without it
- [x] It's important to add it in the near-mid term future
- [ ] It would be nice to have eventually
- [x] I'm willing to contribute dev time / money to fix this issue
- [x] I like ArchiveBox so far / would recommend it to a friend
- [ ] I've had a lot of difficulty getting ArchiveBox set up
Yeah this is definitely on our mind, it probably won't be added for a couple versions but this is definitely something I've been planning.
It's blocked by adding a background queue system like Huey or dramatiq: https://github.com/ArchiveBox/ArchiveBox/issues/91
In the meantime I recommend using docker-compose instead of docker alone, as it allows you to declaratively define your scheduled imports all in one place (you can see the docker-compose.yml commented out section for an example of how to do that).
Gotcha, I saw the future queuing system and that makes sense! And yes, currently using compose, so I'll look into doing that. Thanks!
Here's my proposed implementation of a new model to track scheduled imports: https://github.com/ArchiveBox/ArchiveBox/pull/707/files
Remaining TODOs:
- figure out which python scheduler to use
-
huey+django-huey-monitor(my current favorite) - celery (ugh...)
- APScheduler (will require lots of manual models and concurrency control code)
- yacron (not sure if it can be configured dynamically)
- dramatiq (doesn't support sqlite)
-
- decide whether to continue supporting system crontab at all, or tear it out (imo we should just tear it out and move to using an internal scheduler)
- fork the scheduled task worker off the server process automatically on startup, so no need to run separate
archivebox schedule --foregroundprocess manually - figure out how to enforce "at least once" or "at most once" concurrency model for scheduled tasks
Follow that PR for more updates as work progresses. https://github.com/ArchiveBox/ArchiveBox/pull/707
See this thread here for my WIP design that moves us towards a message-passing / async job worker structure internally: https://github.com/ArchiveBox/ArchiveBox/issues/91#issuecomment-871343428