Decreasing size of database
We are using sqlite, and the size of the database has grown to about 4.7GB.
Looking at the structure, due to the nature of using key value with serial, would it be possible to have some process that could run externally and remove changelog entries for serials that no longer exist?
Or would it somehow be possible to trim the amount of changelog entries?
Yes, but ...
- I haven't tested it at all
- you can't replicate anymore (it might be possible to work around that with some code changes)
- it might be better to do an export/import cycle, which can be tested upfront, is the official way, there will be downtime (which can be minimized)
If you decide to remove old data, I would first try to empty the binary data cell, instead of deleting the whole db row.
- I will make a backup
- We currently don't use replication and have no plans to do so in the future, but good to know
- Will take that under advisement
Thanks!
Any results?
Nope, I've just been letting it grow instead. I haven't had time to run through it and its an integral part of our build system that is in use pretty much 24/7.
https://mail.python.org/mm3/archives/list/[email protected]/thread/VLFAGR4PS3TVFCU6EM4B5RB3NCH2GSVT/
With server 4.7..0 deleting from a mirror is supported. This only reduces the usage for mirrored files though, not the database size.
Depending on your use case of devpi this PR might help: #674
It adds mirror_use_external_urls which removes the file caching by just serving the original URL. It will still store each installed project and any changes for now. I might expand on that PR later, not sure yet.
In regard to database size, I have plans to change the structure to be actually relational but still with serials. That is a bigger refactoring though and there is no funding for it (yet).
A vacuum tool might still be easier in the short term.
With postgresql backend, files are also stored in the database. At least here, it would be quite useful to have an option to remove cached files after they have not be downloaded for x days. Usually users are only interested in the latest versions of a package. For packages which often receive updates, the cache grows and grows containing a lot of files that are no longer used.
@rettenbs we can't trivially store last access times, so automatic cache cleanup is currently not possible. Do you need the local cache? If not, then set the mirror_use_external_urls on the mirror index to true.
Yes, we need the cache. Some of our clients do not have internet access.
For a standalone server it should not be too difficult to track the last access. For replica it is probably not easy. Maybe each server stores its own last access time and cleans the cache depending on that time. It is not ideal but as long as it is configurable it is a tradeoff between storage and bandwidth.
On the other hand I am not sure what the common use case for devpi is and whether this is useful for anybody else.