pkgcache icon indicating copy to clipboard operation
pkgcache copied to clipboard

Tools to prune cache

Open hadley opened this issue 7 years ago • 9 comments

I don't think we need it in this version, but I think the next version should have some way to automatically prune the cache to keep it below a user specified threshold with default (maybe controlled via an environment variable, and set to say 5 Gb by default?)

hadley avatar Sep 15 '18 16:09 hadley

Good point. Agreed. I did not include it, because it requires some thinking....

gaborcsardi avatar Sep 15 '18 16:09 gaborcsardi

@wch just thought this through for shiny caching, so is likely to have good ideas.

hadley avatar Sep 16 '18 12:09 hadley

One difficulty is that we would probably need to store the last-access time stamp, we definitely don't want to remove packages that were used not long ago. For this we need to lock the cache, with an exclusive lock, which is not ideal.

gaborcsardi avatar Sep 16 '18 13:09 gaborcsardi

Do you think it's too unreliable to use the file system last access time?

hadley avatar Sep 16 '18 13:09 hadley

Yeah, we can try that as well, but yeah, still not always reliable. In particular, AFAICT usually not available in Docker containers. I think it is also not always enabled on Windows.

But we can figure something out, probably, e.g. have a separate lock for the access times.

gaborcsardi avatar Sep 16 '18 13:09 gaborcsardi

Although maybe you don't want to prune in Docker containers, anyway. But windows is still an issue, and in general it is just too platform dependent to rely on.

gaborcsardi avatar Sep 16 '18 13:09 gaborcsardi

The atime attribute can't be relied on in general. In Linux, it's not unusual to mount a filesystem with noatime. On some filesystems, the time resolution is poor (for FAT, the time resolution is one day, and for HFS+, I think it's ). On NTFS in Windows, atime has a 100 ns resolution, but it is updated only once per hour.

I ran a bunch of tests on mtime, ctime, and atime here: https://gist.github.com/wch/9bc615c70219c7ac15f7b339ddd7a30d

The solution I ended up using was to use mtime, which seems to work reliably across platforms, and call Sys.setFileTime() each time I accessed the file: https://github.com/rstudio/shiny/blob/8c9ce19/R/cache-disk.R#L290

Note that Sys.setFileTime() apparently updates mtime, ctime, and atime on some platforms and on others (Windows-NTFS in my testing) only updates mtime.

The disk caching and pruning code in the link above is designed to work when multiple processes are using the same directory to store objects, so no locking is required (there are some potential races, but all are recoverable, since it's just a cache). All the relevant state for the objects (name, time, size, and the content) is stored on the filesystem, so you can stop an R process that uses the directory for a cache, then start another one and point it to the directory, and it will continue to work fine.

wch avatar Sep 17 '18 04:09 wch

@wch Thanks!

gaborcsardi avatar Sep 17 '18 07:09 gaborcsardi

Note: I am going to postpone this until we have a database backend, to avoid having to rewrite it then.

gaborcsardi avatar Jun 14 '23 13:06 gaborcsardi