Tools to prune cache
I don't think we need it in this version, but I think the next version should have some way to automatically prune the cache to keep it below a user specified threshold with default (maybe controlled via an environment variable, and set to say 5 Gb by default?)
Good point. Agreed. I did not include it, because it requires some thinking....
@wch just thought this through for shiny caching, so is likely to have good ideas.
One difficulty is that we would probably need to store the last-access time stamp, we definitely don't want to remove packages that were used not long ago. For this we need to lock the cache, with an exclusive lock, which is not ideal.
Do you think it's too unreliable to use the file system last access time?
Yeah, we can try that as well, but yeah, still not always reliable. In particular, AFAICT usually not available in Docker containers. I think it is also not always enabled on Windows.
But we can figure something out, probably, e.g. have a separate lock for the access times.
Although maybe you don't want to prune in Docker containers, anyway. But windows is still an issue, and in general it is just too platform dependent to rely on.
The atime attribute can't be relied on in general. In Linux, it's not unusual to mount a filesystem with noatime. On some filesystems, the time resolution is poor (for FAT, the time resolution is one day, and for HFS+, I think it's ). On NTFS in Windows, atime has a 100 ns resolution, but it is updated only once per hour.
I ran a bunch of tests on mtime, ctime, and atime here: https://gist.github.com/wch/9bc615c70219c7ac15f7b339ddd7a30d
The solution I ended up using was to use mtime, which seems to work reliably across platforms, and call Sys.setFileTime() each time I accessed the file: https://github.com/rstudio/shiny/blob/8c9ce19/R/cache-disk.R#L290
Note that Sys.setFileTime() apparently updates mtime, ctime, and atime on some platforms and on others (Windows-NTFS in my testing) only updates mtime.
The disk caching and pruning code in the link above is designed to work when multiple processes are using the same directory to store objects, so no locking is required (there are some potential races, but all are recoverable, since it's just a cache). All the relevant state for the objects (name, time, size, and the content) is stored on the filesystem, so you can stop an R process that uses the directory for a cache, then start another one and point it to the directory, and it will continue to work fine.
@wch Thanks!
Note: I am going to postpone this until we have a database backend, to avoid having to rewrite it then.