Consider changing default number of cores used
Dear pak developers. Thank you for this cool project!
I just want to bring up a little pet peeve I often find with how software handles multi-core utilization.
Spending a lot of time in shared environments makes me a bit wary of tools where utilizing all cores on the machine is the opt-out behavior. I can't even count the number of times that caused issues for either me or my colleagues.
One might argue that defining some global Ncpus value is a good practice in such environment in the first place. But in R I almost never encountered a need to use it (most packages treat parallelization as opt-in) and it's still a rake that many people might step on, especially since this behavior is explained as far as I can tell only on the website and just a bit deeper than the first time user would want to dig.
Would you consider changing the default behavior to NOT run in parallel unless Ncpus is explicitly set?
Considering that after many years, this is the first request for this, I think leaving it as opt out is the right decision.
All machines are multicore nowadays, and 99.99% of the people are not using supercomputers.
I see your point. Just thought I'd bring it up. Feel free to close it if it's not planned.
Considering the lack of requests for many years, I don't know if most people in the ecosystem are quite aware of pak yet. But it feels like something that can be much more ubiquitous in the future, so the "demographic" could shift a bit as well.
P.S.: Love the renv compatibility. It's lack was stopping me from trying pak before. Hope the smaller issues can get ironed out.
@gaborcsardi Note that using all cores is bad practice, as it can freeze the entire computer and prevent moving the mouse, responding to keyboard presses, updating the desktop clock. I have run into these limitations.
The better way is to leave at least 1 or 2 cores available for system tasks.
Maybe consider giving this a low priority if you want, but closing it is not ideal.
Just have been bitten by this again. Took way to long to find options(Ncpus=8) or so installing packages with 256 processes on a shared system is not very reasonable 😅. Please consider Henrik's take on the topic: https://www.jottr.org/2022/12/05/avoid-detectcores/
How do you suggest we detect a shared system?
I suggest we don’t try to “detect shared systems” inside pak — that’s ultimately the user’s responsibility.
Pak is a bit of a special case: it is usually invoked directly by the user (often outside an HPC batch script) and aims to minimize dependencies. In this context, pulling in parallelly::availableCores() feels too heavy, while defaulting to a single worker would be unnecessarily restrictive. A more balanced middle ground seems more appropriate here, rather than strictly following the futureverse convention of making parallelism fully opt-in. Nevertheless, two concrete proposals:
1. Make it discoverable
The key issue in this thread is that it’s hard to discover how to set the number of workers. A trivial fix is to add it to the example on the landing page and in ?pak::pkg_install like this:
options(Ncpus = 8) # define number of workers
pak::pkg_install("tibble") #install with dependencies – in parallel
Optionally, expose a num_workers argument to pkg_install(), defaulting to getOption("Ncpus"). That would make the knob much more obvious to first-time users.
2. Pick a more reasonable default
Using ps::ps_cpu_count(logical = TRUE) as default is aggressive, even on a laptop. A softer heuristic like
floor(1.5 * sqrt(ps::ps_cpu_count(logical = TRUE)))
scales fine on desktops while avoiding pathological behavior on 128–256 core servers.
If package installations themselves are parallelized (as they are with install.packages() via make -j $Ncpus, and as they should be since in practice a few heavy packages like hdf5r dominate installation time), then such a default prevents double parallelization from clogging the system.
99.99% of the people are not using supercomputers.
Maybe — but 99.99% of paying Posit customers are ;)
For illustration, here’s what parallelly::availableCores(which = "all") returns on different systems I use:
- HPC login node:
system=256,cgroups.cpuset=256,cgroups.cpuquota=26 - RStudio Server:
system=128,cgroups.cpuset=128,nproc=128,BiocParallel=8 - Laptop:
system=16 - HPC compute job:
system=128, butnproc=1,LSF=1
So while pak probably shouldn’t depend on parallelly, copying that function would give much safer defaults.
PS: Consider a dedicated option defaulting to Ncpus only if unset