scr
scr copied to clipboard
SCR caches checkpoint data in storage on the compute nodes of a Linux cluster to provide a fast, scalable checkpoint / restart capability for MPI codes.
SCR can direct the application to write dataset files to subdirectories within a cache directory. SCR also stores its redundancy data in these subdirectories. **Question**: Should it be considered an...
The SCR library communicates tiwh the nnf-dm server via a socket file name. We need do create a settable `SCR_CONFIG_AXL_NNFDM`/`AXL_KEY_CONFIG_NNFDM` that may be provided via environment and/or SCR_Config call.
We should consider: https://exaworks.org/psi-j-python/ It’s a portable submission interface for job schedulers in python. Maybe this will be helpful in SCR.
Work to address #489
SCR is getting large enough that importing is getting cumbersome. For example, updating to v0.3.0 just required this entire file: https://github.com/LLNL/axom/blob/develop/src%2Fcmake%2Fthirdparty%2FFindSCR.cmake As well as a lot of extra directories in...
The scavenge operation assumes files are cached in node local storage. After the run stops, the scavenge script launches the scr_copy executable on each compute node via pdsh. On each...
From Adam: With the current version of AXL, we'd have a single thread on a single compute node copy the entire file from cache to the parallel file system. However,...
When using a global file system as cache, we need to examine the number of metadata files that SCR creates for each dataset. SCR stores a filemap for each process...