Optimise Memory Consumption during Setup
Introduction
Historically, our setup phase is quite cavalier with performance and memory which has given rise to various concerns for larger networks. As we are preparing for large deployments in the next HPC generation, we need to address this.
⚠️ Important ⚠️
Tests on MPI are still failing for domain decomposition due to a cyclic shift of groups which shouldn't influence the test or functionality. But I am too stupid to fix the test to pass.
Label Resolution
Our label resolution scheme concatenates a global vector of all source labels across all MPI tasks. In my recent experiments, this exhausts the node-local memory (512GB) at around 150M cells, with only a few --- $\mathcal{O}(10)$ --- labels per cell.
Fix
In each cell group, build a lazy map of the GIDs we are connected to and the associated source labels. This reduces the required memory to the actual high-water-mark, but incurs the cost of instantiating cells multiple times.
If this still not enough, we can proceed to minimize the memory consumption even more by
- ~~(simple) evict labels from the cache if we are close to the limit. Costs more instantiations, but saves memory~~ Done
- (optimal) observing which cells connect to a given source GID, instantiating the cell for only as long as needed. Now memory is O(1), but we need to fiddle with construction order all the while keeping the connectivity stable. Which might be complicated
Thingification during Cable Cell Construction
Cable cells constantly invoke thingify(region, provider) during construction. Mostly however, we
paint those region in succession, ie:
# ...
.paint('(tag 1)', ...)
.paint('(tag 1)', ...)
.paint('(tag 1)', ...)
#...
This costs a lost of temporary allocations.
Fix
Simply cache the last region, following the idea that this is the 90% case.
Temporary data structures during Setup
embed_pwlin is particularly bad with managing allocation so much so that it shows up on time
profiles as a major cost centre. This originates from structures which are nested vectors under the
hood and must return values for reasons of genericity.
Fix
Return references where possible, convert the rest to continuation passing style so we never actually return a value but take a callback which is invoked with the references.
mechanism_catalogue allocates temporaries
operator[] returns a value, where a const reference would be sufficient.
Fix
Obvious, no?
General
Split large functions -- especially in fvm_layout -- into smaller ones. This fences in temporary memory allocations.
Clean-up and Extras
Rename mc_cell_* to the proper cable_cell_*
This also renames -- finally! -- mc_cell_* to cable_cell_*
Add RSS field to Profiler
To check for the actual memory consumption recording of the high watermark memory was added to the profiler. It now prints the memory used at the 1st hit of this cost centre and the associated maximum. Thus, we can determine allocation during runtime and setup.
Example (not very helpful, but illustrative)
REGION CALLS WALL/s THREAD/s % MAX_RSS/kB 1st_RSS/kB
------ ----- ------ -------- ----- ---------- ----------
root - 0.686 5.485 100.0 - -
advance - 0.685 5.481 99.9 - -
integrate - 0.685 5.476 99.8 - -
current - 0.281 2.246 40.9 - -
pas 40200 0.189 1.515 27.6 29248 29232
zero 40200 0.035 0.283 5.2 29248 29232
hh 40200 0.031 0.249 4.5 29248 29232
expsyn 40200 0.025 0.198 3.6 29248 29232
Add logging operator new
This will -- if enabled at compile time -- spit out the amount of Bytes allocated and the callstack
of the allocation site to stderr during runtime. ⚠️ This is hilariously slow, but really helps pinpointing
allocations. Use the script/memory-log-to-profile.py to convert log files to a report like this:
Size/kB Count
4628 88274 root
4628 88274 start
4627 88269 main
4487 86748 arb::simulation::simulation(arb::recipe const&, std::__1::shared_ptr<arb::execution_context>, arb::domain_decomposition const&, unsigned long long)
4487 86748 arb::simulation::simulation(arb::recipe const&, std::__1::shared_ptr<arb::execution_context>, arb::domain_decomposition const&, unsigned long long)
4487 86747 arb::simulation_state::simulation_state(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)
4487 86747 arb::simulation_state::simulation_state(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)
3106 69279 void arb::simulation_state::foreach_group_index<arb::simulation_state::foreach_group_index(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0>(arb::simulation_state::foreach_group_index(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0&&)
3102 69276 arb::simulation_state::foreach_group_index<arb::simulation_state::foreach_group_index(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0>(arb::simulation_state::foreach_group_index(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0&&)::{lambda(int)#1}::operator()(int) const
3102 69276 arb::simulation_state::simulation_state(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0::operator()(std::__1::unique_ptr<arb::cell_group, std::__1::default_delete<arb::cell_group> >&, int) const
Linked
Fixes #1969
do we still need the struct arb::connectivity?
No and it should no longer be here.
Looks like this'll be essential for large simulations, the use-case we promote Arbor for. Is it possible to provide either strategy as a user option? Give me fast init, or give me low memory usage? Or even auto-activate the low mem path upon ncells > threshold (e.g. 100M?).
If not that, maybe a troubleshooting checklist for when e.g. people run out of memory or other fail situations.