arbor icon indicating copy to clipboard operation
arbor copied to clipboard

Optimise Memory Consumption during Setup

Open thorstenhater opened this issue 3 years ago • 3 comments

Introduction

Historically, our setup phase is quite cavalier with performance and memory which has given rise to various concerns for larger networks. As we are preparing for large deployments in the next HPC generation, we need to address this.

⚠️ Important ⚠️

Tests on MPI are still failing for domain decomposition due to a cyclic shift of groups which shouldn't influence the test or functionality. But I am too stupid to fix the test to pass.

Label Resolution

Our label resolution scheme concatenates a global vector of all source labels across all MPI tasks. In my recent experiments, this exhausts the node-local memory (512GB) at around 150M cells, with only a few --- $\mathcal{O}(10)$ --- labels per cell.

Fix

In each cell group, build a lazy map of the GIDs we are connected to and the associated source labels. This reduces the required memory to the actual high-water-mark, but incurs the cost of instantiating cells multiple times.

If this still not enough, we can proceed to minimize the memory consumption even more by

  • ~~(simple) evict labels from the cache if we are close to the limit. Costs more instantiations, but saves memory~~ Done
  • (optimal) observing which cells connect to a given source GID, instantiating the cell for only as long as needed. Now memory is O(1), but we need to fiddle with construction order all the while keeping the connectivity stable. Which might be complicated

Thingification during Cable Cell Construction

Cable cells constantly invoke thingify(region, provider) during construction. Mostly however, we paint those region in succession, ie:

  # ...
  .paint('(tag 1)', ...)
  .paint('(tag 1)', ...)
  .paint('(tag 1)', ...)
  #...

This costs a lost of temporary allocations.

Fix

Simply cache the last region, following the idea that this is the 90% case.

Temporary data structures during Setup

embed_pwlin is particularly bad with managing allocation so much so that it shows up on time profiles as a major cost centre. This originates from structures which are nested vectors under the hood and must return values for reasons of genericity.

Fix

Return references where possible, convert the rest to continuation passing style so we never actually return a value but take a callback which is invoked with the references.

mechanism_catalogue allocates temporaries

operator[] returns a value, where a const reference would be sufficient.

Fix

Obvious, no?

General

Split large functions -- especially in fvm_layout -- into smaller ones. This fences in temporary memory allocations.

Clean-up and Extras

Rename mc_cell_* to the proper cable_cell_*

This also renames -- finally! -- mc_cell_* to cable_cell_*

Add RSS field to Profiler

To check for the actual memory consumption recording of the high watermark memory was added to the profiler. It now prints the memory used at the 1st hit of this cost centre and the associated maximum. Thus, we can determine allocation during runtime and setup.

Example (not very helpful, but illustrative)

REGION             CALLS  WALL/s  THREAD/s      %  MAX_RSS/kB  1st_RSS/kB
------             -----  ------  --------  -----  ----------  ----------
root                   -   0.686     5.485  100.0           -           -
  advance              -   0.685     5.481   99.9           -           -
    integrate          -   0.685     5.476   99.8           -           -
      current          -   0.281     2.246   40.9           -           -
        pas        40200   0.189     1.515   27.6       29248       29232
        zero       40200   0.035     0.283    5.2       29248       29232
        hh         40200   0.031     0.249    4.5       29248       29232
        expsyn     40200   0.025     0.198    3.6       29248       29232

Add logging operator new

This will -- if enabled at compile time -- spit out the amount of Bytes allocated and the callstack of the allocation site to stderr during runtime. ⚠️ This is hilariously slow, but really helps pinpointing allocations. Use the script/memory-log-to-profile.py to convert log files to a report like this:

Size/kB   Count
4628      88274     root
4628      88274       start
4627      88269         main
4487      86748           arb::simulation::simulation(arb::recipe const&, std::__1::shared_ptr<arb::execution_context>, arb::domain_decomposition const&, unsigned long long)
4487      86748             arb::simulation::simulation(arb::recipe const&, std::__1::shared_ptr<arb::execution_context>, arb::domain_decomposition const&, unsigned long long)
4487      86747               arb::simulation_state::simulation_state(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)
4487      86747                 arb::simulation_state::simulation_state(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)
3106      69279                   void arb::simulation_state::foreach_group_index<arb::simulation_state::foreach_group_index(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0>(arb::simulation_state::foreach_group_index(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0&&)
3102      69276                     arb::simulation_state::foreach_group_index<arb::simulation_state::foreach_group_index(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0>(arb::simulation_state::foreach_group_index(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0&&)::{lambda(int)#1}::operator()(int) const
3102      69276                       arb::simulation_state::simulation_state(arb::recipe const&, arb::domain_decomposition const&, std::__1::shared_ptr<arb::execution_context>, unsigned long long)::$_0::operator()(std::__1::unique_ptr<arb::cell_group, std::__1::default_delete<arb::cell_group> >&, int) const

Linked

Fixes #1969

thorstenhater avatar Oct 19 '22 08:10 thorstenhater

do we still need the struct arb::connectivity?

boeschf avatar Jan 31 '23 10:01 boeschf

No and it should no longer be here.

thorstenhater avatar Jan 31 '23 14:01 thorstenhater

Looks like this'll be essential for large simulations, the use-case we promote Arbor for. Is it possible to provide either strategy as a user option? Give me fast init, or give me low memory usage? Or even auto-activate the low mem path upon ncells > threshold (e.g. 100M?).

If not that, maybe a troubleshooting checklist for when e.g. people run out of memory or other fail situations.

brenthuisman avatar Jul 04 '23 11:07 brenthuisman