erigon OOM crash in 16 Gb during LogIndex stage

System information

Erigon version: ./erigon --version

v2.57.1

OS & Version: Windows/Linux/OSX

Linux

Commit hash:

9f1cd651f0b1b443b4bd96eaed84502c149fdca2

Erigon Command (with flags/config):

--chain=mainnet
--prune=htrc
--batchSize=128M
--db.size.limit=1TB
--internalcl
--metrics
--pprof

Consensus Layer:

caplin

Consensus Layer Command (with flags/config):

--internalcl

Chain/Network:

mainnet

Expected behaviour

No crash.

Actual behaviour

Crash.

Steps to reproduce the behaviour

Sync from scratch until stage 10/12 LogIndex.

Backtrace

Latest DEBUG log lines before the crash:

[INFO] [02-26|15:56:07.925] [10/12 LogIndex] Progress                number=18779613 alloc=6.7GB sys=14.2GB
[INFO] [02-26|15:56:09.176] [10/12 LogIndex] Flushed buffer file     name=erigon-sortable-buf-55986105
[INFO] [02-26|15:56:38.275] [10/12 LogIndex] Progress                number=18789219 alloc=9.0GB sys=14.2GB
[INFO] [02-26|15:57:08.765] [10/12 LogIndex] Progress                number=18799392 alloc=11.2GB sys=14.2GB
[INFO] [02-26|15:57:15.497] [10/12 LogIndex] Flushed buffer file     name=erigon-sortable-buf-4208647017
[INFO] [02-26|15:57:17.390] [10/12 LogIndex] Flushed buffer file     name=erigon-sortable-buf-3571994502

Feb 27 '24 14:02 battlmonstr

@AskAlexSharov is there something like --batchSize for LogIndex?

Feb 27 '24 15:02 battlmonstr

One more crash around block 12.5M:

[INFO] [02-27|15:22:31.915] [10/12 LogIndex] Progress                number=12568622 alloc=10.7GB sys=13.9GB
[INFO] [02-27|15:23:01.912] [10/12 LogIndex] Progress                number=12577585 alloc=10.3GB sys=13.9GB

Feb 27 '24 15:02 battlmonstr

Heap dump before the crash:

Feb 27 '24 15:02 battlmonstr

This is a dump 5 minutes before the crash for comparison:

They look very similar. Maybe the problem is on the mdbx side, not in Go heap?

Feb 27 '24 15:02 battlmonstr

--internalcl - I see on your picture: SpawnHistoryDownload - seems it happening in background and eating ~1G. I guess it can eat less or improve it's mem-limit, or adapt to total ram on machine.

Feb 27 '24 21:02 AskAlexSharov

you can proof it by running stage_log_index without other erigon parts: integration stage_log_index

Feb 27 '24 21:02 AskAlexSharov

@AskAlexSharov Yeah, at the time of the crash I've seen something in the logs about the history downloading. I've ran the integration stage offline successfully. After erigon restarted, it went to 12/12 Finish 🎉 .

Feb 29 '24 10:02 battlmonstr

@Giulio2002 hi, plz take a look if possible put stricter ram limit to history download.

Mar 01 '24 02:03 AskAlexSharov

Also seeing an OOM kill during LogIndex stage with 16GB memory and GOMEMLIMIT = 13GiB

From journalctl:

Mar 18 03:50:01 ethnode kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/supervisor.service,task=erigon,pid=952888,uid=1001 Mar 18 03:50:01 ethnode kernel: Out of memory: Killed process 952888 (erigon) total-vm:17215695868kB, anon-rss:11155356kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:3872040kB oom_score_adj:0 Mar 18 03:50:01 ethnode systemd[1]: supervisor.service: A process of this unit has been killed by the OOM killer.

@Giulio2002 hi, plz take a look if possible put stricter ram limit to history download.

What is the command option for this? Couldn't find in the manual.

Mar 18 '24 17:03 pngwerks

this PR may help: https://github.com/ledgerwatch/erigon/pull/9814

Mar 27 '24 02:03 AskAlexSharov