erigon icon indicating copy to clipboard operation
erigon copied to clipboard

OOM crash in 16 Gb during LogIndex stage

Open battlmonstr opened this issue 1 year ago • 10 comments

System information

Erigon version: ./erigon --version

v2.57.1

OS & Version: Windows/Linux/OSX

Linux

Commit hash:

9f1cd651f0b1b443b4bd96eaed84502c149fdca2

Erigon Command (with flags/config):

--chain=mainnet
--prune=htrc
--batchSize=128M
--db.size.limit=1TB
--internalcl
--metrics
--pprof

Consensus Layer:

caplin

Consensus Layer Command (with flags/config):

--internalcl

Chain/Network:

mainnet

Expected behaviour

No crash.

Actual behaviour

Crash.

Steps to reproduce the behaviour

Sync from scratch until stage 10/12 LogIndex.

Backtrace

Latest DEBUG log lines before the crash:

[INFO] [02-26|15:56:07.925] [10/12 LogIndex] Progress                number=18779613 alloc=6.7GB sys=14.2GB
[INFO] [02-26|15:56:09.176] [10/12 LogIndex] Flushed buffer file     name=erigon-sortable-buf-55986105
[INFO] [02-26|15:56:38.275] [10/12 LogIndex] Progress                number=18789219 alloc=9.0GB sys=14.2GB
[INFO] [02-26|15:57:08.765] [10/12 LogIndex] Progress                number=18799392 alloc=11.2GB sys=14.2GB
[INFO] [02-26|15:57:15.497] [10/12 LogIndex] Flushed buffer file     name=erigon-sortable-buf-4208647017
[INFO] [02-26|15:57:17.390] [10/12 LogIndex] Flushed buffer file     name=erigon-sortable-buf-3571994502

battlmonstr avatar Feb 27 '24 14:02 battlmonstr

@AskAlexSharov is there something like --batchSize for LogIndex?

battlmonstr avatar Feb 27 '24 15:02 battlmonstr

One more crash around block 12.5M:

[INFO] [02-27|15:22:31.915] [10/12 LogIndex] Progress                number=12568622 alloc=10.7GB sys=13.9GB
[INFO] [02-27|15:23:01.912] [10/12 LogIndex] Progress                number=12577585 alloc=10.3GB sys=13.9GB

battlmonstr avatar Feb 27 '24 15:02 battlmonstr

Heap dump before the crash: 15

battlmonstr avatar Feb 27 '24 15:02 battlmonstr

This is a dump 5 minutes before the crash for comparison: 1

They look very similar. Maybe the problem is on the mdbx side, not in Go heap?

battlmonstr avatar Feb 27 '24 15:02 battlmonstr

--internalcl - I see on your picture: SpawnHistoryDownload - seems it happening in background and eating ~1G. I guess it can eat less or improve it's mem-limit, or adapt to total ram on machine.

AskAlexSharov avatar Feb 27 '24 21:02 AskAlexSharov

you can proof it by running stage_log_index without other erigon parts: integration stage_log_index

AskAlexSharov avatar Feb 27 '24 21:02 AskAlexSharov

@AskAlexSharov Yeah, at the time of the crash I've seen something in the logs about the history downloading. I've ran the integration stage offline successfully. After erigon restarted, it went to 12/12 Finish 🎉 .

battlmonstr avatar Feb 29 '24 10:02 battlmonstr

@Giulio2002 hi, plz take a look if possible put stricter ram limit to history download.

AskAlexSharov avatar Mar 01 '24 02:03 AskAlexSharov

Also seeing an OOM kill during LogIndex stage with 16GB memory and GOMEMLIMIT = 13GiB

From journalctl:

Mar 18 03:50:01 ethnode kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/system.slice/supervisor.service,task=erigon,pid=952888,uid=1001 Mar 18 03:50:01 ethnode kernel: Out of memory: Killed process 952888 (erigon) total-vm:17215695868kB, anon-rss:11155356kB, file-rss:0kB, shmem-rss:0kB, UID:1001 pgtables:3872040kB oom_score_adj:0 Mar 18 03:50:01 ethnode systemd[1]: supervisor.service: A process of this unit has been killed by the OOM killer.

@Giulio2002 hi, plz take a look if possible put stricter ram limit to history download.

What is the command option for this? Couldn't find in the manual.

pngwerks avatar Mar 18 '24 17:03 pngwerks

this PR may help: https://github.com/ledgerwatch/erigon/pull/9814

AskAlexSharov avatar Mar 27 '24 02:03 AskAlexSharov