storage/sources/postgres: memory leak/OOM kill
What version of Materialize are you using?
0a8d7bb4d84898a0b1352fe06085c5c6de8c6a4d
How did you install Materialize?
Docker image
What is the issue?
A test that inserts 16Mb worth of postgres data and then repeatedly updates it is unable to complete:
- on a 8Gb machine because the OOM killer repeatedly kills the
storaged - on a 64Gb machine, while the
RESof thestoragedprocess looks reasonable, theVIRTgoes into tens of gigabytes. jmalloc reports the following:
Allocated: 235.43 MiB
In active pages: 239.81 MiB
Allocated for allocator metadata: 48.86 MiB
Bytes unused, but retained by allocator: 19.24 GiB
If the container is restricted to 2Gb of non-swap RAM , iotop reports a lot of swap activity happening and things grind to a halt. perf also reports all CPU usage is towards servicing page faults.
To reproduce:
- (if you want to see OOMs) Create a 8gb scratch machine using the folowing json definition:
diff --git a/misc/scratch/8gb.json b/misc/scratch/8gb.json
new file mode 100644
index 000000000..6f3341d0a
--- /dev/null
+++ b/misc/scratch/8gb.json
@@ -0,0 +1,8 @@
+{
+ "name": "c5.large with 8Gb of ram",
+ "launch_script": "true",
+ "instance_type": "c5.xlarge",
+ "ami": "ami-09d56f8956ab235b3",
+ "size_gb": 50,
+ "tags": {}
+}
- Clone the
http://github.com/philip-stoev/materializetree, branchgh15044 - Run:
cd test/bounded-memory
./mzcompose down -v
./mzcompose run default
- Observe repeated OOMs recorded in
sudo dmesg
What this will do is create a 32Mb postgres table, then establish replication on it and then repeatedly update it.
The issue does not seem reproducible without the CREATE VIEWS FROM statement.
Relevant log output
No response
OOM report:
1053.457633] systemd invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[ 1053.457641] CPU: 2 PID: 1 Comm: systemd Not tainted 5.15.0-1004-aws #6-Ubuntu
[ 1053.457644] Hardware name: Amazon EC2 c5.xlarge/, BIOS 1.0 10/16/2017
[ 1053.457646] Call Trace:
[ 1053.457648] <TASK>
[ 1053.457651] show_stack+0x52/0x58
[ 1053.457657] dump_stack_lvl+0x4a/0x5f
[ 1053.457661] dump_stack+0x10/0x12
[ 1053.457663] dump_header+0x53/0x224
[ 1053.457668] oom_kill_process.cold+0xb/0x10
[ 1053.457671] out_of_memory+0x106/0x2e0
[ 1053.457676] __alloc_pages_slowpath.constprop.0+0x97a/0xa40
[ 1053.457680] __alloc_pages+0x30d/0x320
[ 1053.457683] alloc_pages+0x9e/0x1d0
[ 1053.457687] __page_cache_alloc+0x7a/0x90
[ 1053.457691] pagecache_get_page+0x152/0x590
[ 1053.457694] ? page_cache_ra_unbounded+0x166/0x210
[ 1053.457698] filemap_fault+0x484/0xab0
[ 1053.457700] ? filemap_map_pages+0x309/0x400
[ 1053.457703] __do_fault+0x3c/0xe0
[ 1053.457706] do_read_fault+0xeb/0x160
[ 1053.457709] do_fault+0xa0/0x2e0
[ 1053.457711] handle_pte_fault+0x1c5/0x230
[ 1053.457713] __handle_mm_fault+0x3c7/0x700
[ 1053.457716] handle_mm_fault+0xd8/0x2c0
[ 1053.457719] do_user_addr_fault+0x1c5/0x670
[ 1053.457723] exc_page_fault+0x77/0x160
[ 1053.457725] ? asm_exc_page_fault+0x8/0x30
[ 1053.457728] asm_exc_page_fault+0x1e/0x30
[ 1053.457731] RIP: 0033:0x7f473c21d170
[ 1053.457738] Code: Unable to access opcode bytes at RIP 0x7f473c21d146.
[ 1053.457739] RSP: 002b:00007ffd56f51098 EFLAGS: 00010202
[ 1053.457742] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 000055b8bdfcf170
[ 1053.457743] RDX: 00000000ffffffff RSI: ffffffffffffffff RDI: 000055b8bde2cab0
[ 1053.457745] RBP: 000055b8bde2cab0 R08: 0000000000000000 R09: add4143c6b2c9b64
[ 1053.457746] R10: 916374641153c335 R11: 00007ffd56ff3090 R12: 0000000000000001
[ 1053.457748] R13: 000055b8bde2cab0 R14: ffffffffffffffff R15: 0000000000000000
[ 1053.457751] </TASK>
[ 1053.457752] Mem-Info:
[ 1053.457754] active_anon:4666 inactive_anon:1863435 isolated_anon:0
active_file:2 inactive_file:80 isolated_file:0
unevictable:8058 dirty:0 writeback:0
slab_reclaimable:8113 slab_unreclaimable:12282
mapped:48575 shmem:47108 pagetables:9444 bounce:0
kernel_misc_reclaimable:0
free:24829 free_pcp:0 free_cma:0
[ 1053.457759] Node 0 active_anon:18664kB inactive_anon:7453740kB active_file:8kB inactive_file:320kB unevictable:32232kB isolated(anon):0kB isolated(file):0kB mapped:194300kB dirty:0kB writeback:0kB shmem:188432kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 110592kB writeback_tmp:0kB kernel_stack:6208kB pagetables:37776kB all_unreclaimable? yes
[ 1053.457765] Node 0 DMA free:14336kB min:132kB low:164kB high:196kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1053.457770] lowmem_reserve[]: 0 2987 7571 7571 7571
[ 1053.457775] Node 0 DMA32 free:44344kB min:26612kB low:33264kB high:39916kB reserved_highatomic:0KB active_anon:12kB inactive_anon:2982508kB active_file:0kB inactive_file:348kB unevictable:0kB writepending:0kB present:3129256kB managed:3063720kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1053.457780] lowmem_reserve[]: 0 0 4584 4584 4584
[ 1053.457783] Node 0 Normal free:40636kB min:40836kB low:51044kB high:61252kB reserved_highatomic:0KB active_anon:18652kB inactive_anon:4471208kB active_file:228kB inactive_file:112kB unevictable:32232kB writepending:0kB present:4878336kB managed:4702528kB mlocked:27696kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1053.457789] lowmem_reserve[]: 0 0 0 0 0
[ 1053.457793] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB (M) 3*4096kB (M) = 14336kB
[ 1053.457805] Node 0 DMA32: 85*4kB (UME) 1009*8kB (UME) 667*16kB (UME) 124*32kB (UE) 78*64kB (UE) 38*128kB (UME) 12*256kB (UE) 5*512kB (UE) 2*1024kB (ME) 0*2048kB 1*4096kB (M) = 44684kB
[ 1053.457822] Node 0 Normal: 392*4kB (UME) 247*8kB (UME) 153*16kB (UME) 121*32kB (UME) 87*64kB (UME) 97*128kB (UME) 31*256kB (UME) 8*512kB (UM) 2*1024kB (M) 0*2048kB 0*4096kB = 41928kB
[ 1053.457837] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1053.457840] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1053.457841] 50702 total pagecache pages
[ 1053.457842] 0 pages in swap cache
[ 1053.457843] Swap cache stats: add 0, delete 0, find 0/0
[ 1053.457845] Free swap = 0kB
[ 1053.457846] Total swap = 0kB
[ 1053.457847] 2005896 pages RAM
[ 1053.457847] 0 pages HighMem/MovableOnly
[ 1053.457848] 60494 pages reserved
[ 1053.457849] 0 pages hwpoisoned
[ 1053.457850] Tasks state (memory values in pages):
[ 1053.457851] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 1053.457856] [ 181] 0 181 14112 957 122880 0 -250 systemd-journal
[ 1053.457861] [ 218] 0 218 72328 6775 106496 0 -1000 multipathd
[ 1053.457864] [ 221] 0 221 5850 920 69632 0 -1000 systemd-udevd
[ 1053.457867] [ 433] 100 433 4050 871 69632 0 0 systemd-network
[ 1053.457870] [ 435] 101 435 6338 1368 90112 0 0 systemd-resolve
[ 1053.457872] [ 531] 0 531 703 274 40960 0 0 acpid
[ 1053.457875] [ 534] 0 534 1821 559 53248 0 0 cron
[ 1053.457877] [ 535] 102 535 2223 734 53248 0 -900 dbus-daemon
[ 1053.457880] [ 544] 0 544 20691 577 61440 0 0 irqbalance
[ 1053.457883] [ 545] 0 545 7995 2616 102400 0 0 networkd-dispat
[ 1053.457886] [ 546] 0 546 58995 1094 90112 0 0 polkitd
[ 1053.457888] [ 548] 104 548 55600 1004 77824 0 0 rsyslogd
[ 1053.457891] [ 556] 0 556 384920 2761 270336 0 -900 snapd
[ 1053.457894] [ 558] 0 558 3867 710 73728 0 0 systemd-logind
[ 1053.457897] [ 560] 113 560 4722 624 61440 0 0 chronyd
[ 1053.457899] [ 566] 0 566 98114 1019 131072 0 0 udisksd
[ 1053.457901] [ 571] 113 571 2640 132 61440 0 0 chronyd
[ 1053.457904] [ 620] 0 620 60799 1067 118784 0 0 ModemManager
[ 1053.457906] [ 631] 0 631 27521 2855 110592 0 0 unattended-upgr
[ 1053.457909] [ 653] 0 653 1554 213 45056 0 0 agetty
[ 1053.457912] [ 661] 0 661 1543 212 49152 0 0 agetty
[ 1053.457914] [ 773] 0 773 3854 1049 61440 0 -1000 sshd
[ 1053.457916] [ 1216] 0 1216 345776 1735 200704 0 0 amazon-ssm-agen
[ 1053.457919] [ 1242] 0 1242 367309 3256 237568 0 0 ssm-agent-worke
[ 1053.457921] [ 1601] 0 1601 74006 864 159744 0 0 packagekitd
[ 1053.457924] [ 5472] 114 5472 54670 3381 159744 0 -900 postgres
[ 1053.457927] [ 5474] 114 5474 54702 1244 155648 0 0 postgres
[ 1053.457929] [ 5475] 114 5475 54670 1115 155648 0 0 postgres
[ 1053.457932] [ 5476] 114 5476 54670 1786 143360 0 0 postgres
[ 1053.457934] [ 5477] 114 5477 54812 1008 155648 0 0 postgres
[ 1053.457937] [ 5478] 114 5478 18352 711 118784 0 0 postgres
[ 1053.457939] [ 5479] 114 5479 54777 894 143360 0 0 postgres
[ 1053.457941] [ 8946] 0 8946 357503 5121 315392 0 -999 containerd
[ 1053.457944] [ 9077] 0 9077 435709 6308 421888 0 -500 dockerd
[ 1053.457947] [ 19656] 0 19656 4229 753 69632 0 0 sshd
[ 1053.457949] [ 19832] 1000 19832 4245 402 73728 0 0 systemd
[ 1053.457952] [ 19833] 1000 19833 42604 1247 94208 0 0 (sd-pam)
[ 1053.457954] [ 19894] 1000 19894 4326 602 69632 0 0 sshd
[ 1053.457957] [ 19895] 1000 19895 2308 501 57344 0 0 bash
[ 1053.457960] [ 21552] 0 21552 287182 190 143360 0 -500 docker-proxy
[ 1053.457963] [ 21559] 0 21559 268749 192 131072 0 -500 docker-proxy
[ 1053.457965] [ 21574] 0 21574 268749 188 135168 0 -500 docker-proxy
[ 1053.457968] [ 21582] 0 21582 305615 188 139264 0 -500 docker-proxy
[ 1053.457970] [ 21601] 0 21601 287182 188 139264 0 -500 docker-proxy
[ 1053.457972] [ 21608] 0 21608 287182 191 143360 0 -500 docker-proxy
[ 1053.457975] [ 21621] 0 21621 287182 191 139264 0 -500 docker-proxy
[ 1053.457977] [ 21627] 0 21627 268749 188 139264 0 -500 docker-proxy
[ 1053.457979] [ 21639] 0 21639 268749 189 126976 0 -500 docker-proxy
[ 1053.457982] [ 21645] 0 21645 268749 164 135168 0 -500 docker-proxy
[ 1053.457984] [ 21663] 0 21663 178164 587 102400 0 -998 containerd-shim
[ 1053.457986] [ 21665] 0 21665 305743 250 151552 0 -500 docker-proxy
[ 1053.457989] [ 21675] 0 21675 305551 188 147456 0 -500 docker-proxy
[ 1053.457991] [ 21700] 0 21700 305615 189 159744 0 -500 docker-proxy
[ 1053.457994] [ 21708] 0 21708 268749 189 139264 0 -500 docker-proxy
[ 1053.457996] [ 21725] 0 21725 178100 664 98304 0 -998 containerd-shim
[ 1053.457998] [ 21744] 999 21744 112561 43424 495616 0 0 postgres
[ 1053.458001] [ 21753] 999 21753 696 23 49152 0 0 tini
[ 1053.458004] [ 21850] 999 21850 69223 5848 585728 0 0 environmentd
[ 1053.458006] [ 21919] 999 21919 293834 59869 929792 0 0 cockroach
[ 1053.458008] [ 21969] 999 21969 112595 1142 176128 0 0 postgres
[ 1053.458010] [ 21970] 999 21970 112880 1926 192512 0 0 postgres
[ 1053.458013] [ 21971] 999 21971 112561 1591 167936 0 0 postgres
[ 1053.458015] [ 21972] 999 21972 113569 860 208896 0 0 postgres
[ 1053.458018] [ 21973] 999 21973 16821 504 135168 0 0 postgres
[ 1053.458020] [ 21974] 999 21974 113558 730 200704 0 0 postgres
[ 1053.458023] [ 22014] 999 22014 256399 3799 1445888 0 0 computed
[ 1053.458025] [ 23168] 999 23168 3781020 1708974 27389952 0 0 storaged
[ 1053.458028] [ 23186] 999 23186 115378 3349 290816 0 0 postgres
[ 1053.458030] [ 23201] 999 23201 113696 1240 233472 0 0 postgres
[ 1053.458032] [ 23202] 114 23202 54670 689 114688 0 0 postgres
[ 1053.458035] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=init.scope,mems_allowed=0,global_oom,task_memcg=/system.slice/docker-729680011fd9ecb84547a7bc3c05b5f059503e6a206a14cdfac48e68a2a2c175.scope,task=storaged,pid=23168,uid=999
[ 1053.458078] Out of memory: Killed process 23168 (storaged) total-vm:15124080kB, anon-rss:6835896kB, file-rss:0kB, shmem-rss:0kB, UID:999 pgtables:26748kB oom_score_adj:0
[ 1053.693645] oom_reaper: reaped process 23168 (storaged), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
The test as pushed in the branch uses 16 1-megabyte rows. Unfortunately, the same problem exists with 1026 16-kilobyte rows:
diff --git a/test/bounded-memory/bounded-memory-before-restart.td b/test/bounded-memory/bounded-memory-before-restart.td
index 002c8fa3e..af07cf8e0 100644
--- a/test/bounded-memory/bounded-memory-before-restart.td
+++ b/test/bounded-memory/bounded-memory-before-restart.td
@@ -29,7 +29,7 @@ CREATE SCHEMA public;
CREATE TABLE t1 (f1 SERIAL PRIMARY KEY, f2 INTEGER DEFAULT 0, f3 TEXT);
ALTER TABLE t1 REPLICA IDENTITY FULL;
-INSERT INTO t1 (f3) SELECT REPEAT('a', 1024 * 1024) FROM generate_series(1, 16);
+INSERT INTO t1 (f3) SELECT REPEAT('a', 1024 * 16) FROM generate_series(1, 1024);
CREATE PUBLICATION mz_source FOR ALL TABLES;
@@ -354,4 +354,4 @@ UPDATE t1 SET f2 = f2 + 1;
INSERT INTO t1 (f3) VALUES ('eof');
> SELECT * FROM v2;
-129
+1025
@philip-stoev is there a guide on how to get jemalloc stats (i am just praying it works on macos)?
@guswynn apologies for the late response. Storaged exposes a HTTP port that you can view with a web browser that will allow you to enable stats collection (which needs to happen towards the start of the test) and then view the report towards the end of the test.
The port is available inside the container only, so you may need to expose it to the outside:
diff --git a/src/orchestrator-process/src/lib.rs b/src/orchestrator-process/src/lib.rs
index fdb614d63..259bfd173 100644
--- a/src/orchestrator-process/src/lib.rs
+++ b/src/orchestrator-process/src/lib.rs
@@ -259,7 +259,7 @@ impl NamespacedOrchestrator for NamespacedProcessOrchestrator {
for i in 0..scale_in.get() {
if !processes_exist[i] {
let mut args = args(&ServiceAssignments {
- listen_host: IpAddr::V4(Ipv4Addr::LOCALHOST),
+ listen_host: IpAddr::V4(Ipv4Addr::UNSPECIFIED),
ports: &peers[i].1,
index: Some(i),
peers: &peers,
and then modify the mzcompose.py file's Materialized( ...) section to include:
extra_ports=["2104:2104"],
allow_host_ports=True
I do not know if all that works on mac, but it should work on a scratch machine. Once the port is exposed on the scratch machine, you can use SSH port forwarding to bring it to your local machine where you can view it in the browser:
ssh -L 2104:127.0.0.1:2104 ...
As you can see , there are quite a few hoops to jump through, not to mention all this will not be possible in the cloud.
This is still reproducible :
4003748 systemd+ 20 0 27.9g 1.1g 28072 S 181.1 1.8 4:39.51 storaged
@philip-stoev doing some digging on old issues. Before I dive in, do you know offhand if this issue still applies?
@nrainer-materialize @def-, throwing this in the QA bucket so we can double-check if this is still an issue, or if we can close this out.
@morsapaes: Yes, the OOM can still be reproduced! I migrated and added this check to the bounded-memory framework in https://github.com/MaterializeInc/materialize/pull/26845. Since it fails (even with a higher memory specification), I will disable it for now.
I think some kind of flow control is the real fix here!