materialize icon indicating copy to clipboard operation
materialize copied to clipboard

storage/sources/postgres: memory leak/OOM kill

Open philip-stoev opened this issue 3 years ago • 10 comments

What version of Materialize are you using?

0a8d7bb4d84898a0b1352fe06085c5c6de8c6a4d

How did you install Materialize?

Docker image

What is the issue?

A test that inserts 16Mb worth of postgres data and then repeatedly updates it is unable to complete:

  • on a 8Gb machine because the OOM killer repeatedly kills the storaged
  • on a 64Gb machine, while the RES of the storaged process looks reasonable, the VIRT goes into tens of gigabytes. jmalloc reports the following:
Allocated: 235.43 MiB
In active pages: 239.81 MiB
Allocated for allocator metadata: 48.86 MiB
Bytes unused, but retained by allocator: 19.24 GiB

If the container is restricted to 2Gb of non-swap RAM , iotop reports a lot of swap activity happening and things grind to a halt. perf also reports all CPU usage is towards servicing page faults.

To reproduce:

  1. (if you want to see OOMs) Create a 8gb scratch machine using the folowing json definition:
diff --git a/misc/scratch/8gb.json b/misc/scratch/8gb.json
new file mode 100644
index 000000000..6f3341d0a
--- /dev/null
+++ b/misc/scratch/8gb.json
@@ -0,0 +1,8 @@
+{
+    "name": "c5.large with 8Gb of ram",
+    "launch_script": "true",
+    "instance_type": "c5.xlarge",
+    "ami": "ami-09d56f8956ab235b3",
+    "size_gb": 50,
+    "tags": {}
+}
  1. Clone the http://github.com/philip-stoev/materialize tree, branch gh15044
  2. Run:
cd test/bounded-memory
./mzcompose down -v
./mzcompose run default
  1. Observe repeated OOMs recorded in sudo dmesg

What this will do is create a 32Mb postgres table, then establish replication on it and then repeatedly update it.

The issue does not seem reproducible without the CREATE VIEWS FROM statement.

Relevant log output

No response

philip-stoev avatar Sep 28 '22 11:09 philip-stoev

OOM report:

 1053.457633] systemd invoked oom-killer: gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[ 1053.457641] CPU: 2 PID: 1 Comm: systemd Not tainted 5.15.0-1004-aws #6-Ubuntu
[ 1053.457644] Hardware name: Amazon EC2 c5.xlarge/, BIOS 1.0 10/16/2017
[ 1053.457646] Call Trace:
[ 1053.457648]  <TASK>
[ 1053.457651]  show_stack+0x52/0x58
[ 1053.457657]  dump_stack_lvl+0x4a/0x5f
[ 1053.457661]  dump_stack+0x10/0x12
[ 1053.457663]  dump_header+0x53/0x224
[ 1053.457668]  oom_kill_process.cold+0xb/0x10
[ 1053.457671]  out_of_memory+0x106/0x2e0
[ 1053.457676]  __alloc_pages_slowpath.constprop.0+0x97a/0xa40
[ 1053.457680]  __alloc_pages+0x30d/0x320
[ 1053.457683]  alloc_pages+0x9e/0x1d0
[ 1053.457687]  __page_cache_alloc+0x7a/0x90
[ 1053.457691]  pagecache_get_page+0x152/0x590
[ 1053.457694]  ? page_cache_ra_unbounded+0x166/0x210
[ 1053.457698]  filemap_fault+0x484/0xab0
[ 1053.457700]  ? filemap_map_pages+0x309/0x400
[ 1053.457703]  __do_fault+0x3c/0xe0
[ 1053.457706]  do_read_fault+0xeb/0x160
[ 1053.457709]  do_fault+0xa0/0x2e0
[ 1053.457711]  handle_pte_fault+0x1c5/0x230
[ 1053.457713]  __handle_mm_fault+0x3c7/0x700
[ 1053.457716]  handle_mm_fault+0xd8/0x2c0
[ 1053.457719]  do_user_addr_fault+0x1c5/0x670
[ 1053.457723]  exc_page_fault+0x77/0x160
[ 1053.457725]  ? asm_exc_page_fault+0x8/0x30
[ 1053.457728]  asm_exc_page_fault+0x1e/0x30
[ 1053.457731] RIP: 0033:0x7f473c21d170
[ 1053.457738] Code: Unable to access opcode bytes at RIP 0x7f473c21d146.
[ 1053.457739] RSP: 002b:00007ffd56f51098 EFLAGS: 00010202
[ 1053.457742] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 000055b8bdfcf170
[ 1053.457743] RDX: 00000000ffffffff RSI: ffffffffffffffff RDI: 000055b8bde2cab0
[ 1053.457745] RBP: 000055b8bde2cab0 R08: 0000000000000000 R09: add4143c6b2c9b64
[ 1053.457746] R10: 916374641153c335 R11: 00007ffd56ff3090 R12: 0000000000000001
[ 1053.457748] R13: 000055b8bde2cab0 R14: ffffffffffffffff R15: 0000000000000000
[ 1053.457751]  </TASK>
[ 1053.457752] Mem-Info:
[ 1053.457754] active_anon:4666 inactive_anon:1863435 isolated_anon:0
                active_file:2 inactive_file:80 isolated_file:0
                unevictable:8058 dirty:0 writeback:0
                slab_reclaimable:8113 slab_unreclaimable:12282
                mapped:48575 shmem:47108 pagetables:9444 bounce:0
                kernel_misc_reclaimable:0
                free:24829 free_pcp:0 free_cma:0
[ 1053.457759] Node 0 active_anon:18664kB inactive_anon:7453740kB active_file:8kB inactive_file:320kB unevictable:32232kB isolated(anon):0kB isolated(file):0kB mapped:194300kB dirty:0kB writeback:0kB shmem:188432kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 110592kB writeback_tmp:0kB kernel_stack:6208kB pagetables:37776kB all_unreclaimable? yes
[ 1053.457765] Node 0 DMA free:14336kB min:132kB low:164kB high:196kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1053.457770] lowmem_reserve[]: 0 2987 7571 7571 7571
[ 1053.457775] Node 0 DMA32 free:44344kB min:26612kB low:33264kB high:39916kB reserved_highatomic:0KB active_anon:12kB inactive_anon:2982508kB active_file:0kB inactive_file:348kB unevictable:0kB writepending:0kB present:3129256kB managed:3063720kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1053.457780] lowmem_reserve[]: 0 0 4584 4584 4584
[ 1053.457783] Node 0 Normal free:40636kB min:40836kB low:51044kB high:61252kB reserved_highatomic:0KB active_anon:18652kB inactive_anon:4471208kB active_file:228kB inactive_file:112kB unevictable:32232kB writepending:0kB present:4878336kB managed:4702528kB mlocked:27696kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1053.457789] lowmem_reserve[]: 0 0 0 0 0
[ 1053.457793] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB (M) 3*4096kB (M) = 14336kB
[ 1053.457805] Node 0 DMA32: 85*4kB (UME) 1009*8kB (UME) 667*16kB (UME) 124*32kB (UE) 78*64kB (UE) 38*128kB (UME) 12*256kB (UE) 5*512kB (UE) 2*1024kB (ME) 0*2048kB 1*4096kB (M) = 44684kB
[ 1053.457822] Node 0 Normal: 392*4kB (UME) 247*8kB (UME) 153*16kB (UME) 121*32kB (UME) 87*64kB (UME) 97*128kB (UME) 31*256kB (UME) 8*512kB (UM) 2*1024kB (M) 0*2048kB 0*4096kB = 41928kB
[ 1053.457837] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[ 1053.457840] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 1053.457841] 50702 total pagecache pages
[ 1053.457842] 0 pages in swap cache
[ 1053.457843] Swap cache stats: add 0, delete 0, find 0/0
[ 1053.457845] Free swap  = 0kB
[ 1053.457846] Total swap = 0kB
[ 1053.457847] 2005896 pages RAM
[ 1053.457847] 0 pages HighMem/MovableOnly
[ 1053.457848] 60494 pages reserved
[ 1053.457849] 0 pages hwpoisoned
[ 1053.457850] Tasks state (memory values in pages):
[ 1053.457851] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 1053.457856] [    181]     0   181    14112      957   122880        0          -250 systemd-journal
[ 1053.457861] [    218]     0   218    72328     6775   106496        0         -1000 multipathd
[ 1053.457864] [    221]     0   221     5850      920    69632        0         -1000 systemd-udevd
[ 1053.457867] [    433]   100   433     4050      871    69632        0             0 systemd-network
[ 1053.457870] [    435]   101   435     6338     1368    90112        0             0 systemd-resolve
[ 1053.457872] [    531]     0   531      703      274    40960        0             0 acpid
[ 1053.457875] [    534]     0   534     1821      559    53248        0             0 cron
[ 1053.457877] [    535]   102   535     2223      734    53248        0          -900 dbus-daemon
[ 1053.457880] [    544]     0   544    20691      577    61440        0             0 irqbalance
[ 1053.457883] [    545]     0   545     7995     2616   102400        0             0 networkd-dispat
[ 1053.457886] [    546]     0   546    58995     1094    90112        0             0 polkitd
[ 1053.457888] [    548]   104   548    55600     1004    77824        0             0 rsyslogd
[ 1053.457891] [    556]     0   556   384920     2761   270336        0          -900 snapd
[ 1053.457894] [    558]     0   558     3867      710    73728        0             0 systemd-logind
[ 1053.457897] [    560]   113   560     4722      624    61440        0             0 chronyd
[ 1053.457899] [    566]     0   566    98114     1019   131072        0             0 udisksd
[ 1053.457901] [    571]   113   571     2640      132    61440        0             0 chronyd
[ 1053.457904] [    620]     0   620    60799     1067   118784        0             0 ModemManager
[ 1053.457906] [    631]     0   631    27521     2855   110592        0             0 unattended-upgr
[ 1053.457909] [    653]     0   653     1554      213    45056        0             0 agetty
[ 1053.457912] [    661]     0   661     1543      212    49152        0             0 agetty
[ 1053.457914] [    773]     0   773     3854     1049    61440        0         -1000 sshd
[ 1053.457916] [   1216]     0  1216   345776     1735   200704        0             0 amazon-ssm-agen
[ 1053.457919] [   1242]     0  1242   367309     3256   237568        0             0 ssm-agent-worke
[ 1053.457921] [   1601]     0  1601    74006      864   159744        0             0 packagekitd
[ 1053.457924] [   5472]   114  5472    54670     3381   159744        0          -900 postgres
[ 1053.457927] [   5474]   114  5474    54702     1244   155648        0             0 postgres
[ 1053.457929] [   5475]   114  5475    54670     1115   155648        0             0 postgres
[ 1053.457932] [   5476]   114  5476    54670     1786   143360        0             0 postgres
[ 1053.457934] [   5477]   114  5477    54812     1008   155648        0             0 postgres
[ 1053.457937] [   5478]   114  5478    18352      711   118784        0             0 postgres
[ 1053.457939] [   5479]   114  5479    54777      894   143360        0             0 postgres
[ 1053.457941] [   8946]     0  8946   357503     5121   315392        0          -999 containerd
[ 1053.457944] [   9077]     0  9077   435709     6308   421888        0          -500 dockerd
[ 1053.457947] [  19656]     0 19656     4229      753    69632        0             0 sshd
[ 1053.457949] [  19832]  1000 19832     4245      402    73728        0             0 systemd
[ 1053.457952] [  19833]  1000 19833    42604     1247    94208        0             0 (sd-pam)
[ 1053.457954] [  19894]  1000 19894     4326      602    69632        0             0 sshd
[ 1053.457957] [  19895]  1000 19895     2308      501    57344        0             0 bash
[ 1053.457960] [  21552]     0 21552   287182      190   143360        0          -500 docker-proxy
[ 1053.457963] [  21559]     0 21559   268749      192   131072        0          -500 docker-proxy
[ 1053.457965] [  21574]     0 21574   268749      188   135168        0          -500 docker-proxy
[ 1053.457968] [  21582]     0 21582   305615      188   139264        0          -500 docker-proxy
[ 1053.457970] [  21601]     0 21601   287182      188   139264        0          -500 docker-proxy
[ 1053.457972] [  21608]     0 21608   287182      191   143360        0          -500 docker-proxy
[ 1053.457975] [  21621]     0 21621   287182      191   139264        0          -500 docker-proxy
[ 1053.457977] [  21627]     0 21627   268749      188   139264        0          -500 docker-proxy
[ 1053.457979] [  21639]     0 21639   268749      189   126976        0          -500 docker-proxy
[ 1053.457982] [  21645]     0 21645   268749      164   135168        0          -500 docker-proxy
[ 1053.457984] [  21663]     0 21663   178164      587   102400        0          -998 containerd-shim
[ 1053.457986] [  21665]     0 21665   305743      250   151552        0          -500 docker-proxy
[ 1053.457989] [  21675]     0 21675   305551      188   147456        0          -500 docker-proxy
[ 1053.457991] [  21700]     0 21700   305615      189   159744        0          -500 docker-proxy
[ 1053.457994] [  21708]     0 21708   268749      189   139264        0          -500 docker-proxy
[ 1053.457996] [  21725]     0 21725   178100      664    98304        0          -998 containerd-shim
[ 1053.457998] [  21744]   999 21744   112561    43424   495616        0             0 postgres
[ 1053.458001] [  21753]   999 21753      696       23    49152        0             0 tini
[ 1053.458004] [  21850]   999 21850    69223     5848   585728        0             0 environmentd
[ 1053.458006] [  21919]   999 21919   293834    59869   929792        0             0 cockroach
[ 1053.458008] [  21969]   999 21969   112595     1142   176128        0             0 postgres
[ 1053.458010] [  21970]   999 21970   112880     1926   192512        0             0 postgres
[ 1053.458013] [  21971]   999 21971   112561     1591   167936        0             0 postgres
[ 1053.458015] [  21972]   999 21972   113569      860   208896        0             0 postgres
[ 1053.458018] [  21973]   999 21973    16821      504   135168        0             0 postgres
[ 1053.458020] [  21974]   999 21974   113558      730   200704        0             0 postgres
[ 1053.458023] [  22014]   999 22014   256399     3799  1445888        0             0 computed
[ 1053.458025] [  23168]   999 23168  3781020  1708974 27389952        0             0 storaged
[ 1053.458028] [  23186]   999 23186   115378     3349   290816        0             0 postgres
[ 1053.458030] [  23201]   999 23201   113696     1240   233472        0             0 postgres
[ 1053.458032] [  23202]   114 23202    54670      689   114688        0             0 postgres
[ 1053.458035] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=init.scope,mems_allowed=0,global_oom,task_memcg=/system.slice/docker-729680011fd9ecb84547a7bc3c05b5f059503e6a206a14cdfac48e68a2a2c175.scope,task=storaged,pid=23168,uid=999
[ 1053.458078] Out of memory: Killed process 23168 (storaged) total-vm:15124080kB, anon-rss:6835896kB, file-rss:0kB, shmem-rss:0kB, UID:999 pgtables:26748kB oom_score_adj:0
[ 1053.693645] oom_reaper: reaped process 23168 (storaged), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

philip-stoev avatar Sep 28 '22 11:09 philip-stoev

The test as pushed in the branch uses 16 1-megabyte rows. Unfortunately, the same problem exists with 1026 16-kilobyte rows:

diff --git a/test/bounded-memory/bounded-memory-before-restart.td b/test/bounded-memory/bounded-memory-before-restart.td
index 002c8fa3e..af07cf8e0 100644
--- a/test/bounded-memory/bounded-memory-before-restart.td
+++ b/test/bounded-memory/bounded-memory-before-restart.td
@@ -29,7 +29,7 @@ CREATE SCHEMA public;
 CREATE TABLE t1 (f1 SERIAL PRIMARY KEY, f2 INTEGER DEFAULT 0, f3 TEXT);
 ALTER TABLE t1 REPLICA IDENTITY FULL;
 
-INSERT INTO t1 (f3) SELECT REPEAT('a', 1024 * 1024) FROM generate_series(1, 16);
+INSERT INTO t1 (f3) SELECT REPEAT('a', 1024 * 16) FROM generate_series(1, 1024);
 
 CREATE PUBLICATION mz_source FOR ALL TABLES;
 
@@ -354,4 +354,4 @@ UPDATE t1 SET f2 = f2 + 1;
 INSERT INTO t1 (f3) VALUES ('eof');
 
 > SELECT * FROM v2;
-129
+1025

philip-stoev avatar Sep 28 '22 11:09 philip-stoev

@philip-stoev is there a guide on how to get jemalloc stats (i am just praying it works on macos)?

guswynn avatar Sep 30 '22 16:09 guswynn

@guswynn apologies for the late response. Storaged exposes a HTTP port that you can view with a web browser that will allow you to enable stats collection (which needs to happen towards the start of the test) and then view the report towards the end of the test.

The port is available inside the container only, so you may need to expose it to the outside:

diff --git a/src/orchestrator-process/src/lib.rs b/src/orchestrator-process/src/lib.rs
index fdb614d63..259bfd173 100644
--- a/src/orchestrator-process/src/lib.rs
+++ b/src/orchestrator-process/src/lib.rs
@@ -259,7 +259,7 @@ impl NamespacedOrchestrator for NamespacedProcessOrchestrator {
         for i in 0..scale_in.get() {
             if !processes_exist[i] {
                 let mut args = args(&ServiceAssignments {
-                    listen_host: IpAddr::V4(Ipv4Addr::LOCALHOST),
+                    listen_host: IpAddr::V4(Ipv4Addr::UNSPECIFIED),
                     ports: &peers[i].1,
                     index: Some(i),
                     peers: &peers,

and then modify the mzcompose.py file's Materialized( ...) section to include:

extra_ports=["2104:2104"],
allow_host_ports=True

I do not know if all that works on mac, but it should work on a scratch machine. Once the port is exposed on the scratch machine, you can use SSH port forwarding to bring it to your local machine where you can view it in the browser:

ssh -L 2104:127.0.0.1:2104 ...

As you can see , there are quite a few hoops to jump through, not to mention all this will not be possible in the cloud.

philip-stoev avatar Oct 04 '22 06:10 philip-stoev

This is still reproducible :

4003748 systemd+  20   0   27.9g   1.1g  28072 S 181.1   1.8   4:39.51 storaged                                                                     

philip-stoev avatar Dec 14 '22 16:12 philip-stoev

@philip-stoev doing some digging on old issues. Before I dive in, do you know offhand if this issue still applies?

dseisun-materialize avatar Aug 07 '23 21:08 dseisun-materialize

@nrainer-materialize @def-, throwing this in the QA bucket so we can double-check if this is still an issue, or if we can close this out.

morsapaes avatar Apr 30 '24 11:04 morsapaes

@morsapaes: Yes, the OOM can still be reproduced! I migrated and added this check to the bounded-memory framework in https://github.com/MaterializeInc/materialize/pull/26845. Since it fails (even with a higher memory specification), I will disable it for now.

nrainer-materialize avatar May 06 '24 11:05 nrainer-materialize

I think some kind of flow control is the real fix here!

guswynn avatar May 06 '24 22:05 guswynn