libyang Memory leaks

Hello! We ran stability tests for our system, using netopeer2 (version 2.1.71), sysrepo (version 2.2.132) and libyang (version 2.1.137). And it seems, that there are memory leaks (as shown on screenshots). Memory_usage CPU_usage Briefly:

per process for 24 hrs:
/usr/bin/netopeer2-server - from 45MB to 76MB (+31)
/usr/bin/sysrepo-plugind - from 60MB to 98MB (+38)

After this i checked both netopeer2 and sysrepo via valgrind. The result of checking is presented below: For sysrepo-plugind:

valgrind /usr/bin/sysrepo-plugind -d
584 bytes in 1 blocks are possibly lost in loss record 87 of 120 <----- repeat many times
==2849==    at 0x484DE4C: realloc (in /usr/lib/valgrind/vgpreload_memcheck-mips32-linux.so)
==2849==    by 0x4A83218: parse_ext (parser_yang.c:937)
==2849==    by 0x4A82854: get_keyword (parser_yang.c:837)
==2849==    by 0x4AA5BAC: parse_module (parser_yang.c:4413)
==2849==    by 0x4AA8768: yang_parse_module (parser_yang.c:4805)
==2849==    by 0x4A43798: lys_parse_in (tree_schema.c:1771)
==2849==    by 0x49A9F34: ly_ctx_new (context.c:312)
==2849==    by 0x4889188: sr_ly_ctx_init (common.c:224)
==2849==    by 0x486BD28: sr_conn_new (sysrepo.c:80)
==2849==    by 0x486C5A8: sr_connect (sysrepo.c:216)
==2849==    by 0x406B68: main (sysrepo-plugind.c:554)
----------------------------------------------------
468 bytes in 13 blocks are possibly lost in loss record 84 of 120
==2849==    at 0x484A8F0: malloc (in /usr/lib/valgrind/vgpreload_memcheck-mips32-linux.so)
==2849==    by 0x4A8ECFC: parse_type_pattern (parser_yang.c:2125)
==2849==    by 0x4A82288: get_keyword (parser_yang.c:772)
----------------------------------------------------
==2849== LEAK SUMMARY:
==2849==    definitely lost: 0 bytes in 0 blocks
==2849==    indirectly lost: 0 bytes in 0 blocks
==2849==      possibly lost: 9,084 bytes in 121 blocks
==2849==    still reachable: 111,191 bytes in 373 blocks
==2849==         suppressed: 0 bytes in 0 blocks
==2849== Reachable blocks (those to which a pointer was found) are not shown.
==2849== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==2849==
==2849== For lists of detected and suppressed errors, rerun with: -s
==2849== ERROR SUMMARY: 17 errors from 17 contexts (suppressed: 1 from 1)

For netopeer2:

valgrind /usr/bin/netopeer2-server -d
584 bytes in 1 blocks are possibly lost in loss record 55 of 75
==11434==    at 0x484DE4C: realloc (in /usr/lib/valgrind/vgpreload_memcheck-mips32-linux.so)
==11434==    by 0x4AE4218: parse_ext (parser_yang.c:937)
==11434==    by 0x4AE3854: get_keyword (parser_yang.c:837)
==11434==    by 0x4B06BAC: parse_module (parser_yang.c:4413)
==11434==    by 0x4B09768: yang_parse_module (parser_yang.c:4805)
==11434==    by 0x4AA4798: lys_parse_in (tree_schema.c:1771)
==11434==    by 0x4A0AF34: ly_ctx_new (context.c:312)
==11434==    by 0x4C0E188: sr_ly_ctx_init (common.c:224)
==11434==    by 0x4BF0D28: sr_conn_new (sysrepo.c:80)
==11434==    by 0x4BF15A8: sr_connect (sysrepo.c:216)
==11434==    by 0x406650: main (in /usr/bin/netopeer2-server)
-------------------------------------------------------------
240 bytes in 3 blocks are possibly lost in loss record 34 of 75
==11434==    at 0x484DE4C: realloc (in /usr/lib/valgrind/vgpreload_memcheck-mips32-linux.so)
==11434==    by 0x4AE7DE4: parse_revision (parser_yang.c:1306)
==11434==    by 0x4AE3288: get_keyword (parser_yang.c:772)
==11434==    by 0x4B06554: parse_module (parser_yang.c:4356)
==11434==    by 0x4B09768: yang_parse_module (parser_yang.c:4805)
==11434==    by 0x4AA4798: lys_parse_in (tree_schema.c:1771)
==11434==    by 0x4A0AF34: ly_ctx_new (context.c:312)
==11434==    by 0x4C0E188: sr_ly_ctx_init (common.c:224)
==11434==    by 0x4BF0D28: sr_conn_new (sysrepo.c:80)
==11434==    by 0x4BF15A8: sr_connect (sysrepo.c:216)
==11434==    by 0x406650: main (in /usr/bin/netopeer2-server)
-------------------------------------------------------------
==11434== LEAK SUMMARY:
==11434==    definitely lost: 0 bytes in 0 blocks
==11434==    indirectly lost: 0 bytes in 0 blocks
==11434==      possibly lost: 7,244 bytes in 105 blocks
==11434==    still reachable: 56,369 bytes in 203 blocks
==11434==         suppressed: 0 bytes in 0 blocks
==11434== Reachable blocks (those to which a pointer was found) are not shown.
==11434== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==11434==
==11434== For lists of detected and suppressed errors, rerun with: -s
==11434== ERROR SUMMARY: 14 errors from 14 contexts (suppressed: 1 from 1)

As i can conclude, there are possibly memory leaks in libyang, due valgrind shows for both netopeer2 and sysrepo stacktrace into libyang part. Can you, please, investigate this problem and suggest any fix? Or maybe correct me, if my conclusion is wrong. Ask me any additional information, if it's needed. It's important problem for us, which blocks further moving to new netopeer2 and sysrepo versions.

Dec 11 '23 06:12 optimden

The valgrind report says that about 100kB were not explicitly freed, but it doesn't say where that memory comes from. Anyway, the test suite of all these projects runs (a few different) memory leak detectors, so it's unlikely a "leak", it might be "just" that the memory is only freed at the end of the execution. Either way, without a real heap trace this will be tough to debug.

Dec 11 '23 13:12 jktjkt

I attach valgrind massif output files for both sysrepo-plugind and netopeer2-server. Is it acceptable and enough? It is for first time after both programs started. massif_sysrepo_plugind.txt massif_netopeer2_server.txt

Dec 13 '23 09:12 optimden

What exactly are the binaries doing? Generally, the possibly lost valgrind memory is due to type sharing in libyang which is hard to track.

Dec 14 '23 10:12 michalvasko

What exactly are the binaries doing? Generally, the possibly lost valgrind memory is due to type sharing in libyang which is hard to track.

I'll run again stability tests but with sysrepo-plugind and netopeer2-server running via valgrind --tool=massif. Then I'll attach new massif reports. If answer on your question, we send from time to time different netconf requests such as get, get-config and other in these tests.

Dec 14 '23 10:12 optimden

@michalvasko , I've got new heap traces for two cases:

Periodically sending get and get-config request for different yang modules during 40 minutes: massif.out.sysrepo_1.txt massif.out.netopeer_1.txt
Performed two get requests for the whole yang tree:

[massif.out.netopeer_2.txt](https://github.com/CESNET/libyang/files/13712055/massif.out.netopeer_2.txt) [massif.out.sysrepo_2.txt](https://github.com/CESNET/libyang/files/13712056/massif.out.sysrepo_2.txt)

So, I see possible problems with accumulating allocated memory in heap in the next functions (in the order of decreasing allocated memory): 754 KiB 0x4976A08: ly_realloc (common.c:45) 256.1 KiB 0x4980704: dict_insert (dict.c:209) 136.1 KiB 0x497DE24: lyht_init_hlists_and_records (hash_table.c:75) In genearally, there are more problems in sysrepo, than in netopeer2. Please, answer, is this information enough to start investigate this bug or you need some additional info?

Dec 19 '23 06:12 optimden

The massif tool is not a good one for finding memory leaks, try valgrind itself. As for some explanation of the numbers you measured, I can explain the rising memory consumption of netopeer2-server when executing get for different YANG modules. The server uses data caching to speed up retrieving configuration data so I think you reach the maximum memory when all the configuration data is loaded. If this is a problem for you, you can disable the cache by removing the flag SR_CONN_CACHE_RUNNING from sr_connect() in main.c:server_init().

So, unless you provide valgrind output with some definitely lost memory, there are no memory leaks for me to fix.

Dec 19 '23 07:12 michalvasko

@michalvasko , I have now valgrind report with a few definitely losts. But there are a lot of question marks in stacktrace. It seems it's not easy to fix it. I used environment with all binaries and libs non-stripped with debug info for got this report, and have question marks yet. So, there are all places with definitely losts from report:

==21673== Source and destination overlap in memcpy(0x51637ec, 0x51637ee, 24)
==21673==    at 0x4850F40: memcpy (vg_replace_strmem.c:1036)
==21673==    by 0x52E30DC: mmxnc_eppath_to_template (mmx-system.c:271)
==21673==    by 0x52E33C8: mmx_system_partial_update (mmx-system.c:315)
==21673==    by 0x5305E18: mmxnc_plugin_generic_mmxep_update (plugin.c:659)
==21673==    by 0x5305B10: mmxnc_plugin_worker_thread (plugin.c:621)
==21673==    by 0x4BA3714: start_thread (pthread_create.c:463)
==21673==    by 0x4CDF368: __thread_start (clone.S:135)

==21673== 205 bytes in 1 blocks are definitely lost in loss record 77 of 145
==21673==    at 0x484DE4C: realloc (vg_replace_malloc.c:836)
==21673==    by 0x4C5D720: _IO_mem_finish (memstream.c:131)
==21673==    by 0x4C511B8: fclose@@GLIBC_2.2 (iofclose.c:57)
==21673==    by 0x4CD9F2C: __vsyslog_chk (syslog.c:226)
==21673==    by 0x55447B0: ???
==21673==    by 0x52EC578: ???
==21673==    by 0x52EE530: ???
==21673==    by 0x52EEF04: ???
==21673==    by 0x52F080C: ???
==21673==    by 0x52F0350: ???
==21673==    by 0x52F085C: ???
==21673==    by 0x52F08F8: ???

==21673== 1,024 bytes in 1 blocks are definitely lost in loss record 100 of 145
==21673==    at 0x484DB40: calloc (vg_replace_malloc.c:762)
==21673==    by 0x52EE2C0: ???
==21673==    by 0x52EEF04: ???
==21673==    by 0x52F080C: ???
==21673==    by 0x52F0350: ???
==21673==    by 0x52F085C: ???
==21673==    by 0x52F08F8: ???
==21673==    by 0x52F4AC4: ???
==21673==    by 0x5306444: ???
==21673==    by 0x497086C: timer_sigev_thread (timer_routines.c:62)
==21673==    by 0x4BA3714: start_thread (pthread_create.c:463)
==21673==    by 0x4CDF368: __thread_start (clone.S:135)
==21673== 
==21673== 10,304 bytes in 1 blocks are definitely lost in loss record 127 of 145
==21673==    at 0x484A8F0: malloc (vg_replace_malloc.c:309)
==21673==    by 0x52FC470: ???
==21673==    by 0x52F4A08: ???
==21673==    by 0x5306444: ???
==21673==    by 0x497086C: timer_sigev_thread (timer_routines.c:62)
==21673==    by 0x4BA3714: start_thread (pthread_create.c:463)
==21673==    by 0x4CDF368: __thread_start (clone.S:135)
==21673== 
==21673== 14,850 bytes in 1 blocks are definitely lost in loss record 132 of 145
==21673==    at 0x484A8F0: malloc (vg_replace_malloc.c:309)
==21673==    by 0x52ECA50: ???
==21673==    by 0x52ED6E4: ???
==21673==    by 0x52EDAB4: ???
==21673==    by 0x52EEC18: ???
==21673==    by 0x52F080C: ???
==21673==    by 0x52F0350: ???
==21673==    by 0x52F085C: ???
==21673==    by 0x52F08F8: ???
==21673==    by 0x52F4AC4: ???
==21673==    by 0x5306444: ???
==21673==    by 0x497086C: timer_sigev_thread (timer_routines.c:62)
==21673== 
==21673== 18,688 bytes in 1 blocks are definitely lost in loss record 133 of 145
==21673==    at 0x484A8F0: malloc (vg_replace_malloc.c:309)
==21673==    by 0x56CC65C: ???
==21673==    by 0x4011198: call_init.part.0 (dl-init.c:72)
==21673==    by 0x40112F8: call_init (dl-init.c:30)
==21673==    by 0x40112F8: _dl_init (dl-init.c:119)
==21673==    by 0x40160C4: dl_open_worker (dl-open.c:511)
==21673==    by 0x4D298A8: _dl_catch_exception (dl-error-skeleton.c:196)
==21673==    by 0x4015770: _dl_open (dl-open.c:594)
==21673==    by 0x4983CA8: dlopen_doit (dlopen.c:66)
==21673==    by 0x4D298A8: _dl_catch_exception (dl-error-skeleton.c:196)
==21673==    by 0x4D29944: _dl_catch_error (dl-error-skeleton.c:215)
==21673==    by 0x4984580: _dlerror_run (dlerror.c:162)
==21673==    by 0x4983D50: dlopen@@GLIBC_2.2 (dlopen.c:87)
==21673== 
==21673== 171,983 (64 direct, 171,919 indirect) bytes in 1 blocks are definitely lost in loss record 143 of 145

Is it acceptable by you for start work on this problem?

Dec 24 '23 22:12 optimden

Is it acceptable by you for start work on this problem?

I believe all these leaks are in your code. There are no mmx-system.c or plugin.c files in our projects.

Jan 02 '24 10:01 michalvasko

@michalvasko Hello, sorry for a long delay for this issue. I got valgrind report with a lot of "still reachable" losts in libyang code (not from our project files). Please, looks through and say, do you need any additional information?

sysrepo-plugind-valgrind.log

Jan 15 '24 07:01 optimden

It may still be caused by problem in your plugins and most likely is, there are no such problems reported normally. You can try to get more callers in the output by specifying --num-callers=30 and it should give you exact information where the problem is.

Jan 15 '24 08:01 michalvasko

It may still be caused by problem in your plugins and most likely is, there are no such problems reported normally. You can try to get more callers in the output by specifying --num-callers=30 and it should give you exact information where the problem is.

Thank you for advice.

Jan 15 '24 08:01 optimden

@michalvasko , i investigated, that memory leaks problem was due accumulate error messages in libyang logging system. massif_devel_fin Problem function here is "vasprintf". Due massif showed me stacktrace for this function including only this function itself, i checked libyang source code and found three potentially problem functions, where vasprinf is invoked. vasprinf allocated memory for "char *msg". And this pointer is used in all these three functions (assigned to variables which are passed into these functions). So, i allocated memory for char * in these functions, strcpy msg to this variable and free "msg". I investigated, that problem function is "log_vprintf": massif_devel_log_vprintf I changed this part of libyang log.c:

diff --git a/src/log.c b/src/log.c
index 41c8e90e2..adaab0007 100644
--- a/src/log.c
+++ b/src/log.c
@@ -569,21 +569,11 @@ log_store(const struct ly_ctx *ctx, LY_LOG_LEVEL level, LY_ERR no, LY_VECODE vec
         } while (e->prev->next);
         /* last error was not found */
         assert(0);
-    } else if ((temp_ly_log_opts && ((*temp_ly_log_opts & LY_LOSTORE_LAST) == LY_LOSTORE_LAST)) ||
-            (!temp_ly_log_opts && ((ATOMIC_LOAD_RELAXED(ly_log_opts) & LY_LOSTORE_LAST) == LY_LOSTORE_LAST))) {
+    } else {
         /* overwrite last message */
         free(e->msg);
         free(e->path);
         free(e->apptag);
-    } else {
-        /* store new message */
-        last = e->prev;
-        e->prev = malloc(sizeof *e);
-        LY_CHECK_GOTO(!e->prev, mem_fail);
-        e = e->prev;
-        e->prev = last;
-        e->next = NULL;
-        last->next = e;
     }
 
     /* fill in the information */

to always overwrite last error message. It's quick solution for my case (no memory growth in heap): massif_fix_log_store I have a few questions:

What is purpose of storing a few (not only last one) error messages in libyang context?
Will be these messages used futher for only libyang internal logging (and maybe printing them into syslog) or they can be passed to sysrepo or netopeer, for example?
In common, my stability test passed without failures (with my fix above). But from your point of view, as maintainer, can my changes break some important functionality?

Feb 12 '24 11:02 optimden

Some functions can generate several errors and if not all are stored, only the last one is.
All sysrepo or netopeer2 libyang API calls should process and clear the errors to avoid similar problems.
That is a workaround at best, not a proper solution.

I assume the issue arises because you are calling libyang functions as well and not clearing the errors meaning they accumulate in the context (if sysrepo/netopeer2 is responsible for this, let me know which functions are you calling). In latest sysrepo devel version the logging options are not modified globally so if you use that, the default logging options should be set and only the last error stored.

TLDR; If you update sysrepo to the devel branch version, the problem should be fixed.

Feb 12 '24 12:02 michalvasko

Okey, thank you!

Feb 12 '24 12:02 optimden