mpich icon indicating copy to clipboard operation
mpich copied to clipboard

Valgrind reports some still reachable bytes in MPI_File_open

Open dqwu opened this issue 3 years ago • 3 comments

Reproduced on Ubuntu 18 with default GCC 7.4.0

[Build MPICH 4.0.2 with -g flag

wget https://www.mpich.org/static/downloads/4.0.2/mpich-4.0.2.tar.gz
tar zxf mpich-4.0.2.tar.gz
cd mpich-4.0.2
CFLAGS="-g" ./configure --prefix=/path/to/mpich/installation
make -j4
make install

[Run a simple test with Valgrind]

export PATH=/path/to/mpich/installation/bin:$PATH
mpicc -g test_mpiio.c
mpiexec -n 2 valgrind --leak-check=full --show-leak-kinds=all ./a.out 

[Sample output]

...
==24456== 32,768 bytes in 1 blocks are still reachable in loss record 7 of 8
==24456==    at 0x4C31B25: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24456==    by 0x52990E3: MPIR_Handle_indirect_init (mpir_handlemem.h:188)
==24456==    by 0x52990E3: MPIR_Handle_obj_alloc_unsafe (mpir_handlemem.h:318)
==24456==    by 0x52990E3: MPIR_Info_handle_obj_alloc (mpir_handlemem.c:191)
==24456==    by 0x521849F: MPIR_Info_alloc (infoutil.c:57)
==24456==    by 0x521824F: MPIR_Info_set_impl (info_impl.c:206)
==24456==    by 0x4F4FD55: internal_Info_set (info_set.c:69)
==24456==    by 0x4F4FD55: PMPI_Info_set (info_set.c:142)
==24456==    by 0x728BAB8: ADIOI_GEN_SetInfo (ad_hints.c:81)
==24456==    by 0x7296FE0: ADIO_Open (ad_open.c:123)
==24456==    by 0x7271FCB: PMPI_File_open (open.c:143)
==24456==    by 0x1088DC: main (test_mpiio.c:8)
==24456== 
==24456== 65,536 bytes in 1 blocks are still reachable in loss record 8 of 8
==24456==    at 0x4C31B25: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24456==    by 0x529918E: MPIR_Handle_indirect_init (mpir_handlemem.h:169)
==24456==    by 0x529918E: MPIR_Handle_obj_alloc_unsafe (mpir_handlemem.h:318)
==24456==    by 0x529918E: MPIR_Info_handle_obj_alloc (mpir_handlemem.c:191)
==24456==    by 0x521849F: MPIR_Info_alloc (infoutil.c:57)
==24456==    by 0x521824F: MPIR_Info_set_impl (info_impl.c:206)
==24456==    by 0x4F4FD55: internal_Info_set (info_set.c:69)
==24456==    by 0x4F4FD55: PMPI_Info_set (info_set.c:142)
==24456==    by 0x728BAB8: ADIOI_GEN_SetInfo (ad_hints.c:81)
==24456==    by 0x7296FE0: ADIO_Open (ad_open.c:123)
==24456==    by 0x7271FCB: PMPI_File_open (open.c:143)
==24456==    by 0x1088DC: main (test_mpiio.c:8)
==24456== 
==24456== LEAK SUMMARY:
==24456==    definitely lost: 0 bytes in 0 blocks
==24456==    indirectly lost: 0 bytes in 0 blocks
==24456==      possibly lost: 0 bytes in 0 blocks
==24456==    still reachable: 98,414 bytes in 8 blocks
==24456==         suppressed: 0 bytes in 0 blocks

Test program (test_mpiio.c)

#include <mpi.h>

int main(int argc, char* argv[])
{
  MPI_Init(&argc, &argv);

  MPI_File fh;
  MPI_File_open(MPI_COMM_WORLD, "test_file", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &fh);
  MPI_File_close(&fh);

  MPI_Finalize();

  return 0;
}

dqwu avatar Sep 21 '22 15:09 dqwu

@hzhou Is this a known issue?

dqwu avatar Sep 21 '22 15:09 dqwu

@hzhou Is this a known issue?

Not really. Could you try the main branch of mpich? https://github.com/pmodels/mpich/blob/main/doc/wiki/source_code/Github.md

hzhou avatar Sep 21 '22 15:09 hzhou

@hzhou Is this a known issue?

Not really. Could you try the main branch of mpich? https://github.com/pmodels/mpich/blob/main/doc/wiki/source_code/Github.md

It seems that the leaks from MPI_File_open have been fixed by latest main branch. Below are possible remaining leaks reported by Valgrind.

==24466== 2 bytes in 1 blocks are still reachable in loss record 1 of 10
==24466==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24466==    by 0x7B339B9: strdup (strdup.c:42)
==24466==    by 0x5216083: MPIR_Info_push (infoutil.c:90)
==24466==    by 0x5215B5E: MPIR_Info_set_impl (info_impl.c:153)
==24466==    by 0x5324E17: setup_single_nic (ofi_nic.c:164)
==24466==    by 0x53253C8: MPIDI_OFI_init_multi_nic (ofi_nic.c:128)
==24466==    by 0x5303947: MPIDI_OFI_init_local (ofi_init.c:568)
==24466==    by 0x52AD058: MPID_Init (ch4_init.c:508)
==24466==    by 0x521709B: MPII_Init_thread (mpir_init.c:230)
==24466==    by 0x5217864: MPIR_Init_impl (mpir_init.c:102)
==24466==    by 0x4F66814: internal_Init (init.c:53)
==24466==    by 0x4F66814: PMPI_Init (init.c:105)
==24466==    by 0x1088BA: main (test_mpiio.c:5)
==24466== 
==24466== 2 bytes in 1 blocks are still reachable in loss record 2 of 10
==24466==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24466==    by 0x7B339B9: strdup (strdup.c:42)
==24466==    by 0x5216083: MPIR_Info_push (infoutil.c:90)
==24466==    by 0x5215B5E: MPIR_Info_set_impl (info_impl.c:153)
==24466==    by 0x5324E50: setup_single_nic (ofi_nic.c:166)
==24466==    by 0x53253C8: MPIDI_OFI_init_multi_nic (ofi_nic.c:128)
==24466==    by 0x5303947: MPIDI_OFI_init_local (ofi_init.c:568)
==24466==    by 0x52AD058: MPID_Init (ch4_init.c:508)
==24466==    by 0x521709B: MPII_Init_thread (mpir_init.c:230)
==24466==    by 0x5217864: MPIR_Init_impl (mpir_init.c:102)
==24466==    by 0x4F66814: internal_Init (init.c:53)
==24466==    by 0x4F66814: PMPI_Init (init.c:105)
==24466==    by 0x1088BA: main (test_mpiio.c:5)
==24466== 
==24466== 5 bytes in 1 blocks are indirectly lost in loss record 3 of 10
==24466==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24466==    by 0x7B339B9: strdup (strdup.c:42)
==24466==    by 0x5362506: hwloc__add_info (topology.c:464)
==24466==    by 0x53901CF: hwloc__xml_import_cpukind (topology-xml.c:1815)
==24466==    by 0x5390EC3: hwloc_look_xml (topology-xml.c:2123)
==24466==    by 0x53691D3: hwloc_discover (topology.c:3356)
==24466==    by 0x536A8E2: hwloc_topology_load (topology.c:4033)
==24466==    by 0x52A9F59: MPII_hwtopo_init (mpir_hwtopo.c:216)
==24466==    by 0x5216C9B: MPII_Init_thread (mpir_init.c:169)
==24466==    by 0x5217864: MPIR_Init_impl (mpir_init.c:102)
==24466==    by 0x4F66814: internal_Init (init.c:53)
==24466==    by 0x4F66814: PMPI_Init (init.c:105)
==24466==    by 0x1088BA: main (test_mpiio.c:5)
==24466== 
==24466== 9 bytes in 1 blocks are still reachable in loss record 4 of 10
==24466==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24466==    by 0x7B339B9: strdup (strdup.c:42)
==24466==    by 0x5216077: MPIR_Info_push (infoutil.c:89)
==24466==    by 0x5215B5E: MPIR_Info_set_impl (info_impl.c:153)
==24466==    by 0x5324E17: setup_single_nic (ofi_nic.c:164)
==24466==    by 0x53253C8: MPIDI_OFI_init_multi_nic (ofi_nic.c:128)
==24466==    by 0x5303947: MPIDI_OFI_init_local (ofi_init.c:568)
==24466==    by 0x52AD058: MPID_Init (ch4_init.c:508)
==24466==    by 0x521709B: MPII_Init_thread (mpir_init.c:230)
==24466==    by 0x5217864: MPIR_Init_impl (mpir_init.c:102)
==24466==    by 0x4F66814: internal_Init (init.c:53)
==24466==    by 0x4F66814: PMPI_Init (init.c:105)
==24466==    by 0x1088BA: main (test_mpiio.c:5)
==24466== 
==24466== 15 bytes in 1 blocks are still reachable in loss record 5 of 10
==24466==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24466==    by 0x7B339B9: strdup (strdup.c:42)
==24466==    by 0x5216077: MPIR_Info_push (infoutil.c:89)
==24466==    by 0x5215B5E: MPIR_Info_set_impl (info_impl.c:153)
==24466==    by 0x5324E50: setup_single_nic (ofi_nic.c:166)
==24466==    by 0x53253C8: MPIDI_OFI_init_multi_nic (ofi_nic.c:128)
==24466==    by 0x5303947: MPIDI_OFI_init_local (ofi_init.c:568)
==24466==    by 0x52AD058: MPID_Init (ch4_init.c:508)
==24466==    by 0x521709B: MPII_Init_thread (mpir_init.c:230)
==24466==    by 0x5217864: MPIR_Init_impl (mpir_init.c:102)
==24466==    by 0x4F66814: internal_Init (init.c:53)
==24466==    by 0x4F66814: PMPI_Init (init.c:105)
==24466==    by 0x1088BA: main (test_mpiio.c:5)
==24466== 
==24466== 16 bytes in 1 blocks are indirectly lost in loss record 6 of 10
==24466==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24466==    by 0x7B339B9: strdup (strdup.c:42)
==24466==    by 0x53624CC: hwloc__add_info (topology.c:461)
==24466==    by 0x53901CF: hwloc__xml_import_cpukind (topology-xml.c:1815)
==24466==    by 0x5390EC3: hwloc_look_xml (topology-xml.c:2123)
==24466==    by 0x53691D3: hwloc_discover (topology.c:3356)
==24466==    by 0x536A8E2: hwloc_topology_load (topology.c:4033)
==24466==    by 0x52A9F59: MPII_hwtopo_init (mpir_hwtopo.c:216)
==24466==    by 0x5216C9B: MPII_Init_thread (mpir_init.c:169)
==24466==    by 0x5217864: MPIR_Init_impl (mpir_init.c:102)
==24466==    by 0x4F66814: internal_Init (init.c:53)
==24466==    by 0x4F66814: PMPI_Init (init.c:105)
==24466==    by 0x1088BA: main (test_mpiio.c:5)
==24466== 
==24466== 32 bytes in 1 blocks are still reachable in loss record 7 of 10
==24466==    at 0x4C31B25: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24466==    by 0x864D7E4: _dlerror_run (dlerror.c:140)
==24466==    by 0x864D050: dlopen@@GLIBC_2.2.5 (dlopen.c:87)
==24466==    by 0x71F92F9: ofi_load_dl_prov (fabric.c:692)
==24466==    by 0x71F92F9: fi_ini (fabric.c:841)
==24466==    by 0x71FA1CA: fi_getinfo (fabric.c:1094)
==24466==    by 0x53259F2: find_provider (init_provider.c:115)
==24466==    by 0x53259F2: MPIDI_OFI_find_provider (init_provider.c:71)
==24466==    by 0x5303935: MPIDI_OFI_init_local (ofi_init.c:564)
==24466==    by 0x52AD058: MPID_Init (ch4_init.c:508)
==24466==    by 0x521709B: MPII_Init_thread (mpir_init.c:230)
==24466==    by 0x5217864: MPIR_Init_impl (mpir_init.c:102)
==24466==    by 0x4F66814: internal_Init (init.c:53)
==24466==    by 0x4F66814: PMPI_Init (init.c:105)
==24466==    by 0x1088BA: main (test_mpiio.c:5)
==24466== 
==24466== 61 bytes in 1 blocks are still reachable in loss record 8 of 10
==24466==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24466==    by 0x4017880: _dl_exception_create (dl-exception.c:77)
==24466==    by 0x7BFD250: _dl_signal_error (dl-error-skeleton.c:117)
==24466==    by 0x4009812: _dl_map_object (dl-load.c:2384)
==24466==    by 0x4014EE3: dl_open_worker (dl-open.c:235)
==24466==    by 0x7BFD2DE: _dl_catch_exception (dl-error-skeleton.c:196)
==24466==    by 0x40147C9: _dl_open (dl-open.c:605)
==24466==    by 0x864CF95: dlopen_doit (dlopen.c:66)
==24466==    by 0x7BFD2DE: _dl_catch_exception (dl-error-skeleton.c:196)
==24466==    by 0x7BFD36E: _dl_catch_error (dl-error-skeleton.c:215)
==24466==    by 0x864D734: _dlerror_run (dlerror.c:162)
==24466==    by 0x864D050: dlopen@@GLIBC_2.2.5 (dlopen.c:87)
==24466== 
==24466== 149 (128 direct, 21 indirect) bytes in 1 blocks are definitely lost in loss record 9 of 10
==24466==    at 0x4C2FA3F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24466==    by 0x4C31D84: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24466==    by 0x536248C: hwloc__add_info (topology.c:455)
==24466==    by 0x53901CF: hwloc__xml_import_cpukind (topology-xml.c:1815)
==24466==    by 0x5390EC3: hwloc_look_xml (topology-xml.c:2123)
==24466==    by 0x53691D3: hwloc_discover (topology.c:3356)
==24466==    by 0x536A8E2: hwloc_topology_load (topology.c:4033)
==24466==    by 0x52A9F59: MPII_hwtopo_init (mpir_hwtopo.c:216)
==24466==    by 0x5216C9B: MPII_Init_thread (mpir_init.c:169)
==24466==    by 0x5217864: MPIR_Init_impl (mpir_init.c:102)
==24466==    by 0x4F66814: internal_Init (init.c:53)
==24466==    by 0x4F66814: PMPI_Init (init.c:105)
==24466==    by 0x1088BA: main (test_mpiio.c:5)
==24466== 
==24466== 160 bytes in 1 blocks are still reachable in loss record 10 of 10
==24466==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24466==    by 0x52160A9: MPIR_Info_push (infoutil.c:78)
==24466==    by 0x5215B5E: MPIR_Info_set_impl (info_impl.c:153)
==24466==    by 0x5324E17: setup_single_nic (ofi_nic.c:164)
==24466==    by 0x53253C8: MPIDI_OFI_init_multi_nic (ofi_nic.c:128)
==24466==    by 0x5303947: MPIDI_OFI_init_local (ofi_init.c:568)
==24466==    by 0x52AD058: MPID_Init (ch4_init.c:508)
==24466==    by 0x521709B: MPII_Init_thread (mpir_init.c:230)
==24466==    by 0x5217864: MPIR_Init_impl (mpir_init.c:102)
==24466==    by 0x4F66814: internal_Init (init.c:53)
==24466==    by 0x4F66814: PMPI_Init (init.c:105)
==24466==    by 0x1088BA: main (test_mpiio.c:5)
==24466== 
==24466== LEAK SUMMARY:
==24466==    definitely lost: 128 bytes in 1 blocks
==24466==    indirectly lost: 21 bytes in 2 blocks
==24466==      possibly lost: 0 bytes in 0 blocks
==24466==    still reachable: 281 bytes in 7 blocks

dqwu avatar Sep 21 '22 16:09 dqwu

The hwloc leak is tracked here - https://github.com/open-mpi/hwloc/pull/547

hzhou avatar Oct 11 '22 18:10 hzhou

I don't see the libfabric _dl_open leak -- maybe because I am building embedded libfabric -- but I do see some leaks from prov/opx. Tracked here - https://github.com/ofiwg/libfabric/issues/8091

hzhou avatar Oct 11 '22 21:10 hzhou