parsec Revamp of the binding layer.

Rework the bindings. The main idea is to inherit the bindings from the batch scheduler, and then work from there.

Apr 05 '23 18:04 bosilca

I'm trying it on leconte.

mpirun -np 2 --map-by socket hwloc-info --restrict binding package:0
Package L#0
 [...]
 cpuset = 0x0000ffff,0xf00000ff,0xfff00000
 complete cpuset = 0x0000ffff,0xf00000ff,0xfff00000
 allowed cpuset = 0x0000ffff,0xf00000ff,0xfff00000
 nodeset = 0x00000002
 complete nodeset = 0x00000002
 allowed nodeset = 0x00000002
 [...]
Package L#0
 [...]
 cpuset = 0x0fffff00,0x000fffff
 complete cpuset = 0x0fffff00,0x000fffff
 allowed cpuset = 0x0fffff00,0x000fffff
 nodeset = 0x00000001
 complete nodeset = 0x00000001
 allowed nodeset = 0x00000001
  [...]

Which I interpret as when running mpirun -np 2 --map-by socket on this machine (module load openmpi), there are two processes, and rank 0 should use a different set of cores than rank 1 (and the cpuset seems complicated, but it does seem exclusive).

Now, I run a parsec test with this PR:

mpirun -np 2 --map-by socket ./tests/apps/stencil/testing_stencil_1D -M 40960 -N 40960 -t 16 -T 16 -P 2
Process binding [rank 0]: cpuset [ALLOWED  ]: 0x0fffff00,0x000fffff
Process binding [rank 0]: cpuset [USED     ]: 0x000fffff
Process binding [rank 0]: cpuset [FREE     ]: 0x0fffff00,0x0
W@00000 parsec_hwloc: couldn't bind to mask cpuset  0x0
Process binding [rank 0]: cpuset [ALLOWED  ]: 0x0000ffff,0xf00000ff,0xfff00000
Process binding [rank 0]: cpuset [USED     ]: 0x000000ff,0xfff00000
Process binding [rank 0]: cpuset [FREE     ]: 0x0000ffff,0xf0000000,0x0
W@00001 parsec_hwloc: couldn't bind to mask cpuset  0x0
i@00000 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
	Parsec Streams     : 20
	clockRate (GHz)    : 2.20
	peak Gflops        : double 176.0000, single 352.0000
i@00001 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
	Parsec Streams     : 20
	clockRate (GHz)    : 2.20
	peak Gflops        : double 176.0000, single 352.0000
i@00000 Virtual Process Map with 1 VPs...
i@00000    Virtual Process of index 0 has 20 threads and cpuset 0x000fffff
i@00000     Thread 0 of VP 0 can be bound on cores 0x00000001
i@00000     Thread 1 of VP 0 can be bound on cores 0x00000002
i@00000     Thread 2 of VP 0 can be bound on cores 0x00000004
i@00000     Thread 3 of VP 0 can be bound on cores 0x00000008
i@00000     Thread 4 of VP 0 can be bound on cores 0x00000010
i@00000     Thread 5 of VP 0 can be bound on cores 0x00000020
i@00000     Thread 6 of VP 0 can be bound on cores 0x00000040
i@00000     Thread 7 of VP 0 can be bound on cores 0x00000080
i@00000     Thread 8 of VP 0 can be bound on cores 0x00000100
i@00000     Thread 9 of VP 0 can be bound on cores 0x00000200
i@00000     Thread 10 of VP 0 can be bound on cores 0x00000400
i@00000     Thread 11 of VP 0 can be bound on cores 0x00000800
i@00000     Thread 12 of VP 0 can be bound on cores 0x00001000
i@00000     Thread 13 of VP 0 can be bound on cores 0x00002000
i@00000     Thread 14 of VP 0 can be bound on cores 0x00004000
i@00000     Thread 15 of VP 0 can be bound on cores 0x00008000
i@00000     Thread 16 of VP 0 can be bound on cores 0x00010000
i@00000     Thread 17 of VP 0 can be bound on cores 0x00020000
i@00000     Thread 18 of VP 0 can be bound on cores 0x00040000
i@00000     Thread 19 of VP 0 can be bound on cores 0x00080000
i@00001 Virtual Process Map with 1 VPs...
i@00001    Virtual Process of index 0 has 20 threads and cpuset 0x000fffff
i@00001     Thread 0 of VP 0 can be bound on cores 0x00000001
i@00001     Thread 1 of VP 0 can be bound on cores 0x00000002
i@00001     Thread 2 of VP 0 can be bound on cores 0x00000004
i@00001     Thread 3 of VP 0 can be bound on cores 0x00000008
i@00001     Thread 4 of VP 0 can be bound on cores 0x00000010
i@00001     Thread 5 of VP 0 can be bound on cores 0x00000020
i@00001     Thread 6 of VP 0 can be bound on cores 0x00000040
i@00001     Thread 7 of VP 0 can be bound on cores 0x00000080
i@00001     Thread 8 of VP 0 can be bound on cores 0x00000100
i@00001     Thread 9 of VP 0 can be bound on cores 0x00000200
i@00001     Thread 10 of VP 0 can be bound on cores 0x00000400
i@00001     Thread 11 of VP 0 can be bound on cores 0x00000800
i@00001     Thread 12 of VP 0 can be bound on cores 0x00001000
i@00001     Thread 13 of VP 0 can be bound on cores 0x00002000
i@00001     Thread 14 of VP 0 can be bound on cores 0x00004000
i@00001     Thread 15 of VP 0 can be bound on cores 0x00008000
i@00001     Thread 16 of VP 0 can be bound on cores 0x00010000
i@00001     Thread 17 of VP 0 can be bound on cores 0x00020000
i@00001     Thread 18 of VP 0 can be bound on cores 0x00040000
i@00001     Thread 19 of VP 0 can be bound on cores 0x00080000

It looks like the two ranks are sharing the same cores from the output?

A run of htop in a parallel terminal shows that only cores 0, 1, 13, 20, 21, 40, 41 and 65 have work to do (plus a bit for 77 and 46 sometimes), but definitely not all cores are active, and it's pretty slow.

Also, I tried to rebase the PR on the current master, but there are conflicts I'm not sure how to solve.

Aug 30 '23 21:08 therault

I was not able to find a way to translate between relative and absolute core numbering, so the reported bindings are relative to the allowed procs, and not absolute (as one would expect). Let me look again at the documentation to see if there is a way.

Aug 30 '23 22:08 bosilca

@bosilca ping ... need it badly :)

Nov 01 '23 17:11 evaleev

@bosilca ping again .. real showstopper

Dec 21 '23 11:12 evaleev

doing the merge now

Feb 15 '24 15:02 abouteiller

The following command produces the correct binding for testing (as witnessed from hwloc-ls, the binding is correctly restricted by mpiexec to the correct sockets). '

PARSEC_MCA_device_cuda_memory_use=10 salloc -N1 -whexane  "/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec" --tag-output --oversubscribe -n 2 --bind-to socket --map-by socket --report-bindings hwloc-ls --restrict binding -c --no-io 
salloc: Granted job allocation 5788
[1,0]<stderr>:[hexane:3980861] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[1,1]<stderr>:[hexane:3980861] MCW rank 1 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[1,0]<stdout>:Machine (62GB total) cpuset=0x55555555
[1,0]<stdout>:  Package L#0 cpuset=0x55555555
[1,0]<stdout>:    NUMANode L#0 (P#0 31GB) cpuset=0x55555555
[1,0]<stdout>:    L3 L#0 (12MB) cpuset=0x55555555
[1,0]<stdout>:      L2 L#0 (1280KB) cpuset=0x00010001
[1,0]<stdout>:        L1d L#0 (48KB) cpuset=0x00010001
[1,0]<stdout>:          L1i L#0 (32KB) cpuset=0x00010001
[1,0]<stdout>:            Core L#0 cpuset=0x00010001
[1,0]<stdout>:              PU L#0 (P#0) cpuset=0x00000001
[1,0]<stdout>:              PU L#1 (P#16) cpuset=0x00010000
[1,0]<stdout>:      L2 L#1 (1280KB) cpuset=0x00040004
[1,0]<stdout>:        L1d L#1 (48KB) cpuset=0x00040004
[1,0]<stdout>:          L1i L#1 (32KB) cpuset=0x00040004
[1,0]<stdout>:            Core L#1 cpuset=0x00040004
[1,0]<stdout>:              PU L#2 (P#2) cpuset=0x00000004
[1,0]<stdout>:              PU L#3 (P#18) cpuset=0x00040000
[1,0]<stdout>:      L2 L#2 (1280KB) cpuset=0x00100010
[1,0]<stdout>:        L1d L#2 (48KB) cpuset=0x00100010
[1,0]<stdout>:          L1i L#2 (32KB) cpuset=0x00100010
[1,0]<stdout>:            Core L#2 cpuset=0x00100010
[1,0]<stdout>:              PU L#4 (P#4) cpuset=0x00000010
[1,0]<stdout>:              PU L#5 (P#20) cpuset=0x00100000
[1,0]<stdout>:      L2 L#3 (1280KB) cpuset=0x00400040
[1,0]<stdout>:        L1d L#3 (48KB) cpuset=0x00400040
[1,0]<stdout>:          L1i L#3 (32KB) cpuset=0x00400040
[1,0]<stdout>:            Core L#3 cpuset=0x00400040
[1,0]<stdout>:              PU L#6 (P#6) cpuset=0x00000040
[1,0]<stdout>:              PU L#7 (P#22) cpuset=0x00400000
[1,0]<stdout>:      L2 L#4 (1280KB) cpuset=0x01000100
[1,0]<stdout>:        L1d L#4 (48KB) cpuset=0x01000100
[1,0]<stdout>:          L1i L#4 (32KB) cpuset=0x01000100
[1,0]<stdout>:            Core L#4 cpuset=0x01000100
[1,0]<stdout>:              PU L#8 (P#8) cpuset=0x00000100
[1,0]<stdout>:              PU L#9 (P#24) cpuset=0x01000000
[1,0]<stdout>:      L2 L#5 (1280KB) cpuset=0x04000400
[1,0]<stdout>:        L1d L#5 (48KB) cpuset=0x04000400
[1,0]<stdout>:          L1i L#5 (32KB) cpuset=0x04000400
[1,0]<stdout>:            Core L#5 cpuset=0x04000400
[1,0]<stdout>:              PU L#10 (P#10) cpuset=0x00000400
[1,0]<stdout>:              PU L#11 (P#26) cpuset=0x04000000
[1,0]<stdout>:      L2 L#6 (1280KB) cpuset=0x10001000
[1,0]<stdout>:        L1d L#6 (48KB) cpuset=0x10001000
[1,0]<stdout>:          L1i L#6 (32KB) cpuset=0x10001000
[1,0]<stdout>:            Core L#6 cpuset=0x10001000
[1,0]<stdout>:              PU L#12 (P#12) cpuset=0x00001000
[1,0]<stdout>:              PU L#13 (P#28) cpuset=0x10000000
[1,0]<stdout>:      L2 L#7 (1280KB) cpuset=0x40004000
[1,0]<stdout>:        L1d L#7 (48KB) cpuset=0x40004000
[1,0]<stdout>:          L1i L#7 (32KB) cpuset=0x40004000
[1,0]<stdout>:            Core L#7 cpuset=0x40004000
[1,0]<stdout>:              PU L#14 (P#14) cpuset=0x00004000
[1,0]<stdout>:              PU L#15 (P#30) cpuset=0x40000000
[1,0]<stdout>:  Package L#1 cpuset=0x0
[1,0]<stdout>:    NUMANode L#1 (P#1 31GB) cpuset=0x0
[1,1]<stdout>:Machine (62GB total) cpuset=0xaaaaaaaa
[1,1]<stdout>:  Package L#0 cpuset=0xaaaaaaaa
[1,1]<stdout>:    NUMANode L#0 (P#1 31GB) cpuset=0xaaaaaaaa
[1,1]<stdout>:    L3 L#0 (12MB) cpuset=0xaaaaaaaa
[1,1]<stdout>:      L2 L#0 (1280KB) cpuset=0x00020002
[1,1]<stdout>:        L1d L#0 (48KB) cpuset=0x00020002
[1,1]<stdout>:          L1i L#0 (32KB) cpuset=0x00020002
[1,1]<stdout>:            Core L#0 cpuset=0x00020002
[1,1]<stdout>:              PU L#0 (P#1) cpuset=0x00000002
[1,1]<stdout>:              PU L#1 (P#17) cpuset=0x00020000
[1,1]<stdout>:      L2 L#1 (1280KB) cpuset=0x00080008
[1,1]<stdout>:        L1d L#1 (48KB) cpuset=0x00080008
[1,1]<stdout>:          L1i L#1 (32KB) cpuset=0x00080008
[1,1]<stdout>:            Core L#1 cpuset=0x00080008
[1,1]<stdout>:              PU L#2 (P#3) cpuset=0x00000008
[1,1]<stdout>:              PU L#3 (P#19) cpuset=0x00080000
[1,1]<stdout>:      L2 L#2 (1280KB) cpuset=0x00200020
[1,1]<stdout>:        L1d L#2 (48KB) cpuset=0x00200020
[1,1]<stdout>:          L1i L#2 (32KB) cpuset=0x00200020
[1,1]<stdout>:            Core L#2 cpuset=0x00200020
[1,1]<stdout>:              PU L#4 (P#5) cpuset=0x00000020
[1,1]<stdout>:              PU L#5 (P#21) cpuset=0x00200000
[1,1]<stdout>:      L2 L#3 (1280KB) cpuset=0x00800080
[1,1]<stdout>:        L1d L#3 (48KB) cpuset=0x00800080
[1,1]<stdout>:          L1i L#3 (32KB) cpuset=0x00800080
[1,1]<stdout>:            Core L#3 cpuset=0x00800080
[1,1]<stdout>:              PU L#6 (P#7) cpuset=0x00000080
[1,1]<stdout>:              PU L#7 (P#23) cpuset=0x00800000
[1,1]<stdout>:      L2 L#4 (1280KB) cpuset=0x02000200
[1,1]<stdout>:        L1d L#4 (48KB) cpuset=0x02000200
[1,1]<stdout>:          L1i L#4 (32KB) cpuset=0x02000200
[1,1]<stdout>:            Core L#4 cpuset=0x02000200
[1,1]<stdout>:              PU L#8 (P#9) cpuset=0x00000200
[1,1]<stdout>:              PU L#9 (P#25) cpuset=0x02000000
[1,1]<stdout>:      L2 L#5 (1280KB) cpuset=0x08000800
[1,1]<stdout>:        L1d L#5 (48KB) cpuset=0x08000800
[1,1]<stdout>:          L1i L#5 (32KB) cpuset=0x08000800
[1,1]<stdout>:            Core L#5 cpuset=0x08000800
[1,1]<stdout>:              PU L#10 (P#11) cpuset=0x00000800
[1,1]<stdout>:              PU L#11 (P#27) cpuset=0x08000000
[1,1]<stdout>:      L2 L#6 (1280KB) cpuset=0x20002000
[1,1]<stdout>:        L1d L#6 (48KB) cpuset=0x20002000
[1,1]<stdout>:          L1i L#6 (32KB) cpuset=0x20002000
[1,1]<stdout>:            Core L#6 cpuset=0x20002000
[1,1]<stdout>:              PU L#12 (P#13) cpuset=0x00002000
[1,1]<stdout>:              PU L#13 (P#29) cpuset=0x20000000
[1,1]<stdout>:      L2 L#7 (1280KB) cpuset=0x80008000
[1,1]<stdout>:        L1d L#7 (48KB) cpuset=0x80008000
[1,1]<stdout>:          L1i L#7 (32KB) cpuset=0x80008000
[1,1]<stdout>:            Core L#7 cpuset=0x80008000
[1,1]<stdout>:              PU L#14 (P#15) cpuset=0x00008000
[1,1]<stdout>:              PU L#15 (P#31) cpuset=0x80000000
[1,1]<stdout>:  Package L#1 cpuset=0x0
[1,1]<stdout>:    NUMANode L#1 (P#0 31GB) cpuset=0x0
salloc: Relinquishing job allocation 5788

The resultant binding in Parsec does not appear correct:

PARSEC_MCA_device_cuda_memory_use=10 salloc -N1 -whexane  "/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec" --tag-output --oversubscribe -n 2 --bind-to socket --map-by socket --report-bindings build.cuda/parsec/tests/api/init_fini --mca runtime_report_bindings 1
salloc: Granted job allocation 5790
[1,0]<stderr>:[hexane:3981005] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[1,1]<stderr>:[hexane:3981005] MCW rank 1 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[1,1]<stdout>:Process binding [rank 0]: cpuset [ALLOWED  ]: 0xaaaaaaaa
[1,1]<stdout>:Process binding [rank 0]: cpuset [USED     ]: 0x0000aaaa
[1,1]<stdout>:Process binding [rank 0]: cpuset [FREE     ]: 0xaaaa0000
[1,0]<stdout>:Process binding [rank 0]: cpuset [ALLOWED  ]: 0x55555555
[1,0]<stdout>:Process binding [rank 0]: cpuset [USED     ]: 0x00005555
[1,0]<stdout>:Process binding [rank 0]: cpuset [FREE     ]: 0x55550000
[1,0]<stderr>:W@00000 /!\ DEBUG LEVEL WILL PROBABLY REDUCE THE PERFORMANCE OF THIS RUN /!\.
[1,0]<stderr>:i@00000 Virtual Process Map with 1 VPs...
[1,0]<stderr>:i@00000    Virtual Process of index 0 has 8 threads and logical cpuset 0x000000ff
[1,0]<stderr>:           physical cpuset 0x55555555
[1,0]<stderr>:i@00000     Thread 0 of VP 0 can be bound on logical cores 0x00000001 (physical cores 0x00010001)
[1,0]<stderr>:i@00000     Thread 1 of VP 0 can be bound on logical cores 0x00000002 (physical cores 0x00040004)
[1,0]<stderr>:i@00000     Thread 2 of VP 0 can be bound on logical cores 0x00000004 (physical cores 0x00100010)
[1,0]<stderr>:i@00000     Thread 3 of VP 0 can be bound on logical cores 0x00000008 (physical cores 0x00400040)
[1,0]<stderr>:i@00000     Thread 4 of VP 0 can be bound on logical cores 0x00000010 (physical cores 0x01000100)
[1,0]<stderr>:i@00000     Thread 5 of VP 0 can be bound on logical cores 0x00000020 (physical cores 0x04000400)
[1,0]<stderr>:i@00000     Thread 6 of VP 0 can be bound on logical cores 0x00000040 (physical cores 0x10001000)
[1,0]<stderr>:i@00000     Thread 7 of VP 0 can be bound on logical cores 0x00000080 (physical cores 0x40004000)
[1,1]<stderr>:i@00001 Virtual Process Map with 1 VPs...
[1,1]<stderr>:i@00001    Virtual Process of index 0 has 8 threads and logical cpuset 0x000000ff
[1,1]<stderr>:           physical cpuset 0x55555555
[1,1]<stderr>:i@00001     Thread 0 of VP 0 can be bound on logical cores 0x00000001 (physical cores 0x00010001)
[1,1]<stderr>:i@00001     Thread 1 of VP 0 can be bound on logical cores 0x00000002 (physical cores 0x00040004)
[1,1]<stderr>:i@00001     Thread 2 of VP 0 can be bound on logical cores 0x00000004 (physical cores 0x00100010)
[1,1]<stderr>:i@00001     Thread 3 of VP 0 can be bound on logical cores 0x00000008 (physical cores 0x00400040)
[1,1]<stderr>:i@00001     Thread 4 of VP 0 can be bound on logical cores 0x00000010 (physical cores 0x01000100)
[1,1]<stderr>:i@00001     Thread 5 of VP 0 can be bound on logical cores 0x00000020 (physical cores 0x04000400)
[1,1]<stderr>:i@00001     Thread 6 of VP 0 can be bound on logical cores 0x00000040 (physical cores 0x10001000)
[1,1]<stderr>:i@00001     Thread 7 of VP 0 can be bound on logical cores 0x00000080 (physical cores 0x40004000)
salloc: Relinquishing job allocation 5790

Mar 07 '24 16:03 abouteiller

It looks like the message above is partially misleading: we initialized the vpmap before we extract the ALLOWED mask, so the vpmap_from_flat initializes something that is not the same as what the real final binding is. This is a problem because using VPs will do something different than not using VPs, and the output is misleading, but the actual binding produced should be correct when not using VPs.

A potential solution is to move up the initialization of the ALLOWED_MASK early in parsec_init, and use the ALLOWED mask the same way we use it in the parse_binding_parameters function to restrict vpmap creation, but that may be restrictive for some vpmap from file or otherwise. Thoughts?

Mar 07 '24 17:03 abouteiller

Looks like inherited binding is correct (beside the problem with the vpmap above, comparing --bind-to none -c 8 vs --bind-to socket results in double the performance and using 16 hw cores in the second case and not having the message about running oversubscribed.

 PARSEC_MCA_device_cuda_memory_use=10 salloc -N1 --ntasks-per-node=2 -whexane  "/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec" --tag-output -n 2 --bind-to none --map-by socket --report-bindings build.cuda/tests/testing_dpotrf -g 0 -N 30000 -x -v=4 -t 384 -c 8

[1,1]<stderr>:W@00001 /!\ PERFORMANCE MIGHT BE REDUCED /!\: Multiple PaRSEC processes on the same node may share the same physical core(s);
[1,1]<stderr>:  This is often unintentional, and will perform poorly.
[1,1]<stderr>:  Note that in managed environments (e.g., ALPS, jsrun), the launcher may set `cgroups`
[1,1]<stderr>:  and hide the real binding from PaRSEC; if you verified that the binding is correct,
[1,1]<stderr>:  this message can be silenced using the MCA argument `runtime_warn_slow_binding`.
[1,0]<stderr>:#+++++ cores detected       : 8
[1,0]<stderr>:#+++++ nodes x cores + gpu  : 2 x 8 + 0 (16+0)
[1,0]<stdout>:[****] TIME(s)     31.77802 : dpotrf      PxQxg=   2 1   0 NB=  384 N=   30000 :     283.228825 gflops - ENQ&PROG&DEST     31.80433 :     282.994460 gflops - ENQ      0.02591 - DEST      0.00040

...

 PARSEC_MCA_device_cuda_memory_use=10 salloc -N1 --ntasks-per-node=2 -whexane  "/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec" --tag-output -n 2 --bind-to socket --map-by socket --report-bindings build.cuda/tests/testing_dpotrf -g 0 -N 30000 -x -v=4 -t 384
[1,0]<stderr>:#+++++ cores detected       : 8
[1,0]<stderr>:#+++++ nodes x cores + gpu  : 2 x 8 + 0 (16+0)
[1,0]<stdout>:[****] TIME(s)     19.26431 : dpotrf      PxQxg=   2 1   0 NB=  384 N=   30000 :     467.208496 gflops - ENQ&PROG&DEST     19.26601 :     467.167396 gflops - ENQ      0.00008 - DEST      0.00161

Mar 07 '24 18:03 abouteiller

vpmap initialization creates and fills parsec_vpmap[vp].threads[t+ht].cpuset = HWLOC_ALLOC(); with all sorts of intricate things (that are not abiding with the restricted mask) but these are write only variables.

~~At this point I propose we merge this PR with the broken vpmap, and create a tracking issue to fix them later in v4.1. Poll below.~~

At this point I will excise the rework on the vpmap, merge the rework that is effective in the flat case, and defer completion of complex vpmap process binding to 4.1.

Mar 07 '24 20:03 abouteiller