Revamp of the binding layer.
Rework the bindings. The main idea is to inherit the bindings from the batch scheduler, and then work from there.
I'm trying it on leconte.
mpirun -np 2 --map-by socket hwloc-info --restrict binding package:0
Package L#0
[...]
cpuset = 0x0000ffff,0xf00000ff,0xfff00000
complete cpuset = 0x0000ffff,0xf00000ff,0xfff00000
allowed cpuset = 0x0000ffff,0xf00000ff,0xfff00000
nodeset = 0x00000002
complete nodeset = 0x00000002
allowed nodeset = 0x00000002
[...]
Package L#0
[...]
cpuset = 0x0fffff00,0x000fffff
complete cpuset = 0x0fffff00,0x000fffff
allowed cpuset = 0x0fffff00,0x000fffff
nodeset = 0x00000001
complete nodeset = 0x00000001
allowed nodeset = 0x00000001
[...]
Which I interpret as when running mpirun -np 2 --map-by socket on this machine (module load openmpi), there are two processes, and rank 0 should use a different set of cores than rank 1 (and the cpuset seems complicated, but it does seem exclusive).
Now, I run a parsec test with this PR:
mpirun -np 2 --map-by socket ./tests/apps/stencil/testing_stencil_1D -M 40960 -N 40960 -t 16 -T 16 -P 2
Process binding [rank 0]: cpuset [ALLOWED ]: 0x0fffff00,0x000fffff
Process binding [rank 0]: cpuset [USED ]: 0x000fffff
Process binding [rank 0]: cpuset [FREE ]: 0x0fffff00,0x0
W@00000 parsec_hwloc: couldn't bind to mask cpuset 0x0
Process binding [rank 0]: cpuset [ALLOWED ]: 0x0000ffff,0xf00000ff,0xfff00000
Process binding [rank 0]: cpuset [USED ]: 0x000000ff,0xfff00000
Process binding [rank 0]: cpuset [FREE ]: 0x0000ffff,0xf0000000,0x0
W@00001 parsec_hwloc: couldn't bind to mask cpuset 0x0
i@00000 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Parsec Streams : 20
clockRate (GHz) : 2.20
peak Gflops : double 176.0000, single 352.0000
i@00001 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
Parsec Streams : 20
clockRate (GHz) : 2.20
peak Gflops : double 176.0000, single 352.0000
i@00000 Virtual Process Map with 1 VPs...
i@00000 Virtual Process of index 0 has 20 threads and cpuset 0x000fffff
i@00000 Thread 0 of VP 0 can be bound on cores 0x00000001
i@00000 Thread 1 of VP 0 can be bound on cores 0x00000002
i@00000 Thread 2 of VP 0 can be bound on cores 0x00000004
i@00000 Thread 3 of VP 0 can be bound on cores 0x00000008
i@00000 Thread 4 of VP 0 can be bound on cores 0x00000010
i@00000 Thread 5 of VP 0 can be bound on cores 0x00000020
i@00000 Thread 6 of VP 0 can be bound on cores 0x00000040
i@00000 Thread 7 of VP 0 can be bound on cores 0x00000080
i@00000 Thread 8 of VP 0 can be bound on cores 0x00000100
i@00000 Thread 9 of VP 0 can be bound on cores 0x00000200
i@00000 Thread 10 of VP 0 can be bound on cores 0x00000400
i@00000 Thread 11 of VP 0 can be bound on cores 0x00000800
i@00000 Thread 12 of VP 0 can be bound on cores 0x00001000
i@00000 Thread 13 of VP 0 can be bound on cores 0x00002000
i@00000 Thread 14 of VP 0 can be bound on cores 0x00004000
i@00000 Thread 15 of VP 0 can be bound on cores 0x00008000
i@00000 Thread 16 of VP 0 can be bound on cores 0x00010000
i@00000 Thread 17 of VP 0 can be bound on cores 0x00020000
i@00000 Thread 18 of VP 0 can be bound on cores 0x00040000
i@00000 Thread 19 of VP 0 can be bound on cores 0x00080000
i@00001 Virtual Process Map with 1 VPs...
i@00001 Virtual Process of index 0 has 20 threads and cpuset 0x000fffff
i@00001 Thread 0 of VP 0 can be bound on cores 0x00000001
i@00001 Thread 1 of VP 0 can be bound on cores 0x00000002
i@00001 Thread 2 of VP 0 can be bound on cores 0x00000004
i@00001 Thread 3 of VP 0 can be bound on cores 0x00000008
i@00001 Thread 4 of VP 0 can be bound on cores 0x00000010
i@00001 Thread 5 of VP 0 can be bound on cores 0x00000020
i@00001 Thread 6 of VP 0 can be bound on cores 0x00000040
i@00001 Thread 7 of VP 0 can be bound on cores 0x00000080
i@00001 Thread 8 of VP 0 can be bound on cores 0x00000100
i@00001 Thread 9 of VP 0 can be bound on cores 0x00000200
i@00001 Thread 10 of VP 0 can be bound on cores 0x00000400
i@00001 Thread 11 of VP 0 can be bound on cores 0x00000800
i@00001 Thread 12 of VP 0 can be bound on cores 0x00001000
i@00001 Thread 13 of VP 0 can be bound on cores 0x00002000
i@00001 Thread 14 of VP 0 can be bound on cores 0x00004000
i@00001 Thread 15 of VP 0 can be bound on cores 0x00008000
i@00001 Thread 16 of VP 0 can be bound on cores 0x00010000
i@00001 Thread 17 of VP 0 can be bound on cores 0x00020000
i@00001 Thread 18 of VP 0 can be bound on cores 0x00040000
i@00001 Thread 19 of VP 0 can be bound on cores 0x00080000
It looks like the two ranks are sharing the same cores from the output?
A run of htop in a parallel terminal shows that only cores 0, 1, 13, 20, 21, 40, 41 and 65 have work to do (plus a bit for 77 and 46 sometimes), but definitely not all cores are active, and it's pretty slow.
Also, I tried to rebase the PR on the current master, but there are conflicts I'm not sure how to solve.
I was not able to find a way to translate between relative and absolute core numbering, so the reported bindings are relative to the allowed procs, and not absolute (as one would expect). Let me look again at the documentation to see if there is a way.
@bosilca ping ... need it badly :)
@bosilca ping again .. real showstopper
doing the merge now
The following command produces the correct binding for testing (as witnessed from hwloc-ls, the binding is correctly restricted by mpiexec to the correct sockets). '
PARSEC_MCA_device_cuda_memory_use=10 salloc -N1 -whexane "/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec" --tag-output --oversubscribe -n 2 --bind-to socket --map-by socket --report-bindings hwloc-ls --restrict binding -c --no-io
salloc: Granted job allocation 5788
[1,0]<stderr>:[hexane:3980861] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[1,1]<stderr>:[hexane:3980861] MCW rank 1 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[1,0]<stdout>:Machine (62GB total) cpuset=0x55555555
[1,0]<stdout>: Package L#0 cpuset=0x55555555
[1,0]<stdout>: NUMANode L#0 (P#0 31GB) cpuset=0x55555555
[1,0]<stdout>: L3 L#0 (12MB) cpuset=0x55555555
[1,0]<stdout>: L2 L#0 (1280KB) cpuset=0x00010001
[1,0]<stdout>: L1d L#0 (48KB) cpuset=0x00010001
[1,0]<stdout>: L1i L#0 (32KB) cpuset=0x00010001
[1,0]<stdout>: Core L#0 cpuset=0x00010001
[1,0]<stdout>: PU L#0 (P#0) cpuset=0x00000001
[1,0]<stdout>: PU L#1 (P#16) cpuset=0x00010000
[1,0]<stdout>: L2 L#1 (1280KB) cpuset=0x00040004
[1,0]<stdout>: L1d L#1 (48KB) cpuset=0x00040004
[1,0]<stdout>: L1i L#1 (32KB) cpuset=0x00040004
[1,0]<stdout>: Core L#1 cpuset=0x00040004
[1,0]<stdout>: PU L#2 (P#2) cpuset=0x00000004
[1,0]<stdout>: PU L#3 (P#18) cpuset=0x00040000
[1,0]<stdout>: L2 L#2 (1280KB) cpuset=0x00100010
[1,0]<stdout>: L1d L#2 (48KB) cpuset=0x00100010
[1,0]<stdout>: L1i L#2 (32KB) cpuset=0x00100010
[1,0]<stdout>: Core L#2 cpuset=0x00100010
[1,0]<stdout>: PU L#4 (P#4) cpuset=0x00000010
[1,0]<stdout>: PU L#5 (P#20) cpuset=0x00100000
[1,0]<stdout>: L2 L#3 (1280KB) cpuset=0x00400040
[1,0]<stdout>: L1d L#3 (48KB) cpuset=0x00400040
[1,0]<stdout>: L1i L#3 (32KB) cpuset=0x00400040
[1,0]<stdout>: Core L#3 cpuset=0x00400040
[1,0]<stdout>: PU L#6 (P#6) cpuset=0x00000040
[1,0]<stdout>: PU L#7 (P#22) cpuset=0x00400000
[1,0]<stdout>: L2 L#4 (1280KB) cpuset=0x01000100
[1,0]<stdout>: L1d L#4 (48KB) cpuset=0x01000100
[1,0]<stdout>: L1i L#4 (32KB) cpuset=0x01000100
[1,0]<stdout>: Core L#4 cpuset=0x01000100
[1,0]<stdout>: PU L#8 (P#8) cpuset=0x00000100
[1,0]<stdout>: PU L#9 (P#24) cpuset=0x01000000
[1,0]<stdout>: L2 L#5 (1280KB) cpuset=0x04000400
[1,0]<stdout>: L1d L#5 (48KB) cpuset=0x04000400
[1,0]<stdout>: L1i L#5 (32KB) cpuset=0x04000400
[1,0]<stdout>: Core L#5 cpuset=0x04000400
[1,0]<stdout>: PU L#10 (P#10) cpuset=0x00000400
[1,0]<stdout>: PU L#11 (P#26) cpuset=0x04000000
[1,0]<stdout>: L2 L#6 (1280KB) cpuset=0x10001000
[1,0]<stdout>: L1d L#6 (48KB) cpuset=0x10001000
[1,0]<stdout>: L1i L#6 (32KB) cpuset=0x10001000
[1,0]<stdout>: Core L#6 cpuset=0x10001000
[1,0]<stdout>: PU L#12 (P#12) cpuset=0x00001000
[1,0]<stdout>: PU L#13 (P#28) cpuset=0x10000000
[1,0]<stdout>: L2 L#7 (1280KB) cpuset=0x40004000
[1,0]<stdout>: L1d L#7 (48KB) cpuset=0x40004000
[1,0]<stdout>: L1i L#7 (32KB) cpuset=0x40004000
[1,0]<stdout>: Core L#7 cpuset=0x40004000
[1,0]<stdout>: PU L#14 (P#14) cpuset=0x00004000
[1,0]<stdout>: PU L#15 (P#30) cpuset=0x40000000
[1,0]<stdout>: Package L#1 cpuset=0x0
[1,0]<stdout>: NUMANode L#1 (P#1 31GB) cpuset=0x0
[1,1]<stdout>:Machine (62GB total) cpuset=0xaaaaaaaa
[1,1]<stdout>: Package L#0 cpuset=0xaaaaaaaa
[1,1]<stdout>: NUMANode L#0 (P#1 31GB) cpuset=0xaaaaaaaa
[1,1]<stdout>: L3 L#0 (12MB) cpuset=0xaaaaaaaa
[1,1]<stdout>: L2 L#0 (1280KB) cpuset=0x00020002
[1,1]<stdout>: L1d L#0 (48KB) cpuset=0x00020002
[1,1]<stdout>: L1i L#0 (32KB) cpuset=0x00020002
[1,1]<stdout>: Core L#0 cpuset=0x00020002
[1,1]<stdout>: PU L#0 (P#1) cpuset=0x00000002
[1,1]<stdout>: PU L#1 (P#17) cpuset=0x00020000
[1,1]<stdout>: L2 L#1 (1280KB) cpuset=0x00080008
[1,1]<stdout>: L1d L#1 (48KB) cpuset=0x00080008
[1,1]<stdout>: L1i L#1 (32KB) cpuset=0x00080008
[1,1]<stdout>: Core L#1 cpuset=0x00080008
[1,1]<stdout>: PU L#2 (P#3) cpuset=0x00000008
[1,1]<stdout>: PU L#3 (P#19) cpuset=0x00080000
[1,1]<stdout>: L2 L#2 (1280KB) cpuset=0x00200020
[1,1]<stdout>: L1d L#2 (48KB) cpuset=0x00200020
[1,1]<stdout>: L1i L#2 (32KB) cpuset=0x00200020
[1,1]<stdout>: Core L#2 cpuset=0x00200020
[1,1]<stdout>: PU L#4 (P#5) cpuset=0x00000020
[1,1]<stdout>: PU L#5 (P#21) cpuset=0x00200000
[1,1]<stdout>: L2 L#3 (1280KB) cpuset=0x00800080
[1,1]<stdout>: L1d L#3 (48KB) cpuset=0x00800080
[1,1]<stdout>: L1i L#3 (32KB) cpuset=0x00800080
[1,1]<stdout>: Core L#3 cpuset=0x00800080
[1,1]<stdout>: PU L#6 (P#7) cpuset=0x00000080
[1,1]<stdout>: PU L#7 (P#23) cpuset=0x00800000
[1,1]<stdout>: L2 L#4 (1280KB) cpuset=0x02000200
[1,1]<stdout>: L1d L#4 (48KB) cpuset=0x02000200
[1,1]<stdout>: L1i L#4 (32KB) cpuset=0x02000200
[1,1]<stdout>: Core L#4 cpuset=0x02000200
[1,1]<stdout>: PU L#8 (P#9) cpuset=0x00000200
[1,1]<stdout>: PU L#9 (P#25) cpuset=0x02000000
[1,1]<stdout>: L2 L#5 (1280KB) cpuset=0x08000800
[1,1]<stdout>: L1d L#5 (48KB) cpuset=0x08000800
[1,1]<stdout>: L1i L#5 (32KB) cpuset=0x08000800
[1,1]<stdout>: Core L#5 cpuset=0x08000800
[1,1]<stdout>: PU L#10 (P#11) cpuset=0x00000800
[1,1]<stdout>: PU L#11 (P#27) cpuset=0x08000000
[1,1]<stdout>: L2 L#6 (1280KB) cpuset=0x20002000
[1,1]<stdout>: L1d L#6 (48KB) cpuset=0x20002000
[1,1]<stdout>: L1i L#6 (32KB) cpuset=0x20002000
[1,1]<stdout>: Core L#6 cpuset=0x20002000
[1,1]<stdout>: PU L#12 (P#13) cpuset=0x00002000
[1,1]<stdout>: PU L#13 (P#29) cpuset=0x20000000
[1,1]<stdout>: L2 L#7 (1280KB) cpuset=0x80008000
[1,1]<stdout>: L1d L#7 (48KB) cpuset=0x80008000
[1,1]<stdout>: L1i L#7 (32KB) cpuset=0x80008000
[1,1]<stdout>: Core L#7 cpuset=0x80008000
[1,1]<stdout>: PU L#14 (P#15) cpuset=0x00008000
[1,1]<stdout>: PU L#15 (P#31) cpuset=0x80000000
[1,1]<stdout>: Package L#1 cpuset=0x0
[1,1]<stdout>: NUMANode L#1 (P#0 31GB) cpuset=0x0
salloc: Relinquishing job allocation 5788
The resultant binding in Parsec does not appear correct:
PARSEC_MCA_device_cuda_memory_use=10 salloc -N1 -whexane "/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec" --tag-output --oversubscribe -n 2 --bind-to socket --map-by socket --report-bindings build.cuda/parsec/tests/api/init_fini --mca runtime_report_bindings 1
salloc: Granted job allocation 5790
[1,0]<stderr>:[hexane:3981005] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]], socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]], socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]], socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]: [BB/BB/BB/BB/BB/BB/BB/BB][../../../../../../../..]
[1,1]<stderr>:[hexane:3981005] MCW rank 1 bound to socket 1[core 8[hwt 0-1]], socket 1[core 9[hwt 0-1]], socket 1[core 10[hwt 0-1]], socket 1[core 11[hwt 0-1]], socket 1[core 12[hwt 0-1]], socket 1[core 13[hwt 0-1]], socket 1[core 14[hwt 0-1]], socket 1[core 15[hwt 0-1]]: [../../../../../../../..][BB/BB/BB/BB/BB/BB/BB/BB]
[1,1]<stdout>:Process binding [rank 0]: cpuset [ALLOWED ]: 0xaaaaaaaa
[1,1]<stdout>:Process binding [rank 0]: cpuset [USED ]: 0x0000aaaa
[1,1]<stdout>:Process binding [rank 0]: cpuset [FREE ]: 0xaaaa0000
[1,0]<stdout>:Process binding [rank 0]: cpuset [ALLOWED ]: 0x55555555
[1,0]<stdout>:Process binding [rank 0]: cpuset [USED ]: 0x00005555
[1,0]<stdout>:Process binding [rank 0]: cpuset [FREE ]: 0x55550000
[1,0]<stderr>:W@00000 /!\ DEBUG LEVEL WILL PROBABLY REDUCE THE PERFORMANCE OF THIS RUN /!\.
[1,0]<stderr>:i@00000 Virtual Process Map with 1 VPs...
[1,0]<stderr>:i@00000 Virtual Process of index 0 has 8 threads and logical cpuset 0x000000ff
[1,0]<stderr>: physical cpuset 0x55555555
[1,0]<stderr>:i@00000 Thread 0 of VP 0 can be bound on logical cores 0x00000001 (physical cores 0x00010001)
[1,0]<stderr>:i@00000 Thread 1 of VP 0 can be bound on logical cores 0x00000002 (physical cores 0x00040004)
[1,0]<stderr>:i@00000 Thread 2 of VP 0 can be bound on logical cores 0x00000004 (physical cores 0x00100010)
[1,0]<stderr>:i@00000 Thread 3 of VP 0 can be bound on logical cores 0x00000008 (physical cores 0x00400040)
[1,0]<stderr>:i@00000 Thread 4 of VP 0 can be bound on logical cores 0x00000010 (physical cores 0x01000100)
[1,0]<stderr>:i@00000 Thread 5 of VP 0 can be bound on logical cores 0x00000020 (physical cores 0x04000400)
[1,0]<stderr>:i@00000 Thread 6 of VP 0 can be bound on logical cores 0x00000040 (physical cores 0x10001000)
[1,0]<stderr>:i@00000 Thread 7 of VP 0 can be bound on logical cores 0x00000080 (physical cores 0x40004000)
[1,1]<stderr>:i@00001 Virtual Process Map with 1 VPs...
[1,1]<stderr>:i@00001 Virtual Process of index 0 has 8 threads and logical cpuset 0x000000ff
[1,1]<stderr>: physical cpuset 0x55555555
[1,1]<stderr>:i@00001 Thread 0 of VP 0 can be bound on logical cores 0x00000001 (physical cores 0x00010001)
[1,1]<stderr>:i@00001 Thread 1 of VP 0 can be bound on logical cores 0x00000002 (physical cores 0x00040004)
[1,1]<stderr>:i@00001 Thread 2 of VP 0 can be bound on logical cores 0x00000004 (physical cores 0x00100010)
[1,1]<stderr>:i@00001 Thread 3 of VP 0 can be bound on logical cores 0x00000008 (physical cores 0x00400040)
[1,1]<stderr>:i@00001 Thread 4 of VP 0 can be bound on logical cores 0x00000010 (physical cores 0x01000100)
[1,1]<stderr>:i@00001 Thread 5 of VP 0 can be bound on logical cores 0x00000020 (physical cores 0x04000400)
[1,1]<stderr>:i@00001 Thread 6 of VP 0 can be bound on logical cores 0x00000040 (physical cores 0x10001000)
[1,1]<stderr>:i@00001 Thread 7 of VP 0 can be bound on logical cores 0x00000080 (physical cores 0x40004000)
salloc: Relinquishing job allocation 5790
It looks like the message above is partially misleading: we initialized the vpmap before we extract the ALLOWED mask, so the vpmap_from_flat initializes something that is not the same as what the real final binding is. This is a problem because using VPs will do something different than not using VPs, and the output is misleading, but the actual binding produced should be correct when not using VPs.
A potential solution is to move up the initialization of the ALLOWED_MASK early in parsec_init, and use the ALLOWED mask the same way we use it in the parse_binding_parameters function to restrict vpmap creation, but that may be restrictive for some vpmap from file or otherwise. Thoughts?
Looks like inherited binding is correct (beside the problem with the vpmap above, comparing --bind-to none -c 8 vs --bind-to socket results in double the performance and using 16 hw cores in the second case and not having the message about running oversubscribed.
PARSEC_MCA_device_cuda_memory_use=10 salloc -N1 --ntasks-per-node=2 -whexane "/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec" --tag-output -n 2 --bind-to none --map-by socket --report-bindings build.cuda/tests/testing_dpotrf -g 0 -N 30000 -x -v=4 -t 384 -c 8
[1,1]<stderr>:W@00001 /!\ PERFORMANCE MIGHT BE REDUCED /!\: Multiple PaRSEC processes on the same node may share the same physical core(s);
[1,1]<stderr>: This is often unintentional, and will perform poorly.
[1,1]<stderr>: Note that in managed environments (e.g., ALPS, jsrun), the launcher may set `cgroups`
[1,1]<stderr>: and hide the real binding from PaRSEC; if you verified that the binding is correct,
[1,1]<stderr>: this message can be silenced using the MCA argument `runtime_warn_slow_binding`.
[1,0]<stderr>:#+++++ cores detected : 8
[1,0]<stderr>:#+++++ nodes x cores + gpu : 2 x 8 + 0 (16+0)
[1,0]<stdout>:[****] TIME(s) 31.77802 : dpotrf PxQxg= 2 1 0 NB= 384 N= 30000 : 283.228825 gflops - ENQ&PROG&DEST 31.80433 : 282.994460 gflops - ENQ 0.02591 - DEST 0.00040
...
PARSEC_MCA_device_cuda_memory_use=10 salloc -N1 --ntasks-per-node=2 -whexane "/apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec" --tag-output -n 2 --bind-to socket --map-by socket --report-bindings build.cuda/tests/testing_dpotrf -g 0 -N 30000 -x -v=4 -t 384
[1,0]<stderr>:#+++++ cores detected : 8
[1,0]<stderr>:#+++++ nodes x cores + gpu : 2 x 8 + 0 (16+0)
[1,0]<stdout>:[****] TIME(s) 19.26431 : dpotrf PxQxg= 2 1 0 NB= 384 N= 30000 : 467.208496 gflops - ENQ&PROG&DEST 19.26601 : 467.167396 gflops - ENQ 0.00008 - DEST 0.00161
vpmap initialization creates and fills parsec_vpmap[vp].threads[t+ht].cpuset = HWLOC_ALLOC(); with all sorts of intricate things (that are not abiding with the restricted mask) but these are write only variables.
~~At this point I propose we merge this PR with the broken vpmap, and create a tracking issue to fix them later in v4.1. Poll below.~~
At this point I will excise the rework on the vpmap, merge the rework that is effective in the flat case, and defer completion of complex vpmap process binding to 4.1.