weird pin thread to cpu performance
Hi @clamchowder,
I want to pin thread to cpu when measuring bandwidth, but I found that there seems no such facility under the non-numa mode. So I just borrow this part from CoherenceLatency:
void *ReadBandwidthTestThread(void *param) {
BandwidthTestThreadData* bwTestData = (BandwidthTestThreadData*)param;
if (hardaffinity) {
sched_setaffinity(gettid(), sizeof(cpu_set_t), &global_cpuset);
} else {
// I add the following lines:
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(bwTestData->processorIndex, &cpuset);
sched_setaffinity(gettid(), sizeof(cpu_set_t), &cpuset);
fprintf(stderr, "thread %ld set affinity %d\n", gettid(), bwTestData->processorIndex);
}
...
}
Besides, the processorIndex is calculated by thread_idx % nprocs according to the processor to core id mapping from /proc/cpuinfo.
I test on AMD Ryzen 7 5800X CPU, where only one numa node is equipped (8 physical cores, and 16 logical cores). So I didn't enable numa.
I got the following results:
In the figure above, "auto" means I run original MemoryBandwith code, while "manual" means I added the CPU_SET and sched_setaffinity as the code snippet shows. The left and right figures show 8 and 16 threads results respectively.
My question is, why are the "manual" bandwidth results lower than those of "auto" for 8 threads, while the "manual" catches up for 16 threads?
Thanks, troore
There's no facility to pin threads for memory bandwidth testing in non-NUMA mode because it is not needed. You can use other utilities to set affinity, like taskset on Linux or start /b /affinity <mask> on Windows to ensure the test only runs on certain physical cores.
There's no facility to pin threads for memory bandwidth testing in non-NUMA mode because it is not needed. You can use other utilities to set affinity, like
taskseton Linux orstart /b /affinity <mask>on Windows to ensure the test only runs on certain physical cores.
- Why in non-NUMA mode pin threads is not needed?
- How could
tasksetguarantee the precise affinity, e.g., if we pin 2 threads to 2 physical cores (SMT2, 4 logical cores) bytaskset, can we guarantee that the 2 threads are scheduled on physical core 0 and 1, rather than both being scheduled on physical core 0, resulting in different L1/L2 bandwidth results?
At one point I had an option to put the first thread on core 0, second thread on core 1, and so on, but found that it made no difference compared to setting affinity through taskset or start /b /affinity for the whole process. Operating systems today are SMT-aware and are good at preferring to load separate physical cores before loading SMT threads.
If you have a problem with the operating system not being SMT-aware, you can use taskset or start /b /affinity to exclude SMT sibling threads. I haven't seen it be a problem on any recent Windows or Linux install.
NUMA gets special handling because each thread allocates memory from a designated pool of memory, and has to be pinned to a core close to that pool.
At one point I had an option to put the first thread on core 0, second thread on core 1, and so on, but found that it made no difference compared to setting affinity through
tasksetorstart /b /affinityfor the whole process. Operating systems today are SMT-aware and are good at preferring to load separate physical cores before loading SMT threads.If you have a problem with the operating system not being SMT-aware, you can use
tasksetorstart /b /affinityto exclude SMT sibling threads. I haven't seen it be a problem on any recent Windows or Linux install.NUMA gets special handling because each thread allocates memory from a designated pool of memory, and has to be pinned to a core close to that pool.
Make sense, thanks. I think this issue is solved.
Hi @clamchowder,
I want to pin thread to cpu when measuring bandwidth, but I found that there seems no such facility under the non-numa mode. So I just borrow this part from CoherenceLatency:
void *ReadBandwidthTestThread(void *param) { BandwidthTestThreadData* bwTestData = (BandwidthTestThreadData*)param; if (hardaffinity) { sched_setaffinity(gettid(), sizeof(cpu_set_t), &global_cpuset); } else { // I add the following lines: cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(bwTestData->processorIndex, &cpuset); sched_setaffinity(gettid(), sizeof(cpu_set_t), &cpuset); fprintf(stderr, "thread %ld set affinity %d\n", gettid(), bwTestData->processorIndex); } ... }Besides, the
processorIndexis calculated bythread_idx % nprocsaccording to the processor to core id mapping from/proc/cpuinfo.I test on AMD Ryzen 7 5800X CPU, where only one numa node is equipped (8 physical cores, and 16 logical cores). So I didn't enable numa.
I got the following results:
In the figure above, "auto" means I run original MemoryBandwith code, while "manual" means I added the `CPU_SET` and `sched_setaffinity` as the code snippet shows. The left and right figures show 8 and 16 threads results respectively.
My question is, why are the "manual" bandwidth results lower than those of "auto" for 8 threads, while the "manual" catches up for 16 threads?
Thanks, troore
Hi @clamchowder,
I've just reopened this issue because I am still unable to explain the left figure of the original post (the comparison betwen auto and manual thread bindings). Because I think 8 threads are enough to fully utilize L1 bandwidth before the first slope.
I tried both taskset -c 0-7 and sched_setaffinity but got similar results. The affinity masks of the auto and manual are ffff and ff__. I cannot explain why the manual thread binding is lower than auto.
Could you repeat the results and try to help explain?
Thanks, troore
Please don't do any affinity setting unless you're willing to investigate and debug the effects on your own time. If you choose to do that, tools like perf and performance counters can help you understand what's going on.
Affinity setting is not supported in general, and was only done to work around issues on certain platforms.
In the figure above, "auto" means I run original MemoryBandwith code, while "manual" means I added the `CPU_SET` and `sched_setaffinity` as the code snippet shows. The left and right figures show 8 and 16 threads results respectively.