arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

Regressions from #1604

Open ronawho opened this issue 3 years ago • 9 comments

#1604 ("arkouda_server_refactor with metrics integration") has caused some correctness and performance issues.

16-node-cs-hdr performance is down significantly with argsort going from 15 to 10 GiB/s.

16-node-xc hasn't completed in a few days. It looks like the server isn't actually shutting down so we're left with orphaned servers taking up all the compute nodes.

ronawho avatar Aug 19 '22 12:08 ronawho

@ronawho determined the 16-node-xc test failures are due to a Chapel bug. I am submitting a patch soon that will prevent invocation of exit(0) unless there are 2..n ServerDaemons, meaning that the default arkouda_server configuration will not call exit(0). Fixing the root cause will be the subject of another issue.

hokiegeek2 avatar Aug 25 '22 15:08 hokiegeek2

Thanks to some digging from @bmcdonald3 we believe the perf regression is caused by an existing false-sharing issue and this PR just happens to change allocations in a way that exposed it.

We've seen things like like before in https://github.com/Bears-R-Us/arkouda/issues/982 and https://github.com/Bears-R-Us/arkouda/pull/783/commits/c668ef5f331e170c87a844fc4124f8424fc301a3.

I'll start digging into the root cause of the false-sharing. In the meantime I can probably come up with a patch that gives us the current functionality but changes allocation pattens to avoid it in nightly.

ronawho avatar Aug 26 '22 14:08 ronawho

@ronawho, @hokiegeek2 - is this issue still needed, or has it been addressed through other work?

Ethan-DeBandi99 avatar Feb 09 '23 16:02 Ethan-DeBandi99

Still needed. This is likely caused by false-sharing and requires chapel changes to address (still on my TODO list, but pretty far back)

ronawho avatar Feb 09 '23 16:02 ronawho

@ronawho the performance regression you reported is pretty significant, isn't it? Will the false-sharing fix resolve it?

hokiegeek2 avatar Feb 09 '23 16:02 hokiegeek2

Yeah, I believe so

Not captured in this issue, but we discussed this a bit at the in-person meetup. The basic problem is small changes in memory layout can cause unrelated variables/bytes on the same cacheline to "conflict" with each other, which can lead to false-sharing and this type of variable performance. Note that in nightly we're hovering around 14-16 GiB/s and not the 10 we saw right after this issue was opened, so the perf hit is not that large due to subsequent layout changes. And in general I'm also not worried about this for production workflows, this is mostly impacting very simple benchmarks.

ronawho avatar Feb 09 '23 18:02 ronawho

@ronawho gotcha, cool, thanks for the explanation. If you're not worried about it, that works for me.

hokiegeek2 avatar Feb 09 '23 18:02 hokiegeek2

@ronawho did you ever fix the Chapel false sharing bug that is the root cause of these regressions you reported?

hokiegeek2 avatar Aug 02 '23 09:08 hokiegeek2

No, still on the queue

ronawho avatar Aug 02 '23 12:08 ronawho