kernel icon indicating copy to clipboard operation
kernel copied to clipboard

[Intel-SIG] Fix sched domain build error for GNR, CWF in SNC-3 mode

Open aubreyli opened this issue 2 months ago • 0 comments

While testing Granite Rapids (GNR) and Clearwater Forest (CWF) systems in SNC-3 mode, we encountered sched domain build errors in dmesg. The scheduler domain code did not expect asymmetric node distances from a local node to multiple nodes in a remote package. As a result, remote nodes ended up being grouped partially with local nodes with asymemtric groupings, and creating too many levels in the NUMA sched domain hierarchy.

To address this, we simplify remote node distances for the purpose of sched domain construction on GNR and CWF. Specifically, we replace the individual distances to nodes within the same remote package with their average distance. This resolves the domain build errors and reduces the number of NUMA sched domain levels.

The actual SLIT NUMA node distances are still preserved separately, in case they are needed when building sched domains. NUMA balancing continues to use the true distances when selecting a closer remote node for a task’s numa_group.

The following two commits backported:

  • 0002-sched-Create-architecture-specific-sched-domain-dist.patch
  • 0003-sched-topology-Fix-sched-domain-build-error-for-GNR-.patch

as well as its necessary dependencies:

  • 0001-bitmap-Define-a-cleanup-function-for-bitmaps.patch

Testing result w/o fixes: [ 8.260954] CPU0 attaching sched-domain(s): [ 8.261112] domain-0: span=0,192 level=SMT[ 8.262111] groups: 0:{ span=0 cap=976 }, 192:{ span=192 cap=1022 } [ 8.263111] domain-1: span=0-31,192-223 level=MC [ 8.264110] groups: 0:{ span=0,192 cap=1998 }, 1:{ span=1,193 cap=2046 }, 2:{ span=2,194 cap=2045 }, 3:{ span=3,195 cap=2046 }, 4:{ span=4,196 cap=2044 }, 5:{ span=5,197 cap=2045 }, 6:{ span=6,198 cap=2046 }, 7:{ span=7,199 cap=2045 }, 8:{ span=8,200 cap=2045 }, 9:{ span=9,201 cap=2047 }, 10:{ span=10,202 cap=2045 }, 11:{ span=11,203 cap=2047 }, 12:{ span=12,204 cap=2044 }, 13:{ span=13,205 cap=2045 }, 14:{ span=14,206 cap=2045 }, 15:{ span=15,207 cap=2045 }, 16:{ span=16,208 cap=2045 }, 17:{ span=17,209 cap=2048 }, 18:{ span=18,210 cap=2047 }, 19:{ span=19,211 cap=2045 }, 20:{ span=20,212 cap=2045 }, 21:{ span=21,213 cap=2046 }, 22:{ span=22,214 cap=2048 }, 23:{ span=23,215 cap=2045 }, 24:{ span=24,216 cap=2047 }, 25:{ span=25,217 cap=2046 }, 26:{ span=26,218 cap=2046 }, 27:{ span=27,219 cap=2045 }, 28:{ span=28,220 cap=2046 }, 29:{ span=29,221 cap=2046 }, 30:{ span=30,222 cap=2044 }, 31:{ span=31,223 cap=2046 } [ 8.265119] domain-2: span=0-63,192-255 level=NUMA [ 8.266110] groups: 0:{ span=0-31,192-223 cap=65413 }, 32:{ span=32-63,224-255 cap=65457 } [ 8.267111] domain-3: span=0-95,192-287 level=NUMA [ 8.268110] groups: 0:{ span=0-63,192-255 mask=0-31,192-223 cap=130870 }, 64:{ span=32-95,224-287 mask=64-95,256-287 cap=131001 } [ 8.269111] domain-4: span=0-127,192-319 level=NUMA [ 8.270110] groups: 0:{ span=0-95,192-287 cap=196381 }, 96:{ span=96-127,288-319 cap=65451 } [ 8.271111] domain-5: span=0-127,160-319,352-383 level=NUMA [ 8.272110] groups: 0:{ span=0-127,192-319 mask=0-31,192-223 cap=261832 }, 160:{ span=160-191,352-383 cap=65475 } [ 8.273112] domain-6: span=0-383 level=NUMA [ 8.274110] groups: 0:{ span=0-127,160-319,352-383 mask=0-31,192-223 cap=327307 } [ 8.275111] ERROR: groups don't span domain->span

Testing result w/ fixes: [ 8.187368] CPU0 attaching sched-domain(s): [ 8.188143] domain-0: span=0,192 level=SMT [ 8.189142] groups: 0:{ span=0 cap=887 }, 192:{ span=192 } [ 8.190141] domain-1: span=0-31,192-223 level=MC [ 8.191141] groups: 0:{ span=0,192 cap=1911 }, 1:{ span=1,193 cap=2021 }, 2:{ span=2,194 cap=2038 }, 3:{ span=3,195 cap=2040 }, 4:{ span=4,196 cap=2039 }, 5:{ span=5,197 cap=2045 }, 6:{ span=6,198 cap=2041 }, 7:{ span=7,199 cap=2041 }, 8:{ span=8,200 cap=2042 }, 9:{ span=9,201 cap=2033 }, 10:{ span=10,202 cap=2033 }, 11:{ span=11,203 cap=2033 }, 12:{ span=12,204 cap=2045 }, 13:{ span=13,205 cap=2027 }, 14:{ span=14,206 cap=2038 }, 15:{ span=15,207 cap=2035 }, 16:{ span=16,208 cap=2044 }, 17:{ span=17,209 cap=2044 }, 18:{ span=18,210 cap=2039 }, 19:{ span=19,211 cap=2042 }, 20:{ span=20,212 cap=2041 }, 21:{ span=21,213 cap=2048 }, 22:{ span=22,214 cap=2036 }, 23:{ span=23,215 cap=2048 }, 24:{ span=24,216 cap=2021 }, 25:{ span=25,217 cap=2043 }, 26:{ span=26,218 cap=2044 }, 27:{ span=27,219 cap=2041 }, 28:{ span=28,220 cap=2041 }, 29:{ span=29,221 cap=2037 }, 30:{ span=30,222 cap=2036 }, 31:{ span=31,223 cap=2048 } [ 8.192149] domain-2: span=0-63,192-255 level=NUMA [ 8.193141] groups: 0:{ span=0-31,192-223 cap=65115 }, 32:{ span=32-63,224-255 cap=65201 } [ 8.194142] domain-3: span=0-95,192-287 level=NUMA [ 8.195141] groups: 0:{ span=0-63,192-255 mask=0-31,192-223 cap=130316 }, 64:{ span=32-95,224-287 mask=64-95,256-287 cap=130714 } [ 8.196142] domain-4: span=0-383 level=NUMA [ 8.197141] groups: 0:{ span=0-95,192-287 cap=195692 }, 96:{ span=96-191,288-383 cap=195639 }

aubreyli avatar Dec 16 '25 15:12 aubreyli