MSCCL Multithreaded regression alternative state management
Details
Do not mention proprietary info or link to internal work items in this PR.
Work item: Internal
What were the changes?
Address MT MSCCL issue and reenable MSCCL in MT mode through enabling MSCCL single process mode
This is change to the original implementation that hands off the state from thread-local to rank appropriately, and also uses vector instead of unordered_map.
Why were the changes made?
Severe regression when multiple devices are used per thread in rccl-tests allreduce.
How was the outcome achieved?
Set the device id correctly so that scratch memory and sync buffers are allocated on the right device
Additional Details:
Single-threaded (original | fix)
# out-of-place in-place | # out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong | # size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) | # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1048576 262144 float sum -1 74.21 14.13 24.73 0 73.38 14.29 25.01 0 | 1048576 262144 float sum -1 72.71 14.42 25.24 0 72.84 14.40 25.19 0
2097152 524288 float sum -1 73.45 28.55 49.97 0 73.24 28.64 50.11 0 | 2097152 524288 float sum -1 72.86 28.78 50.37 0 73.29 28.62 50.08 0
3145728 786432 float sum -1 75.96 41.41 72.47 0 74.08 42.46 74.31 0 | 3145728 786432 float sum -1 76.17 41.30 72.27 0 74.21 42.39 74.18 0
4194304 1048576 float sum -1 73.74 56.88 99.55 0 73.60 56.99 99.73 0 | 4194304 1048576 float sum -1 74.15 56.57 98.99 0 73.64 56.96 99.67 0
5242880 1310720 float sum -1 74.04 70.82 123.93 0 76.57 68.47 119.83 0 | 5242880 1310720 float sum -1 73.98 70.87 124.02 0 75.36 69.57 121.75 0
6291456 1572864 float sum -1 74.68 84.24 147.43 0 74.58 84.35 147.62 0 | 6291456 1572864 float sum -1 73.39 85.73 150.03 0 73.49 85.61 149.82 0
7340032 1835008 float sum -1 79.56 92.26 161.46 0 82.47 89.00 155.76 0 | 7340032 1835008 float sum -1 79.33 92.53 161.92 0 82.20 89.30 156.27 0
8388608 2097152 float sum -1 87.23 96.16 168.29 0 88.17 95.14 166.49 0 | 8388608 2097152 float sum -1 87.09 96.32 168.57 0 87.85 95.49 167.10 0
9437184 2359296 float sum -1 94.98 99.36 173.88 0 97.67 96.62 169.08 0 | 9437184 2359296 float sum -1 94.96 99.38 173.92 0 97.63 96.66 169.16 0
10485760 2621440 float sum -1 100.9 103.92 181.87 0 104.1 100.75 176.30 0 | 10485760 2621440 float sum -1 100.9 103.94 181.90 0 104.0 100.87 176.52 0
11534336 2883584 float sum -1 111.3 103.68 181.43 0 113.9 101.25 177.18 0 | 11534336 2883584 float sum -1 111.2 103.75 181.56 0 113.8 101.38 177.42 0
12582912 3145728 float sum -1 117.0 107.57 188.24 0 129.5 97.20 170.10 0 | 12582912 3145728 float sum -1 117.0 107.53 188.19 0 129.6 97.09 169.91 0
13631488 3407872 float sum -1 127.4 107.00 187.26 0 132.6 102.82 179.93 0 | 13631488 3407872 float sum -1 127.3 107.11 187.45 0 132.6 102.83 179.95 0
14680064 3670016 float sum -1 133.3 110.16 192.78 0 136.9 107.26 187.70 0 | 14680064 3670016 float sum -1 133.2 110.22 192.89 0 136.9 107.26 187.70 0
15728640 3932160 float sum -1 143.3 109.74 192.05 0 149.1 105.50 184.63 0 | 15728640 3932160 float sum -1 143.4 109.70 191.97 0 148.9 105.60 184.80 0
16777216 4194304 float sum -1 149.8 112.00 196.00 0 155.9 107.63 188.35 0 | 16777216 4194304 float sum -1 149.7 112.04 196.07 0 156.0 107.52 188.16 0
# Errors with asterisks indicate errors that have exceeded the maximum threshold. | # Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK | # Out of bounds values : 0 OK
# Avg bus bandwidth : 144.171 | # Avg bus bandwidth : 144.471
Multi-threaded (original | fix)
# out-of-place in-place | # out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong | # size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) | # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1048576 262144 float sum -1 28.15 37.25 65.19 0 27.52 38.10 66.67 0 | 1048576 262144 float sum -1 27.84 37.66 65.91 0 27.41 38.26 66.95 0
2097152 524288 float sum -1 37.03 56.64 99.11 0 35.22 59.55 104.21 0 | 2097152 524288 float sum -1 34.81 60.25 105.43 0 35.08 59.79 104.63 0
3145728 786432 float sum -1 44.72 70.34 123.09 0 45.77 68.73 120.27 0 | 3145728 786432 float sum -1 44.53 70.64 123.62 0 45.51 69.13 120.97 0
4194304 1048576 float sum -1 52.56 79.79 139.64 0 54.18 77.42 135.48 0 | 4194304 1048576 float sum -1 52.31 80.18 140.31 0 54.21 77.38 135.41 0
5242880 1310720 float sum -1 62.37 84.06 147.10 0 64.35 81.47 142.58 0 | 5242880 1310720 float sum -1 62.10 84.42 147.74 0 64.26 81.59 142.78 0
6291456 1572864 float sum -1 68.41 91.97 160.95 0 70.77 88.90 155.58 0 | 6291456 1572864 float sum -1 68.37 92.03 161.04 0 70.56 89.16 156.03 0
7340032 1835008 float sum -1 78.57 93.42 163.49 0 81.28 90.31 158.04 0 | 7340032 1835008 float sum -1 78.33 93.70 163.98 0 81.07 90.54 158.45 0
8388608 2097152 float sum -1 84.57 99.19 173.59 0 87.37 96.01 168.02 0 | 8388608 2097152 float sum -1 84.42 99.37 173.89 0 87.33 96.05 168.09 0
9437184 2359296 float sum -1 94.91 99.43 174.01 0 97.37 96.92 169.60 0 | 9437184 2359296 float sum -1 94.73 99.63 174.35 0 97.13 97.16 170.04 0
10485760 2621440 float sum -1 100.7 104.14 182.25 0 103.6 101.22 177.13 0 | 10485760 2621440 float sum -1 100.7 104.15 182.26 0 103.4 101.46 177.55 0
11534336 2883584 float sum -1 111.1 103.82 181.69 0 113.7 101.47 177.57 0 | 11534336 2883584 float sum -1 110.9 103.96 181.93 0 113.6 101.55 177.71 0
12582912 3145728 float sum -1 117.0 107.59 188.28 0 130.4 96.49 168.86 0 | 12582912 3145728 float sum -1 116.8 107.69 188.46 0 130.4 96.51 168.89 0
13631488 3407872 float sum -1 127.2 107.12 187.47 0 133.7 101.96 178.43 0 | 13631488 3407872 float sum -1 127.1 107.23 187.66 0 133.9 101.80 178.15 0
14680064 3670016 float sum -1 133.1 110.31 193.04 0 138.5 105.96 185.44 0 | 14680064 3670016 float sum -1 133.0 110.40 193.21 0 138.1 106.27 185.98 0
15728640 3932160 float sum -1 143.0 110.00 192.49 0 148.6 105.83 185.21 0 | 15728640 3932160 float sum -1 143.0 110.03 192.54 0 148.6 105.81 185.17 0
16777216 4194304 float sum -1 149.5 112.23 196.40 0 155.8 107.67 188.42 0 | 16777216 4194304 float sum -1 149.3 112.36 196.63 0 155.9 107.63 188.36 0
# Errors with asterisks indicate errors that have exceeded the maximum threshold. | # Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK | # Out of bounds values : 0 OK
# Avg bus bandwidth : 157.791 | # Avg bus bandwidth : 158.254
Multi-process (original | fix)
# out-of-place in-place | # out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong | # size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) | # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1048576 262144 float sum -1 25.84 40.58 71.01 0 25.47 41.17 72.05 0 | 1048576 262144 float sum -1 25.88 40.52 70.91 0 25.62 40.93 71.62 0
2097152 524288 float sum -1 32.79 63.95 111.92 0 33.11 63.35 110.86 0 | 2097152 524288 float sum -1 32.73 64.07 112.12 0 33.08 63.39 110.94 0
3145728 786432 float sum -1 42.67 73.72 129.01 0 43.88 71.69 125.47 0 | 3145728 786432 float sum -1 42.66 73.75 129.06 0 43.86 71.72 125.52 0
4194304 1048576 float sum -1 50.19 83.57 146.25 0 51.95 80.74 141.29 0 | 4194304 1048576 float sum -1 50.18 83.58 146.26 0 52.05 80.58 141.01 0
5242880 1310720 float sum -1 60.03 87.34 152.85 0 62.49 83.91 146.84 0 | 5242880 1310720 float sum -1 60.04 87.32 152.82 0 62.38 84.04 147.08 0
6291456 1572864 float sum -1 66.15 95.10 166.43 0 68.45 91.91 160.85 0 | 6291456 1572864 float sum -1 66.13 95.13 166.49 0 68.59 91.73 160.53 0
7340032 1835008 float sum -1 76.43 96.03 168.05 0 79.18 92.71 162.24 0 | 7340032 1835008 float sum -1 76.36 96.12 168.21 0 79.16 92.73 162.27 0
8388608 2097152 float sum -1 82.29 101.93 178.39 0 85.47 98.15 171.76 0 | 8388608 2097152 float sum -1 82.25 101.99 178.48 0 85.57 98.03 171.56 0
9437184 2359296 float sum -1 92.68 101.82 178.19 0 95.21 99.12 173.45 0 | 9437184 2359296 float sum -1 92.59 101.93 178.37 0 95.19 99.14 173.49 0
10485760 2621440 float sum -1 98.28 106.69 186.70 0 101.5 103.31 180.80 0 | 10485760 2621440 float sum -1 98.29 106.69 186.70 0 101.4 103.46 181.05 0
11534336 2883584 float sum -1 108.8 105.99 185.48 0 111.3 103.67 181.42 0 | 11534336 2883584 float sum -1 108.8 106.01 185.52 0 111.5 103.47 181.07 0
12582912 3145728 float sum -1 114.5 109.86 192.25 0 128.7 97.75 171.06 0 | 12582912 3145728 float sum -1 114.6 109.82 192.18 0 128.8 97.70 170.98 0
13631488 3407872 float sum -1 125.0 109.02 190.78 0 131.7 103.49 181.10 0 | 13631488 3407872 float sum -1 125.1 108.99 190.73 0 131.8 103.45 181.04 0
14680064 3670016 float sum -1 130.9 112.18 196.32 0 135.8 108.11 189.20 0 | 14680064 3670016 float sum -1 130.9 112.16 196.28 0 135.7 108.17 189.30 0
15728640 3932160 float sum -1 141.0 111.56 195.23 0 146.7 107.23 187.65 0 | 15728640 3932160 float sum -1 141.1 111.47 195.07 0 146.4 107.42 187.99 0
16777216 4194304 float sum -1 147.4 113.83 199.20 0 153.9 109.02 190.79 0 | 16777216 4194304 float sum -1 147.3 113.89 199.30 0 153.7 109.17 191.04 0
# Errors with asterisks indicate errors that have exceeded the maximum threshold. | # Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK | # Out of bounds values : 0 OK
# Avg bus bandwidth : 162.34 | # Avg bus bandwidth : 162.344
Approval Checklist
Do not approve until these items are satisfied.
- [ ] Verify the CHANGELOG has been updated, if
- there are any NCCL API version changes,
- any changes impact library users, and/or
- any changes impact any other ROCm library.