logging Why are we using olympic scores stdev for RCP interpolation?

The general idea of stdev is to get the variance of a population, but Olympic scoring drops the extreme values. So we should consider using stdev of the entire population for interpolation.

Need to analyze and fix for v4.1

Jul 11 '24 19:07 ShriyaRishab

Stdev_olympic_prunning

Jul 31 '24 20:07 pgmpablo157321

RCPs_pruned_varying_Stdev

Jul 31 '24 20:07 pgmpablo157321

Can we see how this affect the v4.0 scores?

Aug 01 '24 15:08 hiwotadese

@ShriyaPalsamudram I am getting the same results when with and without the olympic stdev. It seems to be, because it the RCP Stdev is only being used to compute the min_epochs

https://github.com/mlcommons/logging/blob/369260bf8326f36f644d34a1996b05ec51ad9717/mlperf_logging/rcp_checker/rcp_checker.py#L272-L275

And since we are no longer pruning based on min_epochs, it doesn't seem to have an effect on the results. The min_epochs later affects the Max Speedup, but this only later used in a condition to check if the RCP passed.

https://github.com/mlcommons/logging/blob/369260bf8326f36f644d34a1996b05ec51ad9717/mlperf_logging/rcp_checker/rcp_checker.py#L276

https://github.com/mlcommons/logging/blob/369260bf8326f36f644d34a1996b05ec51ad9717/mlperf_logging/rcp_checker/rcp_checker.py#L438

@ShriyaPalsamudram What changes were expected when changing the Stdev?

Aug 02 '24 21:08 pgmpablo157321

Since this impacts max speedup, can we compare max speedup before and after the change for all RCP points?

Aug 08 '24 15:08 ShriyaRishab

RCPs_MaxSpeedUP

Aug 14 '24 15:08 pgmpablo157321

Additionally, an example of the max_speed_up values for last training results: HPE-Cray-XD670-Gen11-H100-SXM5-80GB_n1_mxnet_24.04 With olympic score:

[1.018748075108887, 1.0601459916687128]

Without olympic score:

[1.0262860923600325, 1.0699036853548622]

Aug 14 '24 16:08 pgmpablo157321