builder icon indicating copy to clipboard operation
builder copied to clipboard

Performance testing for the release

Open atalman opened this issue 3 years ago • 0 comments

Problems details We have frequent performance regressions during the releases. Late perf testing during the release. Missing visibility into performance regressions during the release. Following are the performance results for releases 1.9, 1.10, 1.11 and 1.12. Following issues are identified with this performance testing approach: Performance testing on release candidates is performed after final rc cut. Only 1 (A100) machine is available for performance tests and each performance test must be executed manually and take 4-5hrs to execute. After performance test is completed we manually collect the result and fill in spreadsheet We test with only 1 version of CUDA/Python 3.x at a time

This situation leads to the following issues: Detecting release regression only after final rc cut Note detecting release regression in some CUDA configurations, since lack of testing Manually running and populating results could be time consuming and error prone.

Proposal In collaboration with Pytorch Perf infra, we should create and implement nightly binaries specific testing pipeline.

Create a smaller performance test suite to run on smaller and more available GPU machines (linux.4xlarge.nvidia.gpu) Github Actions to run execute the performance test on a nightly build (we probably don’t need to cover all the binaries, but at least few key binaries should be covered) Surface the results of the perf test via the rockset and the HUD. So that performance runs can be monitored on a daily basis during the release. Setup alerts in case perf regression is discovered.

Execution We are planning to execute in the following waves Prep (~Q3 2022) (Perf Infra and Dev Infra): Finalize requirements for the performance testing Establish: Design Test suite to be executed on the daily basis Runner type to execute perf tests on Design Perf test results format to send to S3/Rockset

Create POC for visualizing the results based on the output format Finalize resources needed plan and execution (~Q4 2022): Do pre-required eng work on DIRE side Create Github action workflows to execute perf testing

Create POC to execute Performance tests for each cuda version, tasks list (draft)

  • [ ] #1098
  • [ ] #1099
  • [ ] #1100

atalman avatar Aug 09 '22 17:08 atalman