[AutoScheduler] Add Dynamic Gradient Descent Search Algorithm for Auto-Tuning
This PR introduces the Dynamic Gradient Descent (DGD) Search algorithm for accelerating the auto-tuning process of GPU kernels within the Ansor/AutoScheduler framework. The DGD algorithm is designed to explore the search space more efficiently than the existing Genetic Algorithm-based approach. The following changes are included:
-
Dynamic Gradient Descent Search:
- Implements a new search strategy that uses gradient descent in a multi-dimensional tile-space.
- Utilizes online measurements and proxy model to guide the search process.
-
Record Processor:
- A new class to handle the processing and modification of measure records.
- Includes methods to extract and modify SP node coordinates.
This implementation is based on the algorithm described in the paper "Accelerated Auto-Tuning of GPU Kernels for Tensor Computations" presented at ICS'24.
Experimental evaluation on a number of matrix-matrix multiplication and convolution kernels shows that the DGD algorithm achieves an order-of-magnitude improvement in auto-tuning time while maintaining comparable code performance.
Usage:
To use the DGD Search algorithm, instantiate the DynamicGradientSearchTuner class with the desired parameters and call the dynamic_gradient_search method.
Example:
tuner = auto_scheduler.dynamic_gradient_search.DynamicGradientSearchTuner(task, log_file, tune_option)
tuner.dynamic_gradient_search()
Experiments setup:
The experiments used the DGD Search algorithm with a time budget of 1 hour and full duration used by Ansor, comparing the performance achieved by Ansor after suggested trials. The models used for the evaluation were Bert, ResNet-50, and MobileNetV2, with the following configurations based on the Apache blog Introducing TVM Auto-scheduler (a.k.a. Ansor):
- Bert: 12000 trials, running on an Nvidia RTX 4090 for 6 hours.
- ResNet-50: 20000 trials, running on an Nvidia RTX 4090 for 10 hours.
- MobileNetV2: 16000 trials, running on an Nvidia RTX 4090 for 7 hours.
Relative Performance of the DGD Search algorithm achieved in 1 hour and full duration used by Ansor
| Networks | Ratio (1 hour) | Ratio (full) |
|---|---|---|
| Bert | 93.71% | 100.15% |
| ResNet-50 | 90.46% | 96.73% |
| MobileNetV2 | 95.08% | 101.75% |
This table presents the relative performance of the DGD Search algorithm with a 1-hour time budget compared to the full duration used by Ansor. The performance ratios indicate the effectiveness of the Dynamic Gradient Descent Search algorithm in achieving comparable performance within a significantly reduced time frame.
Thank you @Lurkrazy for this contribution !
I add Cc to relevant folks here: @comaniac @jcf94 @merrymercy @FrozenGene @minminsun @jinhongyii
Given that we are migrating toward meta-schedule and may phase out auto-scheduler, i would suggest we bring new xhanges to that path.