tvm [AutoScheduler] Add Dynamic Gradient Descent Search Algorithm for Auto-Tuning

This PR introduces the Dynamic Gradient Descent (DGD) Search algorithm for accelerating the auto-tuning process of GPU kernels within the Ansor/AutoScheduler framework. The DGD algorithm is designed to explore the search space more efficiently than the existing Genetic Algorithm-based approach. The following changes are included:

Dynamic Gradient Descent Search:
- Implements a new search strategy that uses gradient descent in a multi-dimensional tile-space.
- Utilizes online measurements and proxy model to guide the search process.
Record Processor:
- A new class to handle the processing and modification of measure records.
- Includes methods to extract and modify SP node coordinates.

This implementation is based on the algorithm described in the paper "Accelerated Auto-Tuning of GPU Kernels for Tensor Computations" presented at ICS'24.

Experimental evaluation on a number of matrix-matrix multiplication and convolution kernels shows that the DGD algorithm achieves an order-of-magnitude improvement in auto-tuning time while maintaining comparable code performance.

Usage:

To use the DGD Search algorithm, instantiate the DynamicGradientSearchTuner class with the desired parameters and call the dynamic_gradient_search method.

Example:

tuner = auto_scheduler.dynamic_gradient_search.DynamicGradientSearchTuner(task, log_file, tune_option)
tuner.dynamic_gradient_search()

Experiments setup:

The experiments used the DGD Search algorithm with a time budget of 1 hour and full duration used by Ansor, comparing the performance achieved by Ansor after suggested trials. The models used for the evaluation were Bert, ResNet-50, and MobileNetV2, with the following configurations based on the Apache blog Introducing TVM Auto-scheduler (a.k.a. Ansor):

Bert: 12000 trials, running on an Nvidia RTX 4090 for 6 hours.
ResNet-50: 20000 trials, running on an Nvidia RTX 4090 for 10 hours.
MobileNetV2: 16000 trials, running on an Nvidia RTX 4090 for 7 hours.

Relative Performance of the DGD Search algorithm achieved in 1 hour and full duration used by Ansor

Networks	Ratio (1 hour)	Ratio (full)
Bert	93.71%	100.15%
ResNet-50	90.46%	96.73%
MobileNetV2	95.08%	101.75%

This table presents the relative performance of the DGD Search algorithm with a 1-hour time budget compared to the full duration used by Ansor. The performance ratios indicate the effectiveness of the Dynamic Gradient Descent Search algorithm in achieving comparable performance within a significantly reduced time frame.

Jun 28 '24 23:06 Lurkrazy

Thank you @Lurkrazy for this contribution !

I add Cc to relevant folks here: @comaniac @jcf94 @merrymercy @FrozenGene @minminsun @jinhongyii

Aug 12 '24 09:08 cbalint13

Given that we are migrating toward meta-schedule and may phase out auto-scheduler, i would suggest we bring new xhanges to that path.

Aug 12 '24 12:08 tqchen