tvm icon indicating copy to clipboard operation
tvm copied to clipboard

[AutoScheduler] Add Dynamic Gradient Descent Search Algorithm for Auto-Tuning

Open Lurkrazy opened this issue 1 year ago • 2 comments

This PR introduces the Dynamic Gradient Descent (DGD) Search algorithm for accelerating the auto-tuning process of GPU kernels within the Ansor/AutoScheduler framework. The DGD algorithm is designed to explore the search space more efficiently than the existing Genetic Algorithm-based approach. The following changes are included:

  1. Dynamic Gradient Descent Search:

    • Implements a new search strategy that uses gradient descent in a multi-dimensional tile-space.
    • Utilizes online measurements and proxy model to guide the search process.
  2. Record Processor:

    • A new class to handle the processing and modification of measure records.
    • Includes methods to extract and modify SP node coordinates.

This implementation is based on the algorithm described in the paper "Accelerated Auto-Tuning of GPU Kernels for Tensor Computations" presented at ICS'24.

Experimental evaluation on a number of matrix-matrix multiplication and convolution kernels shows that the DGD algorithm achieves an order-of-magnitude improvement in auto-tuning time while maintaining comparable code performance.

Usage:

To use the DGD Search algorithm, instantiate the DynamicGradientSearchTuner class with the desired parameters and call the dynamic_gradient_search method.

Example:

tuner = auto_scheduler.dynamic_gradient_search.DynamicGradientSearchTuner(task, log_file, tune_option)
tuner.dynamic_gradient_search()

Experiments setup:

The experiments used the DGD Search algorithm with a time budget of 1 hour and full duration used by Ansor, comparing the performance achieved by Ansor after suggested trials. The models used for the evaluation were Bert, ResNet-50, and MobileNetV2, with the following configurations based on the Apache blog Introducing TVM Auto-scheduler (a.k.a. Ansor):

  • Bert: 12000 trials, running on an Nvidia RTX 4090 for 6 hours.
  • ResNet-50: 20000 trials, running on an Nvidia RTX 4090 for 10 hours.
  • MobileNetV2: 16000 trials, running on an Nvidia RTX 4090 for 7 hours.

Relative Performance of the DGD Search algorithm achieved in 1 hour and full duration used by Ansor

Networks Ratio (1 hour) Ratio (full)
Bert 93.71% 100.15%
ResNet-50 90.46% 96.73%
MobileNetV2 95.08% 101.75%

This table presents the relative performance of the DGD Search algorithm with a 1-hour time budget compared to the full duration used by Ansor. The performance ratios indicate the effectiveness of the Dynamic Gradient Descent Search algorithm in achieving comparable performance within a significantly reduced time frame.

Lurkrazy avatar Jun 28 '24 23:06 Lurkrazy

Thank you @Lurkrazy for this contribution !

I add Cc to relevant folks here: @comaniac @jcf94 @merrymercy @FrozenGene @minminsun @jinhongyii

cbalint13 avatar Aug 12 '24 09:08 cbalint13

Given that we are migrating toward meta-schedule and may phase out auto-scheduler, i would suggest we bring new xhanges to that path.

tqchen avatar Aug 12 '24 12:08 tqchen