How to use and benchmark Halide autoscheduler?
Hi Halide developers,
I was trying to use Halide autoscheduler to generate scheduler for matmul by following the old tutorials. (btw it's really old, it seems argument last_level_cache_size and balance are no longer in use nowadays)
I found that the schedule produced by autoscheduler is not having great performance, so I wish to see if you guys can help to check whether I'm using the autoscheduler correctly.
My generator (matmul_generator.cpp) looks like this:
`class MatMulGenerator : public Halide::Generator<MatMulGenerator> {
public:
Input<Buffer
void generate() {
Var x("x"), y("y"), k("k");
Func result("result");
RDom r(0, A.dim(1).extent());
result(x, y) = Halide::Expr(0.0);
result(x, y) += A(x, r.x) * B(r.x, y);
C(x, y) = result(x, y);
}
void schedule() {
if (using_autoscheduler()) {
A.set_estimates({{0, 4096}, {0, 4096}});
B.set_estimates({{0, 4096}, {0, 4096}});
C.set_estimates({{0, 4096}, {0, 4096}});
} else {
C.compute_root();
}
}
};
HALIDE_REGISTER_GENERATOR(MatMulGenerator, matmul_generator) `
Then I'm using these commands to generate the schedule, following the tutorial.
g++ matmul_generator.cpp /path/to/GenGen.cpp -g -std=c++17 -fno-rtti -I/path/to/halide/include -L/path/to/halide/lib -lHalide -lpthread -ldl -o matmul_generator
./matmul_generator -o . -g matmul_generator -f matmul_autoschedule_true -e static_library,h,schedule -p /path/to/halide/lib/libautoschedule_adams2019.so target=host autoscheduler=Adams2019 autoscheduler.parallelism=8
In another cpp file, I will use this line of code to call the scheduled matrix multiplication.
matmul_autoschedule_true(A.raw_buffer(), B.raw_buffer(), C.raw_buffer());
I also have questions about how to benchmark halide autoscheduler's performance on a given kernel, I know that in test/performance/matrix_multiplication.cpp, out.realize(output); is called twice, because there will be code generation phase overhead in the first call, and we need to measure halide's performance with the second call.
To summarize, my questions are
- Is my way of using autoscheduler correct?
- I have a minor concern that when benchmarking halide using the second realize call, the cache is not cold, which may lead to performance overestimation.
- When using autoscheduler, and call the kernel like this
matmul_autoschedule_true(A.raw_buffer(), B.raw_buffer(), C.raw_buffer());, does this function contain the code generation phase that could lead to performance underestimation?
Thanks a lot!