taco icon indicating copy to clipboard operation
taco copied to clipboard

Question about performance results for BCSR

Open thepalbi opened this issue 4 months ago • 3 comments

Hi team, I'm running some benchmarks for a sparse matrix - matrix implementation of mine, and comparing everything against Eigen CSR (that' the baseline). I'm getting this results for TACO in the BCSR configuration @stephenchouca suggested in here:

Image

taco is being compiled as follow: cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_POLICY_VERSION_MINIMUM=3.5 ... So no omp, or anything else.

The size and density is the size and density of block sparse matrices (with uniform block sizes), and the speedup is calculated against Eigen.

Is it possible for this cases that taco is supposed to work best with omp? Or am I missing anything?

Also, I'm running pack on the input matrices, and compile and assemble beforehands, and the main benchmark looks (with google bench) looks as follows:

  for (const auto& prepared : preparedCases) {
    const std::string benchName = makeBenchmarkName(prepared->config);
    benchmark::RegisterBenchmark(
        benchName.c_str(),
        [prepared](benchmark::State& state) {
          for (auto _ : state) {
            prepared->result.compute();
            benchmark::DoNotOptimize(prepared->result);
            benchmark::ClobberMemory();
          }
          const double total_mults = static_cast<double>(state.iterations());
          state.counters["mults"] = benchmark::Counter(total_mults);
          state.counters["mults_per_sec"] =
              benchmark::Counter(total_mults, benchmark::Counter::kIsRate);
        });
  }

thepalbi avatar Dec 09 '25 09:12 thepalbi

I'm assuming that the speedup numbers are indicating that TACO isn't doing very well. The numbers will improve somewhat with openmp, as eigen is using multiple cores internally as well. I'm not sure how much tuning has gone into the generated BCSR matmul in taco. Is the CSR spgemm performing reasonably?

rohany avatar Dec 09 '25 17:12 rohany

The numbers will improve somewhat with openmp, as eigen is using multiple cores internally as well

For this experiments I'm restraining everything to single core with OMP_NUM_THREADS = 1. I checked Eigen's CSR code, and it uses omp for parallelization, so that should make everything single threaded.

I'll get the numbers for CSR.

Also, does the benchmark code look good? In the sense, that one should only measure the .compute() code time?

One final question, ha. When using / measuring taco's performance, is multi-threaded the intended use? Or single-threaded as well?

thepalbi avatar Dec 10 '25 09:12 thepalbi

For this experiments I'm restraining everything to single core with OMP_NUM_THREADS = 1. I checked Eigen's CSR code, and it uses omp for parallelization, so that should make everything single threaded.

Yeah, sounds right.

Also, does the benchmark code look good? In the sense, that one should only measure the .compute() code time?

I haven't looked at this in a long time, but it seems right based on my memory.

One final question, ha. When using / measuring taco's performance, is multi-threaded the intended use? Or single-threaded as well?

If you have multiple cores it makes sense to use them, but TACO generates reasonable single-threaded CPU code too (most of the multi-threaded generated code is just the same sequential code with parallel for annotations).

rohany avatar Dec 10 '25 17:12 rohany