[fix] fix dispatch table copy
Hi,
while comparing QPL to TUM Umbra's related functionality, I noticed that the former has a large overhead for small amounts of tuples. After investigating with VTune and the following microbenchmark, I think I found a typo that leads to an accidental copy of the function pointer table in input_stream_t::initialize_sw_kernels. Here's the benchmark:
#include "qpl/qpl.h"
#include <benchmark/benchmark.h>
#include "../DataDistribution.hpp"
#include "../Utils.hpp"
int main() {
// pointer wrapper with destructor
ExecutionContext fixture(qpl_path_software);
qpl_job* job = reinterpret_cast<qpl_job*>(fixture.getExecutionContext());
uint32_t numTuples = 1;
double selectivity = 0.5;
uint32_t inBitWidth = 8;
qpl_out_format outFormat = qpl_out_format::qpl_ow_32;
// private utilities, does what you would expect
LittleEndianBufferBuilder builder(inBitWidth, numTuples);
Dataset dataset = UniformDistribution::generateDataset(
builder,
inBitWidth,
(1ul << inBitWidth) - 1,
numTuples,
umbra::VectorizedFunctions::Mode::Eq,
selectivity);
std::vector<uint8_t> destination;
destination.resize(divceil(32 * numTuples, 8));
for (size_t i = 0; i < 1'000'000'000; i++) {
if (i % 1'000'000 == 0) {
std::cout << i << std::endl;
}
// Parameterize jobs.
{
job->parser = builder.getQPLParser();
job->next_in_ptr = dataset.data.data();
job->available_in = dataset.data.size();
job->next_out_ptr = destination.data();
job->available_out = static_cast<uint32_t>(destination.size());
job->op = map_umbra_to_qpl_op(umbra::VectorizedFunctions::Mode::Eq);
job->src1_bit_width = inBitWidth;
job->num_input_elements = numTuples;
job->out_bit_width = outFormat;
auto [param_low, param_high] = dataset.predicateArguments->template getValues<uint32_t, 2>();
job->param_low = param_low;
job->param_high = param_high;
job->flags = static_cast<uint32_t>(QPL_FLAG_OMIT_CHECKSUMS | QPL_FLAG_OMIT_AGGREGATES);
}
qpl_status status = qpl_execute_job(job);
if (status != QPL_STS_OK) {
ERROR("An error occurred during job execution: " << status);
}
const auto indicesByteSize = job->total_out;
const auto bytesPerHit = divceil(32, 8);
const auto qplHits = indicesByteSize / bytesPerHit;
if ((outFormat != qpl_ow_nom && qplHits != dataset.predicateHits)) {
ERROR("Result does not fit expectations: " << qplHits << " != " << dataset.predicateHits);
} else if (outFormat == qpl_ow_nom && job->total_out != divceil(numTuples, 8)) {
ERROR("Result does not fit expectations: " << job->total_out << " != " << divceil(numTuples, 8));
}
benchmark::DoNotOptimize(job->total_out);
benchmark::DoNotOptimize(status);
benchmark::DoNotOptimize(destination);
benchmark::ClobberMemory();
}
}
The benchmark was run for 15s on an i9 13900K. Screenshot of VTune summary before the PR:
after the PR:
After further investigation, I noticed that the issue is also present in other locations. The second force push now includes other locations (found by searching for core_sw::dispatcher::kernels_dispatcher::get_instance()).
Hi @Jonas-Heinrich thank you for the contribution! Team would review and do a thorough testing on our side in the upcoming weeks.