Add AMX support to speed up Faiss Inner-Product
Use Intel AMX to speed up Inner-Product algorithm of knowhere::BruteForce::Search(), which can bring more than 10x performance boost.
Build parameter: use "-o with_dnnl=True/False" to control enable/disable AMX feature. This feature will depends on libdnnl.so.3, you can install it by running scripts/install_deps.sh.
Runtime parameter: if you want use AMX feature, you need set ENV parameter "DNNL_ENABLE=1" at first, otherwise the AMX feature will not work.
@mellonyou 🔍 Important: PR Classification Needed!
For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:
- If you're fixing a bug, label it as kind/bug.
- For small tweaks (less than 20 lines without altering any functionality), please use kind/improvement.
- Significant changes that don't modify existing functionalities should be tagged as kind/enhancement.
- Adjusting APIs or changing functionality? Go with kind/feature.
For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #
Thanks for your efforts and contribution to the community!.
issue: #541
I can't edit the labels, need any access permissions?
/kind enhancement
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 71.59%. Comparing base (
3c46f4c) to head (7b6f49a). Report is 179 commits behind head on main.
:exclamation: Current head 7b6f49a differs from pull request most recent head ff5c7cd
Please upload reports for the commit ff5c7cd to get more accurate results.
Additional details and impacted files
@@ Coverage Diff @@
## main #535 +/- ##
=========================================
+ Coverage 0 71.59% +71.59%
=========================================
Files 0 67 +67
Lines 0 4446 +4446
=========================================
+ Hits 0 3183 +3183
- Misses 0 1263 +1263
- port the code to knowhere to follow dynamic hook interface.
- about filter, write a simple benchmark to compare no filter amx inner product with simd inner product with filer(0.1f, 0.5f, 0.9f), it can be seen that AMX still has a perf. advantage when the filter percentage reaches 0.9. percentage 0.1 0.5 0.9 amx result(s) 0.432 0.208 0.043 0.033 dnnl perf. boost 13.1x 6.3x 1.3x
For the code, the amx inner product interface is more suitable for producing batch vectors, and it doesn't support a filter interface, I have two ideas:
- amx inner product just handle no filter scenario.
- add a percentage parameter to the interface, when it is less than 0.9, we choose amx inner product.
@alexanderguzhva Looking forward to your suggestions.
@mellonyou Could you please include a benchmark or, at least, its details? The numbers that you've provided cannot be interpreted properly without knowing
- the exact number of samples
- the dimensionality
- whether it is a single query/batched query requests
- is it a test for this particular function or for a whole index,
- etc.
The results are potentially interesting and are definitely worth checking on my end.
#include "simd/distances_onednn.h"
#define MAX_LOOP 20 TEST_CASE("Test Brute Force", "[float vector]") { using Catch::Approx;
const int64_t nb = 2000000;
const int64_t nq = 10;
const int64_t dim = 512;
const int64_t k = 100;
auto metric = GENERATE(as<std::string>{}, knowhere::metric::IP );
const auto train_ds = GenDataSet(nb, dim);
const auto query_ds = CopyDataSet(train_ds, nq);
const knowhere::Json conf = {
{knowhere::meta::DIM, dim},
{knowhere::meta::METRIC_TYPE, metric},
{knowhere::meta::TOPK, k},
{knowhere::meta::RADIUS, knowhere::IsMetricType(metric, knowhere::metric::IP) ? 10.0 : 0.99},
};
SECTION("Test Search Batch") {
faiss::BaseData::getState().store(faiss::BASE_DATA_STATE::MODIFIED);
struct timeval t1,t2;
double timeuse;
gettimeofday(&t1,NULL);
std::vector<std::function<std::vector<uint8_t>(size_t, size_t)>> gen_bitset_funcs = {
GenerateBitsetWithFirstTbitsSet, GenerateBitsetWithRandomTbitsSet};
const auto bitset_percentages = {0.1f, 0.5f, 0.9f};
for (const float percentage : bitset_percentages) {
for (const auto& gen_func : gen_bitset_funcs) {
auto bitset_data = gen_func(nb, percentage * nb);
knowhere::BitsetView bitset(bitset_data.data(), nb);
for (int i = 0; i < MAX_LOOP; i++)
{
gettimeofday(&t1,NULL);
// threads.emplace_back(WrapSearch, queryvar1);
auto res = knowhere::BruteForce::Search<knowhere::fp32>(train_ds, query_ds, conf, bitset);
gettimeofday(&t2,NULL);
timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
std::cout << "elpased: " << timeuse << std::endl;
}
}
}
gettimeofday(&t2,NULL);
timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
std::cout << "All thread finished." << std::endl;
}
}
@alexanderguzhva I just add this code to ut as a temporary benchmark, and build it with "-o with_dnnl=True", then run the test: DNNL_ENABLE=0/1 ./Release/tests/ut/knowhere_tests The test will run 20 rounds, and the results above are the average after discarding the best 20% and the worst 20%. And I ran the test on Intel SPR platform with Ubuntu 22.04 system.
@mellonyou I'll take a look. Thanks!
Add searchwithbuf and rangesearch interface implementation with AMX onednn. And will submit the related build config into milvus later.
I am trying to do a manual filter with multithread before AMX IP. @liliu-z @alexanderguzhva @godchen0212 Do you have any other opinions on the current interface implementation.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: mellonyou
To complete the pull request process, please assign zhengbuqian after the PR has been reviewed.
You can assign the PR to them by writing /assign @zhengbuqian in a comment when ready.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: mellonyou
To complete the pull request process, please assign zhengbuqian after the PR has been reviewed.
You can assign the PR to them by writing /assign @zhengbuqian in a comment when ready.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Have tried to do manual filter with multithread before AMX IP, and it have a significant impact on performance. So we only filter the results to ensure their accuracy, which has a relatively small impact on performance.