knowhere icon indicating copy to clipboard operation
knowhere copied to clipboard

Add AMX support to speed up Faiss Inner-Product

Open mellonyou opened this issue 1 year ago • 15 comments

Use Intel AMX to speed up Inner-Product algorithm of knowhere::BruteForce::Search(), which can bring more than 10x performance boost.

Build parameter: use "-o with_dnnl=True/False" to control enable/disable AMX feature. This feature will depends on libdnnl.so.3, you can install it by running scripts/install_deps.sh.

Runtime parameter: if you want use AMX feature, you need set ENV parameter "DNNL_ENABLE=1" at first, otherwise the AMX feature will not work.

mellonyou avatar Apr 28 '24 09:04 mellonyou

@mellonyou 🔍 Important: PR Classification Needed!

For efficient project management and a seamless review process, it's essential to classify your PR correctly. Here's how:

  1. If you're fixing a bug, label it as kind/bug.
  2. For small tweaks (less than 20 lines without altering any functionality), please use kind/improvement.
  3. Significant changes that don't modify existing functionalities should be tagged as kind/enhancement.
  4. Adjusting APIs or changing functionality? Go with kind/feature.

For any PR outside the kind/improvement category, ensure you link to the associated issue using the format: “issue: #”.

Thanks for your efforts and contribution to the community!.

mergify[bot] avatar Apr 28 '24 09:04 mergify[bot]

issue: #541

mellonyou avatar May 06 '24 02:05 mellonyou

I can't edit the labels, need any access permissions?

mellonyou avatar May 06 '24 03:05 mellonyou

/kind enhancement

liliu-z avatar May 06 '24 03:05 liliu-z

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 71.59%. Comparing base (3c46f4c) to head (7b6f49a). Report is 179 commits behind head on main.

:exclamation: Current head 7b6f49a differs from pull request most recent head ff5c7cd

Please upload reports for the commit ff5c7cd to get more accurate results.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##           main     #535       +/-   ##
=========================================
+ Coverage      0   71.59%   +71.59%     
=========================================
  Files         0       67       +67     
  Lines         0     4446     +4446     
=========================================
+ Hits          0     3183     +3183     
- Misses        0     1263     +1263     

see 67 files with indirect coverage changes

codecov[bot] avatar May 06 '24 04:05 codecov[bot]

  1. port the code to knowhere to follow dynamic hook interface.
  2. about filter, write a simple benchmark to compare no filter amx inner product with simd inner product with filer(0.1f, 0.5f, 0.9f), it can be seen that AMX still has a perf. advantage when the filter percentage reaches 0.9. percentage 0.1 0.5 0.9 amx result(s) 0.432 0.208 0.043 0.033 dnnl perf. boost 13.1x 6.3x 1.3x

For the code, the amx inner product interface is more suitable for producing batch vectors, and it doesn't support a filter interface, I have two ideas:

  1. amx inner product just handle no filter scenario.
  2. add a percentage parameter to the interface, when it is less than 0.9, we choose amx inner product.

@alexanderguzhva Looking forward to your suggestions.

mellonyou avatar May 15 '24 11:05 mellonyou

@mellonyou Could you please include a benchmark or, at least, its details? The numbers that you've provided cannot be interpreted properly without knowing

  • the exact number of samples
  • the dimensionality
  • whether it is a single query/batched query requests
  • is it a test for this particular function or for a whole index,
  • etc.

The results are potentially interesting and are definitely worth checking on my end.

alexanderguzhva avatar May 15 '24 16:05 alexanderguzhva

#include "simd/distances_onednn.h"

#define MAX_LOOP 20 TEST_CASE("Test Brute Force", "[float vector]") { using Catch::Approx;

const int64_t nb = 2000000;
const int64_t nq = 10;
const int64_t dim = 512;
const int64_t k = 100;

auto metric = GENERATE(as<std::string>{}, knowhere::metric::IP );

const auto train_ds = GenDataSet(nb, dim);
const auto query_ds = CopyDataSet(train_ds, nq);

const knowhere::Json conf = {
    {knowhere::meta::DIM, dim},
    {knowhere::meta::METRIC_TYPE, metric},
    {knowhere::meta::TOPK, k},
    {knowhere::meta::RADIUS, knowhere::IsMetricType(metric, knowhere::metric::IP) ? 10.0 : 0.99},
};

SECTION("Test Search Batch") {
 faiss::BaseData::getState().store(faiss::BASE_DATA_STATE::MODIFIED);
 struct timeval t1,t2;
 double timeuse;
 gettimeofday(&t1,NULL);

     std::vector<std::function<std::vector<uint8_t>(size_t, size_t)>> gen_bitset_funcs = {
             GenerateBitsetWithFirstTbitsSet, GenerateBitsetWithRandomTbitsSet};
     const auto bitset_percentages = {0.1f, 0.5f, 0.9f};
     for (const float percentage : bitset_percentages) {
             for (const auto& gen_func : gen_bitset_funcs) {
                     auto bitset_data = gen_func(nb, percentage * nb);
                     knowhere::BitsetView bitset(bitset_data.data(), nb);

                     for (int i = 0; i < MAX_LOOP; i++)
                     {
                             gettimeofday(&t1,NULL);

                             //    threads.emplace_back(WrapSearch, queryvar1);
                             auto res = knowhere::BruteForce::Search<knowhere::fp32>(train_ds, query_ds, conf, bitset);
                             gettimeofday(&t2,NULL);
                             timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;
                             std::cout << "elpased: " << timeuse << std::endl;
                     }

             }
     }

     gettimeofday(&t2,NULL);
     timeuse = (t2.tv_sec - t1.tv_sec) + (double)(t2.tv_usec - t1.tv_usec)/1000000.0;

     std::cout << "All thread finished." << std::endl;

    }

}

mellonyou avatar May 16 '24 01:05 mellonyou

@alexanderguzhva I just add this code to ut as a temporary benchmark, and build it with "-o with_dnnl=True", then run the test: DNNL_ENABLE=0/1 ./Release/tests/ut/knowhere_tests The test will run 20 rounds, and the results above are the average after discarding the best 20% and the worst 20%. And I ran the test on Intel SPR platform with Ubuntu 22.04 system.

mellonyou avatar May 16 '24 01:05 mellonyou

@mellonyou I'll take a look. Thanks!

alexanderguzhva avatar May 21 '24 16:05 alexanderguzhva

Add searchwithbuf and rangesearch interface implementation with AMX onednn. And will submit the related build config into milvus later.

mellonyou avatar Jun 05 '24 03:06 mellonyou

I am trying to do a manual filter with multithread before AMX IP. @liliu-z @alexanderguzhva @godchen0212 Do you have any other opinions on the current interface implementation.

mellonyou avatar Jun 18 '24 03:06 mellonyou

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mellonyou To complete the pull request process, please assign zhengbuqian after the PR has been reviewed. You can assign the PR to them by writing /assign @zhengbuqian in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

sre-ci-robot avatar Jul 01 '24 08:07 sre-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mellonyou To complete the pull request process, please assign zhengbuqian after the PR has been reviewed. You can assign the PR to them by writing /assign @zhengbuqian in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

sre-ci-robot avatar Jul 01 '24 08:07 sre-ci-robot

Have tried to do manual filter with multithread before AMX IP, and it have a significant impact on performance. So we only filter the results to ensure their accuracy, which has a relatively small impact on performance.

mellonyou avatar Jul 01 '24 08:07 mellonyou