Basically I'm trying to find the norm of 3D array, over the 2nd and 3rd dimensions, and it's taking much longer than expected.

Description

I'm trying to rewrite code that was previously written in CuPy ton Arrayfire in a super speedy manner.

I'm using CUDA.

Doing this with CuPy takes just 0.9 seconds for the whole function to complete.

I used official installers.

And yes it can be produced reliably - meaning it happens every time.

Reproducible Code

inline af::array findDistances(af::array &X, af::array &A, af::array &B, float alpha = 1.2) {
        int k = A.dims(1) / 2;
        int m = B.dims(1);

        int n = X.dims(0);
        int d = X.dims(1);
        int D = B.dims(0) / 2;

        int batchSize = findDistanceBatchSize(alpha, n, d, k, m); // Comes out to 20

        af::array distances(n, 2 * k * m, af::dtype::f32);
        af::array ABatch(batchSize, 2 * k, A.type());
        af::array BBatch(batchSize, m, B.type());
        af::array XBatch(batchSize, 2 * k, m, d, X.type());
        af::array XBatchAdj(batchSize, 2 * k * m, d,
                            X.type()); // This is very large, around 7gb. Possible to do this without explicitly allocating the memory?
        af::array XSubset(batchSize, d, X.type());
        af::array XSubsetReshaped = af::constant(0, XBatchAdj.dims(), XBatchAdj.type());
        af::array YBatch = af::constant(0, XBatchAdj.dims(), XBatchAdj.type());

        for (int i = 0; i < n; i += batchSize) {
            int maxBatchIdx = i + batchSize - 1;
            ABatch = A(af::seq(i, maxBatchIdx), af::span);

            BBatch = B(ABatch, af::span);

            BBatch = af::moddims(BBatch, batchSize, 2 * k, m);

            XBatch = X(BBatch, af::span);

            XBatchAdj = af::moddims(XBatch, batchSize, 2 * k * m, d);

            XSubset = X(af::seq(i, maxBatchIdx), af::span);

            XSubsetReshaped = moddims(XSubset, batchSize, 1, d); // Insert new dim

            YBatch = XBatchAdj - XSubsetReshaped;

//            distances(af::seq(i, maxBatchIdx), af::span) =
            af::sqrt(af::sum(af::pow(YBatch, 2), 1)); // It gets hung up on this line. The assignment above breaks the code, so just to get an idea of runtime, I just put it on a new line
        }

        return distances;
    }

System Information

ArrayFire Version: 2.9.0 Device: RTX 3090. Running CUDA 12.6 Operating System: Ubuntu 20.04 Driver version: (nvidia driver): 560.28.03

Checklist

[x] I have read timing ArrayFire Yep

Aug 12 '24 06:08 HugoPhibbs

Multiple lines in your code use the af::seq in indexing. There is an error with that kind of indexing posted to #3525 . We'll be fixing this soon, but as a workaround try assingning the indexing first to an array and then use that as the indexing, i.e,:

// ...

af::array index_sequence = af::seq(i, maxBatchIdx);

ABatch = A(index_sequence, af::span);

//...

Try making this replacement in the code and let us know if you still find the error and the poor performance.

Aug 14 '24 17:08 edwinsolisf

The pow instruction is impacted by CUDA compiler setting -use_fast_math in CUDA (arrayfire generated with AF_WITH_FAST_MATH:BOOL=ON defined). Try replacing "af::pow(YBatch,2)" by "YBatch*YBatch". The later one always gives the best performance and accuracy, independent of the compiler settings.

When fast_math is activated, the pow instruction of float is converted to a faster, imprecise instruction which is failing many tests in arrayfire. To avoid this, the double precision pow instruction is executed instead, since the float precision is no longer available, resulting in bad performance, since this is not converted by the CUDA compiler. The above proposed change is independent from the compiler setting and always gives the best performance and accuracy.

On Mon, 12 Aug 2024 at 08:52, Hugo Phibbs @.***> wrote:

Basically I'm trying to find the norm of 3D array, over the 2nd and 3rd dimensions, and it's taking much longer than expected. Description

I'm trying to rewrite code that was previously written in CuPy ton Arrayfire in a super speedy manner.

I'm using CUDA.

Doing this with CuPy takes just 0.9 seconds for the whole function to complete.

I used official installers.

And yes it can be produced reliably - meaning it happens every time. Reproducible Code

inline af::array findDistances(af::array &X, af::array &A, af::array &B, float alpha = 1.2) { int k = A.dims(1) / 2; int m = B.dims(1);
    int n = X.dims(0);
    int d = X.dims(1);
    int D = B.dims(0) / 2;

    int batchSize = findDistanceBatchSize(alpha, n, d, k, m); // Comes out to 20

    af::array distances(n, 2 * k * m, af::dtype::f32);
    af::array ABatch(batchSize, 2 * k, A.type());
    af::array BBatch(batchSize, m, B.type());
    af::array XBatch(batchSize, 2 * k, m, d, X.type());
    af::array XBatchAdj(batchSize, 2 * k * m, d,
                        X.type()); // This is very large, around 7gb. Possible to do this without explicitly allocating the memory?
    af::array XSubset(batchSize, d, X.type());
    af::array XSubsetReshaped = af::constant(0, XBatchAdj.dims(), XBatchAdj.type());
    af::array YBatch = af::constant(0, XBatchAdj.dims(), XBatchAdj.type());

    for (int i = 0; i < n; i += batchSize) {
        int maxBatchIdx = i + batchSize - 1;
        ABatch = A(af::seq(i, maxBatchIdx), af::span);

        BBatch = B(ABatch, af::span);

        BBatch = af::moddims(BBatch, batchSize, 2 * k, m);

        XBatch = X(BBatch, af::span);

        XBatchAdj = af::moddims(XBatch, batchSize, 2 * k * m, d);

        XSubset = X(af::seq(i, maxBatchIdx), af::span);

        XSubsetReshaped = moddims(XSubset, batchSize, 1, d); // Insert new dim

        YBatch = XBatchAdj - XSubsetReshaped;
// distances(af::seq(i, maxBatchIdx), af::span) = af::sqrt(af::sum(af::pow(YBatch, 2), 1)); // It gets hung up on this line. The assignment above breaks the code, so just to get an idea of runtime, I just put it on a new line }
    return distances;
}
System Information

ArrayFire Version: 2.9.0 Device: RTX 3090. Running CUDA 12.6 Operating System: Ubuntu 20.04 Driver version: (nvidia driver): 560.28.03 Checklist

I have read timing ArrayFire http://arrayfire.org/docs/timing.htm Yep

— Reply to this email directly, view it on GitHub https://github.com/arrayfire/arrayfire/issues/3582, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ2WGPA4JY7NI4JEKVKLEW3ZRBLRPAVCNFSM6AAAAABMLRINISVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ3DAMJXHA4DGMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Aug 14 '24 18:08 willyborn

@edwinsolisf @willyborn I added both of of these changes and the perf issue still remains.

As a side note, this is taking 30 seconds for nothing to happen. So I think something is especially going amis.

If it helps, my specs are: CUDA 12.6 g++ 9.4.0 arrayfire 3.9.0

Aug 14 '24 23:08 HugoPhibbs

Could you post the dimensions of your inputs?

Aug 15 '24 18:08 edwinsolisf

@edwinsolisf sure

X has shape (70_000, 784). A has shape (70_000, 10), B has shape (2048, 50).

The distances array has shape (70_000, 500).

YBatch comes out to (20, 500, 784)

The parameters at the top are therefore:

n=70000
k=5
m=50
D=1024
d=784

Thx

Aug 15 '24 21:08 HugoPhibbs

I have not been able to replicate the getting stuck part you mention. Could you run your program with AF_TRACE=all environment variable and post the output?

Aug 23 '24 17:08 edwinsolisf

[Perf] Reduction over rows of a multi dimension array takes a while

Description

Reproducible Code

System Information

Checklist