[Perf] Reduction over rows of a multi dimension array takes a while
Basically I'm trying to find the norm of 3D array, over the 2nd and 3rd dimensions, and it's taking much longer than expected.
Description
I'm trying to rewrite code that was previously written in CuPy ton Arrayfire in a super speedy manner.
I'm using CUDA.
Doing this with CuPy takes just 0.9 seconds for the whole function to complete.
I used official installers.
And yes it can be produced reliably - meaning it happens every time.
Reproducible Code
inline af::array findDistances(af::array &X, af::array &A, af::array &B, float alpha = 1.2) {
int k = A.dims(1) / 2;
int m = B.dims(1);
int n = X.dims(0);
int d = X.dims(1);
int D = B.dims(0) / 2;
int batchSize = findDistanceBatchSize(alpha, n, d, k, m); // Comes out to 20
af::array distances(n, 2 * k * m, af::dtype::f32);
af::array ABatch(batchSize, 2 * k, A.type());
af::array BBatch(batchSize, m, B.type());
af::array XBatch(batchSize, 2 * k, m, d, X.type());
af::array XBatchAdj(batchSize, 2 * k * m, d,
X.type()); // This is very large, around 7gb. Possible to do this without explicitly allocating the memory?
af::array XSubset(batchSize, d, X.type());
af::array XSubsetReshaped = af::constant(0, XBatchAdj.dims(), XBatchAdj.type());
af::array YBatch = af::constant(0, XBatchAdj.dims(), XBatchAdj.type());
for (int i = 0; i < n; i += batchSize) {
int maxBatchIdx = i + batchSize - 1;
ABatch = A(af::seq(i, maxBatchIdx), af::span);
BBatch = B(ABatch, af::span);
BBatch = af::moddims(BBatch, batchSize, 2 * k, m);
XBatch = X(BBatch, af::span);
XBatchAdj = af::moddims(XBatch, batchSize, 2 * k * m, d);
XSubset = X(af::seq(i, maxBatchIdx), af::span);
XSubsetReshaped = moddims(XSubset, batchSize, 1, d); // Insert new dim
YBatch = XBatchAdj - XSubsetReshaped;
// distances(af::seq(i, maxBatchIdx), af::span) =
af::sqrt(af::sum(af::pow(YBatch, 2), 1)); // It gets hung up on this line. The assignment above breaks the code, so just to get an idea of runtime, I just put it on a new line
}
return distances;
}
System Information
ArrayFire Version: 2.9.0 Device: RTX 3090. Running CUDA 12.6 Operating System: Ubuntu 20.04 Driver version: (nvidia driver): 560.28.03
Checklist
- [x] I have read timing ArrayFire Yep
Multiple lines in your code use the af::seq in indexing. There is an error with that kind of indexing posted to #3525 . We'll be fixing this soon, but as a workaround try assingning the indexing first to an array and then use that as the indexing, i.e,:
// ...
af::array index_sequence = af::seq(i, maxBatchIdx);
ABatch = A(index_sequence, af::span);
//...
Try making this replacement in the code and let us know if you still find the error and the poor performance.
The pow instruction is impacted by CUDA compiler setting -use_fast_math in CUDA (arrayfire generated with AF_WITH_FAST_MATH:BOOL=ON defined). Try replacing "af::pow(YBatch,2)" by "YBatch*YBatch". The later one always gives the best performance and accuracy, independent of the compiler settings.
When fast_math is activated, the pow instruction of float is converted to a faster, imprecise instruction which is failing many tests in arrayfire. To avoid this, the double precision pow instruction is executed instead, since the float precision is no longer available, resulting in bad performance, since this is not converted by the CUDA compiler. The above proposed change is independent from the compiler setting and always gives the best performance and accuracy.
On Mon, 12 Aug 2024 at 08:52, Hugo Phibbs @.***> wrote:
Basically I'm trying to find the norm of 3D array, over the 2nd and 3rd dimensions, and it's taking much longer than expected. Description
I'm trying to rewrite code that was previously written in CuPy ton Arrayfire in a super speedy manner.
I'm using CUDA.
Doing this with CuPy takes just 0.9 seconds for the whole function to complete.
I used official installers.
And yes it can be produced reliably - meaning it happens every time. Reproducible Code
inline af::array findDistances(af::array &X, af::array &A, af::array &B, float alpha = 1.2) { int k = A.dims(1) / 2; int m = B.dims(1);
int n = X.dims(0); int d = X.dims(1); int D = B.dims(0) / 2; int batchSize = findDistanceBatchSize(alpha, n, d, k, m); // Comes out to 20 af::array distances(n, 2 * k * m, af::dtype::f32); af::array ABatch(batchSize, 2 * k, A.type()); af::array BBatch(batchSize, m, B.type()); af::array XBatch(batchSize, 2 * k, m, d, X.type()); af::array XBatchAdj(batchSize, 2 * k * m, d, X.type()); // This is very large, around 7gb. Possible to do this without explicitly allocating the memory? af::array XSubset(batchSize, d, X.type()); af::array XSubsetReshaped = af::constant(0, XBatchAdj.dims(), XBatchAdj.type()); af::array YBatch = af::constant(0, XBatchAdj.dims(), XBatchAdj.type()); for (int i = 0; i < n; i += batchSize) { int maxBatchIdx = i + batchSize - 1; ABatch = A(af::seq(i, maxBatchIdx), af::span); BBatch = B(ABatch, af::span); BBatch = af::moddims(BBatch, batchSize, 2 * k, m); XBatch = X(BBatch, af::span); XBatchAdj = af::moddims(XBatch, batchSize, 2 * k * m, d); XSubset = X(af::seq(i, maxBatchIdx), af::span); XSubsetReshaped = moddims(XSubset, batchSize, 1, d); // Insert new dim YBatch = XBatchAdj - XSubsetReshaped;// distances(af::seq(i, maxBatchIdx), af::span) = af::sqrt(af::sum(af::pow(YBatch, 2), 1)); // It gets hung up on this line. The assignment above breaks the code, so just to get an idea of runtime, I just put it on a new line }
return distances; }System Information
ArrayFire Version: 2.9.0 Device: RTX 3090. Running CUDA 12.6 Operating System: Ubuntu 20.04 Driver version: (nvidia driver): 560.28.03 Checklist
- I have read timing ArrayFire http://arrayfire.org/docs/timing.htm Yep
— Reply to this email directly, view it on GitHub https://github.com/arrayfire/arrayfire/issues/3582, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQ2WGPA4JY7NI4JEKVKLEW3ZRBLRPAVCNFSM6AAAAABMLRINISVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ3DAMJXHA4DGMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
@edwinsolisf @willyborn I added both of of these changes and the perf issue still remains.
As a side note, this is taking 30 seconds for nothing to happen. So I think something is especially going amis.
If it helps, my specs are: CUDA 12.6 g++ 9.4.0 arrayfire 3.9.0
Could you post the dimensions of your inputs?
@edwinsolisf sure
X has shape (70_000, 784). A has shape (70_000, 10), B has shape (2048, 50).
The distances array has shape (70_000, 500).
YBatch comes out to (20, 500, 784)
The parameters at the top are therefore:
n=70000
k=5
m=50
D=1024
d=784
Thx
I have not been able to replicate the getting stuck part you mention. Could you run your program with AF_TRACE=all environment variable and post the output?