Bend GPU slower than Multi-threaded CPU on WSL / Windows 11 RTX 2050

Describe the bug I ran the examples successfully with an 11.x version of CUDA yesterday. Unable to upgrade to 12.4 or 12.5.

I created and shared a screenshot video of running the same code 3 ways in my repo and steps I took: bend_demo_parallelism.mp4

WSL2 on Windows 11, followed install instructions as documented in the readme here: https://github.com/hoopdad/bendlang

Errors: Performance on GPU is slower than on multi-threaded CPU.

To Reproduce Please see README in the above link. The video also shows my windows resource monitor where the NVIDIA RAM gets loaded up so you know it is hitting the GPU.

Expected behavior I expected: GPU is faster than single- or multi-threaded CPU.

Desktop (please complete the following information):

OS: Windows 11 with WSL2
CPU: 13th Gen Intel(R) Core(TM) i5-13420H 2.10 GHz
GPU: NVIDIA GeForce RTX-2050
Cuda Version: Build cuda_11.5.r11.5/compiler.30672275_0

Additional context

May 21 '24 19:05 hoopdad

UPDATE I got it running per my steps outlined in the readme. I was missing the "nvidia-" prefix and the nsight* packages. But my question still remains - why is GPU slower than CPU/multi-threaded? Results are also dumped into the readme.

May 22 '24 05:05 hoopdad

Isn't it more a WSL2 problem rather than a Bend problem?

I mean, i had a lot of issues when trying to use my RTX 3060Ti for AI with WSL2 due to virtualization

May 22 '24 19:05 nahharris

@OJarrisonn It may be, but I am not sure where the problem lies. I am a "it's not you, it's me" by default kind of guy so am assuming I have something misconfigured or just plain too low end.

I have it running on Cuda 12.5 now. Very similar results. I updated my readme with procedures and results (see above for link to it).

Is the RTX-2050 capable enough to be used for computations like this? Do I need to set some env variables to tune it, like does using 7+GB of shared memory (on top of the 4GB on the card) take away from the performance?

I was reading about various architectures and running the nvcc --list-gpu-arch to see what I have. Then setting it to just my highest architecture, though I'm not sure highest = best.

export CUDA_ARCHITECTURES="compute_90"

Note: just ran my example code after running the above with no difference in time.

May 23 '24 19:05 hoopdad

@hoopdad can you run everything with the flag -s and comment the results?

May 23 '24 19:05 kings177

Here it is:

mike@bluewarrior:~/bend-lang$ ./bendrun.sh
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0
starting CPU single thread run
Result: 16515072
- ITRS: 1259339749
- TIME: 42.14s
- MIPS: 29.88

time to run: 42163
starting CPU multi thread run
Result: 16515072
- ITRS: 1259339749
- TIME: 45.99s
- MIPS: 27.38

time to run: 46007
starting GPU multi thread run
Result: 16515072
- ITRS: 1259323365
- LEAK: 59858943
- TIME: 26.65s
- MIPS: 47.25

time to run: 29051

Here's the new shell script that I ran, FYI

nvcc --version

echo "starting CPU single thread run"
export START_NANO=`date +%s%3N`
bend run -s main.bend
export END_NANO=`date +%s%3N`
echo "time to run: $(($END_NANO-$START_NANO))"

echo "starting CPU multi thread run"
export START_NANO=`date +%s%3N`
bend run-c -s main.bend
export END_NANO=`date +%s%3N`
echo "time to run: $(($END_NANO-$START_NANO))"

echo "starting GPU multi thread run"
export START_NANO=`date +%s%3N`
bend run-cu -s main.bend
export END_NANO=`date +%s%3N`
echo "time to run: $(($END_NANO-$START_NANO))"

May 23 '24 20:05 hoopdad

from what i can see, it took half the amount of time to finish on the GPU than on the CPU. no?

now, what is really concerning to me here, is the fact that the single-core rust run, is faster than the multi-core gcc run-c, ?! which doesn't make sense, could be a bug.

May 24 '24 14:05 kings177

Member

That run does show GPU as the fastest, agreed. Prior runs were showing Multi-threaded CPU as much faster, single-threaded as slowest and GPU in the middle. I'll try some more runs and iron out some statistics. Maybe with my successful upgrade to Cuda 12.5 it is working as expected and my first run(s) had something concurrent running. It's just my personal laptop, and Windows, so...

Is that program that I used a good one for a basic benchmark? I got it from the readme but also see many in the examples folder. I'll try to run by end of the day today so we can hopefully close this issue.

May 24 '24 14:05 hoopdad

I ran the example above, 100 times for each of Single Threaded CPU, Multi Threaded CPU and GPU. The raw results are attached. Same code/methodology as before.

100runs.xlsx

averages: CPU single thread run | 40.9678 CPU multi thread run | 23.0878 GPU multi thread run | 25.8416

May 24 '24 22:05 hoopdad

I am having a similar issue with this code:

# Collatz conjecture search in Bend
#Author - Elijah Bare

search_until = 10000

def collatz(n, count):
  if n == 1:
    return count + 1
  else:
    if n % 2 == 0:
      return collatz(n / 2, count + 1)
    else:
      return collatz(3*n + 1, count + 1)


def loop(highscore, high_start_val, i):
  iters = collatz(i, 0)
  if i < search_until:
    if iters > highscore:
      return loop(iters, i, i+1)
    else:
      return loop(highscore, high_start_val, i+1)
  else:
    return [highscore, high_start_val, i+1]


def main():
  return loop(0,0,1) #start with 0 as best scores

It runs well using the run-c command but when i run it with cuda it takes much longer, which doesnt make sense given its an M1 compared to a 4090 (on a vps obivously)

Is anyone else having this issuse?

May 30 '24 17:05 ElijahBare

I have the same problem running the parallel_sum.bend benchmark on Windows 11 through WSL 2.0.

Results:

Hardware:

Aug 06 '24 12:08 TomasMonkevic