TorchSharp icon indicating copy to clipboard operation
TorchSharp copied to clipboard

Perf issue: Torchsharp is slower than pytorch on cuda on some operators

Open LittleLittleCloud opened this issue 1 year ago • 5 comments

I run some benchmark tests to compare the performance difference between torchsharp and pytorch, both uses libtorch 2.2.1 + cuda 12.1. And I notice that torchsharp is slower than pytorch in most of operators. Below are benchmark result

Torchsharp

Image

Pytorch

Image

Observation

I can achieve comparable result between torchsharp and pytorch if I replace operator with in-place version. The performance also become much better if I explicitly dispose current session during each tests

For example, in adding benchmark, torchsharp runs nearly the same with pytorch if I use tensor.add_ instead of tensor.add

Considering that the major difference between the operator and the in-place operator is the in-place operator won't create a new Tensor object, it's likely that the main overhead might happen in Tensor constructor.

Source code

using TorchSharp;

// Initialize CUDA device
var device = torch.CUDA;

var repeatTime = 10000;
// Test randn
var startTime = DateTime.Now;
for (int i = 0; i < repeatTime; i++)
{
    var _ = torch.randn(new long[] { 1000, 1000 }, device: device);
}

Console.WriteLine("Time taken for randn: " + (DateTime.Now - startTime).TotalSeconds);

// Test matmul
startTime = DateTime.Now;
var a = torch.randn(new long[] { 1000, 1000 }, device: device);
var b = torch.randn(new long[] { 1000, 1000 }, device: device);

for (int i = 0; i < repeatTime; i++)
{
    var c = torch.matmul(a, b);
}

Console.WriteLine("Time taken for matmul: " + (DateTime.Now - startTime).TotalSeconds);

// Test concat
startTime = DateTime.Now;
a = torch.randn(new long[] { 1000, 1000 }, device: device);
b = torch.randn(new long[] { 1000, 1000 }, device: device);

for (int i = 0; i < repeatTime; i++)
{
    var c = torch.cat(new[] { a, b }, 0);
}

Console.WriteLine("Time taken for concat: " + (DateTime.Now - startTime).TotalSeconds);

// Test slice
startTime = DateTime.Now;
a = torch.randn(new long[] { 1000, 1000 }, device: device);

for (int i = 0; i < repeatTime; i++)
{
    var c = a[.., 0..500];
}

Console.WriteLine("Time taken for slice: " + (DateTime.Now - startTime).TotalSeconds);

// Test add
startTime = DateTime.Now;
a = torch.randn(new long[] { 1000, 1000 }, device: device);
b = torch.randn(new long[] { 1000, 1000 }, device: device);

for (int i = 0; i < repeatTime; i++)
{
    var c = a + b;
}

Console.WriteLine("Time taken for add: " + (DateTime.Now - startTime).TotalSeconds);
# create a list of benchmark for pytorch on cuda

import torch
import time
repeat = 10000
total_time = 0
start_time = time.time()
for _ in range(repeat):
    a = torch.randn(1000, 1000).cuda()
print("Time taken for randn: " , time.time()-start_time)

start_time = time.time()
# test matmul
a = torch.randn(1000, 1000).cuda()
b = torch.randn(1000, 1000).cuda()
for _ in range(repeat):
    c = torch.matmul(a, b)
    

print("Time taken for matmul: ", time.time()-start_time)

start_time = time.time()

# test concat   
a = torch.randn(1000, 1000).cuda()
b = torch.randn(1000, 1000).cuda()

for _ in range(repeat):
    c = torch.cat((a, b), 0)

print("Time taken for concat: ", time.time()-start_time)

start_time = time.time()
# test slice
a = torch.randn(1000, 1000).cuda()

for _ in range(repeat):
    c = a[:, 0:500]

print("Time taken for slice: ", time.time()-start_time)

start_time = time.time()
# test add
a = torch.randn(1000, 1000).cuda()
b = torch.randn(1000, 1000).cuda()

for _ in range(repeat):
    c = a + b

print("Time taken for add: ", time.time()-start_time)

LittleLittleCloud avatar Feb 07 '25 22:02 LittleLittleCloud

For something like this, you should use BenchmarkDotNet. It handles all the edge cases with .NET benchmarking (like JIT warmup).

ds5678 avatar Feb 08 '25 06:02 ds5678

Hey @ds5678 ,

Thanks for addressing the issue. I've noticed that you've used libtorch 2.2.1 .

The latest version of torchsharp (0.105.0) is using 2.5.1 and there were some performance improvements regarding the calls and garbage collection.

Could you please check your results again with the latest version 0.105.0?

ghost avatar Feb 11 '25 09:02 ghost

@ozanMSFT I think you meant to ping @LittleLittleCloud

ds5678 avatar Feb 11 '25 10:02 ds5678

Is interesting the performance that you have in matmul so i replicate them with Release mode in TorchSharp and LibTorch 2.8.0 cu128.

As always because of cache the first time may take some time, all times is represent in miliseconds.

Image

The source code that i do is:

C#:

var device = torch.CUDA;
Console.WriteLine($"IS BF16 SUPPORTED: {torch.cuda.is_bf16_supported()}");
var repeatTime = 1000;
long[] dims = new long[] { 100, 100 };

torch.backends.cudnn.allow_tf32 = true;
torch.backends.cuda.matmul.allow_tf32 = true;
torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = true;
torch.ScalarType[] scalars = new torch.ScalarType[]
{
    torch.ScalarType.Float32, torch.ScalarType.BFloat16, torch.ScalarType.Float16
};
for (int n = 0; n < 10; n++)
{
    Console.WriteLine($"{n+1}/10");

    var startTime = Stopwatch.GetTimestamp();
    
    foreach (var sca in scalars)
    {
        var a = torch.randn(dims, device: device, dtype: torch.ScalarType.Float32);
        var b = torch.randn(dims, device: device, dtype: torch.ScalarType.Float32);
        Console.WriteLine($"Now test with {sca}");
        a = a.to(sca);
        b = b.to(sca);
        startTime = Stopwatch.GetTimestamp();
        for (int i = 0; i < repeatTime; i++)
        {
            var c = torch.matmul(a, b);
            c.Dispose();
        }
        Console.WriteLine($"Time taken for matmul {sca}: {new TimeSpan(Stopwatch.GetTimestamp() - startTime).TotalMilliseconds}ms");
        a.Dispose();
        b.Dispose();
        GC.Collect();
        
    }
}

Pytorch

import torch
import time
from datetime import datetime
import gc

repeat = 1000
dims =100
total_time = 0

def to_ms(difftime):
    return difftime.total_seconds()*1000

scalars = [torch.float32, torch.bfloat16, torch.float16]

for n in range(10):
    print(f"{n+1}/10")
    start_time = datetime.now()

    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
    for sca in scalars:
        a = torch.randn(dims,dims).cuda()
        b = torch.randn(dims,dims).cuda()
        print(f"Now test with {sca}")
        a = a.to(sca)
        b = b.to(sca)
        start_time = datetime.now()
        for _ in range(repeat):
            c = torch.matmul(a, b)
        print(f"Time taken for matmul: {sca} {to_ms(datetime.now()-start_time)}ms")
        gc.collect()

FAQ: Why in C# dispose tensors? because i have a theory that in pytorch automatically dispose the tensor outer-scope (warning i'm not sure. I didn't researched that)

Why sometime in c# torchsharp the BFloat16 is slow than F32? I don't know i try figure out, i have some theory, but not pretty sure. Need more test.

Specification?

  • i7 9700Kf
  • RTX 3070
  • 32Gb Ram

What version torch between TorchSharp and Pytorch used?

C# Python
LibTorch 2.8.0 Cu128 2.7.1+cu126

haytham2597 avatar Sep 15 '25 01:09 haytham2597

I does more test about matmul and notice the difference between TorchSharp and Pytorch is around of ~22 microseconds (0.022 miliseconds) in favor of Pytorch and at least i know the reason is about of DllImport the system take a time to call a function of Dll.

In LibTorch C++ the diffeence is around of ~5 microseconds (0.005 miliseconds) in favor of C++

In resume, is imperceptible the speed between TorchSharp and Pytorch.

Soon i will make a table for get better comparison between TorcSharp, Pytorch and LibTorch C++ and with different environment of Net

All that test is under BFloat16 and Net Framework 4.7.2

haytham2597 avatar Sep 16 '25 18:09 haytham2597