Perf issue: Torchsharp is slower than pytorch on cuda on some operators
I run some benchmark tests to compare the performance difference between torchsharp and pytorch, both uses libtorch 2.2.1 + cuda 12.1. And I notice that torchsharp is slower than pytorch in most of operators. Below are benchmark result
Torchsharp
Pytorch
Observation
I can achieve comparable result between torchsharp and pytorch if I replace operator with in-place version. The performance also become much better if I explicitly dispose current session during each tests
For example, in adding benchmark, torchsharp runs nearly the same with pytorch if I use tensor.add_ instead of tensor.add
Considering that the major difference between the operator and the in-place operator is the in-place operator won't create a new Tensor object, it's likely that the main overhead might happen in Tensor constructor.
Source code
using TorchSharp;
// Initialize CUDA device
var device = torch.CUDA;
var repeatTime = 10000;
// Test randn
var startTime = DateTime.Now;
for (int i = 0; i < repeatTime; i++)
{
var _ = torch.randn(new long[] { 1000, 1000 }, device: device);
}
Console.WriteLine("Time taken for randn: " + (DateTime.Now - startTime).TotalSeconds);
// Test matmul
startTime = DateTime.Now;
var a = torch.randn(new long[] { 1000, 1000 }, device: device);
var b = torch.randn(new long[] { 1000, 1000 }, device: device);
for (int i = 0; i < repeatTime; i++)
{
var c = torch.matmul(a, b);
}
Console.WriteLine("Time taken for matmul: " + (DateTime.Now - startTime).TotalSeconds);
// Test concat
startTime = DateTime.Now;
a = torch.randn(new long[] { 1000, 1000 }, device: device);
b = torch.randn(new long[] { 1000, 1000 }, device: device);
for (int i = 0; i < repeatTime; i++)
{
var c = torch.cat(new[] { a, b }, 0);
}
Console.WriteLine("Time taken for concat: " + (DateTime.Now - startTime).TotalSeconds);
// Test slice
startTime = DateTime.Now;
a = torch.randn(new long[] { 1000, 1000 }, device: device);
for (int i = 0; i < repeatTime; i++)
{
var c = a[.., 0..500];
}
Console.WriteLine("Time taken for slice: " + (DateTime.Now - startTime).TotalSeconds);
// Test add
startTime = DateTime.Now;
a = torch.randn(new long[] { 1000, 1000 }, device: device);
b = torch.randn(new long[] { 1000, 1000 }, device: device);
for (int i = 0; i < repeatTime; i++)
{
var c = a + b;
}
Console.WriteLine("Time taken for add: " + (DateTime.Now - startTime).TotalSeconds);
# create a list of benchmark for pytorch on cuda
import torch
import time
repeat = 10000
total_time = 0
start_time = time.time()
for _ in range(repeat):
a = torch.randn(1000, 1000).cuda()
print("Time taken for randn: " , time.time()-start_time)
start_time = time.time()
# test matmul
a = torch.randn(1000, 1000).cuda()
b = torch.randn(1000, 1000).cuda()
for _ in range(repeat):
c = torch.matmul(a, b)
print("Time taken for matmul: ", time.time()-start_time)
start_time = time.time()
# test concat
a = torch.randn(1000, 1000).cuda()
b = torch.randn(1000, 1000).cuda()
for _ in range(repeat):
c = torch.cat((a, b), 0)
print("Time taken for concat: ", time.time()-start_time)
start_time = time.time()
# test slice
a = torch.randn(1000, 1000).cuda()
for _ in range(repeat):
c = a[:, 0:500]
print("Time taken for slice: ", time.time()-start_time)
start_time = time.time()
# test add
a = torch.randn(1000, 1000).cuda()
b = torch.randn(1000, 1000).cuda()
for _ in range(repeat):
c = a + b
print("Time taken for add: ", time.time()-start_time)
For something like this, you should use BenchmarkDotNet. It handles all the edge cases with .NET benchmarking (like JIT warmup).
Hey @ds5678 ,
Thanks for addressing the issue. I've noticed that you've used libtorch 2.2.1 .
The latest version of torchsharp (0.105.0) is using 2.5.1 and there were some performance improvements regarding the calls and garbage collection.
Could you please check your results again with the latest version 0.105.0?
@ozanMSFT I think you meant to ping @LittleLittleCloud
Is interesting the performance that you have in matmul so i replicate them with Release mode in TorchSharp and LibTorch 2.8.0 cu128.
As always because of cache the first time may take some time, all times is represent in miliseconds.
The source code that i do is:
C#:
var device = torch.CUDA;
Console.WriteLine($"IS BF16 SUPPORTED: {torch.cuda.is_bf16_supported()}");
var repeatTime = 1000;
long[] dims = new long[] { 100, 100 };
torch.backends.cudnn.allow_tf32 = true;
torch.backends.cuda.matmul.allow_tf32 = true;
torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = true;
torch.ScalarType[] scalars = new torch.ScalarType[]
{
torch.ScalarType.Float32, torch.ScalarType.BFloat16, torch.ScalarType.Float16
};
for (int n = 0; n < 10; n++)
{
Console.WriteLine($"{n+1}/10");
var startTime = Stopwatch.GetTimestamp();
foreach (var sca in scalars)
{
var a = torch.randn(dims, device: device, dtype: torch.ScalarType.Float32);
var b = torch.randn(dims, device: device, dtype: torch.ScalarType.Float32);
Console.WriteLine($"Now test with {sca}");
a = a.to(sca);
b = b.to(sca);
startTime = Stopwatch.GetTimestamp();
for (int i = 0; i < repeatTime; i++)
{
var c = torch.matmul(a, b);
c.Dispose();
}
Console.WriteLine($"Time taken for matmul {sca}: {new TimeSpan(Stopwatch.GetTimestamp() - startTime).TotalMilliseconds}ms");
a.Dispose();
b.Dispose();
GC.Collect();
}
}
Pytorch
import torch
import time
from datetime import datetime
import gc
repeat = 1000
dims =100
total_time = 0
def to_ms(difftime):
return difftime.total_seconds()*1000
scalars = [torch.float32, torch.bfloat16, torch.float16]
for n in range(10):
print(f"{n+1}/10")
start_time = datetime.now()
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
for sca in scalars:
a = torch.randn(dims,dims).cuda()
b = torch.randn(dims,dims).cuda()
print(f"Now test with {sca}")
a = a.to(sca)
b = b.to(sca)
start_time = datetime.now()
for _ in range(repeat):
c = torch.matmul(a, b)
print(f"Time taken for matmul: {sca} {to_ms(datetime.now()-start_time)}ms")
gc.collect()
FAQ: Why in C# dispose tensors? because i have a theory that in pytorch automatically dispose the tensor outer-scope (warning i'm not sure. I didn't researched that)
Why sometime in c# torchsharp the BFloat16 is slow than F32? I don't know i try figure out, i have some theory, but not pretty sure. Need more test.
Specification?
- i7 9700Kf
- RTX 3070
- 32Gb Ram
What version torch between TorchSharp and Pytorch used?
| C# | Python |
|---|---|
| LibTorch 2.8.0 Cu128 | 2.7.1+cu126 |
I does more test about matmul and notice the difference between TorchSharp and Pytorch is around of ~22 microseconds (0.022 miliseconds) in favor of Pytorch and at least i know the reason is about of DllImport the system take a time to call a function of Dll.
In LibTorch C++ the diffeence is around of ~5 microseconds (0.005 miliseconds) in favor of C++
In resume, is imperceptible the speed between TorchSharp and Pytorch.
Soon i will make a table for get better comparison between TorcSharp, Pytorch and LibTorch C++ and with different environment of Net
All that test is under BFloat16 and Net Framework 4.7.2