mscp icon indicating copy to clipboard operation
mscp copied to clipboard

Prebuilt binary has very very very poor performance

Open baryluk opened this issue 1 year ago • 2 comments

Nice project, I was going to write something like this, on this weekend (I already have a script to copy a lot of files in parallel, but also needed one to send one big file in parallel, due to network scaling to 25Gbps+), but then did a quick google search and found the PERC papers quickly.

Tested, and it is not too good:

mscp v0.2.1

$ ~/mscp.linux.x86_64.static /tmp/usr-share.tar localhost:/tmp/usr-share.tar2
Password: 
[=========>                                                  ]  19%  2.9GB/14.6GB  282.4MB/s

$ rm -f /tmp/usr-share.tar2

$ ~/mscp.linux.x86_64.static -v /tmp/usr-share.tar localhost:/tmp/usr-share.tar2
bitrate limit: 0 bps
Password: 
thread[0]: connecting to localhost
thread[1]: connecting to localhost
thread[2]: connecting to localhost
thread[3]: connecting to localhost
thread[4]: connecting to localhost
thread[5]: connecting to localhost
thread[6]: connecting to localhost
[=====================================>          ]  83% 12.1GB/14.6GB  313.5MB/s

Using just normal scp over localhost (IPv4) I am getting about 430MB/s (scp sending, sshd receiving), or 413MB/s (ssh sending, scp receiving).

All files on tmpfs in memory.

Does not look like a bottlneck on sshd side: image

Same results with forcing -o [email protected], ca. 300MB/s.

AMD Threadripper 2950X (Zen+), 16 core (32 threads) CPU, ca. 3.2-4.2GHz

OpenSSH 1:9.6p1-3

OpenSSL 3.2.1-3

Netcat loopback over ::1 (/dev/zero |nc; nc>/dev/null), 1.1GB/s

iperf3 over single tcp on ::1, 22-39 Gbps (without and with -Z option)

Then tested, deb 0.2.1-1~noble, for Ubuntu, and got 2.3GB/s easily with default 7 threads, and about 2.8GB/s with manual -n 10 (could do more to a remote system, but that is above 25Gbps already, and the other machine with 100Gbps NIC on my network is currently offline).

So, the issue clearly looks to be the problem with prebuilt binary. Yes, there is a warning in the readme, but I was not expecting 10× worse performance.

baryluk avatar May 12 '24 18:05 baryluk

Thanks for the report.

but I was not expecting 10× worse performance.

So do I. In my environment with Ryzen 9 7950X 16-core CPU, the throughput of a single binary mscp with one connection is about 430MB/s, while the throughput of a normal build is over 1GB/s.

ryzen1 ~/w/m/build > ldd ~/mscp.linux.x86_64.static
	not a dynamic executable
ryzen1 ~/w/m/build > ~/mscp.linux.x86_64.static -n 1 ~/5g.img localhost:tmp/
[===============================================] 100%  5.0GB/5.0GB  428.2MB/s  00:13 
ryzen1 ~/w/m/build > ldd ./mscp
	linux-vdso.so.1 (0x00007fffa1957000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fe6df326000)
	libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007fe6deee2000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007fe6deec6000)
	libgssapi_krb5.so.2 => /lib/x86_64-linux-gnu/libgssapi_krb5.so.2 (0x00007fe6dee72000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe6dec49000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fe6df49b000)
	libkrb5.so.3 => /lib/x86_64-linux-gnu/libkrb5.so.3 (0x00007fe6deb7c000)
	libk5crypto.so.3 => /lib/x86_64-linux-gnu/libk5crypto.so.3 (0x00007fe6deb4d000)
	libcom_err.so.2 => /lib/x86_64-linux-gnu/libcom_err.so.2 (0x00007fe6deb47000)
	libkrb5support.so.0 => /lib/x86_64-linux-gnu/libkrb5support.so.0 (0x00007fe6deb39000)
	libkeyutils.so.1 => /lib/x86_64-linux-gnu/libkeyutils.so.1 (0x00007fe6deb32000)
	libresolv.so.2 => /lib/x86_64-linux-gnu/libresolv.so.2 (0x00007fe6deb1e000)

ryzen1 ~/w/m/build > ./mscp -n 1 ~/5g.img localhost:tmp/ 
[===============================================] 100%  5.0GB/5.0GB    1.1GB/s  00:05 

Does the 10x performance degradation happen on other machines? I guess threadripper would be a cause, but I cannot determine it because I don't have it.

The single binary version of mscp uses musl libc for portability, and it is known that musl libc's memory handling causes performance degradation compared with glibc (ref1, ref2).

upa avatar May 13 '24 07:05 upa

@upa I will test on some other systems soon.

I will also build locally, with glibc and musl (either on Debian, or in docker container), but with same compiler and flags, and see if that it is.

Could be musl memory allocator or pthread support is subpar (glibc probably scales a bit better to more threads and cores), but I would not expect it to perform only only <10 threads.

But, the fact that binary is not showing any thread at 100% does suggest some lock contention (possibly in the allocator).

I will do some profiling with perf later.

baryluk avatar May 13 '24 13:05 baryluk