Testing performance
I have pushed a test release of the C++ generator.
You can build it yourself (instructions here) or download precompiled binaries from here. The precompiled binaries require AVX instruction set, so if your CPU doesn't support it, you have to build your own.
It consists of two binaries: randomjs (the generator) and xst (javascript engine).
It generates and executes 1000 programs and calculates Blake2b hashes of all outputs.
Post your performance numbers and CPU specs.
Xeon E3-1245 @3.7 GHz (Debian 9)
> ./randomjs
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 100.232 programs per second
Xeon (Skylake, IBRS) @ 2.1 GHz (Ubuntu 18.04 LTS)
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79 Performance: 62.0579 programs per second
Intel Core i7-7820X @ 3.6 GHz (Windows 10). Built all binaries (release build) myself.
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 129.94 programs per second
From precompiled binaries.
Ryzen 1700x @ 3.4GHz Windows 10
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79 Performance: 86.3199 programs per second
Intel Core i7-7820X @ 3.6 GHz (Windows 10). Rebuilt all binaries using profile guided optimizations: 6.4% faster.
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 138.256 programs per second
Edit: here are the precompiled binaries on the same PC
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 130.962 programs per second
Raspberry Pi 3 @ 1.2 GHz
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 13.9537 programs per second
(Note: To build for ARM, the currently used SSE2 version of Blake2b must be replaced by the reference or NEON optimized version.)
Core i7-4700HQ @ 2.4 GHz Windows 8.1 with 16gigs RAM
..\Wownero-Test>randomjs.exe Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79 Performance: 96.2629 programs per second
..\Wownero-Test>randomjs.exe Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79 Performance: 95.4948 programs per second
..\Wownero-Test>randomjs.exe Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79 Performance: 96.447 programs per second
..\Wownero-Test>randomjs.exe Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79 Performance: 94.1722 programs per second
..\Wownero-Test>randomjs.exe Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79 Performance: 99.5576 programs per second
@tevador Your precompiled binaries crash with exception code 0xC000001D (illegal instruction) on Pentium G5400 (Coffee Lake) because it doesn't support AVX. I had to recompile it with SSE2.
Intel Core i5-3210M @ 2.9 GHz, precompiled binaries:
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 75.0907 programs per second
Intel Core i5-3210M @ 2.9 GHz, my binaries with PGO:
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 84.095 programs per second
Intel Pentium G5400 @ 3.7 GHz, my binaries without PGO:
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 109.548 programs per second
Intel Pentium G5400 @ 3.7 GHz, my binaries with PGO:
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 116.807 programs per second
@SChernykh Yes, the precompiled binaries require AVX. I added a note to the original comment. I didn't realize there were still modern CPUs where Intel disables it.
Do the profile guided optimizations work in general or just for these particular 1000 programs? You can run a different set of programs by modifying the block header template in main.cpp. It would be best to read the block template and nonce count from the command line, but I didn't have time to implement it yet.
@tevador 1000 programs (~10 seconds of CPU time) is a big enough sample set for PGO to work well. It optimizes C/C++ code on low level and only needs execution statistics - which branches are executed more often, which if are taken and which are not and so on. In my experience, PGO almost always gives some improvement.
Ryzen 1600 @ 3.6 GHz (Linux)
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 115.622 programs per second
Raspberry Pi 3 @ 1.2 GHz
@tevador Was it 64 bit or 32 bit build? Are there any differences between 32 and 64 bit?
@SChernykh Ubuntu 16.04 armv7l (32bit). Haven't tested a 64bit build since the software support is still a bit lacking.
Ryzen 1700 @ 3.6 GHz, Windows 10 (original testing binaries)
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 92.7852 programs per second
Finally found a way to test it on 64-bit ARM - there are a number of cloud hosting providers that have ARM servers.
OS: Ubuntu 18.04 LTS Compiler: g++ (Ubuntu 8-20180414-1ubuntu2) 8.0.1 20180414 (experimental) [trunk revision 259383] Processor: Cavium ThunderX 88XX @ 2GHz (aarch64)
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 17.4324 programs per second
Interesting. The performance per clock seems to be significantly lower than Raspberry Pi 3 in 32-bit mode. Perhaps the CPU is an older model?
Can you also test a 32-bit build of the executables? I think the compiler flag is -march=armv7.
It's a cloud server - very unstable performance because it's 4 virtual cores on a 96-core server:
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 13.4565 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 12.1426 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 11.2156 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 13.0792 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 12.42 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 10.9402 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 12.3119 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 10.4637 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 10.3319 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 12.4766 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 11.9622 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 10.7777 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 11.278 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 12.9119 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 17.1197 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 16.4346 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 16.774 programs per second
There are also providers that have bare metal ARM servers, I'll try them tomorrow. I'll try to test 32-bit build, but re-compiling boost will take a few hours.
You can specify --with-libraries=system,filesystem when compiling boost. It speeds up compilation considerably (these are the only 2 libraries required by randomjs at the moment).
No luck in compiling 32-bit code
cc1plus: error: unknown value ‘armv7’ for -march
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a native; did you mean ‘armv8-a’?
From what I found online, not all armv8 (64bit) CPUs are backwards compatible with armv7. Extra silicon is needed for this, so backwards compatibility is optional.
It seems that the Cavium ThunderX CPU doesn't support the armv7 instruction set, so it cannot run in 32bit mode.
https://github.com/scaleway/image-debian/issues/86
Ubuntu 16.04
Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79 Performance: 157.207 programs per second Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79 Performance: 156.94 programs per second Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79 Performance: 157.211 programs per second
Intel(R) Core(TM) i5-3450 CPU @ 3.10GHz Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79 Performance: 98.8595 programs per second Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79 Performance: 99.4327 programs per second Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79 Performance: 99.3007 programs per second
wow. A PoW where the GHz actually matter. These are with the precompiled binaries.
Ubuntu 18.04; Threadripper 1950x
root@TR4:/usr/local/src/RandomJS/src-cpp/bin# for i in {1..10}; do ./randomjs; done
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 123.49 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 120.954 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 123.555 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 123.58 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 122.938 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 122.905 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 123.402 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 122.556 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 122.872 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 123.415 programs per second
let's see if I can get some more performant binaries done
Speaking of optimized version
C:\Users\User\Downloads\RandomJS\src-cpp\x64\Release>randomjs.exe
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 175.757 programs per second
C:\Users\User\Downloads\RandomJS\src-cpp\x64\Release>randomjs.exe
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 176.123 programs per second
C:\Users\User\Downloads\RandomJS\src-cpp\x64\Release>randomjs.exe
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 176.713 programs per second
C:\Users\User\Downloads\RandomJS\src-cpp\x64\Release>randomjs.exe
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 175.023 programs per second
C:\Users\User\Downloads\RandomJS\src-cpp\x64\Release>randomjs.exe
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 176.187 programs per second
Stock binaries do 92.7852 programs per second on the same PC (Ryzen 7 1700 @ 3.6 GHz). I only spent a couple of evenings to get 1.9x performance and I have a few more changes in mind to get to 2x and more performance.
@tevador Is this PoW completely finalized? I could spend some more time optimizing it then.
@SChernykh , are those optimizations to the code that will increase performance on any CPU? Or are they compile-time optimizations for the specs of the Ryzen?
Also, I assume these current binaries are for 1 thread. Is there any limit (i.e., cpu cache size) needed for each thread, or will it be able to run per thread regardless of cache size?
@Gingeropolous These are algorithmic and C++ coding optimizations only. No assembly or compiler flags magic.
Edit: as I understood, it's supposed to run 1 process per CPU core. Edit2: yes, my optimizations will work on any CPU.
@Gingeropolous The limited testing I've done shows basically linear scaling with the number of cores. I haven't tested the impact of SMT, so I'm not sure if Ryzen will mine faster with 8 or 16 threads.
@SChernykh Are you are optimizing the XS engine or the JS generator? The generator takes only about ~6% of the time of one hash, so I'm not sure what optimizations can be done there. Can you push the code changes?
The PoW is not final yet. I'm planning some changes to the EvalExpression to increase FPGA resistance, but I have a lot of work this summer, so I don't have time to continue at the moment.
@tevador Most optimizations are in the XS engine, the only thing that I changed in RandomJS binary is IPC communicaton with XS engine. Not sure I can make a pull request right now since some of my optimizations are Windows only for now (memory allocator for example).
I added #ifdef guards to my Windows-only pieces of code and could compile and run it in Ubuntu:
osboxes@osboxes:~/RandomJS/src-cpp/bin$ ./randomjs
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 177.983 programs per second
It's even faster on virtual Linux machine without profile-guided optimizations than on real Windows with PGO. GCC compiler is superior! I'll prepare pull requests later today.
XS pull request: https://github.com/tevador/moddable/pull/9 RandomJS pull request: https://github.com/tevador/RandomJS/pull/7
@tevador ,
The PoW is not final yet. I'm planning some changes to the EvalExpression to increase FPGA resistance, but I have a lot of work this summer, so I don't have time to continue at the moment.
Sounds awesome. Could you possibly write out a rough sketch of the idea so someone else could pick up where you left off?
@SChernykh I have merged your code. Btw, I think we could avoid the call to atoi by prepending the program size as a 4-byte integer at the beginning rather than writing it in textual form.
@Gingeropolous It still requires a lot of research. The problem is that currently, there is a large variation in the number of EvalExpressions executed per program (IIRC it varies from as little as 8 to hundreds). The goal is to have a narrow range.
As I'm planning to remove the = from the eval chars, the low number of EvalExpressions would enable some theoretical attacks by assuming all EvalExpressions throw a SyntaxError and thus avoiding eval altogether.
The programs with a high number of EvalExpressions create long-running outliers, which is also undesirable. Throw/catch is slow and could be a potential optimization target.