Potential optimization in the RISC-V example
I noticed that the getFields function of the RISC-V example often gets called just to get the value of a single field (with calls along the line of let funct3 := get(getFields(inst), funct3)). Since fetching the values of all the fields just to extract the value of one of them looked a bit overkill, I tried replacing getFields with individual getXxx functions (e.g. getFunct3). This provides a decent boost to simulation performance:
-- Running tests with Cuttlesim --
_objects/rv32i.v/rvcore.cuttlesim.opt tests/_build/rv32i/integ/rvbench_qsort.rv32 -1
[before] real: 0m0.018s user: 0m0.017s sys: 0m0.001s
[after ] real: 0m0.013s user: 0m0.013s sys: 0m0.000s
_objects/rv32i.v/rvcore.cuttlesim.opt tests/_build/rv32i/integ/rvbench_median.rv32 -1
[before] real: 0m0.012s user: 0m0.012s sys: 0m0.000s
[after ] real: 0m0.008s user: 0m0.007s sys: 0m0.001s
_objects/rv32i.v/rvcore.cuttlesim.opt tests/_build/rv32i/integ/img.rv32 -1
[before] real: 0m0.216s user: 0m0.195s sys: 0m0.020s
[after ] real: 0m0.160s user: 0m0.142s sys: 0m0.018s
_objects/rv32i.v/rvcore.cuttlesim.opt tests/_build/rv32i/integ/primes.rv32 -1
[before] real: 0m4.421s user: 0m4.417s sys: 0m0.001s
[after ] real: 0m3.090s user: 0m3.084s sys: 0m0.001s
_objects/rv32i.v/rvcore.cuttlesim.opt tests/_build/rv32i/integ/morse.rv32 -1
[before] real: 0m1.584s user: 0m1.583s sys: 0m0.000s
[after ] real: 0m1.086s user: 0m1.085s sys: 0m0.000s
-- Running tests with Verilator --
_objects/rv32i.v/obj_dir.opt/Vtop +VMH=tests/_build/rv32i/integ/rvbench_qsort.vmh -1
[before] real: 0m0.033s user: 0m0.032s sys: 0m0.001s
[after ] real: 0m0.030s user: 0m0.030s sys: 0m0.000s
_objects/rv32i.v/obj_dir.opt/Vtop +VMH=tests/_build/rv32i/integ/morse.vmh -1
[before] real: 0m2.689s user: 0m2.687s sys: 0m0.000s
[after ] real: 0m2.186s user: 0m2.184s sys: 0m0.000s
_objects/rv32i.v/obj_dir.opt/Vtop +VMH=tests/_build/rv32i/integ/rvbench_median.vmh -1
[before] real: 0m0.022s user: 0m0.022s sys: 0m0.000s
[after ] real: 0m0.018s user: 0m0.017s sys: 0m0.001s
_objects/rv32i.v/obj_dir.opt/Vtop +VMH=tests/_build/rv32i/integ/primes.vmh -1
[before] real: 0m8.358s user: 0m8.349s sys: 0m0.001s
[after ] real: 0m6.778s user: 0m6.767s sys: 0m0.001s
_objects/rv32i.v/obj_dir.opt/Vtop +VMH=tests/_build/rv32i/integ/img.vmh -1
[before] real: 0m0.399s user: 0m0.385s sys: 0m0.014s
[after ] real: 0m0.327s user: 0m0.310s sys: 0m0.017s
This results in a speedup of roughly 1.43 for Cuttlesim and of 1.23 for Verilator compared to master. I did not yet check the effects on synthesis but I might do it soon. I assume that this kind of optimization is applied automatically when synthesizing (isn't it?), so I don't expect changes on this side (alas). See commit 04cfdf891 for a demonstration.
Nice sleuthing!
Indeed, this won't speed up actual circuits, because constructing a structure in Verilog doesn't cost anything (it just means remembering that a bunch of wires are bundled together).
On the simulation side it's a nice speedup, but we don't really want to "cheat" by rewriting code to make the simulator happier (especially because these sort of code changes then make it harder to say anything sinceere about Cuttlesim vs. other simulators, since we would have modified the code to fit Cuttlesim better). Instead, we should improve the simulator so that it handles this pattern better (It's a common pattern in the bluespec world).
I thought I did some experiments with this in the past, but I can't find my notes about them. What happens to performance if you force inlining of that function call at the C++ level (IIRC __attribute__((always_inline)) should do it)? Ideally inlining + dead code elimination should get rid of the unnecessary computation.
If that doesn't work, or if it's unreliable, then this would be a really interesting optimization pass to implement in Cuttlesim! We would check whether a function is pure and returns a struct, and if it does we would inline just the relevant computations. Sounds fun, actually :)
I just took a glance at how the C++ code for getFields is generated. As it is not a rule but a function, it returns by modifying its ret argument (initially passed by reference). This argument refers to a temporary variable created by the CALL_FN macro in the surrounding statement expression. Just a quick heads up, I guess that naming something tmp in a Kôika function (which should not be too rare) could lead to frustration since it would be hidden by this variable when used as an argument. This is especially true in the case where an element of type T called tmp is passed to a function returning T, since then the compiler wouldn't even complain about types not matching.
Great catch, thanks! Fortunately, there's at least an inkling of the issue thanks to the compiler printing a warning (cuttlesim.hpp:253:47: warning: ‘tmp’ may be used uninitialized in this function [-Wmaybe-uninitialized]).
I pushed a fix just now.
An occurrence of tmp was left unchanged in the READ macro, which leads to errors when compiling anything that uses it (e.g. examples/conflicts_modular.v: cuttlesim.hpp:1403:27: error: ‘tmp’ was not declared in this scope; did you mean ‘tm’?).
Ugh, thanks a lot!