CFU-Playground icon indicating copy to clipboard operation
CFU-Playground copied to clipboard

Performance of Renode/Verilator cosim

Open tcal-x opened this issue 4 years ago • 6 comments

We should collect some statistics on Renode/Verilator cosimulation to check whether it is close to what would be expected.

  • If the CFU was much "smaller" than the CPU, then we would expect Verilator simulation of just the CFU to require much less host time per simulated cycle than for CPU-only or CPU+CFU (PLATFORM=sim) Verilator simulation. However, in some cases such as hps_accel, the CFU is larger than the CPU, so we wouldn't expect much speedup (less than a factor of 2).

    • I say 'smaller' in quotes since we are simulating the pre-implementation Verilog, so the work of simulating a particular design might not be proportional to the implementation LUT count.
  • To get a handle on this, can we measure:

    • Host time per simulation cycle for proj_template Verilated CFU (using cosim)
    • Host time per simulation cycle for hps_accel Verilated CFU (using cosim)
    • Host time per simulation cycle for proj_template CPU+CFU (full SoC simulation using PLATFORM=sim)
    • Host time per simulation cycle for hps_accel CPU+CFU (full SoC simulation using PLATFORM=sim)
  • Also, if the CFU is active only a small percentage of overall execution cycles, then we were expect the Renode/Verilator cosim to be much faster than full-SoC Verilator sim. But if the CFU is active the majority of the cycles, then we wouldn't expect much speedup due to this factor.

Can we print out the total number of cycles that Verilator simulates in Renode/Verilator cosim? Then this can be directly compared to actual execution on the board to get the total execution cycle count. This will give us an idea of the fraction of original execution cycles that the CFU needs to be simulated.

Finally, is the Verilator-generated C++ always generated so that it's capable of dumping waveforms? We should see how much faster simulation is if we disable waveform generation in the generated code.

FYI @alanvgreen

tcal-x avatar Oct 04 '21 21:10 tcal-x

I think wall time to run tests is probably a much simpler thing to examine?

With Renode+CFU you should also be able to use multiple CPUs at once (atleast one for Renode, one for Verilator).

mithro avatar Oct 04 '21 21:10 mithro

In chat, @PiotrZierhoffer suggested reducing the value on this line: https://github.com/google/CFU-Playground/blob/main/scripts/generate_renode_scripts.py#L91.

I tried it, measuring wall clock time for one hps_accel inference. Baseline time was 1:22. Reducing the value by 1000x reduced time to 1:19. Reducing it by another 1000x reduced the time to 1:17.

tcal-x avatar Oct 05 '21 03:10 tcal-x

We are also about to release changes that get rid of the ticking completely. They passed the internal review already, so we're getting there soon.

PiotrZierhoffer avatar Oct 05 '21 11:10 PiotrZierhoffer

@mithro the execution is not really parallel here - as we execute a single instruction, we let the CFU calculate everything and then return to main Renode.

PiotrZierhoffer avatar Oct 05 '21 11:10 PiotrZierhoffer

@tcal-x @mithro latest changes by @robertszczepanski , pulled in with #301 , should already improve the performance here.

PiotrZierhoffer avatar Oct 18 '21 13:10 PiotrZierhoffer

I'm finding the cosim very slow with hps_accel and GATEWARE_GEN=2. I think this is because there is quite a bit of free-running logic in the gen2 gateware.

alanvgreen avatar Dec 21 '21 21:12 alanvgreen