finn icon indicating copy to clipboard operation
finn copied to clipboard

Refactoring of RTL MVAU

Open mmrahorovic opened this issue 1 year ago • 0 comments

(PR combines the previously closed PRs: PR #976 , PR #975 and PR #794)

Adds support for utilizing multi-packed DSP48s and DSP58s for the 'MatrixVectorActivation' layer and likewise for the DSP58 for the 'VectorVectorActivation' layer. For weights and activations that are between 4- and 8-bits wide (with the exception of 9-bits for activations for DSP58), the custom layer packs 2, 3 or 4 elements on the input datapath of the DSP to achieve multiple MACs per cycle per DSP48/DSP58 (either 2, 3 or 4 depending on bit-width and board).

Important: set the commit-hash of PyVerilator to point to ce0a08c (https://github.com/maltanar/pyverilator/tree/refactor/drive_rising_edge) to ensure the RTL simulation tests pass. Not included in the PR since additional testing might lead to further fixes to this PyVerilator-branch.


Functionalities to be added for the MVU

  • [x] rtllib: RTL implementation for the DSP58-based MVU
    • [x] 4-bit weights x 4-bit activations DSP48 & DSP58: mvu_4sx4u.sv
    • [x] >4-bit weights x >4-bit activations DSP48: mvu_8sx8_dsp48.sv
    • [x] (>)4-bit weights x (>)4-bit activations DSP58: mvu_vvu_8sx9_dsp58.sv
    • [x] Flow control and axi wrapper: mvu_vvu_axi.sv and mvu_vvu_axi_wrapper.v respectively
  • [x] Custom-op for the new RTL component: see matrixvectoractivation_rtl.py
    • [x] Code geneneration
    • [x] IP-stitching
    • [x] Resource estimations
    • [x] Cycle estimations
    • [x] CPPsim & RTLsim
  • [x] Transformation to instantiate the newly created custom-op: see additions in specialize_layers.py

Tests

  • [x] FINN unit test -- test for the MVU custom-op & transformation (node-by-node CPPsim, node-by-node RTLsim, stitched-ip RTLsim): see test_fpgadataflow_rtl_mvau under test_fpgadataflow_mvau.py
  • [x] RTL Testbench: mvu_axi_tb.sv

Outstanding bugs & features

  • [x] PyVerilator rework to simulate the newly introduced RTL-MVU designs correctly; see https://github.com/maltanar/pyverilator/pull/6 for a more detailed description.
  • [x] PyVerilator bug for simulating array with loop-carried dependency; packed arrays instead of unpacked arrays and ensure signed arithmetic is explicitly enforced whenever expected.
  • [x] Support for DSP48E1.
  • [x] 4-bit weights x 4-bit activations DSP48 & DSP58: support for unsigned activations.
  • [x] Relaxing SIMD constraint (SIMD being a multiple of 3) for DSP58-based implementation.

mmrahorovic avatar Mar 04 '24 11:03 mmrahorovic