Henry Ho
Henry Ho
solve conflict
1. more DSPs, 2. higher Fmax, 3. better algorithm like Winograd
You only use 16x32=512 DSPs, while arria10 can have 3036 DSPs, and remaining ALMs can also use as multiplier. So if someone use all DSPs, he have 6 times faster...
you can also check DLA with Arria10 on OpenVINO the throughput is even higher than 2017”s paper And I don’t think local memory is limitation if someone have carefully optimize...
DSP u% is low may owing to high fanout that increase effort on fitter. You can use systolic array to reduce fanout