glow icon indicating copy to clipboard operation
glow copied to clipboard

[CPU] Optimize some kernels from CPU backend

Open mciprian13 opened this issue 5 years ago • 15 comments

Summary

  • Optimize the Insert/Extract/Transpose kernels from the CPU backend by removing the address arithmetic performed at run-time with a simple access pattern based on offsets only, generated at compile-time by a tensor utility class named TensorAccessPattern.
  • Small optimizations regarding pointer arithmetic for other kernels: e.g. resize, softmax.

Note: The usage of the macro definitions like libjit_getXYZW should be prohibited since it results in poor performance. Instead the kernels should be improved by avoiding at all possible the usage of these macros, exploiting more linear access patterns or pre-computed offsets at compile-time.

Test Plan Current unit tests are passing. Added extra unit tests in TensorsTest.cpp for the new utilities.

mciprian13 avatar Feb 20 '21 13:02 mciprian13

@opti-mix Can you take a look on this? Thanks!

mciprian13 avatar Feb 23 '21 17:02 mciprian13

@jackm321 Do you have time to review this PR? Thanks!

mciprian13 avatar Apr 09 '21 16:04 mciprian13

@yinghai Could you take a look on this? Thanks!

mciprian13 avatar Apr 16 '21 18:04 mciprian13

@mciprian13 Could you report how much perf win you observe using the approach in this PR?

opti-mix avatar Apr 16 '21 20:04 opti-mix

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 09 '21 03:06 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 01 '21 21:07 stale[bot]

Ping.

mciprian13 avatar Jul 02 '21 07:07 mciprian13

@mciprian13 Did you address comments from @opti-mix here?

jfix71 avatar Jul 17 '21 03:07 jfix71

@opti-mix @jfix71 I still need to provide some performance numbers to prove there is an optimization. To be noted that we are interested mainly in microcontrollers so the performance might be biased towards these architectures. Will come back with some numbers to provide the requested info.

mciprian13 avatar Jul 19 '21 14:07 mciprian13

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 21 '21 04:08 stale[bot]

Ping.

mciprian13 avatar Aug 23 '21 09:08 mciprian13

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Mar 02 '22 10:03 stale[bot]

Ping.

mciprian13 avatar Mar 02 '22 13:03 mciprian13

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 16 '22 14:04 stale[bot]

ping

mciprian13 avatar Apr 18 '22 09:04 mciprian13