ONE [luci-interpreter] Benchmark modes from tflite-micro

This is continuation of #5080, but with more specific goals.

Goal

To measure luci-interpreter performance and memory consumption for models delivered with tflite-micro: https://github.com/tensorflow/tflite-micro/tree/main/tensorflow/lite/micro/benchmarks

Currently it contains two models:

keyword_scrambled
person_detect

Notes

We are in the process of review and merge of build infrastructure, PAL and memory managers in luci-interpreter:

Memory managers activity #7572
Build infrastructure and PAL #7288

To achieve best results I propose to use memory managers and MCU related stuff.

Sep 01 '21 18:09 binarman

+cc @chunseoklee

Sep 01 '21 18:09 binarman

@binarman @SlavikMIPT Is it possible to investigate similar ( https://github.com/Samsung/ONE/issues/5080#issuecomment-743022914) performance/memory comparison with the latest tf-micro as of now ?

Sep 08 '21 05:09 chunseoklee

@chunseoklee Yes, we can do it. The only thing you should know is that this ( model measured in #5080) is a "toy" example, so it is not actually very descriptive...

Sep 08 '21 08:09 binarman

"toy" example

means sine generator model on tf-micro site, right ?

Sep 08 '21 08:09 chunseoklee

means sine generator model on tf-micro site, right ?

Yes, this is about sin(x) model. It is a normal tflite model (you can cut it from examples, save as tflite file, etc.), but it is too simple compared to real model.

Sep 08 '21 08:09 binarman

Keyword model has some issues: it does not have some fields and does not verify. In order to extract it from char array I've used some hacks on tensorflow code:

code

diff --git a/tensorflow/lite/micro/benchmarks/keyword_scrambled_model_data.cc b/tensorflow/lite/micro/benchmarks/keyword_scrambled_model_data.cc
index 56fed4e..97ca9a8 100644
--- a/tensorflow/lite/micro/benchmarks/keyword_scrambled_model_data.cc
+++ b/tensorflow/lite/micro/benchmarks/keyword_scrambled_model_data.cc
@@ -13,7 +13,8 @@ See the License for the specific language governing permissions and
 limitations under the License.
 ==============================================================================*/
 
-#include "tensorflow/lite/micro/benchmarks/keyword_scrambled_model_data.h"
+//#include "tensorflow/lite/micro/benchmarks/keyword_scrambled_model_data.h"
+
 
 // Keep model aligned to 8 bytes to guarantee aligned 64-bit accesses.
 alignas(8) const unsigned char g_keyword_scrambled_model_data[] = {
@@ -2901,3 +2902,21 @@ alignas(8) const unsigned char g_keyword_scrambled_model_data[] = {
     0x1b, 0x00, 0x00, 0x00};
 
 const unsigned int g_keyword_scrambled_model_data_length = 34576;
+
+#include <fstream>
+#include "tensorflow/lite/schema/schema_generated.h"
+
+int main()
+{
+  auto model = tflite::GetModel(g_keyword_scrambled_model_data);
+  flatbuffers::FlatBufferBuilder fbb;
+  auto offset = tflite::Model::Pack(fbb, model->UnPack());
+  tflite::FinishModelBuffer(fbb, offset);
+  uint8_t *buf = fbb.GetBufferPointer();
+  int size = fbb.GetSize();
+
+  std::ofstream ofs("model.tflite", std::ostream::out | std::ostream::binary);
+  for (int i = 0; i < size; ++i)
+    ofs << buf[i];
+}
+

So I've got the model: model.zip

It has some features we do not support currently:

signed int8 fullyconnected layer
SVDF layer
Quantize layers (that is used for requantization)
Softmax with mixed precision

I plan to resolve quantization and softmax problem with just cutting them out from the model, support SVDF and int8 FC.

Sep 09 '21 19:09 binarman

Model looks like this: keyword_model

Sep 09 '21 19:09 binarman

cut model: model_1-12.tflite.zip cut model with softmax: model_1-13.tflite.zip

Another issue I noticed is SVDF operator takes additional tensor as input and modifies it in-place. This is neither constant nor input tensor, looks like it is internal state of operation, this is kind of breaks pure functional model of our computations, need to think how to handle this...

Also tensorflow documentation states that this is experimental operation, so I assume it could change.

Sep 09 '21 23:09 binarman

I have run three NN tflite models: The first one is simple model from tflite-micro hello_word example (sin(x) model), the second one is keyword model (mentioned and attached above) from tflite-micro benchmarks, and the third is person detection model also from tflite-micro benchmarks (attached below). person_detect.tflite.zip

I have run on stm32F413ZH microcontroller with 320 KB of SRAM in a single thread of MbedOS.

The memory costs were measured using tflite::RecordingMicroInterpreter (https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/micro/docs/memory_management.md)

Results:

sin(x) model

Finished in 128us
[RecordingMicroAllocator] Arena allocation total 676 bytes
[RecordingMicroAllocator] Arena allocation head 40 bytes
[RecordingMicroAllocator] Arena allocation tail 636 bytes
[RecordingMicroAllocator] 'TfLiteEvalTensor data' used 120 bytes with alignment overhead (requested 120 bytes for 10 allocations)
[RecordingMicroAllocator] 'Persistent TfLiteTensor data' used 64 bytes with alignment overhead (requested 64 bytes for 2 tensors)
[RecordingMicroAllocator] 'Persistent TfLiteTensor quantization data' used 40 bytes with alignment overhead (requested 40 bytes for 4 allocations)
[RecordingMicroAllocator] 'Persistent buffer data' used 128 bytes with alignment overhead (requested 104 bytes for 5 allocations)
[RecordingMicroAllocator] 'NodeAndRegistration struct' used 96 bytes with alignment overhead (requested 96 bytes for 3 NodeAndRegistration structs)

keyword model

Finished in 5382us
[RecordingMicroAllocator] Arena allocation total 13020 bytes
[RecordingMicroAllocator] Arena allocation head 680 bytes
[RecordingMicroAllocator] Arena allocation tail 12340 bytes
[RecordingMicroAllocator] 'TfLiteEvalTensor data' used 648 bytes with alignment overhead (requested 648 bytes for 54 allocations)
[RecordingMicroAllocator] 'Persistent TfLiteTensor data' used 64 bytes with alignment overhead (requested 64 bytes for 2 tensors)
[RecordingMicroAllocator] 'Persistent TfLiteTensor quantization data' used 40 bytes with alignment overhead (requested 40 bytes for 4 allocations)
[RecordingMicroAllocator] 'Persistent buffer data' used 548 bytes with alignment overhead (requested 512 bytes for 17 allocations)
[RecordingMicroAllocator] 'TfLiteTensor variable buffer data' used 10248 bytes with alignment overhead (requested 10240 bytes for 7 allocations)
[RecordingMicroAllocator] 'NodeAndRegistration struct' used 480 bytes with alignment overhead (requested 480 bytes for 15 NodeAndRegistration structs)

person detection model

Finished in 12416052us
[RecordingMicroAllocator] Arena allocation total 82276 bytes
[RecordingMicroAllocator] Arena allocation head 55304 bytes
[RecordingMicroAllocator] Arena allocation tail 26972 bytes
[RecordingMicroAllocator] 'TfLiteEvalTensor data' used 1068 bytes with alignment overhead (requested 1068 bytes for 89 allocations)
[RecordingMicroAllocator] 'Persistent TfLiteTensor data' used 64 bytes with alignment overhead (requested 64 bytes for 2 tensors)
[RecordingMicroAllocator] 'Persistent TfLiteTensor quantization data' used 40 bytes with alignment overhead (requested 40 bytes for 4 allocations)
[RecordingMicroAllocator] 'Persistent buffer data' used 23824 bytes with alignment overhead (requested 23456 bytes for 88 allocations)
[RecordingMicroAllocator] 'NodeAndRegistration struct' used 992 bytes with alignment overhead (requested 992 bytes for 31 NodeAndRegistration structs)

Sep 10 '21 09:09 BalyshevArtem

@BalyshevArtem Do you have benchmark result with luci-interpreter? (The result above is the one with tf-micro?)

Sep 12 '21 00:09 chunseoklee

@chunseoklee Sorry for the delayed status =_=

@SlavikMIPT is benchmarking luci interpreter. We faced some issues with both target models:

keyword model uses experimental SVDF operation that is not easy to support right now
person detection model uses too much RAM, so we have some issues fitting it in hardware we have in SRR (we actually have one board with large enough external RAM, but it requires extra efforts to utilize it)

Sep 13 '21 08:09 binarman

@chunseoklee On stm32f767 216MHz with external SDRAM person detection: tflite micro: Finished in 6747161us luci: Finished in 4766142us I had to make some modifications to make the interpreter use external memory

Sep 13 '21 08:09 SlavikMIPT

On stm32f767 216MHz with external SDRAM person detection: tflite micro: Finished in 6747161us luci: Finished in 4766142us

This is a bit interesting. We're still better even with tf-micro's new allocator.

@SlavikMIPT Do you have comparison with sin and kerword model ?

Sep 13 '21 09:09 chunseoklee

@chunseoklee Here is data for sin(x): https://github.com/Samsung/ONE/issues/5080#issuecomment-743022914

Sep 13 '21 10:09 SlavikMIPT

@SlavikMIPT Could you measure it again with new tflite?

Just to be sure data is fresh =)

Sep 13 '21 10:09 binarman

@SlavikMIPT

tflite micro: Finished in 6747161us luci: Finished in 4766142us

This is really interesting results, thank you!

Sep 13 '21 10:09 binarman

@chunseoklee for sin(x) it's almost the same ~50us for tflite micro on internal SRAM, for keyword - the model is broken - so I was not able to convert it to circle

Sep 14 '21 12:09 SlavikMIPT

@chunseoklee I tried to represent https://github.com/Samsung/ONE/issues/7598#issuecomment-916753214 results for person_detection model with board NUCLEO_F413ZH and CMSIS_NN kernels and got result much more better.

With CMSIS_NN:

invokation time: 386595us

Without CMSIS_NN:

invokation time: 13005446us

Sorry for typo in log msg! =) Anybody can represent my measurements with my local branch using README.md.

Sep 15 '21 16:09 m-bronnikov

The results for STM32F767ZI 216MHz, 512kB on-chip SRAM and 128Mbit SDRAM 187MHz on FSMC

tflite micro with CMSIS_NN on external SDRAM:

invokation time: 393753us

luci-micro with CMSIS_NN on external SDRAM:

invokation time: 313689us

At this moment I can't fit luci in 512kB on-chip SRAM to compare, but working on it

Nov 07 '21 20:11 SlavikMIPT

I have run some models on tflite-micro/luci-micro with cmsis-nn from https://github.com/ARM-software/ML-zoo and my own phrase recognition model (MFCC based) on STM32H743ZI Cortex-M7 MCU with 2MBytes of Flash memory, 1MB RAM, 480 MHz CPU(SystemCoreClock is 216MHz for all tests):

Model              | tflite-micro |  luci-micro   |
-------------------------------------------------------
ad_small_int8      |    182721us  |   2628189us
-------------------------------------------------------
kws_micronet_s     |     97284us  |   1771989us 
-------------------------------------------------------
speech_recognition |      3409us  |     13723us 
-------------------------------------------------------

It seems that reference kernels have been used for some operations in luci-micro.

I also ran the benchmarks with model placed in RAM not in FLASH on tflite-micro - inference time has increased on ~10%

Used models and models that can be potentially used on MCU are attached modelzoo.zip

Jan 11 '22 08:01 SlavikMIPT

I also ran the benchmarks with model placed in RAM not in FLASH on tflite-micro - inference time has increased on ~10%

This is extremely unexpected results for me =O

480 MHz CPU(SystemCoreClock is 216MHz for all tests)

Does System core clock affect performance? If I understood documentation right, this clock is used for debugging, tracing or measuring time.

Jan 11 '22 17:01 binarman

Model              | tflite-micro |  luci-micro   |
-------------------------------------------------------
ad_small_int8      |    182721us  |   2628189us
-------------------------------------------------------
kws_micronet_s     |     97284us  |   1771989us 
-------------------------------------------------------
speech_recognition |      3409us  |     13723us 
-------------------------------------------------------

It seems that reference kernels have been used for some operations in luci-micro.

Сould it be that one of the reasons for such a difference in speed between tflite-micro and luci-micro is that the first allocates memory statically for all tensors, and in the case of the second dynamically? Or am I wrong, and tflite-micro in your experiments also uses dynamic allocation type?

Jan 12 '22 11:01 BalyshevArtem

Model              | tflite-micro |  luci-micro   |
-------------------------------------------------------
ad_small_int8      |    182721us  |   2628189us
-------------------------------------------------------
kws_micronet_s     |     97284us  |   1771989us 
-------------------------------------------------------
speech_recognition |      3409us  |     13723us 
-------------------------------------------------------
It seems that reference kernels have been used for some operations in luci-micro.
Сould it be that one of the reasons for such a difference in speed between tflite-micro and luci-micro is that the first allocates memory statically for all tensors, and in the case of the second dynamically? Or am I wrong, and tflite-micro in your experiments also uses dynamic allocation type?

I have done some benchmarks and tracing - the problem (as I said above) is calling reference kernels instead optimized for following operations: Conv2D DepthwiseConv2D AveragePool2D Reshape I will make a pull request with tracing utility compatible with arm soon - so you can debug this

Jan 20 '22 10:01 SlavikMIPT

I made a mistake - in the last test I did not notice that reference kernels were used (PAL has been changed after rebase )

I made a comparison on the kws_micronet_s again with correct PAL:

tflite-micro

arm_convolve_s8 		31936us
arm_depthwise_conv_wrapper_s8 	 3558us
arm_convolve_1x1_s8_fast 	 9059us
arm_depthwise_conv_wrapper_s8 	 4604us
arm_convolve_1x1_s8_fast 	 7608us
arm_depthwise_conv_wrapper_s8 	 3471us
arm_convolve_1x1_s8_fast 	 6799us
arm_depthwise_conv_wrapper_s8 	 3468us
arm_convolve_1x1_s8_fast 	 6800us
arm_depthwise_conv_wrapper_s8 	 3465us
arm_convolve_1x1_s8_fast 	15851us
arm_depthwise_conv_wrapper_s8 	  614us
arm_convolve_1x1_s8_fast           71us


luci-micro

arm_convolve_s8 		323831us
arm_depthwise_conv_wrapper_s8 	 15876us
arm_convolve_1x1_s8_fast 	258151us
arm_depthwise_conv_wrapper_s8 	 20417us
arm_convolve_1x1_s8_fast 	256459us
arm_depthwise_conv_wrapper_s8 	 15298us
arm_convolve_1x1_s8_fast 	193568us
arm_depthwise_conv_wrapper_s8 	 15298us
arm_convolve_1x1_s8_fast 	193585us
arm_depthwise_conv_wrapper_s8 	 15288us
arm_convolve_1x1_s8_fast 	451642us
arm_avgpool_s8 			  4294us
arm_convolve_1x1_s8_fast 	   723us

The execution time of the same functions from the CMSIS-NN library is different - so I think the problem is different definitions used during compilation

Feb 03 '22 12:02 SlavikMIPT

I have compiled with correct flags, so now results are:

Model              | tflite-micro |  luci-micro   |
-------------------------------------------------------
kws_micronet_s     |     97284us  |   92275us 
-------------------------------------------------------
speech_recognition |      3409us  |     2701us 
-------------------------------------------------------

Feb 17 '22 07:02 SlavikMIPT

@SlavikMIPT @binarman @BalyshevArtem Could you please help to measure the result like https://github.com/Samsung/ONE/issues/7598#issuecomment-1042640533 for new arrived internal models ?

We tried to do it by ourselves, but we did not succeed to compile it like https://github.com/Samsung/ONE/issues/8754

Mar 30 '22 01:03 chunseoklee

@chunseoklee Can we use python notebooks with the internal models which @lemmaa provided with me yesterday or you already have converted to tf-micro models? If so, could share them in some internal repo?

Mar 30 '22 08:03 lucenticus

@chunseoklee Can we use python notebooks with the internal models which @lemmaa provided with me yesterday or you already have converted to tf-micro models? If so, could share them in some internal repo?

Okay, I will after @lemmaa 's approval.

Note that I have converted as follows after model.fit :

# Convert the model.
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

# Save the model.
with open('model.tflite', 'wb') as f:
  f.write(tflite_model)

Mar 30 '22 09:03 chunseoklee

If so, could share them in some internal repo?

@lucenticus , PTAL RS7-RuntimeNTools/model_zoo_da for tflite models converted by @chunseoklee . :)

Mar 30 '22 15:03 lemmaa

@lucenticus Note that the model by https://github.com/Samsung/ONE/issues/7598#issuecomment-1082816468 might have non -static input shape.

Mar 31 '22 05:03 chunseoklee