executorch Core ML perf regression observed on non-genai models

🐛 Describe the bug

We are running exercise on the newly launched benchmarking infra with the in-tree enabled models under examples/models. You can follow the instruction to extract and inspect the benchmark_results.json. (Connecting the results to a dashboard is still WIP)

The preliminary data points collected from the benchmarking infra can be found here: https://github.com/pytorch/executorch/tree/main/extension/benchmark#preliminary-benchmark-results. It shows some previously enabled models, e.g. MobileNetV2/3, InceptionV3/4, are running expected slower with Core ML delegates.

The models are exported using the coreml export scripts (code pointer), and the app that loads and runs the exported model is located under extension/apple/benchmark.

Versions

From latest executorch main

Sep 23 '24 22:09 guangy10

cc: @cbilgin @kimishpatel @shoumikhin

Sep 23 '24 22:09 guangy10

Something weird is going on, the numbers are too high. Investigating.

Sep 24 '24 05:09 cymbalrush

Here is the profiling of mv3 running on the executor_runner binary:

...
│ 186 │ Execute            │ DELEGATE_CALL                                                       │  86.8285      │  86.8285      │  86.8285      │  86.8285      │  86.8285      │  86.8285      │ ['aten.convolution.default', 'aten._native_batch_norm_legit_no_training.default' ... 'aten.addmm.default'] (317 total) │ False             │ CoreMLBackend           │ nan               │           nan           │ nan                       │ nan                        │
├─────┼────────────────────┼─────────────────────────────────────────────────────────────────────┼───────────────┼───────────────┼───────────────┼───────────────┼───────────────┼───────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────────────────┼─────────────────────────┼───────────────────┼─────────────────────────┼───────────────────────────┼────────────────────────────┤
│ 187 │ Execute            │ Method::execute                                                     │  86.8298      │  86.8298      │  86.8298      │  86.8298      │  86.8298      │  86.8298      │ []                                                                                                                     │ False             │                         │ nan               │           nan           │ nan                       │ nan                        │
╘═════╧════════════════════╧═════════════════════════════════════════════════════════════════════╧═══════════════╧═══════════════╧═══════════════╧═══════════════╧═══════════════╧═══════════════╧════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╧═══════════════════╧═════════════════════════╧═══════════════════╧═════════════════════════╧═══════════════════════════╧════════════════════════════╛

The avg of Method::execute is 86.829ms, far different from the numbers get from the benchmark app. @shoumikhin Is it same as how test_forward is measured in the app?

Sep 24 '24 21:09 guangy10

Did you see any graph breaks?

Sep 27 '24 16:09 YifanShenSZ

@YifanShenSZ There are no graph breaks.

@guangy10 the actual number is close to 1 ms. The 86.829 ms you are seeing is because the delegate execution time is the total time taken by the execute call. When profiling, there is some cost associated with getting the profile data and writing it.

@Olivia-liu Is there a way for a delegate to log the time directly instead of relying on the execute duration?

@shoumikhin are the numbers still low?

Oct 03 '24 20:10 cymbalrush

@cymbalrush @guangy10

Not exactly log the time directly, but should be helpful to reduce most overhead:

I found EventTracerProfilingLevel::kProfileMethodOnly https://github.com/pytorch/executorch/blob/1052e3b17ee0750c89ae78dd60c61dd6b54ebb89/runtime/core/event_tracer.h#L77

So basically it's an known issue that the overhead for profiling ops can be high. And one can use this option to disable per op profiling and essentially only profile the important events such as inference execution time.

In this case since @guangy10 is using the executor_runner, in the runner implementation, find the line ETDumpGen etdump_gen; and add etdump_gen. set_event_tracer_profiling_level(EventTracerProfilingLevel::kProfileMethodOnly)

Oct 05 '24 00:10 Olivia-liu

Thanks @Olivia-liu! for the pointer but is this something that the tooling can address? otherwise this will keep on bugging us :)

Oct 05 '24 00:10 cymbalrush

@cymbalrush Yes, will do. Is the request that while per op profiling is enabled in the delegate calls, the duration of the delegate call itself is measured or calculated in such a way that the profiling overhead is removed?

Oct 08 '24 00:10 Olivia-liu

Hmm @cymbalrush @guangy10 something seems off here. Can you share the etdump here? Profiling has overhead but it shouldn't be that much.

Oct 08 '24 15:10 tarun292

The root cause was how the numbers are reported. We have re-calibrated the new ios benchmark app and verified the perf numbers on-device. The number are updated here: https://github.com/pytorch/executorch/tree/main/extension/benchmark#preliminary-benchmark-results.

Oct 08 '24 17:10 guangy10

Hmm @cymbalrush @guangy10 something seems off here. Can you share the etdump here? Profiling has overhead but it shouldn't be that much.

For the more prominent overhead (small models like resnet18), it seems like the devtools delegate execute time is 0.98 ms, while the trace shows 0.358 ms. Any chance you know the overhead of the coreml layer? Since the delegate call is pretty much one line call, the overhead ideally shouldn't be that much.

The issue reported as the title, like the 86 ms overhead, might be the benchmark infra issue. We may file a different issue to track them seperately for the delegate profiling overhead.

Oct 08 '24 19:10 cccclai

Should we close this one?

Oct 22 '24 23:10 cccclai

Thanks @cccclai for flagging to close this. I am gonna close this for now

Dec 03 '24 15:12 kimishpatel