Core ML perf regression observed on non-genai models
๐ Describe the bug
We are running exercise on the newly launched benchmarking infra with the in-tree enabled models under examples/models. You can follow the instruction to extract and inspect the benchmark_results.json. (Connecting the results to a dashboard is still WIP)
The preliminary data points collected from the benchmarking infra can be found here: https://github.com/pytorch/executorch/tree/main/extension/benchmark#preliminary-benchmark-results. It shows some previously enabled models, e.g. MobileNetV2/3, InceptionV3/4, are running expected slower with Core ML delegates.
The models are exported using the coreml export scripts (code pointer), and the app that loads and runs the exported model is located under extension/apple/benchmark.
Versions
From latest executorch main
cc: @cbilgin @kimishpatel @shoumikhin
Something weird is going on, the numbers are too high. Investigating.
Here is the profiling of mv3 running on the executor_runner binary:
...
โ 186 โ Execute โ DELEGATE_CALL โ 86.8285 โ 86.8285 โ 86.8285 โ 86.8285 โ 86.8285 โ 86.8285 โ ['aten.convolution.default', 'aten._native_batch_norm_legit_no_training.default' ... 'aten.addmm.default'] (317 total) โ False โ CoreMLBackend โ nan โ nan โ nan โ nan โ
โโโโโโโผโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ 187 โ Execute โ Method::execute โ 86.8298 โ 86.8298 โ 86.8298 โ 86.8298 โ 86.8298 โ 86.8298 โ [] โ False โ โ nan โ nan โ nan โ nan โ
โโโโโโโงโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโงโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
The avg of Method::execute is 86.829ms, far different from the numbers get from the benchmark app. @shoumikhin Is it same as how test_forward is measured in the app?
Did you see any graph breaks?
@YifanShenSZ There are no graph breaks.
@guangy10 the actual number is close to 1 ms. The 86.829 ms you are seeing is because the delegate execution time is the total time taken by the execute call. When profiling, there is some cost associated with getting the profile data and writing it.
@Olivia-liu Is there a way for a delegate to log the time directly instead of relying on the execute duration?
@shoumikhin are the numbers still low?
@cymbalrush @guangy10
Not exactly log the time directly, but should be helpful to reduce most overhead:
I found EventTracerProfilingLevel::kProfileMethodOnly https://github.com/pytorch/executorch/blob/1052e3b17ee0750c89ae78dd60c61dd6b54ebb89/runtime/core/event_tracer.h#L77
So basically it's an known issue that the overhead for profiling ops can be high. And one can use this option to disable per op profiling and essentially only profile the important events such as inference execution time.
In this case since @guangy10 is using the executor_runner, in the runner implementation, find the line ETDumpGen etdump_gen; and add etdump_gen. set_event_tracer_profiling_level(EventTracerProfilingLevel::kProfileMethodOnly)
Thanks @Olivia-liu! for the pointer but is this something that the tooling can address? otherwise this will keep on bugging us :)
@cymbalrush Yes, will do. Is the request that while per op profiling is enabled in the delegate calls, the duration of the delegate call itself is measured or calculated in such a way that the profiling overhead is removed?
Hmm @cymbalrush @guangy10 something seems off here. Can you share the etdump here? Profiling has overhead but it shouldn't be that much.
The root cause was how the numbers are reported. We have re-calibrated the new ios benchmark app and verified the perf numbers on-device. The number are updated here: https://github.com/pytorch/executorch/tree/main/extension/benchmark#preliminary-benchmark-results.
Hmm @cymbalrush @guangy10 something seems off here. Can you share the etdump here? Profiling has overhead but it shouldn't be that much.
For the more prominent overhead (small models like resnet18), it seems like the devtools delegate execute time is 0.98 ms, while the trace shows 0.358 ms. Any chance you know the overhead of the coreml layer? Since the delegate call is pretty much one line call, the overhead ideally shouldn't be that much.
The issue reported as the title, like the 86 ms overhead, might be the benchmark infra issue. We may file a different issue to track them seperately for the delegate profiling overhead.
Should we close this one?
Thanks @cccclai for flagging to close this. I am gonna close this for now