Strange issue: Converting RealESRGAN to CoreML model, ML Program format performs significantly slower than Neural Network format

Open itcook opened this issue 8 months ago • 1 comments

Hi everyone,

I'm currently learning about machine learning in the Apple ecosystem. I tried converting the RealESRGAN super-resolution model to a CoreML model and found that the ML Program format performs much slower than the Neural Network format, especially with load times being over 10 times longer.

In theory, the newer ML Program format should perform better, but that's not the case here. I'm not sure where the issue lies. Here's the script I used for the conversion—I'd really appreciate any help if someone can provide insights.

Additionally, after setting the input tensor's width and height to fixed dimensions, the model's tracing time became extremely long, whereas previously, using RangDim to set flexible shapes didn't exhibit this issue (the sample inputs were almost identical). This also puzzles me...

Furthermore, does a model with fixed-shape inputs generally perform better than one with flexible shapes? My understanding is that, in terms of performance: fixed-shape input > enumerated fixed-shapes input > flexible-shape input. Is this correct?

Jun 17 '25 00:06 itcook

Hi @itcook --

The newer ML program format should generally perform better than the older Neural Network format. Something that may be contributing to the high load times is the runtime for ML program models does a lot of work on first load that is then cached for future runs. It's hard to tell from the screenshots if that's what's happening here, but the model traces in the Instruments profile might have more details ("Open in Instruments" on the top right of the performance view).

Additionally, after setting the input tensor's width and height to fixed dimensions, the model's tracing time became extremely long, whereas previously, using RangDim to set flexible shapes didn't exhibit this issue (the sample inputs were almost identical). This also puzzles me...

Are you seeing this additional latency in torch.jit.trace or in ct.convert? In general, a statically shaped model can be further optimized compared to a dynamically shaped model so there might be additional time being spent running those optimizations.

Furthermore, does a model with fixed-shape inputs generally perform better than one with flexible shapes? My understanding is that, in terms of performance: fixed-shape input > enumerated fixed-shapes input > flexible-shape input. Is this correct?

Yep, this is correct!

Jun 17 '25 22:06 nikalra