Running FLUX inference twice gets slower on second run.

Open louen opened this issue 7 months ago • 1 comments

I have observed cases where running a large image generation model such as FLUX can yield a counter-intuitive behavior: when running the same inference twice, the second run is much slower than the first one, despite the model being already loaded in memory.

Using a slightly modified version of MFLUX (https://github.com/louen/mflux/tree/val/timings) that reports timings, I get the following typical timings

Loading time: 0.026757500017993152 s
Step time 1/  0%|          | 0/4 [00:00<?, ?it/s]: 15.438848124991637 s
Step time 2/  0%|          | 0/4 [00:00<?, ?it/s]: 13.129162791010458 s
Step time 3/  0%|          | 0/4 [00:00<?, ?it/s]: 12.971911125001498 s
Step time 4/  0%|          | 0/4 [00:00<?, ?it/s]: 12.954933458997402 s
Decode time: 0.0012256670161150396 s
Total time: 54.52519808401121 s

Loading time: 0.008706458029337227 s
Step time 1/  0%|          | 0/4 [00:00<?, ?it/s]: 58.969357583031524 s
Step time 2/  0%|          | 0/4 [00:00<?, ?it/s]: 13.170377583010122 s
Step time 3/  0%|          | 0/4 [00:00<?, ?it/s]: 12.888172666018363 s
Step time 4/  0%|          | 0/4 [00:00<?, ?it/s]: 17.07190245797392 s
Decode time: 0.009696583030745387 s
Total time: 102.12591358303325 s

Loading time: 0.030068000021856278 s
Step time 1/  0%|          | 0/4 [00:00<?, ?it/s]: 75.19355304195778 s
Step time 2/  0%|          | 0/4 [00:00<?, ?it/s]: 15.67308379104361 s
Step time 3/  0%|          | 0/4 [00:00<?, ?it/s]: 13.393822292040568 s
Step time 4/  0%|          | 0/4 [00:00<?, ?it/s]: 16.481010291958228 s
Decode time: 0.00855304196011275 s
Total time: 120.78219166700728 s

Obtained by running mflux with multiple seeds e.g. mflux-generate -m schnell --prompt "a cute cat" --seed 0 1 2 3 4 Notice how the first step takes about 15s in the first run and 60 to 70 in the second and third.

This is running on the following hardware configuration

Model Name: Mac Studio
Model Identifier: Mac13,2
Chip: Apple M1 Ultra
Total Number of Cores: 20 (16 performance and 4 efficiency)
Memory: 64 GB

with python 3.12 and mlx 0.26.1.

From our investigation with @awni we have uncovered the following

This appears to be triggered reliably only on 64 GB machines. On 32GB, running FLUX is possible but is really slow (the model weights being about 31GB themselves), and on 128 GB, the issue does not appear
Similarly, triggering his issue seems to be easier on older Apple Silicon (M1 generation). Faster processors can somehow avoid the issue
The issue is not 100% reproducible, possibly only triggered when the system has other apps running
Reducing the cache limit to a smaller amount (e.g. 1GB) avoids the issue
Setting the wired limit to ~40 GB also does not trigger the issue
Increasing the default working set size does not seem to have an effect.

This circumstantial evidence, along with readings from Activity Monitor, seems to indicate that the memory configuration of FLUX seems to trip the swap mechanism of the OS, leading to this counter-intuitive result.

This might warrant more investigation on the default limits to MLX memory and cache and what triggers a cache cleanup.

Jul 09 '25 18:07 louen

The slowdown happens because macOS’s unified memory system begins swapping parts of the FLUX model’s memory to disk after the first run on 64 GB machines. The model (~31 GB) plus runtime and OS overhead push total usage near the system limit. macOS mistakenly evicts MLX’s cached GPU tensors, causing the second run to reload swapped-out pages from disk, which sharply increases latency. The issue mainly affects 64 GB M1 systems where memory barely fits, while 32 GB systems are consistently slow and 128 GB systems avoid it. If needed, one could work on a fix by tuning MLX’s cache and wired memory limits or adding mechanisms to keep model weights pinned in memory.

Oct 28 '25 18:10 raghav-567