[bug]: MPS Memory Usage on Invoke 2.3.x
Is there an existing issue for this?
- [X] I have searched the existing issues
OS
macOS
GPU
mps
VRAM
64GB
What version did you experience this issue on?
2.3.5
What happened?
I've taken extensive data across several different combinations of Invoke, PyTorch, and OS versions demonstrating that there is a major issue with memory use on Invoke 2.3.x on MPS.
I am using an M2 Max MBP 16" with 64GB unified memory, currently running MacOS 13.4b4. I am unable to generate at more than 1024x1024 using Invoke 2.3.5, and that requires me to manually update to PyTorch 2. The default 1.13.1 starts at 512x512 using a similar amount of memory to Invoke 2.2.5 with the same version of PyTorch, but 2.3.5 quickly eats up more memory as the resolution increases. This indicates to me that there is an issue with the new inference pipeline, possibly due to diffusers (but I have no proof that that component is the cause).
I generated the following data by starting a fresh session for each generation using the same prompt, seed, scheduler (DDIM), number of iterations (50), and weights (vanilla Stable Diffusion 1.5). Except for the first section of the table, the memory use stats were generated by polling the memory usage of Python at regular intervals using the following one-liner command:
footprint --noCategories --sample 0.1 -f bytes -p <PID> | awk '/<PNAME>/ {print $5; fflush()}' >> /<pathtofile>/memout.txt
where <PID> is the process ID number of Python and <PNAME> is the process name, or a representative portion thereof, as displayed in Activity Monitor (Invoke uses "Python" while A1111 uses "python3.10" for me).
The first data section is the oldest and that was generated by watching Activity Monitor's reported memory use for Python and manually recording the largest number displayed. Thus the first section is probably the most inaccurate, but could be considered a lower estimate.
Note that in a number of configurations, Python experiences a distinct spike in memory usage after the 50th iteration when it decodes the latent representation into pixel space. Thus I have recorded this separately. Sometimes there is no distinct increase in memory use -- I have no idea why this sometimes doesn't happen, but it appears consistent -- and so some generations have "--" in the Decode cell.
I also noted the reported iteration speed and total generation time, the memory retained by Python after the first generation has finished, the average and standard deviation of the initial Python memory use before generation (for some reason this varies from one startup to another), and other notes as relevant (crashes, non-crashing errors, image corruption, etc.)
I've also included some data from @brkirch's fork of A1111 showing what is possible. Note that it appears that this fork runs in fp16 at least partially. I have run a single generation using the --no-half argument to provide more of an apples-to-apples comparison with Invoke and you'll see A1111 is still better than Invoke in both memory use and speed. I will fill out this section more completely if time permits and if it's found useful. Note that A1111, whether in fp16 or fp32, uses a different seeding/noise generation method than Invoke, so the result is always different even though I still used the same seed. It is also possible that some of the Invoke prompt syntax is incompatible with A1111. I haven't made an effort to fix this yet.
This is the original spreadsheet, if it is helpful: Stable Diffusion benchmarks.xlsx
@gogurtenjoyer @hipsterusername
Screenshots
All data:
Summary graph of the maximum memory usage observed for each SD configuration vs image resolution. Note that 2.2.5 and A1111 appear roughly linear while 2.3.5 looks to be quadratic:
Example graphs of Invoke memory use over time. Note that you can see repeating features corresponding to each iteration of the generation, plus the spike of memory use during decoding. Some generations on certain configurations do not have a distinct spike.
Additional context
Generations on MacOS 13.2 were severely broken due to some sort of OS-level issue, causing crashes and "gray blob" output at higher resolutions. This has all been fixed in 13.3 and higher.
My term "gray blob" refers to the appearance of these generations, most clear when you are doing img2img and so know what the output should look like. The upper section of the image becomes blurry and desaturated, losing all high-resolution detail but still retaining the general low-resolution features of the input image. The lower section of the image becomes a gray field with a slight noise texture to it. As resolution is increased, the dividing line between the two sections moves from the bottom to the top of the image and, in the resolution one tick higher from the one where the image is completely gray, it suddenly becomes coherent again. Weird stuff.
There is an incompatibility in Invoke 2.2.5 with PyTorch 2 that causes all schedulers except for DDIM and PLMS to only output noise. I had only been using DDIM, so I didn't notice at first. I don't know enough or have enough time to try to identify the cause.
Contact Details
Adreitz on Discord, [email protected]