[CI] add a big GPU marker to run memory-intensive tests separately on CI

Open sayakpaul opened this issue 1 year ago • 6 comments

What does this PR do?

I have only touched a handful of tests with the marker being introduced. I think we may need to change the slices based on the CI machine and infra. @a-r-r-o-w should consider marking the Cog tests similarly as well?

@DN6 would love to get your thoughts on the design.

Oct 16 '24 07:10 sayakpaul

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Oct 16 '24 07:10 HuggingFaceDocBuilderDev

should consider marking the Cog tests similarly as well?

With model cpu offload and vae tiling, it should be < 16 GB, and I think we documented it here. Are we seeing Cog test failures due to memory? I see that they are passing here

Oct 16 '24 11:10 a-r-r-o-w

Ah okay then. No issues.

Oct 16 '24 11:10 sayakpaul

@DN6 okay if I modified the failing tests to account for the machine change?

Oct 16 '24 13:10 sayakpaul

@DN6 can you give this a look? I think the test failures should go away once the CI Bot has access to Flux.

Once approved I will revert the changes which I have denoted as temporary (like this).

Oct 17 '24 10:10 sayakpaul

@DN6 regarding https://github.com/huggingface/diffusers/actions/runs/11398910357/job/31716739483?pr=9691#step:7:67, my hunch is that there's some kind of leakage happening which is causing the worker to crash. When I SSH'd into the runner and manually ran the test, it passed.

Oct 18 '24 07:10 sayakpaul

In a follow-up I will introduce the quantization tests.

Oct 31 '24 13:10 sayakpaul