[Ray data] [stable diffussion batch inference] cpu resources in cluster cannot be fully utilized when running stable diffusion batch inference task.

Open mct2611 opened this issue 1 year ago • 0 comments

What happened + What you expected to happen

Hi, i want to use the cluster's cpu resources to run the stable diffusion inference demo. i do not have GPUs. I thought through the ray framework, the cpus can also be used to execute some inference task.

I got two wsls to set up the ray cluster. wsl A has 12 cpus and is as the head node. wsl B has 12 cpus and is as the worker node. So run 'ray status' command, it shows: ======== Autoscaler status: 2024-03-22 00:59:14.244899 ======== Node status Active: 1 node_88349db0fa0ccd3086db2f5a4c79ab9a527acb4aca4c023cb8120c8b 1 node_5cb133607c13b47fa48631b86114996f49a7ced083a5bcbeafbc20b8 Pending: (no pending nodes) Recent failures: (no failures)

Resources Usage: 0.0/24.0 CPU 0B/43.54GiB memory 0B/21.04GiB object_store_memory

Demands: (no resource demands)

Then i run the stable diffusion batch inference demo, and set the pipe and device parameters to 'cpu' as below script shows. Then i set the num_cpus=16. In my opinion, the ray cluster may use the 16/24 cpus to run the task. However , it raise the error:

(autoscaler +6s) Error: No available node types can fulfill resource request {'CPU': 16.0}. Add suitable node types to this cluster to resolve this issue.

Only when i set the num_cpus <= 12 (the original wsl A's total cpu num), it will work and only one of the two worker will execute the task.

I saw the document says, the num_cpus is the number of CPUs to reserve for each parallel map worker and the concurrency is the number of ray workers to use concurrently. So i try to set the concurrency=2 and the num_cpus=8, i thought 2*8=16 cpus may work. However, when executing the inference process, the error occurred again.

So my point is ,how can i make use of the cpu resources in the cluster to execute one inference task?

Versions / Dependencies

ray 2.9.3 python3.10.12 wsl2

Reproduction script

model_id = "stabilityai/stable-diffusion-2-1" prompt = "a photo of an astronaut riding a horse on mars"

import ray import ray.data import pandas as pd

ds = ray.data.from_pandas(pd.DataFrame([prompt], columns=['prompt']))

class PredictCallable: def init(self, model_id: str, revision: str = None): from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler import torch

self.pipe = StableDiffusionPipeline.from_pretrained(
    model_id, torch_dtype=torch.float
)
self.pipe.scheduler = DPMSolverMultistepScheduler.from_config(
    self.pipe.scheduler.config
)

self.pipe = self.pipe.to("cpu")

def call(self, batch: pd.DataFrame) -> pd.DataFrame: import torch import numpy as np

# Set a different seed for every image in batch
self.pipe.generator = [
    torch.Generator(device="cpu").manual_seed(i) for i in range(len(batch))
]
images = self.pipe(list(batch["prompt"])).images
return {"images": np.array(images, dtype=object)}

preds = ds.map_batches( PredictCallable, fn_constructor_kwargs=dict(model_id=model_id), concurrency=1, num_cpus=16, batch_size=1, batch_format='pandas' )

results = preds.take_all()

Issue Severity

High: It blocks me from completing my task.

Apr 08 '24 01:04 mct2611