Xiang Zhang
Xiang Zhang
thanks! the official diffusers document loads LCMScheduler in this way  can we have extra arg like device to make it initialize with passed in device? just curious, if we...
Hi @a-r-r-o-w , i saw your investigation, thanks for doing that!! I'll also try moving everythong onto cpu/gpu on my end to see the performance gain.
> Hi @xiang9156. I shared my investigation in [this](https://github.com/huggingface/diffusers/pull/9475#issuecomment-2367852119) comment. Do your observations align with it? We're thinking about how to remove all cuda stream synchronization from the pipeline (which...
> > Hi @xiang9156. I shared my investigation in [this](https://github.com/huggingface/diffusers/pull/9475#issuecomment-2367852119) comment. Do your observations align with it? We're thinking about how to remove all cuda stream synchronization from the pipeline...
@a-r-r-o-w from tracing, i did see extra sync in controlnet forward and unet forward after i set timestep tensor to cpu  
> I think it's better to maintain two copies of timesteps in the scheduler for this (one on cuda and one on cpu). To the controlnet/unet/transformer, you can pass the...
@a-r-r-o-w i think i know why my cudaMemcpyAsync takes long time, it's still related to the underlying asyn cuda execution, cudaMemcpyAsync is synchronizing until the underlying async cuda execution finishes....