Xiang Zhang

Results 7 comments of Xiang Zhang

thanks! the official diffusers document loads LCMScheduler in this way ![image](https://github.com/user-attachments/assets/ea9f127b-c750-410f-869d-68a9f6e738e1) can we have extra arg like device to make it initialize with passed in device? just curious, if we...

Hi @a-r-r-o-w , i saw your investigation, thanks for doing that!! I'll also try moving everythong onto cpu/gpu on my end to see the performance gain.

> Hi @xiang9156. I shared my investigation in [this](https://github.com/huggingface/diffusers/pull/9475#issuecomment-2367852119) comment. Do your observations align with it? We're thinking about how to remove all cuda stream synchronization from the pipeline (which...

> > Hi @xiang9156. I shared my investigation in [this](https://github.com/huggingface/diffusers/pull/9475#issuecomment-2367852119) comment. Do your observations align with it? We're thinking about how to remove all cuda stream synchronization from the pipeline...

@a-r-r-o-w from tracing, i did see extra sync in controlnet forward and unet forward after i set timestep tensor to cpu ![image](https://github.com/user-attachments/assets/a6d6a663-fb3b-4a39-9e0d-df43e368f0f7) ![image](https://github.com/user-attachments/assets/7b68b04e-077d-4a61-9f7b-c74f5d1332e7)

> I think it's better to maintain two copies of timesteps in the scheduler for this (one on cuda and one on cpu). To the controlnet/unet/transformer, you can pass the...

@a-r-r-o-w i think i know why my cudaMemcpyAsync takes long time, it's still related to the underlying asyn cuda execution, cudaMemcpyAsync is synchronizing until the underlying async cuda execution finishes....