ComfyUI_BiRefNet_ll Segmentation running slow all of a sudden

Recently, the segmentation runs very slow. Before christmas the same image used to take under 1sec on my 4090, now at least 2-3sec and sometimes even over 15sec (it seems that is goes to cpu mode even though the vram is not full).

Jan 08 '25 07:01 dapa5900

Recently, the segmentation runs very slow. Before christmas the same image used to take under 1sec on my 4090, now at least 2-3sec and sometimes even over 15sec (it seems that is goes to cpu mode even though the vram is not full).

Didn't notice the problem you mentioned, if it took that long, the model should have been cleared from the cache, please check if you are using other very memory intensive models

Jan 08 '25 13:01 petercham

It still persists here but only on the Rmbg and Rmbg Advance nodes. On the Getmask node with the Portrait model it is still the fastest segmentation of all the ones I tested with 0.5sec max.

Am Mi., 8. Jan. 2025 um 14:07 Uhr schrieb HJH_Chenhe < @.***>:

Recently, the segmentation runs very slow. Before christmas the same image used to take under 1sec on my 4090, now at least 2-3sec and sometimes even over 15sec (it seems that is goes to cpu mode even though the vram is not full).

Didn't notice the problem you mentioned, if it took that long, the model should have been cleared from the cache, please check if you are using other very memory intensive models

— Reply to this email directly, view it on GitHub https://github.com/lldacing/ComfyUI_BiRefNet_ll/issues/15#issuecomment-2577631431, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR5YUKWBK7OJU7PJ4AFDR3T2JUPKZAVCNFSM6AAAAABUZHTN4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZXGYZTCNBTGE . You are receiving this because you authored the thread.Message ID: @.***>

Jan 09 '25 07:01 dapa5900

It still persists here but only on the Rmbg and Rmbg Advance nodes. On the Getmask node with the Portrait model it is still the fastest segmentation of all the ones I tested with 0.5sec max. Am Mi., 8. Jan. 2025 um 14:07 Uhr schrieb HJH_Chenhe < @.>: … Recently, the segmentation runs very slow. Before christmas the same image used to take under 1sec on my 4090, now at least 2-3sec and sometimes even over 15sec (it seems that is goes to cpu mode even though the vram is not full). Didn't notice the problem you mentioned, if it took that long, the model should have been cleared from the cache, please check if you are using other very memory intensive models — Reply to this email directly, view it on GitHub <#15 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR5YUKWBK7OJU7PJ4AFDR3T2JUPKZAVCNFSM6AAAAABUZHTN4SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZXGYZTCNBTGE . You are receiving this because you authored the thread.Message ID: @.>

These three processes are equivalent. The difference from the previous one is that it is changed to fast-foreground-estimation to synthesize the image. You can use other nodes to mix the image and the mask instead.

这三种处理是等价的，与之前的差别是改成了fast-foreground-estimation来合成图片，你可以使用其它节点来混合图片和遮罩代替

Jan 11 '25 11:01 lldacing

Thanks. Maybe that's the reason it's running so slow then...? Can you recommend a solution that yields similar results like yours? I use "ImageRemoveAlpha" from "LayerStyles"-custom nodes but it gives this Halo around the Edge (right side of the image) compared to what I get out of using your "RmbgByBiRefNet"-node directly (left side). Thank you.

halo

Jan 14 '25 07:01 dapa5900

Thanks. Maybe that's the reason it's running so slow then...? Can you recommend a solution that yields similar results like yours? I use "ImageRemoveAlpha" from "LayerStyles"-custom nodes but it gives this Halo around the Edge (right side of the image) compared to what I get out of using your "RmbgByBiRefNet"-node directly (left side). Thank you.

@dapa5900 I code a new node to reproduce the original version effect, you can add the node to file birefnetNode.py and test. I am not sure it works for you. If it works, I will merge it.

class GetForegroundImageSimple:

    @classmethod
    def INPUT_TYPES(cls):
        return {
            "required": {
                "image": ("IMAGE",),
                "mask": ("MASK", ),
            }
        }

    RETURN_TYPES = ("IMAGE",)
    RETURN_NAMES = ("image",)
    FUNCTION = "get_image"
    CATEGORY = "rembg/BiRefNet"

    def get_image(self, image, mask):
        # image.shape => (b, h, w, c)
        # mask.shape => (b, h, w)

        # You can open the code to see if it has any impact
        # mask = normalize_mask(mask)

        image = add_mask_as_alpha(image, mask)

        return image,


NODE_CLASS_MAPPINGS = {
    "AutoDownloadBiRefNetModel": AutoDownloadBiRefNetModel,
    "LoadRembgByBiRefNetModel": LoadRembgByBiRefNetModel,
    "RembgByBiRefNet": RembgByBiRefNet,
    "RembgByBiRefNetAdvanced": RembgByBiRefNetAdvanced,
    "GetMaskByBiRefNet": GetMaskByBiRefNet,
    "BlurFusionForegroundEstimation": BlurFusionForegroundEstimation,
    "GetForegroundImageSimple": GetForegroundImageSimple,
}

NODE_DISPLAY_NAME_MAPPINGS = {
    "AutoDownloadBiRefNetModel": "AutoDownloadBiRefNetModel",
    "LoadRembgByBiRefNetModel": "LoadRembgByBiRefNetModel",
    "RembgByBiRefNet": "RembgByBiRefNet",
    "RembgByBiRefNetAdvanced": "RembgByBiRefNetAdvanced",
    "GetMaskByBiRefNet": "GetMaskByBiRefNet",
    "BlurFusionForegroundEstimation": "BlurFusionForegroundEstimation",
    "GetForegroundImageSimple": "GetForegroundImageSimple",
}

Jan 14 '25 09:01 lldacing

Thank you. It works but on a black background it gives the same result with the halo like the above-mentioned approach (although it's faster on my side, which is good :-)) Please find attached a workflow that compares these three approaches for an example input image

ModelBase_Girl_Wide_00027_ compare.json :

Jan 14 '25 09:01 dapa5900

Thank you. It works but on a black background it gives the same result with the halo like the above-mentioned approach (although it's faster on my side, which is good :-)) Please find attached a workflow that compares these three approaches for an example input image

compare.json :

It looks like there is no difference with LayerStyle, I guess this is what fast-foreground-estimation solves.

Jan 14 '25 10:01 lldacing

Alright, thanks for your efforts!

Jan 14 '25 13:01 dapa5900

this node is slow because it runs on a CPU. if you fix it and add tensor translation to the device, it will work faster.

....
class BlurFusionForegroundEstimation:
.....
    def get_foreground(self, images, masks, blur_size=91, blur_size_two=7, fill_color=False, color=None):
     ......
        # (b, c, h, w)
        _image_masked = refine_foreground(image_bchw.to(deviceType), out_masks.to(deviceType), r1=blur_size, r2=blur_size_two)
        .....

Feb 26 '25 16:02 yrsolo

this node is slow because it runs on a CPU. if you fix it and add tensor translation to the device, it will work faster.

....
class BlurFusionForegroundEstimation:
.....
    def get_foreground(self, images, masks, blur_size=91, blur_size_two=7, fill_color=False, color=None):
     ......
        # (b, c, h, w)
        _image_masked = refine_foreground(image_bchw.to(deviceType), out_masks.to(deviceType), r1=blur_size, r2=blur_size_two)
        .....

In order to save VRAM, the calculation of the image will be moved on the CPU. You can use the node named 🔧 Image To Device of cubiq/ComfyUI_essentials to move it to the GPU.

Feb 27 '25 09:02 lldacing

this node is slow because it runs on a CPU. if you fix it and add tensor translation to the device, it will work faster.
....
class BlurFusionForegroundEstimation:
.....
    def get_foreground(self, images, masks, blur_size=91, blur_size_two=7, fill_color=False, color=None):
     ......
        # (b, c, h, w)
        _image_masked = refine_foreground(image_bchw.to(deviceType), out_masks.to(deviceType), r1=blur_size, r2=blur_size_two)
        .....
In order to save VRAM, the calculation of the image will be moved on the CPU. You can use the node named 🔧 Image To Device of cubiq/ComfyUI_essentials to move it to the GPU.

Maybe I should delete this, but the impact needs to be evaluated

Feb 27 '25 09:02 lldacing

@lldacing > In order to save VRAM, the calculation of the image will be moved on the CPU. You can use the node named 🔧 Image To Device of cubiq/ComfyUI_essentials to move it to the GPU.

Using this node, the following error occurs

Apr 22 '25 12:04 Woukim

Can I suggest a more explicit device selection?

I noticed that the current implementation sometimes defaults to CPU when AUTO is selected, I suspect it somehow conflicts with ComfyUI-MultiGPU.

I kept the AUTO, CPU entries for backwards compatibility, and added the default ComfyUI torch device entry, this also allows to select a GPU in multi-GPU setup. (cuda only)

Unfortunately I have only tested it on CUDA setup, I have no access to other hardware.

https://github.com/lldacing/ComfyUI_BiRefNet_ll/compare/main...fAIseh00d:ComfyUI_BiRefNet_ll:main

Apr 23 '25 18:04 fAIseh00d