imgutils Feature Request/Question: Obtaining more granular attributes (e.g., gender) from Person Detection

Dear imgutils maintainers (@deepghs and other contributors),

Firstly, thank you for creating and maintaining the imgutils library! It's a very helpful tool, and I've been particularly impressed with the detect_person functionality for anime-style images.

I am currently working on a project that involves intelligent image cropping, aiming to automatically focus on and frame characters within images. The detect_person function excellently provides bounding boxes for characters, which I'm using as a primary guide for a subsequent smart-cropping process (leveraging a library similar to smartcrop.js effect).

My Use Case & Why I Need More Granular Information:

My application processes a variety of images, often slide shows or collections, where multiple characters might be present. To enhance the smart-cropping logic and provide a better user experience, I'm looking for ways to prioritize or differentiate between detected persons based on more specific attributes, most notably gender.

For example, if an image contains multiple characters, my system could be configured to:

Preferentially crop around female characters. Apply different framing or compositional rules based on the gender of the main subject. Allow users to filter or select characters based on these attributes.

Currently, detect_person returns the label 'person' and a confidence score. While this is great for general person detection, it doesn't provide attributes like gender.

My Question/Feature Request:

1.Is there any existing mechanism or planned feature within imgutils (specifically for the anime person detection models like deepghs/anime_person_detection) to extract or infer more granular attributes such as gender for the detected persons?

2.If not directly available, would you have any recommendations or insights on how one might extend or combine imgutils's person detection with other models or techniques to achieve gender classification for the detected bounding boxes, particularly in the context of anime-style images?

3.Are the underlying YOLOv8 models used (from deepghs/anime_person_detection) trained purely for "person" class detection, or do they potentially contain any multi-label/attribute information that is not currently exposed through the detect_person API?

What I'm currently doing:

I'm successfully using detect_person to get bounding boxes. These boxes are then passed as boost regions to a smart cropping algorithm. The current limitation is that if multiple 'person' boxes are returned, my logic defaults to heuristics like picking the largest box or the first one, which isn't always optimal for my specific needs that involve prioritizing based on gender.

Any guidance, suggestions, or information on potential future developments in this area would be greatly appreciated. I believe having access to such attributes would significantly enhance the capabilities of applications built flusso (flow) imgutils.

Thank you for your time and for this fantastic library!

Best regards,

Jun 03 '25 09:06 suwadaimyojin

I've encountered a situation regarding confidence scores and a Bounding Box (BBox) selection that I'd like to get your insights on. When testing with an image containing multiple characters, the API returned two "person" detections with the following (simplified) data structure: [ { "height": 329, "weight": 0.7799009680747986, "width": 185, "x": 144, "y": 278 }, { "height": 1023, "weight": 0.6733676791191101, "width": 1015, "x": 323, "y": 51 } ]

Visually, the second detected person (with a BBox width of 1015 and height of 1023) clearly contains significantly more information about the character and appears to be the "main" subject compared to the first detection (width 185, height 329). However, the confidence score for the first, less informative region (score: 0.7799...) is higher than that of the second, more prominent region (score: 0.6733...). My current application logic defaults to selecting the detection with the highest confidence score as the primary target for further processing. In this scenario, it would incorrectly prioritize the less informative character. My goal is to reliably select the character region that is more visually dominant or contains more complete information.

are there ways to assess the "saliency" or "completeness" of an object within its BBox?

Based solely on the current output (BBox coordinates and confidence score), do you have any recommended strategies—other than simply sorting by confidence—to better distinguish and select the more informative and likely primary character region?

Jun 04 '25 07:06 suwadaimyojin

@suwadaimyojin hi, sorry for the late reply, extremely busy months ago

Preferentially crop around female characters.

yet currently no reliable gender prediction model for anime images, we need these datasets, lemme know if you know where to get them

are there ways to assess the "saliency" or "completeness" of an object within its BBox?

no primitively supported these values in YOLO models, but maybe some visual entropy-based methods will work, currently have no idea, lemme think about that

Aug 06 '25 03:08 narugo1992