[WIP] Improve multimodal processors - rely less on kwargs

Open molbap opened this issue 2 years ago • 1 comments

What does this PR do?

This PR aims at a better control on the logic flow through Processor classes, in particular those leveraging ImageProcessor with a Tokenizer. Linked with #27768.

ImageProcessors compared to Nougat (as a reference point) have different signatures in their preprocess. One can list them here

TvltImageProcessor:
videos, patch_size, crop_size, do_center_crop, is_mixed, num_frames

IdeficsImageProcessor:
transform, image_num_channels, image_size

ViTImageProcessor:
No difference in args

Mask2FormerImageProcessor:
segmentation_maps, ignore_index, size_divisor, reduce_labels, instance_id_to_semantic_id

MaskFormerImageProcessor:
segmentation_maps, ignore_index, size_divisor, do_reduce_labels, instance_id_to_semantic_id

YolosImageProcessor:
format, return_segmentation_masks, annotations, masks_path

MobileNetV1ImageProcessor:
do_center_crop, crop_size

DeiTImageProcessor:
do_center_crop, crop_size

EfficientNetImageProcessor:
include_top, do_center_crop, rescale_offset, crop_size

BeitImageProcessor:
do_reduce_labels, do_center_crop, segmentation_maps, crop_size

MobileViTImageProcessor:
do_flip_channel_order, do_center_crop, segmentation_maps, crop_size

PerceiverImageProcessor:
do_center_crop, crop_size

DeformableDetrImageProcessor:
format, return_segmentation_masks, annotations, masks_path

EfficientFormerImageProcessor:
do_center_crop, crop_size

SegformerImageProcessor:
do_reduce_labels, segmentation_maps

LayoutLMv2ImageProcessor:
apply_ocr, ocr_lang, tesseract_config

BridgeTowerImageProcessor:
do_center_crop, size_divisor

SamImageProcessor:
segmentation_maps, pad_size, do_convert_rgb, mask_pad_size, mask_size

BlipImageProcessor:
do_convert_rgb

Owlv2ImageProcessor:
No difference in args

LayoutLMv3ImageProcessor:
apply_ocr, ocr_lang, tesseract_config

DetaImageProcessor:
format, return_segmentation_masks, annotations, masks_path

BitImageProcessor:
do_center_crop, do_convert_rgb, crop_size

ViTHybridImageProcessor:
do_center_crop, do_convert_rgb, crop_size

FuyuImageProcessor:
patch_size, padding_mode, padding_value

PvtImageProcessor:
No difference in args

Pix2StructImageProcessor:
max_patches, header_text, do_convert_rgb, patch_size

VitMatteImageProcessor:
trimaps, size_divisibility

VideoMAEImageProcessor:
videos, do_center_crop, crop_size

MobileNetV2ImageProcessor:
do_center_crop, crop_size

OneFormerImageProcessor:
segmentation_maps, ignore_index, task_inputs, do_reduce_labels, instance_id_to_semantic_id

FlavaImageProcessor:
crop_size, codebook_crop_size, codebook_rescale_factor, mask_group_max_patches, mask_group_min_patches, mask_group_max_aspect_ratio, codebook_image_mean, codebook_do_resize, return_image_mask, input_size_patches, codebook_do_center_crop, codebook_resample, mask_group_min_aspect_ratio, codebook_do_normalize, codebook_do_map_pixels, return_codebook_pixels, codebook_image_std, do_center_crop, codebook_size, codebook_do_rescale, total_mask_patches

DonutImageProcessor:
random_padding

TvpImageProcessor:
videos, crop_size, constant_values, do_flip_channel_order, do_center_crop, pad_size, pad_mode

GLPNImageProcessor:
size_divisor

PoolFormerImageProcessor:
crop_pct, do_center_crop, crop_size

CLIPImageProcessor:
do_center_crop, do_convert_rgb, crop_size

DPTImageProcessor:
ensure_multiple_of, keep_aspect_ratio, size_divisor

ViltImageProcessor:
size_divisor

Swin2SRImageProcessor:
pad_size

ImageGPTImageProcessor:
clusters, do_color_quantize

SiglipImageProcessor:
No difference in args

VivitImageProcessor:
videos, do_center_crop, offset, crop_size

ConvNextImageProcessor:
crop_pct

OwlViTImageProcessor:
do_center_crop, crop_size

ChineseCLIPImageProcessor:
do_center_crop, do_convert_rgb, crop_size

LevitImageProcessor:
do_center_crop, crop_size

ConditionalDetrImageProcessor:
format, return_segmentation_masks, annotations, masks_path

DetrImageProcessor:
format, return_segmentation_masks, annotations, masks_path

This helps standardize a bit in the first place, and then, will allow uniformizing Processors.

Fixes # (issue)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

Models:

text models: @ArthurZucker and @younesbelkada
vision models: @amyeroberts

Jan 25 '24 17:01 molbap

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Jan 25 '24 18:01 HuggingFaceDocBuilderDev