transformers icon indicating copy to clipboard operation
transformers copied to clipboard

[WIP] Improve multimodal processors - rely less on kwargs

Open molbap opened this issue 2 years ago • 1 comments

What does this PR do?

This PR aims at a better control on the logic flow through Processor classes, in particular those leveraging ImageProcessor with a Tokenizer. Linked with #27768.

ImageProcessors compared to Nougat (as a reference point) have different signatures in their preprocess. One can list them here

TvltImageProcessor:
videos, patch_size, crop_size, do_center_crop, is_mixed, num_frames

IdeficsImageProcessor:
transform, image_num_channels, image_size

ViTImageProcessor:
No difference in args

Mask2FormerImageProcessor:
segmentation_maps, ignore_index, size_divisor, reduce_labels, instance_id_to_semantic_id

MaskFormerImageProcessor:
segmentation_maps, ignore_index, size_divisor, do_reduce_labels, instance_id_to_semantic_id

YolosImageProcessor:
format, return_segmentation_masks, annotations, masks_path

MobileNetV1ImageProcessor:
do_center_crop, crop_size

DeiTImageProcessor:
do_center_crop, crop_size

EfficientNetImageProcessor:
include_top, do_center_crop, rescale_offset, crop_size

BeitImageProcessor:
do_reduce_labels, do_center_crop, segmentation_maps, crop_size

MobileViTImageProcessor:
do_flip_channel_order, do_center_crop, segmentation_maps, crop_size

PerceiverImageProcessor:
do_center_crop, crop_size

DeformableDetrImageProcessor:
format, return_segmentation_masks, annotations, masks_path

EfficientFormerImageProcessor:
do_center_crop, crop_size

SegformerImageProcessor:
do_reduce_labels, segmentation_maps

LayoutLMv2ImageProcessor:
apply_ocr, ocr_lang, tesseract_config

BridgeTowerImageProcessor:
do_center_crop, size_divisor

SamImageProcessor:
segmentation_maps, pad_size, do_convert_rgb, mask_pad_size, mask_size

BlipImageProcessor:
do_convert_rgb

Owlv2ImageProcessor:
No difference in args

LayoutLMv3ImageProcessor:
apply_ocr, ocr_lang, tesseract_config

DetaImageProcessor:
format, return_segmentation_masks, annotations, masks_path

BitImageProcessor:
do_center_crop, do_convert_rgb, crop_size

ViTHybridImageProcessor:
do_center_crop, do_convert_rgb, crop_size

FuyuImageProcessor:
patch_size, padding_mode, padding_value

PvtImageProcessor:
No difference in args

Pix2StructImageProcessor:
max_patches, header_text, do_convert_rgb, patch_size

VitMatteImageProcessor:
trimaps, size_divisibility

VideoMAEImageProcessor:
videos, do_center_crop, crop_size

MobileNetV2ImageProcessor:
do_center_crop, crop_size

OneFormerImageProcessor:
segmentation_maps, ignore_index, task_inputs, do_reduce_labels, instance_id_to_semantic_id

FlavaImageProcessor:
crop_size, codebook_crop_size, codebook_rescale_factor, mask_group_max_patches, mask_group_min_patches, mask_group_max_aspect_ratio, codebook_image_mean, codebook_do_resize, return_image_mask, input_size_patches, codebook_do_center_crop, codebook_resample, mask_group_min_aspect_ratio, codebook_do_normalize, codebook_do_map_pixels, return_codebook_pixels, codebook_image_std, do_center_crop, codebook_size, codebook_do_rescale, total_mask_patches

DonutImageProcessor:
random_padding

TvpImageProcessor:
videos, crop_size, constant_values, do_flip_channel_order, do_center_crop, pad_size, pad_mode

GLPNImageProcessor:
size_divisor

PoolFormerImageProcessor:
crop_pct, do_center_crop, crop_size

CLIPImageProcessor:
do_center_crop, do_convert_rgb, crop_size

DPTImageProcessor:
ensure_multiple_of, keep_aspect_ratio, size_divisor

ViltImageProcessor:
size_divisor

Swin2SRImageProcessor:
pad_size

ImageGPTImageProcessor:
clusters, do_color_quantize

SiglipImageProcessor:
No difference in args

VivitImageProcessor:
videos, do_center_crop, offset, crop_size

ConvNextImageProcessor:
crop_pct

OwlViTImageProcessor:
do_center_crop, crop_size

ChineseCLIPImageProcessor:
do_center_crop, do_convert_rgb, crop_size

LevitImageProcessor:
do_center_crop, crop_size

ConditionalDetrImageProcessor:
format, return_segmentation_masks, annotations, masks_path

DetrImageProcessor:
format, return_segmentation_masks, annotations, masks_path

This helps standardize a bit in the first place, and then, will allow uniformizing Processors.

Fixes # (issue)

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [ ] Did you read the contributor guideline, Pull Request section?
  • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [ ] Did you write any new necessary tests?

Who can review?

Models:

  • text models: @ArthurZucker and @younesbelkada
  • vision models: @amyeroberts

molbap avatar Jan 25 '24 17:01 molbap

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.