transformers
transformers copied to clipboard
[WIP] Improve multimodal processors - rely less on kwargs
What does this PR do?
This PR aims at a better control on the logic flow through Processor classes, in particular those leveraging ImageProcessor with a Tokenizer. Linked with #27768.
ImageProcessors compared to Nougat (as a reference point) have different signatures in their preprocess. One can list them here
TvltImageProcessor:
videos, patch_size, crop_size, do_center_crop, is_mixed, num_frames
IdeficsImageProcessor:
transform, image_num_channels, image_size
ViTImageProcessor:
No difference in args
Mask2FormerImageProcessor:
segmentation_maps, ignore_index, size_divisor, reduce_labels, instance_id_to_semantic_id
MaskFormerImageProcessor:
segmentation_maps, ignore_index, size_divisor, do_reduce_labels, instance_id_to_semantic_id
YolosImageProcessor:
format, return_segmentation_masks, annotations, masks_path
MobileNetV1ImageProcessor:
do_center_crop, crop_size
DeiTImageProcessor:
do_center_crop, crop_size
EfficientNetImageProcessor:
include_top, do_center_crop, rescale_offset, crop_size
BeitImageProcessor:
do_reduce_labels, do_center_crop, segmentation_maps, crop_size
MobileViTImageProcessor:
do_flip_channel_order, do_center_crop, segmentation_maps, crop_size
PerceiverImageProcessor:
do_center_crop, crop_size
DeformableDetrImageProcessor:
format, return_segmentation_masks, annotations, masks_path
EfficientFormerImageProcessor:
do_center_crop, crop_size
SegformerImageProcessor:
do_reduce_labels, segmentation_maps
LayoutLMv2ImageProcessor:
apply_ocr, ocr_lang, tesseract_config
BridgeTowerImageProcessor:
do_center_crop, size_divisor
SamImageProcessor:
segmentation_maps, pad_size, do_convert_rgb, mask_pad_size, mask_size
BlipImageProcessor:
do_convert_rgb
Owlv2ImageProcessor:
No difference in args
LayoutLMv3ImageProcessor:
apply_ocr, ocr_lang, tesseract_config
DetaImageProcessor:
format, return_segmentation_masks, annotations, masks_path
BitImageProcessor:
do_center_crop, do_convert_rgb, crop_size
ViTHybridImageProcessor:
do_center_crop, do_convert_rgb, crop_size
FuyuImageProcessor:
patch_size, padding_mode, padding_value
PvtImageProcessor:
No difference in args
Pix2StructImageProcessor:
max_patches, header_text, do_convert_rgb, patch_size
VitMatteImageProcessor:
trimaps, size_divisibility
VideoMAEImageProcessor:
videos, do_center_crop, crop_size
MobileNetV2ImageProcessor:
do_center_crop, crop_size
OneFormerImageProcessor:
segmentation_maps, ignore_index, task_inputs, do_reduce_labels, instance_id_to_semantic_id
FlavaImageProcessor:
crop_size, codebook_crop_size, codebook_rescale_factor, mask_group_max_patches, mask_group_min_patches, mask_group_max_aspect_ratio, codebook_image_mean, codebook_do_resize, return_image_mask, input_size_patches, codebook_do_center_crop, codebook_resample, mask_group_min_aspect_ratio, codebook_do_normalize, codebook_do_map_pixels, return_codebook_pixels, codebook_image_std, do_center_crop, codebook_size, codebook_do_rescale, total_mask_patches
DonutImageProcessor:
random_padding
TvpImageProcessor:
videos, crop_size, constant_values, do_flip_channel_order, do_center_crop, pad_size, pad_mode
GLPNImageProcessor:
size_divisor
PoolFormerImageProcessor:
crop_pct, do_center_crop, crop_size
CLIPImageProcessor:
do_center_crop, do_convert_rgb, crop_size
DPTImageProcessor:
ensure_multiple_of, keep_aspect_ratio, size_divisor
ViltImageProcessor:
size_divisor
Swin2SRImageProcessor:
pad_size
ImageGPTImageProcessor:
clusters, do_color_quantize
SiglipImageProcessor:
No difference in args
VivitImageProcessor:
videos, do_center_crop, offset, crop_size
ConvNextImageProcessor:
crop_pct
OwlViTImageProcessor:
do_center_crop, crop_size
ChineseCLIPImageProcessor:
do_center_crop, do_convert_rgb, crop_size
LevitImageProcessor:
do_center_crop, crop_size
ConditionalDetrImageProcessor:
format, return_segmentation_masks, annotations, masks_path
DetrImageProcessor:
format, return_segmentation_masks, annotations, masks_path
This helps standardize a bit in the first place, and then, will allow uniformizing Processors.
Fixes # (issue)
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [ ] Did you read the contributor guideline, Pull Request section?
- [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
Models:
- text models: @ArthurZucker and @younesbelkada
- vision models: @amyeroberts
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.