transformers Add Video Mask2Former

What does this PR do?

This PR adds Video Mask2Former model.

Original repo: https://github.com/facebookresearch/Mask2Former/ Mask2Former for Video Instance Segmentation Paper: https://arxiv.org/abs/2112.10764

Co-authored with: @alaradirik

[x] Update model checkpoints
[x] Update model cards
[ ] transfer model checkpoints to facebook organization

Who can review?

@alaradirik

Mar 15 '23 14:03 shivalikasingh95

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Mar 15 '23 14:03 HuggingFaceDocBuilderDev

I suggest that instead of adapting the logic within the current Mask2Former layers and image processor, a new model Mask2FormerVideo is added, with its own image processor and modeling file. This would enable simpler logic in the forward passes for the respective models, and an image processor which can take and return videos directly.

This PR is in a good state, so it should be fairly simple to make this change but of course let us know if you have any questions about this implementation.

Hey @amyeroberts , thanks a lot for the review :)

Regarding the structure...Alara and I actually had a discussion regarding this earlier...whether to add this as a separate model or not but we felt that since video mask2former is not very different from original mask2former, it would be alright if we just modify the existing implementation to handle both video and image.

But I'd be happy to convert this into a separate model. Just want to quickly double check with @alaradirik too before proceeding as this relates to our earlier discussion.

Mar 22 '23 22:03 shivalikasingh95

@shivalikasingh95 I'm fine with going either way since the video segmentation model only has minor differences but perhaps we could keep most of changes to the existing sub classes and add a new head class - Mask2FormerForVideoSegmentation and add it to the model docs. I think this would boost the visibility and usage of the model as well, what do you think?

Just asking for future models @amyeroberts, would we have a video_processing_xxx.py file and VideoProcessorXXX class to process videos? We have talked about creating video processing utilities with @shivalikasingh95 before but I wasn't sure about the best way to handle it.

Mar 23 '23 05:03 alaradirik

@shivalikasingh95 I'm fine with going either way since the video segmentation model only has minor differences but perhaps we could keep most of changes to the existing sub classes and add a new head class - Mask2FormerForVideoSegmentation and add it to the model docs. I think this would boost the visibility and usage of the model as well, what do you think?

@alaradirik I think adding a new head class - Mask2FormerForVideoSegmentation is a really good idea!

Just asking for future models @amyeroberts, would we have a video_processing_xxx.py file and VideoProcessorXXX class to process videos?

If we can add something like Mask2FormerVideoProcessor which can handle videos directly then that would be perfect.

I'm not fully sure if it would make sense to add a separate modeling file altogether for video-mask2former since the authors wanted to show how easily we can use mask2former for video segmentation too.

Quote from the paper -: "We find Mask2Former also achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline"

And hence, implementation wise there isn't really much difference.

I think adding a new head and video processor class should help in making the implementation cleaner.

But again, I'm not sure if a model can have both an ImageProcessor and VideoProcessor class. If you guys feel, this may not be the best approach then may be we can go for turning this into a separate model.

Mar 23 '23 06:03 shivalikasingh95

I'm not fully sure if it would make sense to add a separate modeling file altogether for video-mask2former

I agree that having a separate model implementation isn't in line with the spirit of the model. However, at the moment, supporting video inputs does require a modification of architecture, as shown by the need for the is_video flag throughout modeling_mask2formers.py.

If we can add something like Mask2FormerVideoProcessor which can handle videos directly then that would be perfect.

I don't think a VideoProcessor class is necessary at the moment. We already have models with video inputs e.g. VideoMAE which use image processors. If we want to use the same processing class as the image model, adding a method such as preprocess_video is a possibility. This does mean the processor won't be compatible with the usual API, i.e. it could not be directly called image_processor(video_inputs). However, the current method post_process_video_instance_segmentation also breaks this. Having a separate modeling file resolves the issue of one class handling both images and videos.

Mar 24 '23 16:03 amyeroberts

Having a separate modeling file resolves the issue of one class handling both images and videos.

Sure @amyeroberts. I understand now why adding a separate modeling file would make more sense. Had a discussion with @alaradirik too regarding this change on Friday. I'll take care of adding the new modeling file.

I don't think a VideoProcessor class is necessary at the moment. We already have models with video inputs e.g. VideoMAE which use image processors.

Again makes sense, @amyeroberts. I only meant to suggest adding a VideoProcessor class for Mask2Former in case you guys were planning on introducing VideoProcessor classes in general as part of transformers library.

If we want to use the same processing class as the image model, adding a method such as preprocess_video is a possibility. This does mean the processor won't be compatible with the usual API, i.e. it could not be directly called image_processor(video_inputs)

Would adding a separate image processor class, VideoMask2FormerImageProcessor make sense if we don't want to have the same image processing class for image and video models? This way the usual API behaviour would not be broken. In this case, we can directly use image_processor(video_inputs).

Mar 26 '23 16:03 shivalikasingh95

Would adding a separate image processor class, VideoMask2FormerImageProcessor make sense if we don't want to have the same image processing class for image and video models? This way the usual API behaviour would not be broken. In this case, we can directly use image_processor(video_inputs).

@shivalikasingh95 - yep, I think that works!

Mar 27 '23 14:03 amyeroberts

@alaradirik and @amyeroberts please feel free to review this PR.

I'm just getting a few failing CI checks due to this error. Would be great if I can get some help on how to fix it.

Apr 18 '23 15:04 shivalikasingh95

@amyeroberts could you do a final review, Video Mask2Former is a separate model now and the PR is in good shape :)

Apr 25 '23 14:04 alaradirik

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 21 '23 15:05 github-actions[bot]