Add Video Mask2Former
What does this PR do?
This PR adds Video Mask2Former model.
Original repo: https://github.com/facebookresearch/Mask2Former/ Mask2Former for Video Instance Segmentation Paper: https://arxiv.org/abs/2112.10764
Co-authored with: @alaradirik
- [x] Update model checkpoints
- [x] Update model cards
- [ ] transfer model checkpoints to facebook organization
Who can review?
@alaradirik
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.
I suggest that instead of adapting the logic within the current
Mask2Formerlayers and image processor, a new modelMask2FormerVideois added, with its own image processor and modeling file. This would enable simpler logic in the forward passes for the respective models, and an image processor which can take and return videos directly.This PR is in a good state, so it should be fairly simple to make this change but of course let us know if you have any questions about this implementation.
Hey @amyeroberts , thanks a lot for the review :)
Regarding the structure...Alara and I actually had a discussion regarding this earlier...whether to add this as a separate model or not but we felt that since video mask2former is not very different from original mask2former, it would be alright if we just modify the existing implementation to handle both video and image.
But I'd be happy to convert this into a separate model. Just want to quickly double check with @alaradirik too before proceeding as this relates to our earlier discussion.
@shivalikasingh95 I'm fine with going either way since the video segmentation model only has minor differences but perhaps we could keep most of changes to the existing sub classes and add a new head class - Mask2FormerForVideoSegmentation and add it to the model docs. I think this would boost the visibility and usage of the model as well, what do you think?
Just asking for future models @amyeroberts, would we have a video_processing_xxx.py file and VideoProcessorXXX class to process videos? We have talked about creating video processing utilities with @shivalikasingh95 before but I wasn't sure about the best way to handle it.
@shivalikasingh95 I'm fine with going either way since the video segmentation model only has minor differences but perhaps we could keep most of changes to the existing sub classes and add a new head class -
Mask2FormerForVideoSegmentationand add it to the model docs. I think this would boost the visibility and usage of the model as well, what do you think?
@alaradirik I think adding a new head class - Mask2FormerForVideoSegmentation is a really good idea!
Just asking for future models @amyeroberts, would we have a video_processing_xxx.py file and
VideoProcessorXXXclass to process videos?
If we can add something like Mask2FormerVideoProcessor which can handle videos directly then that would be perfect.
I'm not fully sure if it would make sense to add a separate modeling file altogether for video-mask2former since the authors wanted to show how easily we can use mask2former for video segmentation too.
Quote from the paper -: "We find Mask2Former also achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline"
And hence, implementation wise there isn't really much difference.
I think adding a new head and video processor class should help in making the implementation cleaner.
But again, I'm not sure if a model can have both an ImageProcessor and VideoProcessor class. If you guys feel, this may not be the best approach then may be we can go for turning this into a separate model.
I'm not fully sure if it would make sense to add a separate modeling file altogether for video-mask2former
I agree that having a separate model implementation isn't in line with the spirit of the model. However, at the moment, supporting video inputs does require a modification of architecture, as shown by the need for the is_video flag throughout modeling_mask2formers.py.
If we can add something like Mask2FormerVideoProcessor which can handle videos directly then that would be perfect.
I don't think a VideoProcessor class is necessary at the moment. We already have models with video inputs e.g. VideoMAE which use image processors.
If we want to use the same processing class as the image model, adding a method such as preprocess_video is a possibility. This does mean the processor won't be compatible with the usual API, i.e. it could not be directly called image_processor(video_inputs). However, the current method post_process_video_instance_segmentation also breaks this. Having a separate modeling file resolves the issue of one class handling both images and videos.
Having a separate modeling file resolves the issue of one class handling both images and videos.
Sure @amyeroberts. I understand now why adding a separate modeling file would make more sense. Had a discussion with @alaradirik too regarding this change on Friday. I'll take care of adding the new modeling file.
I don't think a VideoProcessor class is necessary at the moment. We already have models with video inputs e.g. VideoMAE which use image processors.
Again makes sense, @amyeroberts. I only meant to suggest adding a VideoProcessor class for Mask2Former in case you guys were planning on introducing VideoProcessor classes in general as part of transformers library.
If we want to use the same processing class as the image model, adding a method such as preprocess_video is a possibility. This does mean the processor won't be compatible with the usual API, i.e. it could not be directly called image_processor(video_inputs)
Would adding a separate image processor class, VideoMask2FormerImageProcessor make sense if we don't want to have the same image processing class for image and video models?
This way the usual API behaviour would not be broken. In this case, we can directly use image_processor(video_inputs).
Would adding a separate image processor class, VideoMask2FormerImageProcessor make sense if we don't want to have the same image processing class for image and video models? This way the usual API behaviour would not be broken. In this case, we can directly use image_processor(video_inputs).
@shivalikasingh95 - yep, I think that works!
@alaradirik and @amyeroberts please feel free to review this PR.
I'm just getting a few failing CI checks due to this error. Would be great if I can get some help on how to fix it.
@amyeroberts could you do a final review, Video Mask2Former is a separate model now and the PR is in good shape :)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.