Cluster2Former: Semisupervised Clustering Transformers for Video Instance Segmentation (Sensors 2024)
Áron Fóthi, Adrián Szlatincsán, Ellák Somfai
A novel approach for video instance segmentation is presented using semisupervised learning. Our Cluster2Former model leverages scribble-based annotations for training, significantly reducing the need for comprehensive pixel-level masks. We augment a video instance segmenter, for example, the Mask2Former architecture, with similarity-based constraint loss to handle partial annotations efficiently. We demonstrate that despite using lightweight annotations (using only 0.5% of the annotated pixels), Cluster2Former achieves competitive performance on standard benchmarks. The approach offers a cost-effective and computationally efficient solution for video instance segmentation, especially in scenarios with limited annotation resources.
Keywords: transformers; video processing; instance segmentation; semisupervised learning
- A single architecture for panoptic, instance and semantic segmentation.
- Based on Mask2Former, no change in the architecture of the model
- With the use of scribble like annotations and Similarity-based Constraint loss, you can achive competitive performance, but with much less annotation effort compared to the full mask annotation.
- Tensorboard visualization support during training and evaluation
- Support major VIS datasets and scribble version of them: YouTubeVIS 2019/2021, OVIS.
See installation instructions.
See Preparing Datasets for Mask2Former and Cluster2Former.
See Getting Started with Mask2Former and Cluster2Former.
See more in Mask2Former
See Advanced Usage of Mask2Former.
We also provide a set of baseline results and trained models available for download in addition to the Model Zoo of the Mask2Fomer in the Mask2Former and Cluster2Former Model Zoo.
If you use Cluster2Former in your research or wish to refer to the baseline results published in the Model Zoo, please use the following BibTeX entry.
@Article{s24030997,
AUTHOR = {Fóthi, Áron and Szlatincsán, Adrián and Somfai, Ellák},
TITLE = {Cluster2Former: Semisupervised Clustering Transformers for Video Instance Segmentation},
JOURNAL = {Sensors},
YEAR = {2024},
}
Code is based on Mask2Former (https://github.com/facebookresearch/Mask2Former).