Porting SegFormer to HuggingFace Transformers
Hi guys,
First of all thanks for this impressive (and simple) model!
I'd like to port this model to HuggingFace Transformers, which, as you might know, is a library that includes a lot of Transformer-based models (mostly NLP models like BERT and RoBERTa, but recently I've added the Vision Transformer (ViT), DeiT and DETR to the library, so I think SegFormer definitely deserves its place there too!).
The API I had in mind could look something like this (very similar to ViT):
from transformers import SegFormerFeatureExtractor, SegFormerForImageSegmentation
from PIL import Image
import requests
feature_extractor = SegFormerFeatureExtractor.from_pretrained("nvidia/segformer-b0-fine-tuned-ade-512-512")
model = SegFormerForImageSegmentation("nvidia/segformer-b0-fine-tuned-ade-512-512")
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits # shape (batch_size, num_labels, height/4, width/4)
The main advantage would be that people could train the SegFormer model within a Colab notebook with ease just using a native PyTorch training loop or with frameworks like PyTorch Lighting, HuggingFace Accelerate, etc., and also perform inference very easily as shown above. No scripts required!
The feature extractor should not be a fully-fledged preprocessor, it would probably just need to resize + normalize images, such that they can be fed to the model. I guess resizing to 512x512 is a good default option. I would perhaps include a post_process method, that can be used to convert the logits of the model to an actual image of the semantic segmentation.
All model checkpoints can be hosted for free on the hub, under the NVIDIA namespace (which currently includes models like Megatron-GPT-2).
Are you interested in helping me finishing up this model? My main questions would be:
- what are the most basic image + mask transformations that would work in order to perform inference + fine-tune on a custom dataset? What should the values of the image size be for each of the checkpoints? It seems that for the 512x512 ADE model, the shortest side is 512?
- I guess that if the feature extractor resizes (rescales) images to 512x512, the corresponding masks also need to be resized. But as the model predicts masks at resolution 128x128, does the feature extractor need to resize them to this resolution?
- how is the loss defined, is this just the
CrossEntropyLossbetween the predicted mask and the ground truth mask?
Hi, Thanks! It is so great if SegFormer will be in HuggingFace!
For your questions:
(0) basic image + mask transformations for inference and finetune on custom datasets
First, let me take line 7-31 in ade20k config as example.
For inference on a new image, you only need three transformations:
For AlignedResize you can refer to (1).
For finetune on other custom datasets, I believe random scale, crop and flip is necessary for augmentation.

Tips: the definition of these transformations can be find here.
(1) For inference on custom dataset, we have two modes:
-
whole image test. For example, if given an image with size 256x300 from ADE20K, we first scale the image's short side=512, e.g. (256x300 --> 512x600) then align the shape that is divisible by 32 (512x600 --> 512x608), this step we callAlignedResize. -
slide window test. If the image size is too large, e.g. cityscapes with 1024x2048 image resolution, we can use overlapping slide window on images, e.g. window_size=1024x1024 and stride=768. Also make sure the window shape is divisible by 32.
(2) The mask shape:
The output feature of the network is 1/4 resolution of original image. Then we need to upsample to the input image's size and calculate the per-pixel classification loss.
(3) loss function:
Yes, we only use the CrossEntropyLoss defined here.
If you have other questions, feel free to let me know. Thanks!
Ok, thanks for the detailed answer! Each model in HuggingFace Transformers requires 3 files to be implemented:
-
configuration_segformer.py, which defines the hyperparameters -
modeling_segformer.py, which implements the model -
feature_extraction_segformer.py, which implements the feature extractor
Actually, I've finished the modeling part (modeling_segformer.py). I've defined two models, SegFormerModel (which is the hierarchical Transformer encoder only) and SegFormerForImageSegmentation (which is SegFormerModel with the all-MLP decoder + classifier on top). This is to streamline the API with BERT for example, which has BertModel (Transformer encoder without any head on top) and BertForSequenceClassification (which is BertModel with a linear layer on top of the [CLS] token) - among other head models.
I'm now working on SegFormerFeatureExtractor (which can be used to prepare images + segmentation maps for the model). I'm not going to include random scale, crop and flip (if people want to use those, they can use torchvision's transforms for example). It will only define two necessary transformations, namely AlignedResize and normalize. I've replaced mmcv.rescale and mmcv.resize by self.resize within the feature extractor, as the SegFormerFeatureExtractor inherits from a class called ImageFeatureExtractionMixin that has a resize method implemented. I guess I can also remove the random_scale from the AlignedResize class, as each of the checkpoints has image scales defined.
However, I need to know what the image scale is for each of the fine-tuned checkpoints. Is it correct that the image scale is (2048, 512) for the AlignedResize for each of the ade20k released checkpoints, and (2048, 1024) for the Resize for each of the cityscapes checkpoints?
So what I now basically need is a dummy image (for example the one from the demo), resize + normalize it and then forward it through the original implementation, forward it through my implementation and verify whether the logits are exactly equal. Is there an easy way to do this with the original implementation? I'd like to use demo.py, but that also seems to include a RandomFlip transformation (however flip is set to False?).
Hi,
Although random flip is defined in the config's test_pipeline, it is not used for inference unless you set aug-test=True(means multi-scale+flip test) in tools/test.py to evaluate the dataset.
But for image_demo.py, it does not use Flip. It only contains AlignedResize+Normalize. No other steps are needed.
So, you can directly compare the results of your implementation and the original one(image_demo.py).
About image_scale=(2048, 512), it means to scale the short_size to 512 for most case, but the longer_size should<=2048 to avoid GPU OOM (because there are very few images which have extremely large aspect ratio). The scale_factor is defined as min(512/short_side, 2048/long_side)
Example1: the original image is (256,304), the scale_factor is min(512/256, 2048/304)=2, after AlignedResize, it is (512,608).
Example2: the original image is (256,2048), the scale_factor is min(512/256, 2048/2048)=1, after AlignedResize, it should be (256,2048)
But it is fine to simply scale the short_side=512. Example2's case is very rare.
For cityscapes, all the image has same shape(1024x2048), and the image_scale is also set to (1024,2048). In this case the image will not be resized because scale_factor=1.
Ok great, thanks for the response. I've just finished the conversion script (which let's me convert the original checkpoints to their HuggingFace counterpart). Currently, it only complains about the following parameters which are not converted:
RuntimeError: Error(s) in loading state_dict for SegFormerForImageSegmentation:
Unexpected key(s) in state_dict: "conv_seg.weight", "conv_seg.bias".
I see that the SegFormer decoder head inherits from BaseDecodeHead, which defines the conv_seg linear layer. But does it actually use this layer? I see the linear layer for getting the logits is called linear_pred.
The documentation page will look like this:

Next step now is to compare the implementations on the same image.
Hi,
The layer conv_seg is not used, you can remove it.
The documentation page looks very nice!
Ok thanks!
I'm currently testing my implementation and the original one on the same image (one from ADE20k). However, when comparing the pixel values prepared by SegFormerFeatureExtractor to the ones that are created in mmseg, it turns out these are not exactly equal.
The shapes are equal, as well as the initial values (printing pixel_values[0,:3,:3,:3])
# my implementation
tensor([[[-0.7993, -0.7993, -0.8164],
[-0.7993, -0.7993, -0.8164],
[-0.7993, -0.7993, -0.8164]],
[[-0.1975, -0.1975, -0.2150],
[-0.1975, -0.1975, -0.2150],
[-0.1975, -0.1975, -0.2150]],
[[ 0.6705, 0.6705, 0.6531],
[ 0.6705, 0.6705, 0.6531],
[ 0.6705, 0.6705, 0.6531]]])
# original implementation
tensor([[[-0.7993, -0.7993, -0.8164],
[-0.7993, -0.7993, -0.8164],
[-0.7993, -0.8164, -0.8164]],
[[-0.1975, -0.1975, -0.2150],
[-0.1975, -0.1975, -0.2150],
[-0.1975, -0.2150, -0.2150]],
[[ 0.6705, 0.6705, 0.6531],
[ 0.6705, 0.6705, 0.6531],
[ 0.6705, 0.6531, 0.6531]]], device='cuda:0')
And I've also checked the final values (pixel_values[0,-3:,-3:,-3:]), these are also equal. But comparing the sum of the pixel values (pixel_values.sum()), this is tensor(92383.2031) for my implementation (which uses PIL as backend) and tensor(89296.2734) for the original implementation. I see cv2 is used as a backend, which might explain the difference. Will this have a big impact on performance?
Hi,
I am not sure whether use CV2 or PIL to read images will cause a slight difference.
I think you can visualize the result and compare them.
Also, you can calculate the IoU between your implementation and the original one.
If they are almost the same, I believe there is no problem.
Ok thanks!
I'm currently testing my implementation and the original one on the same image (one from ADE20k). However, when comparing the pixel values prepared by
SegFormerFeatureExtractorto the ones that are created in mmseg, it turns out these are not exactly equal.The shapes are equal, as well as the initial values (printing
pixel_values[0,:3,:3,:3])# my implementation tensor([[[-0.7993, -0.7993, -0.8164], [-0.7993, -0.7993, -0.8164], [-0.7993, -0.7993, -0.8164]], [[-0.1975, -0.1975, -0.2150], [-0.1975, -0.1975, -0.2150], [-0.1975, -0.1975, -0.2150]], [[ 0.6705, 0.6705, 0.6531], [ 0.6705, 0.6705, 0.6531], [ 0.6705, 0.6705, 0.6531]]]) # original implementation tensor([[[-0.7993, -0.7993, -0.8164], [-0.7993, -0.7993, -0.8164], [-0.7993, -0.8164, -0.8164]], [[-0.1975, -0.1975, -0.2150], [-0.1975, -0.1975, -0.2150], [-0.1975, -0.2150, -0.2150]], [[ 0.6705, 0.6705, 0.6531], [ 0.6705, 0.6705, 0.6531], [ 0.6705, 0.6531, 0.6531]]], device='cuda:0')And I've also checked the final values (
pixel_values[0,-3:,-3:,-3:]), these are also equal. But comparing the sum of the pixel values (pixel_values.sum()), this istensor(92383.2031)for my implementation (which uses PIL as backend) andtensor(89296.2734)for the original implementation. I see cv2 is used as a backend, which might explain the difference. Will this have a big impact on performance?
Hi @NielsRogge , I've once compared PIL vs. cv2 on another semantic segmentation project. They don't seem to introduce major differences (either single scale inference or multi-scale inference).
But I do notice some differences between cv2 and PIL in image resizing as recently pointed out by Jun-Yan Zhu and Richard Zhang et al. In particular PIL introduces anti-aliasing when downsampling while cv2 does not: https://twitter.com/junyanz89/status/1385654389872934926?s=20 https://github.com/GaParmar/clean-fid Not sure if this is partly related to your question.
In my case, switching to PIL gives slight improvements on multi-scale testing (maybe helped when images are resized to scale 0.5/0.75). But the improvements are marginal and not sure if it's statistically significant.
Ok, thanks for the information. Yeah in HuggingFace Transformers, all feature extractors (ViTFeatureExtractor, DeiTFeatureExtractor, DetrFeatureExtractor) currently rely on PIL, and they are not meant to be fully-fledged preprocessors, for now they just support some basic operations (resizing, center cropping, normalizing images). I guess the results will not be significantly different, so it's safe to use PIL.
In terms of progress, my current implementation is giving me the same logits as the original implementation! Here's a notebook that performs inference on an image from the ADE20k dataset:
https://colab.research.google.com/drive/17i0XkXKYWgRGUd8J72jwPs_IDRmk0EIa?usp=sharing
Can you help me fix the visualization part? I've set the notebook to be editable.
Note: the colab is with random weights for now, once the visualization part works I'll upload the first weights to HuggingFace's hub, and we'll get a nice segmentation map :)
I'm perhaps planning to add the visualization part to the feature extractor, such that people can simply do feature_extractor.show_results(image, logits). However, this will create an additional dependency on matplotlib, which I'm not sure the authors of HuggingFace are going to like.
Also an additional question: so during training, the model outputs logits of shape (batch_size, num_labels, height/4, width/4), and these need to be upscaled again to the original image size before computing the loss. So this is upscaling to the crop_size, right? Since all images are cropped and padded up to the same crop_size?
I will probably also add random cropping to the feature extractor, such that it can also be used to fine-tune on a custom dataset.
I also plan to add palettes to SegFormerConfig.
Hi @NielsRogge I have finished the vis code in colab, please check it.
If you do not want to involve matplotlib, you can save the image using PIL instead of showing it.
During training, yes, the output feature map will be upsampled from (B,C,H/4,W/4) to (B,C,H,W) and calculate loss.
Ok great, thanks for looking into it!
Inference now works: https://colab.research.google.com/drive/1Aq2uelaRNubW1iduc2oh0kkUIYamgZkY?usp=sharing
I've uploaded weights of the b0 model to the hub as can be seen here. If the project is finished, I can upload all model variants to the NVIDIA namespace.
I think the main thing to work on to finish this is the feature extractor. So if I understand it correctly:
- during training, images are resized and then randomly cropped to a certain
crop_size(omitting the other augmentations). Next, images are padded up to the samecrop_size, and then training is performed using cross-entropy loss. So effectively a loss is also incurred for padded pixels? The model outputs logits of shape(batch_size, num_classes, height of the crop_size / 4, width of the crop_size / 4). These are then upscaled to thecrop_sizeand cross-entropy is calculated.
If you can confirm this, then I'll let the feature extractor support:
- resizing + aligned resizing (the latter is only required for ADE20k)
- normalizing
- padding
Hi @NielsRogge
Your understanding is mostly correcty, only for pad pixels, we will ignore the loss of pad_pixel, only calculate the loss of valid pixels.
So the labels are set to -100 for pad pixels? Can you point me to where this happens in the code?
No, from the config we can see that seg_pad_val=255.
And from decode_head.py the ignore_index=255
So we pad seg_map = 255 and set the ignore_index=255 instead of 100.
But you can set any value, just ensure seg_pad_val==ignore_index
Another question: when calculating the loss, the logits need to be upsampled again as shown here:
https://github.com/NVlabs/SegFormer/blob/93301b33d7b7634b018386681be3a640f5979957/mmseg/models/decode_heads/decode_head.py#L220-L224
Why are we taking seg_label.shape[2:]?
If I understand correctly, the input to the model is of shape (batch_size, num_channels, height, width) and the corresponding labels (ground truth segmentation maps of a batch of images) of shape (batch_size, height, width). So I would assume .shape[1:] instead of .shape[2:].
Hi,
shape[2:] indicates [h, w], which means upsample the seg_logit to the shape of seg_label
Yes I understand that. But why not shape[1:] instead of shape[2:]? The seq_label has shape (batch_size, height, width) right, or not?
I am not sure. maybe you need to check the size of seg_label. But the size should be (h,w) anyway.
Hi,
I'm also defining a SegFormerForImageClassification, as you can also use the SegFormer encoder to classify images. I see here that the classification head projects from the hidden size of the last block to num_labels.
However, the hidden_states of the last block are of shape (batch_size, embed_dim, height // 4, width // 4). So how are the classification logits computed from the last hidden states?
Hi, It is great if you can support image classification! Our method has a strong relationship with PVTv2. You can refer to line 288-298 in PVTv2 classification.
In detail, for classification, we only use the last stage feature (with shape h/32 x w/32), add layer_norm->global_pool->linear classification head on the last feature map.
By the way, our PVTv2 is also a very strong vision transformer backbone, does HuggingFace consider supporting it?
If you can support SegFormerForImageClassification, it is super easy to support PVTv2 in technical.
By the way, our PVTv2 is also a very strong vision transformer backbone, does HuggingFace consider supporting it? If you can support SegFormerForImageClassification, it is super easy to support PVTv2 in technical.
Yes, that should be possible. However, I first will finish SegFormer. Another question: is the model trained with targets between 1 and 150? And does the model output labels between 1 and 150 (i.e. not between 0 and 149) for ADE20k? Update: seems like the labels are reduced by one when training, so the model outputs labels between 0 and 149. Can you point me to where this happens in the code? Update v2: found it, here.
To do:
- add
reduce_zero_labeloption toSegFormerFeatureExtractor: done - add padding of images + segmentation maps.
@xieenze this is a notebook illustrating fine-tuning SegFormer on custom data: https://colab.research.google.com/drive/15JeOp3KxEjeTxG74DZc1cGZVaT5gPQjj?usp=sharing
However, I'm not sure whether it works properly already (loss is going down nicely, but inference results don't look good). Could you review my notebook?
Also, regarding the segmentation maps of the ADE20k dataset: am I reading these maps in the correct way?
Thanks!
Hi,
I find that when doing training, the value of label is between [0,255], which is unreasonable. It should be [0-150] since ADE20K has 150 classes.
I think there are some bugs in self.feature_extractor, because the value is correct when using PIL read from local. But after self.feature_extractor it is incorrect.
Thanks for looking into it.
Is this part wrong?
if self.reduce_zero_label:
if segmentation_maps is not None:
for idx, map in enumerate(segmentation_maps):
if not isinstance(map, np.ndarray):
map = np.array(map)
# avoid using underflow conversion
map[map == 0] = 255
map = map - 1
map[map == 254] = 255
segmentation_maps[idx] = Image.fromarray(map.astype(np.uint8))
So what I do is: convert each segmentation map to a NumPy array, and then reduce the labels as was done in the original code.
I don't know how do you calculate loss when the value=255. If you set 255 as ignore_index when calculate loss I think it is fine.
Yes that's the case, as can be seen here. So then I'm mostly ready.
However, it's weird that the inference part doesn't look as expected, after fine-tuning the model. Or is this expected given the number of epochs?
The "seg_pad_val" is used in class Pad, but the "ignore_index" is not used in loss "CrossEntropyLoss" during training, so does it really ignore the ignore_index during training?
I'm not sure what you mean. I'm setting ignore_index equal to 255 when defining the CrossEntropyLoss, so labels having a value of 255 will be ignored. For now, I have not added padding of images (only rescaling - no AlignedResize- + random cropping + normalization). It seems to work fine without padding, as all images have a size that's at least as big as the crop size. I will add padding later.
This part of the code:
# avoid using underflow conversion
map[map == 0] = 255
map = map - 1
map[map == 254] = 255
makes sure that background labels are ignored, right? All background labels, which are 0 in the official dataset, are replaced by 255.
I will today debug the fine-tuning notebook. Free free to help me out :)
For example, I will add metrics like pixel-wise accuracy and mIoU to the training loop, taking into account the ignore_index. Do you know an easy way to calculate those?
Also, at inference time, I'm using AlignedResize (but the model is trained with Resize + keep_ratio=True), could this be the issue?
I'm have an own implementation, I'm not using the mmseg code base.
@NielsRogge Thank you for your work on porting this to HuggingFace. How are you progressing with this? I have tested your implementation (based on your repo) on a custom dataset. It is doing a decent job (use-case is building rooftop extraction), but I am finding the edges of the segmented buildings could benefit from a better pre-trained model, and optimization/configuration. I am working on the optimization now. Will you be uploading the larger pre-trained models (b1-b5) to HuggingFace soon? As an example here is an image illustrating inference on a image based on the model I have generated. Again, thank you!! AS you can see, not bad, but needs improvement.

Hi,
Yeah I've tested it on a small dataset and it seems to work well. I'm currently working on another model, but once I have the time I'll add SegFormer to the Huggingface repo.
Really nice to see it works!! Thanks for trying it out