segmentation_models.pytorch segformer model implementation != original arch design

In the segformer paper, the diagram looks like this

But in this repo, the code is written as below. How come it has encoder name attribute, there's no CNN feature extraction separately in the original design plan?

    @supports_config_loading
    def __init__(
        self,
        encoder_name: str = "resnet34",
        encoder_depth: int = 5,
        encoder_weights: Optional[str] = "imagenet",
        decoder_segmentation_channels: int = 256,
        in_channels: int = 3,
        classes: int = 1,
        activation: Optional[Union[str, Callable]] = None,
        upsampling: int = 4,
        aux_params: Optional[dict] = None,
        **kwargs: dict[str, Any],
    ):
        super().__init__()

        self.encoder = get_encoder(
            encoder_name,
            in_channels=in_channels,
            depth=encoder_depth,
            weights=encoder_weights,
            **kwargs,
        )

        self.decoder = SegformerDecoder(
            encoder_channels=self.encoder.out_channels,
            encoder_depth=encoder_depth,
            segmentation_channels=decoder_segmentation_channels,
        )

        self.segmentation_head = SegmentationHead(
            in_channels=decoder_segmentation_channels,
            out_channels=classes,
            activation=activation,
            kernel_size=1,
            upsampling=upsampling,
        )

        if aux_params is not None:
            self.classification_head = ClassificationHead(
                in_channels=self.encoder.out_channels[-1], **aux_params
            )
        else:
            self.classification_head = None

        self.name = "segformer-{}".format(encoder_name)
        self.initialize()

Sep 15 '25 07:09 pure-rgb

Hey, to instantiate the original architecture, see the params in pretrained segformer models

https://huggingface.co/smp-hub/segformer-b5-640x640-ade-160k/blob/main/config.json

This model initializes the original transformer encoder + MLP decoder, load pretrined weights and will reproduce inference result

Sep 15 '25 10:09 qubvel

Understood.

But what this implementation about? Do you think that the implementation followed original architecture? The model name is set as segformer, which isn't misleading?

Sep 20 '25 07:09 pure-rgb

Hi @pure-rgb ,

It follows the transformers libraries, with the only change being SMP’s convention of using ResNet as the default backbone. You can easily swap in any timm backbone if needed.

If you expect a strict, paper-exact SegFormer, then SMP is probably not the library you’re looking for.🤗

Sep 25 '25 18:09 brianhou0208

@brianhou0208

If you expect a strict, paper-exact SegFormer, then SMP is probably not the library you’re looking for.🤗

Understood.

In that case, please add this info in the repo. It's completely misleading otherwise. SegFormer model has its own arch design proposal, which is not faithfully follow in this repo, IMHO. And user may get confuse it by using it in their research or project.

original encoder: mix vision transformer
this repo: none, but old image-net model.
original decoder: mlp based light decoder.
this repo: add convnet.

Oct 14 '25 09:10 pure-rgb

Every researcher should understand the code they’re using — that’s just basic research practice.🤗

The SMP library is a high-level convenience wrapper, not a shrine for paper replication. If someone’s goal is to perfectly reproduce the SegFormer paper, the official repository is right there — no one’s stopping them.

By the way, the Mix Vision Transformer (MiT) implementation already exists; the ResNet34 mentioned here is simply a default placeholder, not some profound architectural decision.

Oct 14 '25 14:10 brianhou0208

Every researcher should understand the code they’re using — that’s just basic research practice

There is researcher, engineer, and ml practitioner. I'm certain to say that, this library is great fit for engineer and ml practitioner but doesn't fit for the researcher.

The SMP library is a high-level convenience wrapper, not a shrine for paper replication. If someone’s goal is to perfectly reproduce the SegFormer paper, the official repository is right there — no one’s stopping them.

I think I didn’t explain it clearly before. The goal of this library isn’t necessarily to replicate or reproduce results from research papers. However, it should at least adhere to the exact architecture proposed in the original paper.

If there’s an official implementation that specifies a particular combination, for example, encoder_X + decoder_X, that should be explicitly followed. For instance, in this library, the SegFormer model uses a ResNet encoder by default, whereas it should be using MiT.

It’s perfectly fine if the library allows flexibility with different backbones (e.g., ResNet, DenseNet, etc.), but such changes should be clearly documented or indicated in the implementation. The current setup doesn’t make this distinction, which is concerning. If someone imports SegFormer from this library as-is and assumes it’s the actual SegFormer model, that would be misleading.

By the way, the Mix Vision Transformer (MiT) implementation already exists; the ResNet34 mentioned here is simply a default placeholder, not some profound architectural decision.

That’s fine, modifications are acceptable. However, such changes should be clearly stated, either in the code or documentation, which unfortunately isn’t the case here.

Faithful implementation should always come first. At the same time, providing flexibility to modify the architecture with different encoders or decoders is absolutely fine. But when the core architecture is altered, it shouldn’t be referred to by its original name, at least not without clarification. 🤗

If you expect a strict, paper-exact SegFormer, then SMP is probably not the library you’re looking for.

class Segformer(SegmentationModel): # NO

class SMPSegformer(SegmentationModel): # Make sense

Oct 14 '25 17:10 pure-rgb

Follow up question on this. Since architectures diverge, how does this affect licensing of SegFormer in this config:

smp.Segformer(
    encoder_name="mit_b4",
    encoder_weights="imagenet",
    other_params...)

Dec 20 '25 20:12 Moritz-Langer

@Moritz-Langer, I'm not a license expert, but I believe while you are using "mit_" encoder, you are under the Nvidia license, otherwise, it's the library's MIT license.

Dec 23 '25 12:12 qubvel