From which model did you fine-tune?
Hi, you said you use resnet-101 with dilated convolution, but I can't find any pre-trained ResNet with dilated convolution on the Internet. I want to know more in details.
Hi
TL;DR: just use a ResNet-101 trained on ImageNet without dilated convolutions.
In my opinion, if one consider a ResNet-101 with dilated convolution as described by Yu and Kolten [1] (without the context network), the receptive field of all neurons are the exact same between ResNet-101 and ResNet-101 with dilated convolutions. Indeed, increasing dilation factor in dilated convolutions emulates the effect of max pooling. So, if you can take a ResNet-101 trained on ImageNet, you can outputs dense predictions (on the 1000 classes) using dilated convolutions, without having to train a specific model with the dilated convolutions.
This can be very useful. For instance, in DeepLabv3 [2], Chen train a model at resolution 1/16 due to memory limitations and produces dense predictions at resolution 1/8 at testing time.
Howewer, still remains that the first layers modifications (three 3x3 conv in PSPNet instead of 7x7 conv in ResNet)
[1] Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions." arXiv preprint arXiv:1511.07122 (2015). [2] Chen, Liang-Chieh, et al. "Rethinking Atrous Convolution for Semantic Image Segmentation." arXiv preprint arXiv:1706.05587 (2017).
@howard-mahe May I ask why
remains that the first layers modifications (three 3x3 conv in PSPNet instead of 7x7 conv in ResNet)
What's the problem of using the 7x7?
I don't know, only the author could answer. anyway, the main constribution of PSPNet is (1) the PSP module and (2) large crop, large batch and fine tuning BN parameters matters but requires multi gpu training.