tf_classification Questions on accuracy and annotations

Hi,

thanks for this very useful code. I want to ask you some questions:

Which is the final test accuracy that you achieve? (with the method explained in the tutorial, inception-v3)
Are B-Box annotations used? In your code it seems you use annotations in both training and test
Do you know if the paper from Krause et al. that you cite at the beginning uses B-Boxes to get the 84.4%?

Thanks, Andrea

Nov 17 '17 16:11 simo23

My final accuracy using only the image is around 82 - 83%. I haven't been able to reproduce Krause's results, which actually use the Inception-v2 model. I've talked with Krause about his hyper parameters and everything seems fine, so I'm not sure where the difference is coming from.

If bounding boxes are used, then I can get accuracy up to 84% (or higher). The 84.4% from Krause does not use bounding boxes.

Nov 21 '17 22:11 gvanhorn38

Thanks for the answer! If you don't mind I have some further questions:

Krause uses inception-v3 or inception-v2? I think it's v3 because they state this in the paper. This other paper "Spatial Transformer Networks" from Jaderberg et al. (https://arxiv.org/abs/1506.02025, Google Deep Mind ) uses inception-v2 and state to get up to 82.3% without boxes or anything else.
Another thing is the number of iterations: in the "Spatial transformer networks" they wait for 10k iterations with a batch size of 256 before decreasing the learning rate, which means that they made ~427 epochs just for the first phase of the training! This is a lot of time...do you think this could be an issue?
The thing that I'm most dubious about is the color augmentation. I know that the code is from the official Tensorflow inception preprocessing but if you take a look at the color augmented images you can spot some very strange colors sometimes: sky turns pink or red, colors of the birds are completely altered. I don't know if CNNs can accept all this without repercussions on accuracy. Do you think this could be an issue?
Another thing is the batch normalization decay rate which you set to 0.9997. I know that with a batch size of 32 an high decay rate can help to stabilize but in this way a looot of iteration steps are needed for the moving avg to go from the Imagenet value to the CUB value right?

Thanks, Andrea

Nov 23 '17 10:11 simo23

You are right, it was the Spatial Transformer paper that used inception-v2. Krause used inception-v3.

Here are some notes from an email exchange with Krause:

From my notes an initial learning rate 0.0045, decay by 0.94 every 4 epochs, with the checkpoint used (based on 10% of the training data) after ~23k minibatches of size 32. There was also a fair amount of data augmentation, though I think for that and everything else we used the Inception-V3 defaults.

I don't have easily-accessible records for the batch norm, dropout, or exponential moving average parameters, since I just used default values there

Other things that might make a difference: -Data augmentation params -Using a central crop at test time (was a ~1.1% difference for us)

I've found that more conservative data augmentation (bigger crops, less color augmentation) lead to a decrease in performance on the CUB dataset. The dataset is so small, that I don't think we can afford to run for too long. Good point regarding the decay rate (and same thing for the moving averages). I think I tried lowering those values at one point but I can't find any notes about the effect.

Nov 27 '17 18:11 gvanhorn38

Thank you very much for sharing these details, they will hopefully help to solve this mystery.

I mean, we are using:

same framework
same inception, written by Google so there should not be anything wrong
same images
almost the same parameters
almost same augmentation
random crop at training time, 299px
central crop at test time, because we cannot do anything else, 299px

And still 2% lower in accuracy, there must be something quite important missing don't you think? I would accept a 0.5 difference if it was just a setting of a parameter, but not an huge 2%. And the same 2% drop happens with the inception-v2.

I noticed that you randomly distort the colors 30% of the time, thing that in the original implementation is done 100% of the time. At the end the ''normal colors'' are included into the random color augmentation anyway. Was this a personal choice?

I also checked the implementation in Pytorch and the only significant difference seems to be that they normalize the data with mean and standard deviations from the Imagenet dataset. In here we simply subtract 0.5 and multiply by 2 which sound like a pretty strong approximation, don't you think? ( But Krause et. al probably used this...so the problem is somewhere else)

Thanks again for the info, I will keep trying!

Nov 28 '17 08:11 simo23

Yes, I have tried, but the numbers were still off.

A few of the lab mates have also tried unsuccessfully to reproduce those numbers.

On Tue, Dec 5, 2017 at 4:13 AM, Andrea Simonelli [email protected] wrote:

Hi @gvanhorn38 https://github.com/gvanhorn38,

sorry to bother you again. While checking the code I found that you take 10% of the training set (600 /5994 images ) to use it as validation. Did you try to train also with those images and see how much does the accuracy increase? I think the papers use the whole training set to train.

Thanks, Andrea

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/visipedia/tf_classification/issues/9#issuecomment-349286740, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjQvXImCtOqQBdzLPhw6GWvTMmiarN7ks5s9TNzgaJpZM4QiQ3g .

Dec 11 '17 16:12 gvanhorn38