It is slow, when the category is big
@winstywang Thanks for this great work! I tried it on 4 Titan x with Inception-BN.conf. In the case of the 4490 categories, train 40 batch cost 43 sec. However, in the case of 44900 categories, it cost 200 sec.
This makes sense, because you need a much larger connection in the last fc layer in 44900 categories.
@winstywang I also tried it on caffe with googlenet(without batch normalization), 4490 categories and 44900 categories would take 20 sec and 32 sec to train 40 batch respectively. There is little difference between the two case.
It is a bit hard to know which part causes this issue only according to your description.... Could you pass me a minimum set of samples could reproduce this issue?
Although my CUDA GPU is GTX970, the training for 40*40 color image with batch size 32 is very slow ,it cost much more and stopped in round 0: " round 0:[ 2000]246 sec escaped"
training iterator
data = train iter = img image_list = "./image_list_train.txt" image_root = "./data2/train/" input_flat = 0 divideby = 256 shuffle = 0 iter = end
evaluation iterator
eval = test iter = img input_flat = 0 image_list = "./image_list_test.txt" image_root = "./data2/test/" divideby = 256 shuffle = 0 iter = end
global parameters
label_width = 10 label_vec[0,10) = landmarks
netconfig=start #3_40_40 layer[0->1] = conv:cv1 kernel_size = 5 nchannel = 30 stride = 2 layer[1->2] = relu:relu1 layer[2->3] = max_pooling:mp1 kernel_size = 2 stride = 2
#30_18_18 layer[3->4] = conv:cv2 kernel_size = 3 nchannel = 30 no_bias=0 layer[4->5] = relu:relu2 layer[5->6] = max_pooling:mp2 kernel_size = 2 stride = 2
layer[6->7] = flatten #layer[7->7] = dropout
threshold = 0.5
layer[7->8] = fullc:fc1 nhidden = 100 init_sigma = 0.01 layer[8->9] = sigmoid:se1 layer[9->10] = fullc:fc2 nhidden = 10 init_sigma = 0.01 layer[10->10] = l2_loss target = landmarks netconfig=end
input shape not including batch
input_shape = 3,40,40 batch_size = 32
global parameters
dev = gpu save_model = 100 max_round = 15 num_round = 15 train_eval = 1 random_type = xavier #random_type = gaussian
learning parameters
eta = 0.1 momentum = 0.9 wd = 0.0
evaluation metric
metric = error eval_train = 1
end of config
I wish you could provide more details for us to reproduce this issue...
I changed the conf and updated the details in last post. Now round 0:[ 2400]300 sec escaped [1], train-error : 1 test-error:1 round 1:[ 2400]586 sec escaped [2], train-error : 1 test-error:1
It must be something wrong. My train_image_list like this: 1 0.0851194 0.150562 0.130048 0.10518 0.160999 0.0888869 0.0784231 0.135541 0.167709 0.159056 Aaron_Eckhart_0001_0000000.jpg 2 0.149247 0.0820331 0.105284 0.131967 0.0746481 0.0844444 0.0764162 0.133879 0.165702 0.159134 Aaron_Eckhart_0001_0000001.jpg 3 0.174822 0.106676 0.131367 0.159006 0.100897 0.0902925 0.0835475 0.141192 0.172833 0.167361 Aaron_Eckhart_0001_0000002.jpg 4
Is your problem related to @chengchengowen's problem? Seems you only have 10 classes in total.
It may not related to @chengchengowen's problem. Just bacause the training is slow and didn't converge ,so I post here. My problem is multi label regression. I am tired and will go to sleep now.
First, I am not sure about the speed of GTX970. 260pic/s seems reasonable to me. To diagnose the problem, you need first to check IO is not the bottleneck, since you are using img list iterator. You can check the GPU usage by nvidia-smi to see whether GPU is fully occupied.
Second, if the network does not converge, you should first try a smaller learning rate to see whether it helps. Since I cannot access to your data, I cannot know what is the exact problem.
I am using your old version CXXNET the time you release the multi label traning doc. For new version CXXNET, start the traning will crash at the code line " net_trainer->Update(itr_train->Value());" for the first round 0. I use the nvidia-si to check if CUDA is used, it seems the CUDA is not used. The utilization of GPU are all N/A when I am running cxxnet for training: ..... Utilization: GPU N/A Memory N/A Encoder N/A Decoder N/A .... What's the problem?
@winstywang I am sorry, our database is non-public. I suggest that random generate some labels form imagenet to reproduce this issue.
@chengchengowen Stay tuned. We will try full imagenet in the following month.