mxnet-ssd icon indicating copy to clipboard operation
mxnet-ssd copied to clipboard

Error with kvstore while training SSD example with multiple GPUs on different computers / distributed training

Open encodingwaddles opened this issue 8 years ago • 2 comments

I have been successful in using the latest commit (67ba1c9) to train the model and evaluate the trained model by following your example in the README. I am now trying to train SSD using distributed GPUs but I am encountering some errors and would like your advice.

I used the train_cifar10 example provided in the mxnet repo (/example/image-clsasification/train_cifar10.py) as the guide to implementing kv_store into the current SSD training script.

Starting with the train.py, I added "--kv-store" into the parse_args() function, and then passed it into the function train_net(...,args.kv_store,...). This argument defaults to "device" but can be specified as desired (i.e., "dist_sync") .
parser.add_argument('--kv-store', dest='kv_store', help='key-value store type', default='device',type=str)

Inside the function train_net() in train_net.py, I created kv = mx.kvstore.create(kv_store)

Next, based on the kv_store argument, I obtain the corresponding rank / num_workers and fed them as arguments into DetRecordIter for train_iter and val_iter through the cfg.train dictionary.

    rank = None
    nworker = None
    if "dist" in kv_store:
        (rank, nworker) = (kv.rank, kv.num_workers)
    else:
        (rank, nworker) = (0, 1)

    cfg.train["rank"] = rank
    cfg.train["nworker"] = nworker

    cfg.valid["rank"] = rank
    cfg.valid["nworker"] = nworker

For the learning rate, in get_lr_scheduler(), I specified the epoch size as follows if kv_store is specified to be for distributed GPU training:

        if 'dist' in kv_store:
            epoch_size /= nworker

Lastly, in the mod.fit(), I added kvstore=kv and tried to run as follows:

python ../../tools/launch.py -n 2 --launcher ssh -H hosts python train.py --batch-size 2 --lr 0.0005 --gpus 0 --kv-store dist_sync

The hosts file contains the ip of the 2 machines, that already has the ability to ssh without keys and has access through nfs filesystem.

The errors I'm getting (1)

Traceback (most recent call last):
  File "train.py", line 141, in <module>
    voc07_metric=args.use_voc07_metric)
  File "/mxnet/example/ssd/train/train_net.py", line 223, in train_net
    label_pad_width=label_pad_width, path_imglist=train_list, **cfg.train)
  File "/mxnet/example/ssd/dataset/iterator.py", line 55, in __init__
    self.rec = mx.io.ImageDetRecordIter(
AttributeError: 'module' object has no attribute 'ImageDetRecordIte

(2)

[16:19:44] src/io/iter_image_det_recordio.cc:262: ImageDetRecordIOParser: /mxnet/example/ssd/data/train.rec, use 6 threads for decoding..
[16:19:45] src/io/iter_image_det_recordio.cc:315: ImageDetRecordIOParser: /mxnet/example/ssd/data/train.rec, label padding width: 350
INFO:root:Start training with (gpu(0)) from pretrained model /mxnet/example/ssd/model/vgg16_reduced
[16:19:46] src/nnvm/legacy_json_util.cc:190: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
[16:19:46] src/nnvm/legacy_json_util.cc:198: Symbol successfully upgraded!
INFO:root:Freezed parameters: [conv1_1_weight,conv1_1_bias,conv1_2_weight,conv1_2_bias,conv2_1_weight,conv2_1_bias,conv2_2_weight,conv2_2_bias]
[16:19:50] src/operator/././cudnn_algoreg-inl.h:57: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[16:19:12] mxnet/dmlc-core/include/dmlc/logging.h:300: [16:19:12] src/kvstore/././kvstore_dist_server.h:211: Check failed: !stored.is_none() init 0 first

Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7f6c48e54c9c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7kvstore17KVStoreDistServer10DataHandleERKN2ps6KVMetaERKNS2_7KVPairsIfEEPNS2_8KVServerIfEE+0xb60) [0x7f6c497e6b20]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN2ps8KVServerIfE7ProcessERKNS_7MessageE+0x107) [0x7f6c497db287]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN2ps8Customer9ReceivingEv+0x592) [0x7f6c49b04be2]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb1a60) [0x7f6c38676a60]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8184) [0x7f6c69c3b184]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f6c6996837d]

terminate called after throwing an instance of 'dmlc::Error'
  what():  [16:19:12] src/kvstore/././kvstore_dist_server.h:211: Check failed: !stored.is_none() init 0 first

Has anyone tried to train SSD with multiple GPU on different machines? If so, have you encountered these errors / how did you fix them? Is there a better way to implement kv_store ?

Thanks in advance!

encodingwaddles avatar Apr 27 '17 00:04 encodingwaddles

For (1), update submodule, it's indicating you don't have the correct python module (2) use num_parts and part_index instead of rand and nworker in cfg.train/valid to split dataset. The second error seems like you have not successfully initialized the kv_store, but I'm not familiar with it.

zhreshold avatar Apr 27 '17 15:04 zhreshold

do you solver it? i also encounter this problem.

picctree avatar Sep 04 '17 12:09 picctree