Training and Test mAP@50 and mAP50:95 are showing very strange results in fedcv object detection

Open Adeelbek opened this issue 3 years ago • 0 comments

I have training Cross-Silo Horizontal distributed training mode with following configuration settings:

common_args:
  training_type: "cross_silo"
  random_seed: 0
  scenario: "horizontal"
  using_mlops: false
  config_version: release
  name: "exp" # yolo
  project: "runs/train" # yolo
  exist_ok: true # yolo

environment_args:
  bootstrap: /home/gpuadmin/Project/FedML/python/app/fedcv/object_detection/config/bootstrap.sh

data_args:
  dataset: "bdd"
  data_cache_dir: ~/fedcv_data
  partition_method: "homo"
  partition_alpha: 0.5
  data_conf: "/home/gpuadmin/Project/FedML/python/app/fedcv/object_detection/data/bdd.yaml" # yolo
  img_size: [640, 640] # [640, 640]

model_args:
  model: "yolov5" # "yolov5"
  class_num: 13
  yolo_cfg: "/home/gpuadmin/Project/FedML/python/app/fedcv/object_detection/model/yolov5/models/yolov5s.yaml" # "./model/yolov6/configs/yolov6s.py" # yolo
  yolo_hyp: "/home/gpuadmin/Project/FedML/python/app/fedcv/object_detection/config/hyps/hyp.scratch.yaml" # yolo
  weights: "none" # "best.pt" # yolo
  single_cls: false # yolo
  conf_thres: 0.001 # yolo
  iou_thres: 0.6 # for yolo NMS
  yolo_verbose: true # yolo

train_args:
  federated_optimizer: "FedAvg"
  client_id_list:
  client_num_in_total: 2
  client_num_per_round: 2
  comm_round: 10
  epochs: 4
  batch_size: 64
  client_optimizer: sgd
  lr: 0.01
  weight_decay: 0.001
  checkpoint_interval: 1
  server_checkpoint_interval: 1

validation_args:
  frequency_of_the_test: 2

device_args:
  worker_num: 2
  using_gpu: true
  gpu_mapping_file: /home/gpuadmin/Project/FedML/python/app/fedcv/object_detection/config/gpu_mapping.yaml
  gpu_mapping_key: mapping_config5_2
  gpu_ids: [0,1,2,3,4,5,6,7]

comm_args:
  backend: "MQTT_S3"
  mqtt_config_path: /home/gpuadmin/Project/FedML/python/app/fedcv/object_detection/config/mqtt_config.yaml
  s3_config_path: /home/gpuadmin/Project/FedML/python/app/fedcv/object_detection/config/s3_config.yaml

tracking_args:
  log_file_dir: /home/gpuadmin/Project/FedML/python/app/fedcv/object_detection/log
  enable_wandb: true
  wandb_key: ee0b5f53d949c84cee7decbe7a6
  wandb_project: fedml
  wandb_name: fedml_torch_object_detection

During the training I have got very high mAP@50 and mAP@50:95 almost for every epoch from beginning till the end. Normally, mAP should be small for early training epochs and it should grow slowly in the later epochs. But in my case it is just fluctuating in range of 0.985 ~ 0.9885 for both clients. I have checked metric calculation functions borrowed from original YOLOv5 PyTorch platform. They are working fine. IF ANYBODY CAN SHARE THEIR PRELIMINARY RESULTS FOR DISTRIBUTED OBJECT DETECTION FOR ANY DATASET (COCO or PASCAL VOC). I would like to verify my result with their result. For solo YOLOv5s model training, ,mAP is much smaller and it is growing epochs by epochs.

Any clue from the Authors would be very much appreciated.

P.S. For mAP calculation, I used default val(train_data, device, args) function inside the YOLOv5Trainer class in the yolov5_trainer.py .

Oct 12 '22 02:10 Adeelbek