TextFuseNet icon indicating copy to clipboard operation
TextFuseNet copied to clipboard

Question of NUM_CLASSES

Open bbangbin2780 opened this issue 5 years ago • 7 comments

I have a question while learning Korean dataset

Follow the steps below to proceed

  1. write config file
  2. register dataset( my dataset name is AISL dataset)
  3. then training below command
$ python tools/train_net.py --num-gpus 4 --config-file

below is config file ( just change the dataset name from total-text config file )

_BASE_: "./Base-RCNN-FPN.yaml"
MODEL:
  MASK_ON: True
  TEXTFUSENET_MUTIL_PATH_FUSE_ON: True
  WEIGHTS: "./out_dir_r101/totaltext_model/model_tt_r101.pth"
  PIXEL_STD: [57.375, 57.120, 58.395]
  RESNETS:
    STRIDE_IN_1X1: False  # this is a C2 model
    NUM_GROUPS: 32
    WIDTH_PER_GROUP: 8
    DEPTH: 101
  ROI_HEADS:
    NMS_THRESH_TEST: 0.4
  TEXTFUSENET_SEG_HEAD:
    FPN_FEATURES_FUSED_LEVEL: 1
    POOLER_SCALES: (0.125,)

DATASETS:
  TRAIN: ("AISLText",)
  TEST: ("AISLText",)
SOLVER:
  IMS_PER_BATCH: 8
  BASE_LR: 0.001
  STEPS: (40000,80000,)
  MAX_ITER: 120000
  CHECKPOINT_PERIOD: 2500

INPUT:
  MIN_SIZE_TRAIN: (800,1000,1200)
  MAX_SIZE_TRAIN: 1500
  MIN_SIZE_TEST: 800
  MAX_SIZE_TEST: 1333


OUTPUT_DIR: "./out_dir_r101/at_model/"

register with coco_register in detectron2/data/datasets/builtin.py.

image_path = "/home/ensa/JYB/TextFuseNet/datasets/AISLText/train_images"
json_path = "/home/ensa/JYB/TextFuseNet/datasets/AISLText/trainval.json"
register_coco_instances("AISLText", {},json_path, image_path)

An error occurs when learning

[01/19 18:35:50 d2.data.datasets.coco]: Loaded 3 images in COCO format from /home/ensa/JYB/TextFuseNet/datasets/AISLText/trainval.json
[01/19 18:35:50 d2.data.build]: Removed 0 images with no usable annotations. 3 images left.
[01/19 18:35:50 d2.data.build]: Distribution of training instances among all 31 categories:
|  category  | #instances   |  category  | #instances   |  category  | #instances   |
|:----------:|:-------------|:----------:|:-------------|:----------:|:-------------|
|     -      | 2            |     0      | 2            |     1      | 2            |
|     3      | 3            |     5      | 1            |     7      | 2            |
|     A      | 2            |     B      | 2            |     E      | 4            |
|     K      | 2            |     L      | 2            |     R      | 1            |
|     a      | 1            |     b      | 1            |     c      | 1            |
|     e      | 2            |     i      | 1            |     m      | 1            |
|     o      | 2            |     r      | 3            |     t      | 1            |
|    text    | 7            |     u      | 1            |     y      | 1            |
|     강      | 1            |     료      | 1            |     실      | 3            |
|     의      | 1            |     자      | 1            |     장      | 1            |
|     화      | 1            |            |              |            |              |
|   total    | 56           |            |              |            |              |
[01/19 18:35:50 d2.data.detection_utils]: TransformGens used in training: [ResizeShortestEdge(short_edge_length=(800, 1000, 1200), max_size=1500, sample_style='choice'), RandomFlip(), RandomContrast(intensity_min=0.5, intensity_max=1.5), RandomBrightness(intensity_min=0.5, intensity_max=1.5), RandomSaturation(intensity_min=0.5, intensity_max=1.5), RandomLighting(scale=1.1931034212737668)]
[01/19 18:35:50 d2.data.build]: Using training sampler TrainingSampler
[01/19 18:35:51 fvcore.common.checkpoint]: Loading checkpoint from ./out_dir_r101/totaltext_model/model_tt_r101.pth
[01/19 18:35:51 d2.engine.train_loop]: Starting training from iteration 0
[01/19 18:35:53 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks)
Traceback (most recent call last):
  File "tools/train_net.py", line 161, in <module>
    args=(args,),
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/launch.py", line 49, in launch
    daemon=False,
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/launch.py", line 84, in _distributed_worker
    main_func(*args)
  File "/home/ensa/JYB/TextFuseNet/tools/train_net.py", line 149, in main
    return trainer.train()
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/defaults.py", line 356, in train
    super().train(self.start_iter, self.max_iter)
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/train_loop.py", line 212, in run_step
    loss_dict = self.model(data)
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/meta_arch/rcnn.py", line 88, in forward
    _, detector_losses = self.roi_heads(images, features, proposals, gt_instances)
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/roi_heads/roi_heads.py", line 584, in forward
    losses.update(self._forward_mask(features_list, proposals, targets))
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/roi_heads/roi_heads.py", line 684, in _forward_mask
    mask_features = self.mutil_path_fuse_module(mask_features, global_context, proposals)
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/roi_heads/mutil_path_fuse_module.py", line 110, in forward
    feature_fuse = char_context + x + global_context
RuntimeError: The size of tensor a (19) must match the size of tensor b (145) at non-singleton dimension 0

To test whether learning is possible,I just tested with 3 images. then this error is occurred

I compared the your sample coco format to my coco format, but it was the same.

I need to learn at least 1000 characters, does this error relevant to the number of characters? or relevant to input size?

Thank you for reading please help...

bbangbin2780 avatar Jan 19 '21 09:01 bbangbin2780

@bbangbin2780 it seems that the num of char_context, x and globa_context is not equal. This implementation only train with batchsize 4 with 4gpus. our 64 classes are text, 0-9, a-z, A-Z and background.

Real-YeJ avatar Jan 19 '21 12:01 Real-YeJ

I appreciate your answer, Thanks

I modify my config file ( batch size 4)

then below error occurred

Traceback (most recent call last):
  File "tools/train_net.py", line 161, in <module>
    args=(args,),
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/launch.py", line 49, in launch
    daemon=False,
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/launch.py", line 84, in _distributed_worker
    main_func(*args)
  File "/home/ensa/JYB/TextFuseNet/tools/train_net.py", line 149, in main
    return trainer.train()
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/defaults.py", line 356, in train
    super().train(self.start_iter, self.max_iter)
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/home/ensa/JYB/TextFuseNet/detectron2/engine/train_loop.py", line 212, in run_step
    loss_dict = self.model(data)
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 442, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/meta_arch/rcnn.py", line 88, in forward
    _, detector_losses = self.roi_heads(images, features, proposals, gt_instances)
  File "/home/ensa/anaconda3/envs/textfusenet2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/roi_heads/roi_heads.py", line 581, in forward
    losses = self._forward_box(features_list, proposals)
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/roi_heads/roi_heads.py", line 650, in _forward_box
    return outputs.losses()
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/roi_heads/fast_rcnn.py", line 267, in losses
    "loss_box_reg": self.smooth_l1_loss(),
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/roi_heads/fast_rcnn.py", line 209, in smooth_l1_loss
    self.proposals.tensor, self.gt_boxes.tensor
  File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/box_regression.py", line 66, in get_deltas
    assert (src_widths > 0).all().item(), "Input boxes to Box2BoxTransform are not valid!"
RuntimeError: CUDA error: device-side assert triggered

If the number of classes is greater than 64, is error occurred?

I want to learning at least 1000 characters

Thanks

bbangbin2780 avatar Jan 20 '21 06:01 bbangbin2780

@bbangbin2780 the IMS_PER_BATCH should be set to 4 when using 4 gpus. if you set more classes, the pred_branches in our model will be skiped when training your custom datasets.

Real-YeJ avatar Jan 20 '21 07:01 Real-YeJ

I have for GPUs (4 TITAN RTX)

if class number over 64, then that error occured

bbangbin2780 avatar Jan 20 '21 08:01 bbangbin2780

@bbangbin2780 if you change the num of classes, there are several configs should be modified in detectron2/data/datasets/builtin.py as well

Real-YeJ avatar Jan 20 '21 14:01 Real-YeJ

@bbangbin2780 if you change the num of classes, there are several configs should be modified in detectron2/data/datasets/builtin.py as well

Why is pred_branches in your model skipped if I set the number of classes more than 63?

I read your paper again but I don't understand why is pred_branches skipped.

bbangbin2780 avatar Jan 27 '21 01:01 bbangbin2780

@Real-YeJ. I have a same as problem and I have updated for file detectron2/data/datasets/builtin.py and this is my config

'BASE: "./Base-RCNN-FPN.yaml" MODEL: MASK_ON: True TEXTFUSENET_MUTIL_PATH_FUSE_ON: True WEIGHTS: "" PIXEL_STD: [57.375, 57.120, 58.395] RESNETS: STRIDE_IN_1X1: False # this is a C2 model NUM_GROUPS: 32 WIDTH_PER_GROUP: 8 DEPTH: 50 ROI_HEADS: NMS_THRESH_TEST: 0.3 TEXTFUSENET_SEG_HEAD: FPN_FEATURES_FUSED_LEVEL: 2 POOLER_SCALES: (0.0625,) DATASETS: TRAIN: ("mydataset",) TEST: ("mydataset",) SOLVER: IMS_PER_BATCH: 1 BASE_LR: 0.001 STEPS: (40000,80000,) MAX_ITER: 120000 CHECKPOINT_PERIOD: 2500 INPUT: MIN_SIZE_TRAIN: (800,1000,1200) MAX_SIZE_TRAIN: 1500 MIN_SIZE_TEST: 800 MAX_SIZE_TEST: 1500

OUTPUT_DIR: "./out_dir_r101/icdar2013_model/" ' and my command line is:

python train_net.py --num-gpus 1 --config-file configs/ocr/icdar2013_101_FPN.yaml

and in file detectron2/data/datasets/builtin.py . I add one more key in dict PREDEFINED_SPLITS_COCO["coco"] is:

"mydataset":("F:/project_2/New_folder/data/downloads", "F:/project_2/New_folder/data/downloads/train.json")

But it still have issue below:

File "/home/ensa/JYB/TextFuseNet/detectron2/modeling/box_regression.py", line 66, in get_deltas assert (src_widths > 0).all().item(), "Input boxes to Box2BoxTransform are not valid!" RuntimeError: CUDA error: device-side assert triggered

ducthinh14091999 avatar Apr 28 '22 02:04 ducthinh14091999