EAST icon indicating copy to clipboard operation
EAST copied to clipboard

Training gets stuck in the generator

Open luhgit opened this issue 6 years ago • 26 comments

Hi,

I am training the EAST model using the following command on my own images:

python multigpu_train.py --gpu_list=0 --input_size=512 --batch_size_per_gpu=14 --checkpoint_path=tmp/east_icdar2015_resnet_v1_50_rbox/ --text_scale=512 --training_data_path=data/train/ --geometry=RBOX --learning_rate=0.0001 --num_readers=24 --pretrained_model_path=tmp/resnet_v1_50.ckpt

The problem I have is that it does not reach to the training stage it gets stuck in the generator function or more precisely get_batch() function.

Here is the output from console:

Use standard file APIs to check for files with this prefix.
step 0
Generator use 10 batches for buffering, this may take a while, you can tune this yourself.
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/
3 training images in data/train/

It does not move forward from here then I checked the code of get_batch() function and I found that it gets stuck in else condition (commented below in the code) forever.

def get_batch(num_workers, **kwargs):
    try:
        enqueuer = GeneratorEnqueuer(generator(**kwargs), use_multiprocessing=True)
        print('Generator use 10 batches for buffering, this may take a while, you can tune this yourself.')
        enqueuer.start(max_queue_size=10, workers=num_workers)
        generator_output = None
        while True:
            while enqueuer.is_running():
                if not enqueuer.queue.empty():
                    generator_output = enqueuer.queue.get()
                    break
                else:
                    # The control comes here but never get out of here!
                    time.sleep(0.01)
            yield generator_output
            generator_output = None
    finally:
        if enqueuer is not None:
            enqueuer.stop()

My CPU is almost idle:

Processes: 496 total, 3 running, 1 stuck, 492 sleeping, 2512 threads                            17:24:51
Load Avg: 1.49, 1.96, 2.31  CPU usage: 8.81% user, 9.74% sys, 81.43% idle

I am using Tensorflow: 1.13.2 and OpenCV 4 on a Macbook Pro machine.

Does someone else also faced the same problem? If so how did you fix it?

Thanks!

luhgit avatar Aug 30 '19 14:08 luhgit

3 training images is not enough, use 10 + images, because you use 10 batches

IgorZorkov avatar Aug 30 '19 15:08 IgorZorkov

I am now using 16 images and the problem still persists.

Generator use 10 batches for buffering, this may take a while, you can tune this yourself.
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/
16 training images in data/train/

luhgit avatar Aug 30 '19 15:08 luhgit

you use GPU or CPU?

IgorZorkov avatar Aug 30 '19 15:08 IgorZorkov

I am using only CPU because I have built-in intel graphic card which I guess is not supported by tensorflow?

luhgit avatar Aug 30 '19 15:08 luhgit

training even one epoh on cpu will take a very long time, use google colab with gpu

IgorZorkov avatar Aug 30 '19 15:08 IgorZorkov

How do you suggest to run this github project in Google Colab? I am right now running it through terminal as it takes command line arguments!

luhgit avatar Sep 02 '19 07:09 luhgit

%cd /content !git clone https://github.com/argman/EAST

%cd /content/EAST !python eval.py --test_data_path=/training_samples/ --gpu_list=0 --checkpoint_path=MYPATH
--output_dir=/tmp/

IgorZorkov avatar Sep 02 '19 08:09 IgorZorkov

there's no problem running this code in the lab

IgorZorkov avatar Sep 02 '19 08:09 IgorZorkov

when after a training you want to freeze pb, ask me, i will explain how to do this in the lab

IgorZorkov avatar Sep 02 '19 08:09 IgorZorkov

Oh Perfect! Thank you very much for the hint. I will try to train it there: I hope this time it will not have the problem I had on local machine. I will come back to you after I execute it there!

luhgit avatar Sep 02 '19 08:09 luhgit

Ok, don't forget to change runtime type to GPU mode in menu

IgorZorkov avatar Sep 02 '19 09:09 IgorZorkov

Oh yeah, I almost forgot! I am now running the training in colab! Now the problem is disappeared! You were right it was the issue of speed on CPU vs GPU. Once the training is complete, how do I preserve the model for further prediction?

Screenshot 2019-09-02 at 11 52 52

luhgit avatar Sep 02 '19 09:09 luhgit

1. First you need to save your trained checkpoint files

To do this replace your eval.py with this file, https://yadi.sk/d/B2qL9iYpDvDoBA, change line number 154 as needed in your new eval.py file

1.1 run in colab something like this !python eval.py --test_data_path="/PATH TO .jpg IMAGES" --gpu_list=0 --checkpoint_path="/PATH TO CHECKPOINT FILES/"
--output_dir="/content/EAST/test_result"

2. To freeze saved filed (see line 154 in eval.py) use this file https://yadi.sk/d/FAALJEEk6tQWpQ like this

!python "/FULL PATH TO FILE.freeze.py" --model_dir="/content/EAST/saved" --output_node_names="feature_fusion/Conv_7/Sigmoid,feature_fusion/concat_3"

IgorZorkov avatar Sep 02 '19 10:09 IgorZorkov

3.

https://github.com/spmallick/learnopencv/tree/master/TextDetectionEAST

IgorZorkov avatar Sep 02 '19 10:09 IgorZorkov

or download TextDetection.py https://yadi.sk/d/72iA8zmoX8Ffvw and run %cd /content/ !python "/content/drive/My Drive/TextDetection.py" --input "/content/test.jpg"
--thr=0.5
--nms=0.5
--model "/content/EAST/saved/frozen_model.pb"
--width=512
--height=512

IgorZorkov avatar Sep 02 '19 10:09 IgorZorkov

you'll get out.jpg in the same folder with test.jpg file, just press refresh

IgorZorkov avatar Sep 02 '19 10:09 IgorZorkov

and the last some my training images

https://yadi.sk/i/4iLMlOMXonW9Pg https://yadi.sk/i/vK2-03MOx2iuYQ https://yadi.sk/i/uwCIjoQZF2HgkQ

https://yadi.sk/d/o38Voy7qNAxMkw 154 MB, good luck

IgorZorkov avatar Sep 02 '19 11:09 IgorZorkov

Thank you very much for helping me on this. I am looking forward to the end of the training and implement your suggestions!

luhgit avatar Sep 02 '19 11:09 luhgit

You're welcome

IgorZorkov avatar Sep 02 '19 11:09 IgorZorkov

Oh my God I can't stop posting, somebody kill me

IgorZorkov avatar Sep 02 '19 11:09 IgorZorkov

System information (version) OpenCV =>4.12 Operating System / Platform => Windows 64 Bit Compiler => PyCharm 2018 CE Detailed description i tried to run text detection.py based on my own east model, but it failed at ''outs = net.forward(outNames)''

cv2.error: OpenCV(4.1.1) .\opencv-python\opencv\modules\dnn\src\dnn.cpp:525: error: (-2:Unspecified error) Can't create layer "resnet_v1_50/conv1/BatchNorm/FusedBatchNormV3" of type "FusedBatchNormV3" in function 'cv::dnn::dnn4_v20190621::LayerData::getLayerInstance'

i saved my model in this:

output_graph = "frozen_east_model_02.pb" output_graph_def = tf.graph_util.convert_variables_to_constants(sess, sess.graph_def, ["feature_fusion/Conv_7/Sigmoid", "feature_fusion/concat_3"]) tf.train.write_graph(output_graph_def, ".", output_graph, as_text=False)

i have tried to modify model.py , nevertheless it did not work. https://github.com/argman/EAST/blob/master/model.py Line 150

c1_1 = slim.conv2d(tf.concat([g[i-1], f[i]], axis=3), num_outputs[i], 1) pi2 = 0.5 * np.pi angle_map = (slim.conv2d(g[3], 1, 1, activation_fn=tf.nn.sigmoid, normalizer_fn=None) - 0.5) * pi2 # angle is between [-45, 45] F_geometry = tf.concat([geo_map, angle_map], axis=3)

@SmallDonkey

SpringRainLu avatar Nov 24 '19 16:11 SpringRainLu

Use tensorflow==1.14

IgorZorkov avatar Feb 10 '20 07:02 IgorZorkov

I follow step 2 ,but the error happend: AssertionError: feature_fusion/Conv_7/Sigmoid is not in graph why can l solve the problem? thank you! @SmallDonkey tensorflow==1.14.0

zzcqinag avatar Sep 23 '20 14:09 zzcqinag

System information (version) OpenCV =>4.12 Operating System / Platform => Windows 64 Bit Compiler => PyCharm 2018 CE Detailed description i tried to run text detection.py based on my own east model, but it failed at ''outs = net.forward(outNames)''

cv2.error: OpenCV(4.1.1) .\opencv-python\opencv\modules\dnn\src\dnn.cpp:525: error: (-2:Unspecified error) Can't create layer "resnet_v1_50/conv1/BatchNorm/FusedBatchNormV3" of type "FusedBatchNormV3" in function 'cv::dnn::dnn4_v20190621::LayerData::getLayerInstance'

i saved my model in this:

output_graph = "frozen_east_model_02.pb" output_graph_def = tf.graph_util.convert_variables_to_constants(sess, sess.graph_def, ["feature_fusion/Conv_7/Sigmoid", "feature_fusion/concat_3"]) tf.train.write_graph(output_graph_def, ".", output_graph, as_text=False)

i have tried to modify model.py , nevertheless it did not work. https://github.com/argman/EAST/blob/master/model.py Line 150

c1_1 = slim.conv2d(tf.concat([g[i-1], f[i]], axis=3), num_outputs[i], 1) pi2 = 0.5 * np.pi angle_map = (slim.conv2d(g[3], 1, 1, activation_fn=tf.nn.sigmoid, normalizer_fn=None) - 0.5) * pi2 # angle is between [-45, 45] F_geometry = tf.concat([geo_map, angle_map], axis=3)

@SmallDonkey

l meet the same problem ,did you solve this?

zzcqinag avatar Sep 23 '20 15:09 zzcqinag

I follow step 2 ,but the error happend: AssertionError: feature_fusion/Conv_7/Sigmoid is not in graph why can l solve the problem? thank you! @SmallDonkey tensorflow==1.14.0

https://github.com/argman/EAST/issues/277#issuecomment-507749717

hmen97 avatar Dec 07 '20 11:12 hmen97