mlsh icon indicating copy to clipboard operation
mlsh copied to clipboard

--continue_iter is buggy

Open TheCrazyT opened this issue 8 years ago • 6 comments

I used the following statement: python3 main.py --task AntBandits-v1 --num_subs 2 --macro_duration 1000 --num_rollouts 2000 --warmup_time 20 --train_time 50 --continue_iter 00615 --replay True AntAgent

The thing i notice is that you need to write "0" infront of the iteration-number. Another thing i noticed is that i needed to copy the files from "savedir" to the folder "AntAgent" to make it work. I guess the checkpoint-algorithm uses the wrong directory for storing checkpoints.

The first output also shows "It is Iteration 0 so i'm changing [...]". But i wanted to continue the learning process and didn't want to start from beginning.

TheCrazyT avatar Nov 03 '17 16:11 TheCrazyT

Thanks for the report, I'll look into it. For now: although the output shows "iteration 0", it is continuing the learning process from the checkpoint.

kvfrans avatar Dec 16 '17 19:12 kvfrans

Hi! Do you run this code on GPU? My computer has two TITAN XP gpus, When I run this code, the utilization of the first one is only 5%. The second one is even zero. Do you know why my GPU utilization is so low? Do I need to modify the code appropriately according to the configuration of my computer? Thanks!

Muguangfeng avatar Apr 16 '19 12:04 Muguangfeng

@Muguangfeng

Well you kinda hijacked this topic, but i will answer to you. It probably depends on what version of tensorflow is installed. It sounds to me that you have installed the default variant of tensorflow. If i remember right, the default only uses cpu. ( also take a look at https://www.tensorflow.org/install/gpu )

TheCrazyT avatar Apr 17 '19 13:04 TheCrazyT

So sorry! Do you mean that this code can only run on the CPU? In order to save time, I want to use GPU to accelerate. I've seen this tutorial for installing TensorFlow-gpu before. And I installed TensorFlow-gpu = 1.8.0. It can run, but its speed hasn't improved because Low GPU utilization.

In addition, after training, I view the training process by running: python main.py --task AntBandits-v1 --num_subs 2 --macro_duration 1000 --num_rollouts 2000 --warmup_time 20 --train_time 30 --replay True --continue_iter 00015 AntAgent. it cannot find the file. The folder of the file is savedir/Antagent/checkpoints/. Is that wrong?

Muguangfeng avatar Apr 17 '19 14:04 Muguangfeng

does:

import tensorflow as tf print(tf.test.is_gpu_available())

also return True for you?

TheCrazyT avatar Apr 18 '19 06:04 TheCrazyT

ok, looks like the current code does not set the gpu-device-count. See Email for more details.

TheCrazyT avatar Apr 19 '19 09:04 TheCrazyT