Some errors when testing Fluid on minist tuning
I am trying to writing code tuning minist to observe the performance boots provided by Fluid, but got some errors.
My env
A 8-gpus local server. Python 3.7.11. ray 0.8.5.
Not using pip package, but using Fluid implementation from the newest repo, aka. the repo after this commit:
commit bc59400c61da7e6fde3cac29ddfe40a718795a58
Author: Peifeng Yu <[email protected]>
Date: Fri Jan 7 19:55:44 2022 -0500
Log for debugging CI
RUN and ERROR info
I am in Fluid/workloads now. I run cp -r ../fluid ./rfluid to avoid ambiguity when importing.
I run this tune_fluid_minist.py use python tune_fluid_minist.py -l
(This tune_fluid_minist.py file is based on Fluid/workloads/tune_fluid_minist.py of this repo, but change the import and change the Executor used.)
I got error like this : all_error_output_info.txt I take the Traceback parts in the output here:
[2022-01-08 21:32:01,130][rfluid.fluid_executor][WARNING] Cloud not find running trial: TorchTrainable_44eb4392, currently running ones are []
[2022-01-08 21:32:01,138][rfluid.fluid_executor][ERROR] Trial TorchTrainable_44eb4392: Unexpected error starting runner.
Traceback (most recent call last):
File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 356, in _kickoff
runner = self._setup_remote_runner(trial, res, reuse_allowed)
File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 338, in _setup_remote_runner
return cls.remote(**kwargs)
File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 378, in remote
return self._remote(args=args, kwargs=kwargs)
File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 556, in _remote
extension_data=str(actor_method_cpu))
File "python/ray/_raylet.pyx", line 918, in ray._raylet.CoreWorker.create_actor
File "python/ray/_raylet.pyx", line 919, in ray._raylet.CoreWorker.create_actor
File "python/ray/_raylet.pyx", line 257, in ray._raylet.prepare_resources
ValueError: Resource quantities >1 must be whole numbers.
[2022-01-08 21:32:03,241][rfluid.fluid_executor][WARNING] Cloud not find running trial: TorchTrainable_44ea46f4, currently running ones are []
[2022-01-08 21:32:03,251][rfluid.fluid_executor][ERROR] Trial TorchTrainable_44ea46f4: Unexpected error starting runner.
Traceback (most recent call last):
File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 356, in _kickoff
runner = self._setup_remote_runner(trial, res, reuse_allowed)
File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 338, in _setup_remote_runner
return cls.remote(**kwargs)
File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 378, in remote
return self._remote(args=args, kwargs=kwargs)
File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 556, in _remote
extension_data=str(actor_method_cpu))
File "python/ray/_raylet.pyx", line 918, in ray._raylet.CoreWorker.create_actor
File "python/ray/_raylet.pyx", line 919, in ray._raylet.CoreWorker.create_actor
File "python/ray/_raylet.pyx", line 257, in ray._raylet.prepare_resources
ValueError: Resource quantities >1 must be whole numbers.
[2022-01-08 21:32:05,259][rfluid.fluid_executor][WARNING] Cloud not find running trial: TorchTrainable_44ebbf8e, currently running ones are []
[2022-01-08 21:32:05,265][rfluid.fluid_executor][ERROR] Trial TorchTrainable_44ebbf8e: Unexpected error starting runner.
Traceback (most recent call last):
File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 356, in _kickoff
runner = self._setup_remote_runner(trial, res, reuse_allowed)
File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 338, in _setup_remote_runner
return cls.remote(**kwargs)
File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 378, in remote
return self._remote(args=args, kwargs=kwargs)
File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 556, in _remote
extension_data=str(actor_method_cpu))
File "python/ray/_raylet.pyx", line 918, in ray._raylet.CoreWorker.create_actor
File "python/ray/_raylet.pyx", line 919, in ray._raylet.CoreWorker.create_actor
File "python/ray/_raylet.pyx", line 257, in ray._raylet.prepare_resources
ValueError: Resource quantities >1 must be whole numbers.
Traceback (most recent call last):
File "tune_fluid_mnist.py", line 80, in <module>
main()
File "tune_fluid_mnist.py", line 71, in main
analysis = tune.run(MyTrainable, **params)
File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/tune/tune.py", line 326, in run
runner.step()
File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 333, in step
self.trial_executor.on_step_begin(self)
File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 693, in on_step_begin
self._update_avail_resources()
File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 753, in _update_avail_resources
), "Cluster removed resources from running trials!"
AssertionError: Cluster removed resources from running trials!
VERY thanks for you reply ! I am also trying understand these errors.