Fluid icon indicating copy to clipboard operation
Fluid copied to clipboard

Some errors when testing Fluid on minist tuning

Open xyzustc opened this issue 4 years ago • 0 comments

I am trying to writing code tuning minist to observe the performance boots provided by Fluid, but got some errors.

My env

A 8-gpus local server. Python 3.7.11. ray 0.8.5. Not using pip package, but using Fluid implementation from the newest repo, aka. the repo after this commit:

commit bc59400c61da7e6fde3cac29ddfe40a718795a58
Author: Peifeng Yu <[email protected]>
Date:   Fri Jan 7 19:55:44 2022 -0500

    Log for debugging CI

RUN and ERROR info

I am in Fluid/workloads now. I run cp -r ../fluid ./rfluid to avoid ambiguity when importing. I run this tune_fluid_minist.py use python tune_fluid_minist.py -l (This tune_fluid_minist.py file is based on Fluid/workloads/tune_fluid_minist.py of this repo, but change the import and change the Executor used.)

I got error like this : all_error_output_info.txt I take the Traceback parts in the output here:

[2022-01-08 21:32:01,130][rfluid.fluid_executor][WARNING] Cloud not find running trial: TorchTrainable_44eb4392, currently running ones are []
[2022-01-08 21:32:01,138][rfluid.fluid_executor][ERROR] Trial TorchTrainable_44eb4392: Unexpected error starting runner.
Traceback (most recent call last):
  File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 356, in _kickoff
    runner = self._setup_remote_runner(trial, res, reuse_allowed)
  File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 338, in _setup_remote_runner
    return cls.remote(**kwargs)
  File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 378, in remote
    return self._remote(args=args, kwargs=kwargs)
  File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 556, in _remote
    extension_data=str(actor_method_cpu))
  File "python/ray/_raylet.pyx", line 918, in ray._raylet.CoreWorker.create_actor
  File "python/ray/_raylet.pyx", line 919, in ray._raylet.CoreWorker.create_actor
  File "python/ray/_raylet.pyx", line 257, in ray._raylet.prepare_resources
ValueError: Resource quantities >1 must be whole numbers.
[2022-01-08 21:32:03,241][rfluid.fluid_executor][WARNING] Cloud not find running trial: TorchTrainable_44ea46f4, currently running ones are []
[2022-01-08 21:32:03,251][rfluid.fluid_executor][ERROR] Trial TorchTrainable_44ea46f4: Unexpected error starting runner.
Traceback (most recent call last):
  File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 356, in _kickoff
    runner = self._setup_remote_runner(trial, res, reuse_allowed)
  File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 338, in _setup_remote_runner
    return cls.remote(**kwargs)
  File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 378, in remote
    return self._remote(args=args, kwargs=kwargs)
  File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 556, in _remote
    extension_data=str(actor_method_cpu))
  File "python/ray/_raylet.pyx", line 918, in ray._raylet.CoreWorker.create_actor
  File "python/ray/_raylet.pyx", line 919, in ray._raylet.CoreWorker.create_actor
  File "python/ray/_raylet.pyx", line 257, in ray._raylet.prepare_resources
ValueError: Resource quantities >1 must be whole numbers.
[2022-01-08 21:32:05,259][rfluid.fluid_executor][WARNING] Cloud not find running trial: TorchTrainable_44ebbf8e, currently running ones are []
[2022-01-08 21:32:05,265][rfluid.fluid_executor][ERROR] Trial TorchTrainable_44ebbf8e: Unexpected error starting runner.
Traceback (most recent call last):
  File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 356, in _kickoff
    runner = self._setup_remote_runner(trial, res, reuse_allowed)
  File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 338, in _setup_remote_runner
    return cls.remote(**kwargs)
  File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 378, in remote
    return self._remote(args=args, kwargs=kwargs)
  File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/actor.py", line 556, in _remote
    extension_data=str(actor_method_cpu))
  File "python/ray/_raylet.pyx", line 918, in ray._raylet.CoreWorker.create_actor
  File "python/ray/_raylet.pyx", line 919, in ray._raylet.CoreWorker.create_actor
  File "python/ray/_raylet.pyx", line 257, in ray._raylet.prepare_resources
ValueError: Resource quantities >1 must be whole numbers.
Traceback (most recent call last):
  File "tune_fluid_mnist.py", line 80, in <module>
    main()
  File "tune_fluid_mnist.py", line 71, in main
    analysis = tune.run(MyTrainable, **params)
  File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/tune/tune.py", line 326, in run
    runner.step()
  File "/export/data/qiqi/miniconda3/envs/graphgym/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 333, in step
    self.trial_executor.on_step_begin(self)
  File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 693, in on_step_begin
    self._update_avail_resources()
  File "/home/data/qiqi/GNN-Fluid/Fluid/workloads/rfluid/fluid_executor.py", line 753, in _update_avail_resources
    ), "Cluster removed resources from running trials!"
AssertionError: Cluster removed resources from running trials!

VERY thanks for you reply ! I am also trying understand these errors.

xyzustc avatar Jan 08 '22 13:01 xyzustc