ray icon indicating copy to clipboard operation
ray copied to clipboard

[2.0rc1][Nightly] chaos_dataset_shuffle_sort_1tb failed

Open scv119 opened this issue 3 years ago • 0 comments

What happened + What you expected to happen

Traceback (most recent call last):
  File "dataset/sort.py", line 140, in <module>
    print(ds.stats())
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 3374, in stats
    return self._plan.stats().summary_string()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/stats.py", line 218, in summary_string
    self.stats_actor.get.remote(self.stats_uuid)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2247, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: _StatsActor
        actor_id: 50cae1920894cf834b6049ac03000000
        pid: 222
        name: datasets_stats_actor
        namespace: a6746e48-ced2-4cf0-9539-eea5bb5709b3
        ip: 172.31.94.96
The actor is dead because its node has died. Node Id: c95969bb80d34bfe6d426623b276d6e175536faeef7a4bc60c9e6a6b

https://buildkite.com/ray-project/release-tests-branch/builds/882#018286ad-1c05-4cc7-9622-fbb5f0df3104

this look something new. Is the problem that StatsActor is not configured to be fault-tolerent?

Versions / Dependencies

releases/2.0.0rc1

Reproduction script

N/A

Issue Severity

High: It blocks me from completing my task.

scv119 avatar Aug 10 '22 17:08 scv119

Just FYI: it's not a regression, the stats actor is not fault tolerant from the very beginning.

jjyao avatar Aug 10 '22 21:08 jjyao