ray
ray copied to clipboard
[2.0rc1][Nightly] chaos_dataset_shuffle_sort_1tb failed
What happened + What you expected to happen
Traceback (most recent call last):
File "dataset/sort.py", line 140, in <module>
print(ds.stats())
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 3374, in stats
return self._plan.stats().summary_string()
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/stats.py", line 218, in summary_string
self.stats_actor.get.remote(self.stats_uuid)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2247, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: _StatsActor
actor_id: 50cae1920894cf834b6049ac03000000
pid: 222
name: datasets_stats_actor
namespace: a6746e48-ced2-4cf0-9539-eea5bb5709b3
ip: 172.31.94.96
The actor is dead because its node has died. Node Id: c95969bb80d34bfe6d426623b276d6e175536faeef7a4bc60c9e6a6b
https://buildkite.com/ray-project/release-tests-branch/builds/882#018286ad-1c05-4cc7-9622-fbb5f0df3104
this look something new. Is the problem that StatsActor is not configured to be fault-tolerent?
Versions / Dependencies
releases/2.0.0rc1
Reproduction script
N/A
Issue Severity
High: It blocks me from completing my task.
Just FYI: it's not a regression, the stats actor is not fault tolerant from the very beginning.