Jian Xiao
Jian Xiao
Seems another lineage reconstruction: ``` Traceback (most recent call last): File "dask_on_ray/dask_on_ray_sort.py", line 230, in file_path=args.file_path, File "dask_on_ray/dask_on_ray_sort.py", line 145, in trial 10, npartitions=-1 File "/home/ray/anaconda3/lib/python3.7/site-packages/dask/dataframe/core.py", line 1140, in head...
Succeeded on rety: https://buildkite.com/ray-project/release-tests-branch/builds/878#018283ca-d7f5-4ba4-9e24-3eb3f75f2dcb
Thanks for confirming @stephanie-wang, closing since it worked as intended.
@scv119 Any new findings about this test? It has been passing in past 2 days.
Re-run 10x and got 2 failures, one of them showed a bit more informative stuff, from which it looks the node was just busy and didn't respond heartbeat to GCS,...
For debugging, the above log was from this test: https://console.anyscale.com/o/anyscale-internal/projects/prj_2xR6uT6t7jJuu1aCwWMsle/clusters/ses_WzX46Q5r2RwHi5rQbbKBQDWU Its head node was 10.0.3.218. The node got killed was 10.0.3.192.
Ran it 6x today (mostly trying to observe the live CPU load) and all of them passed (so nothing suspicious observed), the failure rate shouldn't be that high :) Now...
Compared to the data loading of read task, one RPC seems a small cost? Do we have a test to run the impact of this?
Tried a simple test like this: ``` total_time = 0 for _ in range(16): start_time = time.time() ds = ray.data.range(100000, parallelism=10000) ds.map_batches(lambda x: x) total_time = time.time() - start_time print("mean...
Microbenchmark: ``` start_time = time.time() for _ in range(1000): ah = ray.data.impl.stats._get_or_create_stats_actor() print("mean time to get:", (time.time() - start_time) / 1000) ``` Before: 1.4783143997192383e-05 (sec) After: 0.0005355322360992432 (sec) Diff: 36x...