python-driver icon indicating copy to clipboard operation
python-driver copied to clipboard

tests/integration: set `skip_wait_for_gossip_to_settle=0`

Open fruch opened this issue 1 year ago • 9 comments

to speed up the boot sequence of scylla nodes we are using skip_wait_for_gossip_to_settle=0 same as we are using for quite a while in dtest on almost all tests

also introduced wait_other_notice=True for places where starting the cluster, because without it we can get into situation we start a test, and cluster isn't fully ready and up.

this change shaves 1h of integration tests run, and it's now finishes in 28min.

fruch avatar Feb 22 '24 15:02 fruch

Interesting, I remember that I did try to do this at one point, but got a lot of failures. Maybe I just made some mistake when running the tests.

Lorak-mmk avatar Feb 23 '24 21:02 Lorak-mmk

Interesting, I remember that I did try to do this at one point, but got a lot of failures. Maybe I just made some mistake when running the tests.

it depends when you tried it, we (mostly @nyh) did a lot of fine tuning to ccm, to support this case correctly. while trying to figure out why that UDT test is failing, it was annoying to wait that much time for cluster creation.

fruch avatar Feb 24 '24 19:02 fruch

I think we can merge it after CI passes

Lorak-mmk avatar Feb 27 '24 17:02 Lorak-mmk

I think we can merge it after CI passes

one of the integration suite was stuck for 5h, I'm running it all again:

tests/integration/standard/test_metadata.py ss...s.............x...s.s.. [ 15%]
s...s.ss.s...x.s.x.....sssssssssss...ss.s....s.s...ss                    [ 20%]
Error: The operation was canceled.

I'm not sure if it's connected to this change or not, we'll need more reruns, and maybe enabling of more debug in CI to figure this one out

fruch avatar Feb 28 '24 06:02 fruch

I think we can merge it after CI passes

one of the integration suite was stuck for 5h, I'm running it all again:

tests/integration/standard/test_metadata.py ss...s.............x...s.s.. [ 15%]
s...s.ss.s...x.s.x.....sssssssssss...ss.s....s.s...ss                    [ 20%]
Error: The operation was canceled.

I'm not sure if it's connected to this change or not, we'll need more reruns, and maybe enabling of more debug in CI to figure this one out

it getting stuck also in other places, which are not this PR: https://github.com/scylladb/python-driver/actions/runs/8076169015/job/22064206623

tests/integration/standard/test_metadata.py ss...s.............x...s.s.. [ 15%]
s...s.ss.s...x.s.x.....sssssssssss...ss.s....s.s...ss                    [ 20%]
Error: The operation was canceled.

fruch avatar Feb 28 '24 13:02 fruch

clearly from logs, test_connection_error is the one getting stuck, still not clear why

also seen that test_connection_honor_cluster_port leave a trail of session behind, which keep trying to reconnect to cluster that isn't' there anymore

fruch avatar Feb 28 '24 22:02 fruch

clearly from logs, test_connection_error is the one getting stuck, still not clear why

also seen that test_connection_honor_cluster_port leave a trail of session behind, which keep trying to reconnect to cluster that isn't' there anymore

Are the problems in those tests caused by this PR? If not then I think we can merge this

Lorak-mmk avatar Apr 29 '24 13:04 Lorak-mmk

clearly from logs, test_connection_error is the one getting stuck, still not clear why

also seen that test_connection_honor_cluster_port leave a trail of session behind, which keep trying to reconnect to cluster that isn't' there anymore

Are the problems in those tests caused by this PR? If not then I think we can merge this

I didn't find any connection to this change

fruch avatar Apr 30 '24 06:04 fruch

Looks like all tests are passing now, aren't they?

roydahan avatar Apr 30 '24 14:04 roydahan