materialize icon indicating copy to clipboard operation
materialize copied to clipboard

"AWS (Real)" is flaky

Open nrainer-materialize opened this issue 1 year ago • 2 comments

Buildkite link

https://buildkite.com/materialize/nightlies/builds/6448#018daa2c-f08f-405c-95d0-1fc0a0f6d7bc

Relevant log output

: {'S': 'ERROR', 'C': '58000', 'M': 'role trust policy does not require an external ID', 'D': "The trust policy for the connection's role (arn:aws:iam::400121260767:role/testdrive-3254404661-Customer) is insecure and allows any Materialize customer to assume the role.", 'H': 'See: https://materialize.com/s/aws-connection-role-trust-policy'}

Additional thoughts

For tracking if more occurrences happen.

nrainer-materialize avatar Feb 15 '24 08:02 nrainer-materialize

Thanks! I'll keep this in my background queue. Please ping if you see this happen again. One easy fix here is to bump the IAM sleep from 10s to 30s. That will slow the test down quite a bit though. The slightly less easy fix is to add retry loops, so that we wait as long as necessary to see what we expect, rather than blanket waiting 10 or 30s.

benesch avatar Feb 17 '24 23:02 benesch

Another simple option: maybe just tell Buildkite to retry this one up to three times when it fails?

benesch avatar Feb 17 '24 23:02 benesch

Seen again in https://buildkite.com/materialize/nightlies/builds/6603#018de042-0791-4d1e-8b29-e7a3a942d71d.

nrainer-materialize avatar Feb 26 '24 09:02 nrainer-materialize

Have we had trouble with this since #25553 landed?

benesch avatar Apr 03 '24 02:04 benesch

Have we had trouble with this since #25553 landed?

That change landed on Feb 26. Since then, it has failed three times (twice due to this error in builds/6761 and builds/7159, and once in builds/6755 in which most jobs failed) but it recovered each time.

We can consider this resolved.

nrainer-materialize avatar Apr 03 '24 06:04 nrainer-materialize

Great!

benesch avatar Apr 03 '24 12:04 benesch