gh-ost Ensure replication running after replication restart

A Pull Request should be associated with an Issue.

We wish to have discussions in Issues. A single issue may be targeted by multiple PRs. If you're offering a new feature or fixing anything, we'd like to know beforehand in Issues, and potentially we'll be able to point development in a particular direction.

Related issue:

Further notes in https://github.com/github/gh-ost/blob/master/.github/CONTRIBUTING.md Thank you! We are open to PRs, but please understand if for technical reasons we are unable to accept each and any PR

Description

This PR adds checks to function restartReplication that ensures that replication has started before continuing. Before adding this check, there as a hard coded 500ms wait time and then the program assumed that the replication threads were started and running (added in https://github.com/github/gh-ost/pull/337).

We encountered situations in different environments that this wait time wasn't sufficient. As an experiment, we doubled this wait time and deployed it to our live environments to see if this resolves the issue. This did help solve the problem, so now we are coming back to find a better permanent fix.

example output from our logs:

2024-06-11 08:17:41 FATAL Replication on <replica-hostname>:3306 is broken: Slave_IO_Running: Connecting, Slave_SQL_Running: Yes. Please make sure replication runs before using gh-ost.

old description

This PR increases the `startSlavePostWaitMilliseconds` as we are seeing an error when running `gh-ost` in some cloud environments that the `Slave_IO_Running` is `Connecting` rather than `Yes` as expected.

We found this old PR that described the issue we're having https://github.com/github/gh-ost/pull/337 - as a first step we are increasing by doubling the value. If this test is successful, then we'll look into making this something that could be configured.

In case this PR introduced Go code changes:

[ ] contributed code is using same conventions as original code
[ ] script/cibuild returns with no formatting errors, build errors or unit test errors.

Jun 11 '24 19:06 andyedison

@andyedison I wonder if this fix is good enough for all cases. The sleep that exists now is a bit hacky

in some cloud environments that the Slave_IO_Running is Connecting rather than Yes as expected.

What should we be waiting for? I think you're saying the IO thread running. Whatever the answer, it would be safer if gh-ost waited + checked that what we want is achieved vs a time.Sleep()

Aug 15 '24 22:08 timvaillancourt

No I agree, I doubt this is good enough for all cases. This was a bit of an experiment to see if increasing this time would prevent the errors we were seeing in a particular environment from happening over a period of time. I believe it has, we just haven't had time to swing back to this and dig into it to find a more permanent solution

Aug 19 '24 15:08 andyedison

I've updated the PR and title+description to better represent what this change is. Instead of assuming that replication has resumed successfully after the timeout, I made the change to instead check if it is running, and if not, wait an interval before trying to check again, erroring if we exceed a maximum wait time

Oct 17 '24 19:10 andyedison

Internal testing results look good, feel free to merge when ready 👍

Dec 18 '24 19:12 meiji163