gh-ost icon indicating copy to clipboard operation
gh-ost copied to clipboard

Migrating 100%, but shows State: postponing cut-over; ETA: due

Open cool-firer opened this issue 6 years ago • 4 comments

Hi, I'm new to this. I try to use this greate tool for modifying table structure. When migrating completed 100%, but continuously pop up:

Copy: 22102848/22102848 100.0%; Applied: 110; Backlog: 0/1000; Time: 3h22m0s(total), 3h6m12s(copy); streamer: mysql-bin.000018:912494008; State: postponing cut-over; ETA: due

and my command is this:

./gh-ost
--max-load=Threads_running=25
--critical-load=Threads_running=1000
--chunk-size=1000
--max-lag-millis=1500
--user="xxxx"
--password="xxxx"
--host=xxxx
--port=4223
--database="test"
--table="t1"
--verbose
--alter="ADD COLUMN isBookmarkedThree TINYINT(1) NOT NULL DEFAULT 0"
--allow-master-master
--cut-over=default
--exact-rowcount
--concurrent-rowcount
--default-retries=120
--panic-flag-file=./ghost.panic.flag
--postpone-cut-over-flag-file=./ghost.postpone.flag
-aliyun-rds
--allow-on-master
--execute

I try to delete ghost.postpone.flag file, it kept popping up:

2019-12-06 19:25:39 INFO Grabbing voluntary lock: gh-ost.319743.lock 2019-12-06 19:25:39 INFO Setting LOCK timeout as 6 seconds 2019-12-06 19:25:39 INFO Looking for magic cut-over table 2019-12-06 19:25:39 INFO Creating magic cut-over table test._t1_del 2019-12-06 19:25:39 INFO Magic cut-over table created 2019-12-06 19:25:39 INFO Locking test.t1, test._t1_del 2019-12-06 19:25:39 INFO Tables locked 2019-12-06 19:25:39 INFO Session locking original & magic tables is 319743 2019-12-06 19:25:39 INFO Writing changelog state: AllEventsUpToLockProcessed:1575631539838703726 2019-12-06 19:25:39 INFO Waiting for events up to lock 2019-12-06 19:25:42 ERROR Timeout while waiting for events up to lock 2019-12-06 19:25:42 ERROR 2019-12-06 19:25:42 ERROR Timeout while waiting for events up to lock 2019-12-06 19:25:42 INFO Looking for magic cut-over table 2019-12-06 19:25:42 INFO Will now proceed to drop magic table and unlock tables 2019-12-06 19:25:42 INFO Dropping magic cut-over table 2019-12-06 19:25:42 INFO Releasing lock from test.t1, test._t1_del 2019-12-06 19:25:42 INFO Tables unlocked

serveral minutes later, ends like:

2019-12-06 19:27:39 ERROR Timeout while waiting for events up to lock 2019-12-06 19:27:39 ERROR 2019-12-06 19:27:39 ERROR Timeout while waiting for events up to lock 2019-12-06 19:27:39 INFO Looking for magic cut-over table 2019-12-06 19:27:39 INFO Will now proceed to drop magic table and unlock tables 2019-12-06 19:27:39 INFO Dropping magic cut-over table 2019-12-06 19:27:39 INFO Removing socket file: /tmp/gh-ost.test.t1.sock 2019-12-06 19:27:39 FATAL 2019-12-06 19:27:39 ERROR Timeout while waiting for events up to lock

Is there anything I missed? Thanks

cool-firer avatar Dec 06 '19 11:12 cool-firer

So the "100%, State: postponing cut-over; ETA: due" is gh-ost working as intended. Because you specified the --postpone-cut-over-flag-file, gh-ost will wait to cut-over until either (1) you instruct it to via an interactive command (unpostpone) or (2) you remove the file. As you point out, once you removed the file, gh-ost then proceeded to attempt to cut-over, and this is when things went sideways.

When gh-ost attempts the cutover, it creates that magic _del table, and locks both the original table and that one. During this time, any new DML is blocked on the original table, which is deliberate. Once all DMLs before the point when gh-ost locked the tables have finished, gh-ost will perform the swap, and unlock things. Keeping writes blocked for too long, though, is not good, so gh-ost will bail if it can't detect that all previous DMLs finished within 6 seconds.

gh-ost does this detection by writing an event to its changelog table, and waits for it to appear in the binary log of the replica gh-ost is streaming from. Based on this, one of two things is probably happening:

  1. You have (possibly many) large or slow writes on your database.
  2. You have network latency between the replica and gh-ost.

ajm188 avatar May 09 '20 20:05 ajm188

Is there any way we can configure the timeout of 6 seconds? @ajm188 we are facing a similar issue for a very large table with active writes happening, and we couldn't find any config to control this setting, the closest one related to cutover: --cut-over-lock-timeout-seconds which doesn't seem to be helping.

From the code it seems to be deriving 2 * CutOverLockTimeoutSeconds and cutOverLockTimeoutSeconds seems to be the correct setting:

cutOverLockTimeoutSeconds := flag.Int64("cut-over-lock-timeout-seconds", 3, "Max number of seconds to hold locks on tables while attempting to cut-over (retry attempted when lock exceeds timeout)")

tableLockTimeoutSeconds := this.migrationContext.CutOverLockTimeoutSeconds * 2
--
824 | this.migrationContext.Log.Infof("Setting LOCK timeout as %d seconds", tableLockTimeoutSeconds)

manikuramboyina avatar Apr 04 '21 22:04 manikuramboyina