flink-cdc support table splitting process control when read table snapshot

This resolves a problem encountered in our production environment when there are too many tables to be read on the snapshot phase and each takes quite a lot of time , as a result, the process of data reading falls behind the process of chunk splitting too much.

To be more specific, say there are 100 tables to be read, the current process is: 1、 split all the tables into chunks one by one, which may last several hours, as it's a lightweight job. 2、read the chunks from procedure 1, which may last several days, as it's a heavyweight job.

When start reading the last split of the last table, executing 'select from table where primary key > *** ', too much data has accumulated for this chunk, because a long time has passed since it has been split, then the job will fail .

There are someone else who also encountered this problem (url) #1375

As a solution, we can control the speed of chunk splitting thread, to prevent it from going too much ahead. when it reaches the threshold value, it will wait for the data reading thread.

Oct 18 '22 11:10 JustinLeesin

Hi @JustinLeesin , sorry for the delay of this PR. Could you please rebase it to latest master branch since there’s been lots of changes in Flink CDC repo since your original commit? Kindly reminder that com.ververica.cdc.connectors.mysql package has been moved to org.apache.flink.cdc.connectors.mysql.

cc @leonardBang

Apr 25 '24 07:04 yuxiqian

This pull request has been automatically marked as stale because it has not had recent activity for 60 days. It will be closed in 30 days if no further activity occurs.

Jul 17 '24 00:07 github-actions[bot]