br restore failed when split range and pd unavailable for in 3-5s, which is not expected
Please answer these questions before submitting your issue. Thanks!
- What did you do? If possible, provide a recipe for reproducing the error.
- br restore full to S3
- tiup cluster restart xxx -R pd, the tidb cluster has only one pd, so pd unavaible for only 3-5s.
- br restore failed.
-
What did you expect to see? br restore can tolerate 1-3minutes when split range and pd unavailable
-
What did you see instead? br log: [2021/07/01 14:11:26.245 +08:00] [INFO] [base_client.go:296] ["[pd] cannot update member from this address"] [address=http://172.16.6.6:12379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.16.6.6:12379: connect: connection refused" target:172.16.6.6:12379 status:TRANSIENT_FAILURE"] [2021/07/01 14:11:26.245 +08:00] [ERROR] [base_client.go:166] ["[pd] failed updateMember"] [error="[PD:client:ErrClientGetLeader]get leader from [http://172.16.6.6:12379] error"] [stack="github.com/tikv/pd/client.(*baseClient).memberLoop\n\tgithub.com/tikv/[email protected]/client/base_client.go:166"]
... [2021/07/01 14:11:26.855 +08:00] [ERROR] [base_client.go:166] ["[pd] failed updateMember"] [error="[PD:client:ErrClientGetLeader]get leader from [http://172.16.6.6:12379] error"] [stack="github.com/tikv/pd/client.(*baseClient).memberLoop\n\tgithub.com/tikv/[email protected]. 0.20210323121136-78679e5e209d/client/base_client.go:166"] [2021/07/01 14:11:26.855 +08:00] [ERROR] [pipeline_items.go:236] ["failed on split range"] [ranges="{total=178,ranges="[\"[7480000000000014855F69800000000000000300, 7480000000000014855F698000000000000003FB)\",\"(skip 176)\",\"[74800000000000F6075F72000000000000 0000, 74800000000000F6075F72FFFFFFFFFFFFFFFF00)\"]",totalFiles=205,totalKVs=5309510,totalBytes=737344350,totalSize=737344350}"] [error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.16.6.6:12379: connect: co nnection refused""] [errorVerbose="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.16.6.6:12379: connect: connection refused"\ngithub.com/tikv/pd/client.(*client).ScanRegions\n\tgithub.com/tikv/[email protected] .20210323121136-78679e5e209d/client/client.go:1100\ngithub.com/pingcap/br/pkg/restore.(*pdClient).ScanRegions\n\tgithub.com/pingcap/br/pkg/restore/split_client.go:385\ngithub.com/pingcap/br/pkg/restore.PaginateScanRegion\n\tgithub.com/pingcap/br/pkg/restore/split.go:298\n github.com/pingcap/br/pkg/restore.(*RegionSplitter).Split\n\tgithub.com/pingcap/br/pkg/restore/split.go:113\ngithub.com/pingcap/br/pkg/restore.SplitRanges\n\tgithub.com/pingcap/br/pkg/restore/util.go:390\ngithub.com/pingcap/br/pkg/restore.(*tikvSender).splitWorker\n\tgith ub.com/pingcap/br/pkg/restore/pipeline_items.go:235\nruntime.goexit\n\truntime/asm_amd64.s:1371"] [stack="github.com/pingcap/br/pkg/restore.(*tikvSender).splitWorker\n\tgithub.com/pingcap/br/pkg/restore/pipeline_items.go:236"] ...
[2021/07/01 14:11:29.487 +08:00] [ERROR] [restore.go:35] ["failed to restore"] [error="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.16.6.6:12379: connect: connection refused""] [errorVerbose="rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.16.6.6:12379: connect: connection refused\
- What version of BR and TiDB/TiKV/PD are you using?
-
Operation logs
- Please upload
br.logfor BR if possible - Please upload
tidb-lightning.logfor TiDB-Lightning if possible - Please upload
tikv-importer.logfrom TiKV-Importer if possible - Other interesting logs
- Please upload
-
Configuration of the cluster and the task
-
tidb-lightning.tomlfor TiDB-Lightning if possible -
tikv-importer.tomlfor TiKV-Importer if possible -
topology.ymlif deployed by TiUP
-
-
Screenshot/exported-PDF of Grafana dashboard or metrics' graph in Prometheus if possible