br icon indicating copy to clipboard operation
br copied to clipboard

Backup failed with cloud storage report `reset by peers`

Open shuijing198799 opened this issue 5 years ago • 1 comments

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.

When the gcs network is unstable, it will give tikv a reset by peer error. This error does not trigger tikv's retry mechanism, and backup exits with an error.

TiKV log

[2020/11/04 06:29:58.792 +00:00] [ERROR] [endpoint.rs:292] ["backup save file failed"] [err_code=KV:Unknown] [err="Io(Custom { kind: InvalidInput, error: \"invalid HTTP request: connection error: Connection reset by peer (os error 104)\" })"]
[2020/11/04 06:29:58.792 +00:00] [ERROR] [endpoint.rs:694] ["backup region failed"] [err_code=KV:Unknown] [err="Io(Custom { kind: InvalidInput, error: \"invalid HTTP request: connection error: Connection reset by peer (os error 104)\" })"] [end_key=7480000000000000E75F72800000000EE03CD9] [start_key=7480000000000000E75F72800000000EB0EE2D] [region="id: 110409 start_key: 7480000000000000FFE75F72800000000EFFB0EE2D0000000000FA end_key: 7480000000000000FFE75F72800000000EFFE03CD90000000000FA region_epoch { conf_ver: 5 version: 10797 } peers { id: 110410 store_id: 1 } peers { id: 110411 store_id: 4 } peers { id: 110412 store_id: 5 }"]

[2020/11/04 06:29:58.825 +00:00] [ERROR] [client.rs:438] ["failed to send heartbeat"] [err_code=KV:PD:gRPC] [err="Grpc(RpcFinished(Some(RpcStatus { status: 14-UNAVAILABLE, details: Some(\"Connection reset by peer\") })))"]
[2020/11/04 06:29:58.834 +00:00] [ERROR] [util.rs:301] ["request failed, retry"] [err_code=KV:Unknown] [err="Grpc(RpcFinished(Some(RpcStatus { status: 14-UNAVAILABLE, details: Some(\"Connection reset by peer\") })))"]
[2020/11/04 06:29:58.856 +00:00] [ERROR] [util.rs:301] ["request failed, retry"] [err_code=KV:Unknown] [err="Other(SendError(\"...\"))"]
[2020/11/04 06:29:58.856 +00:00] [ERROR] [util.rs:301] ["request failed, retry"] [err_code=KV:Unknown] [err="Other(SendError(\"...\"))"]
[2020/11/04 06:29:58.857 +00:00] [INFO] [util.rs:419] ["connecting to PD endpoint"] [endpoints=https://db-pd-0.db-pd-peer.tidb1323802126710738944.svc:2379]
[2020/11/04 06:29:58.879 +00:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fb975e55ed0 for subchannel 0x7fb975ea1700"]
[2020/11/04 06:29:58.895 +00:00] [INFO] [util.rs:419] ["connecting to PD endpoint"] [endpoints=https://db-pd-1.db-pd-peer.tidb1323802126710738944.svc:2379]
[2020/11/04 06:29:58.905 +00:00] [INFO] [<unknown>] ["New connected subchannel at 0x7fb975e56b90 for subchannel 0x7fb975ea1c40"]
[2020/11/04 06:29:58.913 +00:00] [INFO] [util.rs:484] ["connected to PD leader"] [endpoints=https://db-pd-1.db-pd-peer.tidb1323802126710738944.svc:2379]
[2020/11/04 06:29:58.913 +00:00] [INFO] [util.rs:190] ["heartbeat sender and receiver are stale, refreshing ..."]
[2020/11/04 06:29:58.936 +00:00] [WARN] [util.rs:209] ["updating PD client done"] [spend=79.041063ms]

Add some crash bag or duplicate bag in chaos env may reproduce this problem.

  1. What did you expect to see? Backup successful

  2. What did you see instead? Backup Failed

  3. What version of BR and TiDB/TiKV/PD are you using?

br -V

Release Version: v4.0.8
Git Commit Hash: c2ed897feadaae1ae27a4111cd44b1840941e9be
Git Branch: heads/refs/tags/v4.0.8
Go Version: go1.13
UTC Build Time: 2020-10-30 08:14:21
Race Enabled: false

tidb-server -V

Release Version: v4.0.8
Edition: Community
Git Commit Hash: 66ac9fc31f1733e5eb8d11891ec1b38f9c422817
Git Branch: heads/refs/tags/v4.0.8
UTC Build Time: 2020-10-30 08:21:16
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

tikv-server -V

/ # /tikv-server -V
TiKV 
Release Version:   4.0.8
Edition:           Community
Git Commit Hash:   83091173e960e5a0f5f417e921a0801d2f6635ae
Git Commit Branch: heads/refs/tags/v4.0.8
UTC Build Time:    2020-10-30 08:40:33
Rust Version:      rustc 1.42.0-nightly (0de96d37f 2019-12-19)
Enable Features:   jemalloc mem-profiling portable sse protobuf-codec
Profile:           dist_release

pd-server -V

Release Version: v4.0.8
Edition: Community
Git Commit Hash: 775b6a5ef517f8ab2f43fef6418bbfc7d6c9c9dc
Git Branch: heads/refs/tags/v4.0.8
UTC Build Time:  2020-10-30 08:15:09

shuijing198799 avatar Nov 04 '20 08:11 shuijing198799

Some related issues: https://github.com/googleapis/google-cloud-go/issues/108 https://stackoverflow.com/questions/51624788/google-cloud-storage-batch-move-file-failure-connection-reset-by-peer

tennix avatar Nov 04 '20 08:11 tennix