Relayer error handling specification
Crate
docs
Summary
Currently the relayer performs retries for any error that occurs while building messages and transactions.
Problem Definition
There are RPC, light client, on-chain, internal errors, etc. The most complex ones are the on-chain errors that are a bit more subtle (leaving aside errors in the relayer code) and also they are not typed (we just get an error message and the reason is embedded in a string). The retry may apply for some cases, in others we should not retry (e.g. a transaction has already been submitted by another relayer), and yet in other cases we should rebuild stuff, etc.
Proposal
Analyze the code, document all errors and the actions to be taken in each case. Consider also the exponential backoff as per @romac's https://github.com/informalsystems/ibc-rs/pull/709#discussion_r582852413, where applicable.
For Admin Use
- [x] Not duplicate issue
- [x] Appropriate labels applied
- [x] Appropriate milestone (priority) applied
- [ ] Appropriate contributors tagged
- [ ] Contributor assigned/self-assigned
also they are not typed (we just get an error message and the reason is embedded in a string)
May be worth trying to get these typed upstream in ibc
I tried to do a full trace of an example error message to see how it is propagated. There is a complicated path from the Hermes relayer to the IBC module, and there are many potential errors that can occur in each step.
On the Go side, the error messages are formatted using helpers such as sdkerrors.Wrapf that are converted into strings. It would be challenging to modify the Go code to use more structured logging, so we can only do pattern matching on the string to find the type of error that are of our interest.
Here is an example error message:
Error: tx error: failed with underlying cause: tx response error: deliver_tx reports error: log=Log("failed to execute message; message index: 1: connection handshake open try failed: failed connection state verification for client (07-tendermint-4): chained membership proof failed to verify membership of value: 0A0F... in subroot C9FE...at index 0. Please ensure the path and value are both correct.: invalid proof")
And here is the manual stack trace:
-
tx error- ibc-rs/relayer-cli/src/error.rs:33 -
failed with underlying cause- ibc-rs/relayer/src/connection.rs:31 -
tx response error- ibc-rs/relayer/src/connection.rs:614 -
deliver_tx reports error- ibc-rs/relayer/src/chain/cosmos.rs:1472 -
failed to execute message- cosmos-sdk/baseapp/baseapp.go:735 -
connection handshake open try failed- ibc-go/modules/core/keeper/msg_server.go:L174 -
failed connection state verification- ibc-go/modules/core/03-connection/keeper/verify.go -
chained membership proof failed to verify membership- ibc-go/modules/core/23-commitment/types/merkle.go:249 -
invalid proof- ibc-go/modules/core/23-commitment/types/errors.go#L12
I'm not sure if there are good way to document all errors on the Go code, other than trying to search through the code with keywords such as: status.Errorf, sdkerrors.Wrap, sdkerrors.Wrapf, fmt.Errorf, and look for standard errors in ibc-go/modules/*/errors.go. If we do document the errors, we could submit the documentation to ibc-go as well.
we get the response from the Tx here: https://github.com/informalsystems/ibc-rs/blob/2592440b0c599304a1032746ce975b2cf7baa673/relayer/src/chain/cosmos.rs#L1479
The response has a check_tx and a deliver_tx result with type TxResult defined here https://github.com/informalsystems/tendermint-rs/blob/89b3fd0a236b54fcbd149aa9917a52f7d0a6de26/rpc/src/endpoint/broadcast/tx_commit.rs#L61-L91
We currently take the log and embed it in an error. I think we are interested in returning some typed errors based on the code and let the caller decide if the error is recoverable or not.
For IBC the errors (codes) are defined in the errors.go file in each sub-protocol. One example: https://github.com/cosmos/ibc-go/blob/main/modules/core/03-connection/types/errors.go
If I try to send a chan-open-try for a connection that is not opened, an error is issued from the chain here (ErrInvalidConnectionState which is code 6):
https://github.com/cosmos/ibc-go/blob/f4123347754d814759c9208273fe6772880063cb/modules/core/04-channel/keeper/handshake.go#L154-L158
And we get this response (prinln from our tx_result_to_event)
Response {
check_tx: TxResult {
code: Ok,
data: None,
log: Log(
"[]",
),
info: Info(
"",
),
gas_wanted: Gas(
0,
),
gas_used: Gas(
0,
),
events: [],
codespace: Codespace(
"",
),
},
deliver_tx: TxResult {
code: Err(
6,
),
data: None,
log: Log(
"failed to execute message; message index: 1: channel handshake open try failed: connection state is not OPEN (got STATE_INIT): invalid connection state",
),
info: Info(
"",
),
gas_wanted: Gas(
0,
),
gas_used: Gas(
0,
),
events: [],
codespace: Codespace(
"connection",
),
},
hash: transaction::Hash(7D51A00CC530011768600E1BF64816D3CB99BE0EBBC1408E7DCC09CC4C9972B7),
height: block::Height(39444),
}
I think we need to look at all ibc error.rs files and map the errror code -s to hermes typed errors. Not sure if a one-to-one mapping is needed but it is a start.
Yes, that is my impression as well. However the error code are just integers specific to the module, and I am not sure if they are guaranteed to not overlap when making a call. My initial thought was to do string pattern matching to determine the exact error. Also, the Go code do not just use sdkerrors.Wrapf, but also other more general functions such as fmt.Errorf. We could perhaps try to classify the more prominent errors in the errors.go files first, though I don't think they are exhaustive.
So do you mean the main choke point we need to handle is tx_result_to_event? This was not specified in the issue description, and I ended up hunting the entire code base and not knowing where the errors of interest are originated from. :sweat_smile:
and I am not sure if they are guaranteed to not overlap when making a call.
they don't within the codespace
Also, the Go code do not just use sdkerrors.Wrapf, but also other more general functions such as fmt.Errorf
yes, but on the relayer side we are interested in that code integer and translate it into a typed error.
So do you mean the main choke point we need to handle is tx_result_to_event? This was not specified in the issue description,
Sorry, I should have added more details to .. most complex ones are the on-chain errors But looking at the traces you posted it looked to me that you were on the right track.
they don't within the codespace
Aha! So that's the devil in the detail! Then I'm confident to interpret the error code directly.
Btw I'm still quite confused that when an error happens, tx_result_to_event still succeeds with the error event:
if response.check_tx.code.is_err() {
return Ok(vec![IbcEvent::ChainError(format!(
"check_tx reports error: log={:?}",
response.check_tx.log
))]);
}
Is that intentional? If not I will just change it to return them as newly classified errors in relayer::errors::Error.
that's our current way to distinguish between chain IBC errors and all the rest. There are very few for which we should retry here but we need to go over the full list of IBC SdkErrors first. And yes we can work with typed errors here also.
Seems like we have some dead code capturing various errors that come from IBC-go. Ref: https://github.com/informalsystems/ibc-rs/pull/2266#issuecomment-1145951834
Closing as most of the work has been done in https://github.com/informalsystems/hermes/pull/988 and other PRs over the years. Let' reopen if needed.