hermes icon indicating copy to clipboard operation
hermes copied to clipboard

Relayer error handling specification

Open ancazamfir opened this issue 5 years ago • 8 comments

Crate

docs

Summary

Currently the relayer performs retries for any error that occurs while building messages and transactions.

Problem Definition

There are RPC, light client, on-chain, internal errors, etc. The most complex ones are the on-chain errors that are a bit more subtle (leaving aside errors in the relayer code) and also they are not typed (we just get an error message and the reason is embedded in a string). The retry may apply for some cases, in others we should not retry (e.g. a transaction has already been submitted by another relayer), and yet in other cases we should rebuild stuff, etc.

Proposal

Analyze the code, document all errors and the actions to be taken in each case. Consider also the exponential backoff as per @romac's https://github.com/informalsystems/ibc-rs/pull/709#discussion_r582852413, where applicable.


For Admin Use

  • [x] Not duplicate issue
  • [x] Appropriate labels applied
  • [x] Appropriate milestone (priority) applied
  • [ ] Appropriate contributors tagged
  • [ ] Contributor assigned/self-assigned

ancazamfir avatar Feb 25 '21 16:02 ancazamfir

also they are not typed (we just get an error message and the reason is embedded in a string)

May be worth trying to get these typed upstream in ibc

ebuchman avatar May 05 '21 00:05 ebuchman

I tried to do a full trace of an example error message to see how it is propagated. There is a complicated path from the Hermes relayer to the IBC module, and there are many potential errors that can occur in each step.

On the Go side, the error messages are formatted using helpers such as sdkerrors.Wrapf that are converted into strings. It would be challenging to modify the Go code to use more structured logging, so we can only do pattern matching on the string to find the type of error that are of our interest.

Here is an example error message:

Error: tx error: failed with underlying cause: tx response error: deliver_tx reports error: log=Log("failed to execute message; message index: 1: connection handshake open try failed: failed connection state verification for client (07-tendermint-4): chained membership proof failed to verify membership of value: 0A0F... in subroot C9FE...at index 0. Please ensure the path and value are both correct.: invalid proof")

And here is the manual stack trace:

I'm not sure if there are good way to document all errors on the Go code, other than trying to search through the code with keywords such as: status.Errorf, sdkerrors.Wrap, sdkerrors.Wrapf, fmt.Errorf, and look for standard errors in ibc-go/modules/*/errors.go. If we do document the errors, we could submit the documentation to ibc-go as well.

soareschen avatar May 07 '21 14:05 soareschen

we get the response from the Tx here: https://github.com/informalsystems/ibc-rs/blob/2592440b0c599304a1032746ce975b2cf7baa673/relayer/src/chain/cosmos.rs#L1479

The response has a check_tx and a deliver_tx result with type TxResult defined here https://github.com/informalsystems/tendermint-rs/blob/89b3fd0a236b54fcbd149aa9917a52f7d0a6de26/rpc/src/endpoint/broadcast/tx_commit.rs#L61-L91

We currently take the log and embed it in an error. I think we are interested in returning some typed errors based on the code and let the caller decide if the error is recoverable or not.

For IBC the errors (codes) are defined in the errors.go file in each sub-protocol. One example: https://github.com/cosmos/ibc-go/blob/main/modules/core/03-connection/types/errors.go

If I try to send a chan-open-try for a connection that is not opened, an error is issued from the chain here (ErrInvalidConnectionState which is code 6): https://github.com/cosmos/ibc-go/blob/f4123347754d814759c9208273fe6772880063cb/modules/core/04-channel/keeper/handshake.go#L154-L158

And we get this response (prinln from our tx_result_to_event)

Response {
    check_tx: TxResult {
        code: Ok,
        data: None,
        log: Log(
            "[]",
        ),
        info: Info(
            "",
        ),
        gas_wanted: Gas(
            0,
        ),
        gas_used: Gas(
            0,
        ),
        events: [],
        codespace: Codespace(
            "",
        ),
    },
    deliver_tx: TxResult {
        code: Err(
            6,
        ),
        data: None,
        log: Log(
            "failed to execute message; message index: 1: channel handshake open try failed: connection state is not OPEN (got STATE_INIT): invalid connection state",
        ),
        info: Info(
            "",
        ),
        gas_wanted: Gas(
            0,
        ),
        gas_used: Gas(
            0,
        ),
        events: [],
        codespace: Codespace(
            "connection",
        ),
    },
    hash: transaction::Hash(7D51A00CC530011768600E1BF64816D3CB99BE0EBBC1408E7DCC09CC4C9972B7),
    height: block::Height(39444),
}

I think we need to look at all ibc error.rs files and map the errror code -s to hermes typed errors. Not sure if a one-to-one mapping is needed but it is a start.

ancazamfir avatar May 18 '21 11:05 ancazamfir

Yes, that is my impression as well. However the error code are just integers specific to the module, and I am not sure if they are guaranteed to not overlap when making a call. My initial thought was to do string pattern matching to determine the exact error. Also, the Go code do not just use sdkerrors.Wrapf, but also other more general functions such as fmt.Errorf. We could perhaps try to classify the more prominent errors in the errors.go files first, though I don't think they are exhaustive.

So do you mean the main choke point we need to handle is tx_result_to_event? This was not specified in the issue description, and I ended up hunting the entire code base and not knowing where the errors of interest are originated from. :sweat_smile:

soareschen avatar May 18 '21 12:05 soareschen

and I am not sure if they are guaranteed to not overlap when making a call.

they don't within the codespace

Also, the Go code do not just use sdkerrors.Wrapf, but also other more general functions such as fmt.Errorf

yes, but on the relayer side we are interested in that code integer and translate it into a typed error.

So do you mean the main choke point we need to handle is tx_result_to_event? This was not specified in the issue description,

Sorry, I should have added more details to .. most complex ones are the on-chain errors But looking at the traces you posted it looked to me that you were on the right track.

ancazamfir avatar May 18 '21 12:05 ancazamfir

they don't within the codespace

Aha! So that's the devil in the detail! Then I'm confident to interpret the error code directly.

Btw I'm still quite confused that when an error happens, tx_result_to_event still succeeds with the error event:

    if response.check_tx.code.is_err() {
        return Ok(vec![IbcEvent::ChainError(format!(
            "check_tx reports error: log={:?}",
            response.check_tx.log
        ))]);
    }

Is that intentional? If not I will just change it to return them as newly classified errors in relayer::errors::Error.

soareschen avatar May 18 '21 20:05 soareschen

Is that intentional? If not I will just change it to return them as newly classified errors in relayer::errors::Error.

that's our current way to distinguish between chain IBC errors and all the rest. There are very few for which we should retry here but we need to go over the full list of IBC SdkErrors first. And yes we can work with typed errors here also.

ancazamfir avatar May 25 '21 11:05 ancazamfir

Seems like we have some dead code capturing various errors that come from IBC-go. Ref: https://github.com/informalsystems/ibc-rs/pull/2266#issuecomment-1145951834

adizere avatar Jun 03 '22 13:06 adizere

Closing as most of the work has been done in https://github.com/informalsystems/hermes/pull/988 and other PRs over the years. Let' reopen if needed.

ancazamfir avatar Jan 11 '24 18:01 ancazamfir