Skip to content

CometBFT crashes with panic in fendermint during BeginBlock when fendermint is catching up (replaying) from CometBFT #1196

Description

Description:

We encountered an issue where cometbft crashes with a panic caught in fendermint. This issue occurs because, in BeginBlock, we attempt to resolve the CometBFT validator ID to a public key. However, when fendermint’s data folder is deleted and fendermint is restarted, cometbft attempts to start block replay but is not ready for the RPC API connection that fendermint requires for this process.

Steps to Reproduce:

  1. Run both cometbft and fendermint.
  2. Wait until a few blocks have been produced.
  3. Stop fendermint and delete its data folder.
  4. Restart fendermint.

Observed Errors:

cometbft Logs Before Crash:

I[2024-11-06|15:13:38.920] ABCI Replay Blocks module=consensus appHeight=0 storeHeight=5 stateHeight=5 I[2024-11-06|15:13:51.760] Applying block module=consensus height=1 E[2024-11-06|15:13:51.762] Stopping abci.socketClient for error: read message: EOF module=abci-client connection=consensus I[2024-11-06|15:13:51.762] service stop module=abci-client connection=consensus msg="Stopping socketClient service" impl=socketClient E[2024-11-06|15:13:51.762] consensus connection terminated. Did the application crash? Please restart CometBFT module=proxy err="read message: EOF"

fendermint Panic:

2024-11-06T14:13:51.762219Z ERROR fendermint/abci/src/application.rs:212: failed to execute ABCI request: Error { msg: "HTTP error", source: "error trying to connect: tcp connect error: Connection refused (os error 61)", } thread 'tokio-runtime-worker' panicked at /Users/alexei/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tower-abci-0.7.0/src/v037/server.rs:145:70: called Result::unwrap() on an Err value: HTTP error

Caused by: error trying to connect: tcp connect error: Connection refused (os error 61)

Location: /Users/alexei/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10:9

Caused by: 0: HTTP error 1: error trying to connect: tcp connect error: Connection refused (os error 61) note: run with RUST_BACKTRACE=1 environment variable to display a backtrace 2024-11-06T14:13:51.995565Z ERROR fendermint/app/src/main.rs:24: panicking stacktrace=" 0: std::backtrace_rs::backtrace::libunwind::trace\n

Cause:

The issue seems to be due to this line in validators.rs, where fendermint tries to resolve the validator ID to a public key by connecting to the cometbft RPC API during BeginBlock. If cometbft is not fully ready (due to replay or a fresh start with deleted data), this connection fails, causing fendermint to panic and terminate.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions