snarkOS icon indicating copy to clipboard operation
snarkOS copied to clipboard

[Bug] The sync server returns 403 Forbidden for some block requests

Open damons opened this issue 3 years ago • 8 comments

🐛 Bug Report

Sync improved dramatically after #1896; however, snarkOS main thread again panics and the process dies while syncing. Happens consistently. Restart and it will sync several more blocks, if it's lucky, but will eventually hit this issue.

2022-09-27T13:04:30.857069Z TRACE Requesting block 20123 of 25922
2022-09-27T13:04:31.585474Z  INFO Synced up to block 20024 of 25922 - 77% complete (est. 17 minutes remaining)
2022-09-27T13:04:31.585510Z TRACE Requesting block 20124 of 25922
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Block parse error', /jihad/snarkos/snarkos/ledger/mod.rs:328:39

Steps to Reproduce

  1. Step 1. Run snarkos.
  2. Step 2. Watch it sync.
  3. Step 3. It panics and dies.

Expected Behavior

Sync completes.

Your Environment

  • testnet3
  • Rust 1.64.0
  • Ubuntu 20.04.5 LTS

damons avatar Sep 27 '22 13:09 damons

It's due to the server not being able to handle that many requests. I don't have access to its configuration, but I've reported this further.

ljedrz avatar Sep 27 '22 19:09 ljedrz

Now that rate limiting is turned off, it seems that some blocks might be missing; I encountered it with 25976, but others have been reported before.

We need to find out the reason why this happens; for reference, this is the output of curl -I https://vm.aleo.org/testnet3/block/testnet3/25976.block I'm currently getting:

HTTP/1.1 403 Forbidden
Server: Cowboy
Connection: keep-alive
X-Powered-By: Express
Access-Control-Allow-Origin: *
Surrogate-Control: no-store
Cache-Control: max-age=60
Pragma: no-cache
Expires: 0
Date: Fri, 30 Sep 2022 09:06:35 GMT
Content-Length: 231
Content-Type: application/xml
Accept-Ranges: bytes
X-Amz-Request-Id: tx0000000000000ccc49635-006336b18c-3f19a6f7-nyc3c
Strict-Transport-Security: max-age=15552000; includeSubDomains; preload
Vary: Access-Control-Request-Headers,Access-Control-Request-Method,Origin, Accept-Encoding
X-Hw: 1664528795.dop089.dc2.t,1664528795.cds063.dc2.shn,1664528795.dop089.dc2.t,1664528795.cds078.dc2.c
Via: 1.1 vegur

ljedrz avatar Sep 30 '22 09:09 ljedrz

This is likely the same issue as https://github.com/AleoHQ/snarkOS/issues/1934.

ljedrz avatar Sep 30 '22 09:09 ljedrz

Frequently get a similar but different error:

0|cargo  | thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: error sending request for url (https://vm.aleo.org/testnet3/block/testnet3/26580.block): error trying to connect: Connection reset by peer (os error 104)
0|cargo  | Caused by:
0|cargo  |     0: error trying to connect: Connection reset by peer (os error 104)
0|cargo  |     1: Connection reset by peer (os error 104)
0|cargo  |     2: Connection reset by peer (os error 104)', /root/snarkOS/snarkos/ledger/mod.rs:328:39
0|cargo  | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
PM2      | App [cargo:0] exited with code [101] via signal [SIGINT]

It's always an HTTP 104 in my case and I need to restart the process constantly to attempt a sync. Same issue across multiple instances.

emkman avatar Oct 14 '22 16:10 emkman

This seems like a transient sync server issue; how long did it take for the sync process to resume without issues? In any case, as long as the sync process could be resumed, it's not this issue, but rather https://github.com/AleoHQ/snarkOS/issues/1973.

ljedrz avatar Oct 17 '22 13:10 ljedrz

It happened constantly, a few times an hour at least. Syncing to completion wasn't even possible until I wrapped the process in PM2 with autorestart. Since the sync server seems to just be proxying an S3 bucket the failure rate seems really high and something worth looking into. Agree that the linked issue #1973 looks like it may have fixed it at least client side. Thanks.

emkman avatar Oct 25 '22 19:10 emkman

Hi ,@emkman I met the same issue with error 104, can you provide resolutions step by step? thanks

erickingxu avatar Oct 27 '22 23:10 erickingxu

@erickingxu I would git pull and make sure you are up to date as this may be fixed. Regardless, if you want to ensure automatic restarts, pm2 makes this very easy. Install nvm if you don't have node on your system.

# curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.2/install.sh | bash 
npm install pm2 -g
pm2 start "cargo run --release"

This will ensure the aleo node automatically restarts on any failure

emkman avatar Nov 03 '22 20:11 emkman

I don't think we've seen it happen in a while now, so I'm assuming it's fixed now. Please feel free to reopen if it ever does.

ljedrz avatar Dec 19 '22 11:12 ljedrz