[Bug] The sync server returns 403 Forbidden for some block requests
🐛 Bug Report
Sync improved dramatically after #1896; however, snarkOS main thread again panics and the process dies while syncing. Happens consistently. Restart and it will sync several more blocks, if it's lucky, but will eventually hit this issue.
2022-09-27T13:04:30.857069Z TRACE Requesting block 20123 of 25922
2022-09-27T13:04:31.585474Z INFO Synced up to block 20024 of 25922 - 77% complete (est. 17 minutes remaining)
2022-09-27T13:04:31.585510Z TRACE Requesting block 20124 of 25922
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Block parse error', /jihad/snarkos/snarkos/ledger/mod.rs:328:39
Steps to Reproduce
- Step 1. Run snarkos.
- Step 2. Watch it sync.
- Step 3. It panics and dies.
Expected Behavior
Sync completes.
Your Environment
- testnet3
- Rust 1.64.0
- Ubuntu 20.04.5 LTS
It's due to the server not being able to handle that many requests. I don't have access to its configuration, but I've reported this further.
Now that rate limiting is turned off, it seems that some blocks might be missing; I encountered it with 25976, but others have been reported before.
We need to find out the reason why this happens; for reference, this is the output of curl -I https://vm.aleo.org/testnet3/block/testnet3/25976.block I'm currently getting:
HTTP/1.1 403 Forbidden
Server: Cowboy
Connection: keep-alive
X-Powered-By: Express
Access-Control-Allow-Origin: *
Surrogate-Control: no-store
Cache-Control: max-age=60
Pragma: no-cache
Expires: 0
Date: Fri, 30 Sep 2022 09:06:35 GMT
Content-Length: 231
Content-Type: application/xml
Accept-Ranges: bytes
X-Amz-Request-Id: tx0000000000000ccc49635-006336b18c-3f19a6f7-nyc3c
Strict-Transport-Security: max-age=15552000; includeSubDomains; preload
Vary: Access-Control-Request-Headers,Access-Control-Request-Method,Origin, Accept-Encoding
X-Hw: 1664528795.dop089.dc2.t,1664528795.cds063.dc2.shn,1664528795.dop089.dc2.t,1664528795.cds078.dc2.c
Via: 1.1 vegur
This is likely the same issue as https://github.com/AleoHQ/snarkOS/issues/1934.
Frequently get a similar but different error:
0|cargo | thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: error sending request for url (https://vm.aleo.org/testnet3/block/testnet3/26580.block): error trying to connect: Connection reset by peer (os error 104)
0|cargo | Caused by:
0|cargo | 0: error trying to connect: Connection reset by peer (os error 104)
0|cargo | 1: Connection reset by peer (os error 104)
0|cargo | 2: Connection reset by peer (os error 104)', /root/snarkOS/snarkos/ledger/mod.rs:328:39
0|cargo | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
PM2 | App [cargo:0] exited with code [101] via signal [SIGINT]
It's always an HTTP 104 in my case and I need to restart the process constantly to attempt a sync. Same issue across multiple instances.
This seems like a transient sync server issue; how long did it take for the sync process to resume without issues? In any case, as long as the sync process could be resumed, it's not this issue, but rather https://github.com/AleoHQ/snarkOS/issues/1973.
It happened constantly, a few times an hour at least. Syncing to completion wasn't even possible until I wrapped the process in PM2 with autorestart. Since the sync server seems to just be proxying an S3 bucket the failure rate seems really high and something worth looking into. Agree that the linked issue #1973 looks like it may have fixed it at least client side. Thanks.
Hi ,@emkman I met the same issue with error 104, can you provide resolutions step by step? thanks
@erickingxu I would git pull and make sure you are up to date as this may be fixed. Regardless, if you want to ensure automatic restarts, pm2 makes this very easy. Install nvm if you don't have node on your system.
# curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.2/install.sh | bash
npm install pm2 -g
pm2 start "cargo run --release"
This will ensure the aleo node automatically restarts on any failure
I don't think we've seen it happen in a while now, so I'm assuming it's fixed now. Please feel free to reopen if it ever does.