[BUG]: after power outage alphas don't start anymore
What version of Dgraph are you using?
happens in: v22.0.2 and 23-beta-1
Tell us a little more about your go-environment?
Go version : go1.18.5 jemalloc enabled : true
Have you tried reproducing the issue with the latest release?
Yes
What is the hardware spec (RAM, CPU, OS)?
Docker, 4 vCPU, 8 GB Ram
What steps will reproduce the bug?
Start a local cluster with 3 alphas and 1 zero in docker. add a schema and data. plug the power of while the system is running.
sometimes, 1 alpha sometimes 2 alphas or it also happens that all 3 alphas can't start now:
2023/02/21 18:36:36 3 state.commit 6886 is out of range [150, 6885] panic: 3 state.commit 6886 is out of range [150, 6885]
Expected behavior and actual result.
if an alpha is out of sync it should sync itself
Additional information
more details are discussed in the chat here: https://discuss.dgraph.io/t/power-outage-and-now-the-cluster-didnt-start-again/18314are
Hey, do you also get the message "Was the raft log corrupted, truncated, or lost?" until you restart all three alpha nodes like I described in #8751?
hi,
no, that seems to be correct
but I see there is an error when subscribing for updates:
E0316 13:04:05.600424 15 groups.go:1229] Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": unable to find any servers for group: 1. closer err: <nil>
complete log is here:
Dgraph version : v23.0.0-beta1
Dgraph codename : dgraph
Dgraph SHA-256 : 6ea18d0c006fef96cff36b62202b6670f43f50ec486629f9e0fef8a73d2adb93
Commit SHA-1 : 1a182e6
Commit timestamp : 2023-03-01 22:25:03 +0000
Branch : HEAD
Go version : go1.19.6
jemalloc enabled : true
For Dgraph official documentation, visit https://dgraph.io/docs.
For discussions about Dgraph , visit https://discuss.dgraph.io.
For fully-managed Dgraph Cloud , visit https://dgraph.io/cloud.
Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2023 Dgraph Labs, Inc.
I0316 13:04:05.574108 15 run.go:742] x.Config: {PortOffset:0 Limit:mutations=allow; max-retries=10; max-pending-queries=10000; shared-instance=false; txn-abort-after=5m; query-edge=1000000; normalize-node=10000; mutations-nquad=1000000; disallow-drop=false; query-timeout=0ms LimitMutationsNquad:1000000 LimitQueryEdge:1000000 BlockClusterWideDrop:false LimitNormalizeNode:10000 QueryTimeout:0s MaxRetries:10 SharedInstance:false GraphQL:introspection=true; debug=false; extensions=true; poll-interval=1s; lambda-url= GraphQLDebug:false}
I0316 13:04:05.574137 15 run.go:743] x.WorkerConfig: {TmpDir:t ExportPath:export Trace:datadog=; ratio=0.01; jaeger= MyAddr:alpha2:7080 ZeroAddr:[zero:5080] TLSClientConfig:<nil> TLSServerConfig:<nil> Raft:learner=false; snapshot-after-entries=10000; snapshot-after-duration=30m; pending-proposals=256; idx=; group= Badger:{Dir: ValueDir: SyncWrites:false NumVersionsToKeep:1 ReadOnly:false Logger:0xc00011e320 Compression:1 InMemory:false MetricsEnabled:true NumGoroutines:8 MemTableSize:67108864 BaseTableSize:2097152 BaseLevelSize:10485760 LevelSizeMultiplier:10 TableSizeMultiplier:2 MaxLevels:7 VLogPercentile:0 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 BlockCacheSize:697932185 IndexCacheSize:375809638 NumLevelZeroTables:5 NumLevelZeroTablesStall:15 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:4 CompactL0OnClose:false LmaxCompaction:false ZSTDCompressionLevel:0 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 DetectConflicts:true NamespaceOffset:-1 ExternalMagicVersion:0 managedTxns:false maxBatchCount:0 maxBatchSize:0 maxValueThreshold:0} WhiteListedIPRanges:[{Lower:10.0.0.0 Upper:10.255.255.255} {Lower:172.0.0.0 Upper:172.255.255.255} {Lower:192.168.0.0 Upper:192.168.255.255}] StrictMutations:false AclEnabled:false HmacSecret:**** AbortOlderThan:5m0s ProposedGroupId:0 StartTime:2023-03-16 13:04:05.154453593 +0000 UTC m=+0.179041634 Security:whitelist=10.0.0.0/8,172.0.0.0/8,192.168.0.0/16; token= EncryptionKey:**** LogDQLRequest:0 HardSync:false Audit:false}
I0316 13:04:05.574195 15 run.go:744] worker.Config: {PostingDir:p WALDir:w MutationsMode:0 AuthToken: HmacSecret:**** AccessJwtTtl:0s RefreshJwtTtl:0s CachePercentage:0,65,35 CacheMb:1024 Audit:<nil> ChangeDataConf:file=; kafka=; sasl_user=; sasl_password=; ca_cert=; client_cert=; client_key=; sasl-mechanism=PLAIN; tls=false;}
I0316 13:04:05.576768 15 log.go:296] Found file: 2 First Index: 30001
I0316 13:04:05.577463 15 storage.go:124] Init Raft Storage with snap: 30224, first: 30225, last: 31959
I0316 13:04:05.578027 15 server_state.go:140] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false NumVersionsToKeep:2147483647 ReadOnly:false Logger:0x320f398 Compression:1 InMemory:false MetricsEnabled:true NumGoroutines:8 MemTableSize:67108864 BaseTableSize:2097152 BaseLevelSize:10485760 LevelSizeMultiplier:10 TableSizeMultiplier:2 MaxLevels:7 VLogPercentile:0 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 BlockCacheSize:697932185 IndexCacheSize:375809638 NumLevelZeroTables:5 NumLevelZeroTablesStall:15 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:4 CompactL0OnClose:false LmaxCompaction:false ZSTDCompressionLevel:0 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 DetectConflicts:false NamespaceOffset:1 ExternalMagicVersion:0 managedTxns:false maxBatchCount:0 maxBatchSize:0 maxValueThreshold:0}
I0316 13:04:05.593597 15 log.go:34] All 3 tables opened in 1ms
I0316 13:04:05.594630 15 log.go:34] Discard stats nextEmptySlot: 0
I0316 13:04:05.594673 15 log.go:34] Set nextTxnTs to 231515
I0316 13:04:05.595718 15 log.go:34] Deleting empty file: p/000363.vlog
I0316 13:04:05.599243 15 groups.go:101] Current Raft Id: 0x1
I0316 13:04:05.599281 15 groups.go:117] Sending member request to Zero: id:1 addr:"alpha2:7080"
I0316 13:04:05.599727 15 worker.go:114] Worker listening at address: [::]:7080
I0316 13:04:05.600403 15 run.go:559] Bringing up GraphQL HTTP API at 0.0.0.0:8080/graphql
E0316 13:04:05.600424 15 groups.go:1229] Error during SubscribeForUpdates for prefix "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x15dgraph.graphql.schema\x00": unable to find any servers for group: 1. closer err: <nil>
I0316 13:04:05.600430 15 run.go:560] Bringing up GraphQL HTTP admin API at 0.0.0.0:8080/admin
I0316 13:04:05.601032 15 run.go:587] gRPC server started. Listening on port 9080
I0316 13:04:05.601038 15 run.go:588] HTTP server started. Listening on port 8080
I0316 13:04:05.700604 15 pool.go:165] CONN: Connecting to zero:5080
I0316 13:04:05.704869 15 groups.go:136] Connected to group zero. Assigned group: 0
I0316 13:04:05.704895 15 groups.go:138] Raft Id after connection to Zero: 0x1
I0316 13:04:05.704974 15 pool.go:165] CONN: Connecting to alpha1:7080
I0316 13:04:05.705009 15 pool.go:165] CONN: Connecting to alpha3:7080
I0316 13:04:05.705048 15 draft.go:268] Node ID: 0x1 with GroupID: 1
I0316 13:04:05.705052 15 draft.go:277] RaftContext: id:1 group:1 addr:"alpha2:7080"
I0316 13:04:05.705162 15 node.go:153] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc000134880 Applied:30224 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x320f398 DisableProposalForwarding:false}
I0316 13:04:05.705291 15 log.go:34] LoadFromDb Schema Streaming about 0 B of uncompressed data (0 B on disk)
I0316 13:04:05.705689 15 log.go:34] Number of ranges found: 3
I0316 13:04:05.705747 15 log.go:34] Sent range 0 for iteration: [010000000000000000000776657273696f6effffffffffff8891, 010000000000000000000d706c616e696e675f7461636b73fffffffffffffff6) of size: 887 KiB
I0316 13:04:05.705752 15 log.go:34] Sent range 1 for iteration: [, 010000000000000000000776657273696f6effffffffffff8891) of size: 0 B
I0316 13:04:05.705754 15 log.go:34] Sent range 2 for iteration: [010000000000000000000d706c616e696e675f7461636b73fffffffffffffff6, ) of size: 0 B
I0316 13:04:05.707572 15 log.go:34] LoadFromDb Schema Sent data of size 0 B
I0316 13:04:05.707618 15 log.go:34] LoadFromDb Type Streaming about 0 B of uncompressed data (0 B on disk)
I0316 13:04:05.707633 15 log.go:34] Number of ranges found: 3
I0316 13:04:05.707638 15 log.go:34] Sent range 0 for iteration: [020000000000000000000455736572fffffffffffe29a7, 0200000000000000000009546f70537065656473ffffffffffff888b) of size: 887 KiB
I0316 13:04:05.707640 15 log.go:34] Sent range 1 for iteration: [, 020000000000000000000455736572fffffffffffe29a7) of size: 0 B
I0316 13:04:05.707642 15 log.go:34] Sent range 2 for iteration: [0200000000000000000009546f70537065656473ffffffffffff888b, ) of size: 0 B
I0316 13:04:05.708438 15 log.go:34] LoadFromDb Type Sent data of size 0 B
I0316 13:04:05.708469 15 groups.go:155] Load schema from DB: OK
I0316 13:04:05.708478 15 node.go:312] Found Snapshot.Metadata: {ConfState:{Nodes:[1 2 3] Learners:[] XXX_unrecognized:[]} Index:30224 Term:41 XXX_unrecognized:[]}
I0316 13:04:05.708486 15 node.go:323] Found hardstate: {Term:43 Vote:2 Commit:31961 XXX_unrecognized:[]}
I0316 13:04:05.708992 15 node.go:328] Group 1 found 1959 entries
I0316 13:04:05.709003 15 draft.go:1782] Restarting node for group: 1
I0316 13:04:05.709007 15 node.go:190] Setting conf state to nodes:1 nodes:2 nodes:3
2023/03/16 13:04:05 1 state.commit 31961 is out of range [30224, 31959]
panic: 1 state.commit 31961 is out of range [30224, 31959]
goroutine 183 [running]:
log.Panicf({0x1ff75f1?, 0xb17d85?}, {0xc0011c4280?, 0x1d20e40?, 0x7fd1665b7f00?})
/opt/hostedtoolcache/go/1.19.6/x64/src/log/log.go:395 +0x67
github.com/dgraph-io/dgraph/x.(*ToGlog).Panicf(0x7cd7?, {0x1ff75f1?, 0x90?}, {0xc0011c4280?, 0xb1a487?, 0xb62e866b135e68af?})
/home/runner/work/dgraph/dgraph/x/log.go:40 +0x32
go.etcd.io/etcd/raft.(*raft).loadState(0xc000366000, {0x2b, 0x2, 0x7cd9, {0x0, 0x0, 0x0}})
/home/runner/go/pkg/mod/go.etcd.io/[email protected]/raft/raft.go:1475 +0x1a4
go.etcd.io/etcd/raft.newRaft(0xc000296370)
/home/runner/go/pkg/mod/go.etcd.io/[email protected]/raft/raft.go:377 +0x791
go.etcd.io/etcd/raft.RestartNode(0xc000296370)
/home/runner/go/pkg/mod/go.etcd.io/[email protected]/raft/node.go:242 +0x33
github.com/dgraph-io/dgraph/worker.(*node).InitAndStartNode(0xc0002d6600)
/home/runner/work/dgraph/dgraph/worker/draft.go:1808 +0x6f6
github.com/dgraph-io/dgraph/worker.StartRaftNodes(0xc000134880, 0x1?)
/home/runner/work/dgraph/dgraph/worker/groups.go:157 +0x65f
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run.func4()
/home/runner/work/dgraph/dgraph/dgraph/cmd/alpha/run.go:802 +0x34
created by github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run
/home/runner/work/dgraph/dgraph/dgraph/cmd/alpha/run.go:801 +0x1491
W0316 13:04:05.717159 1 sentry_integration.go:139] unable to read CID from file /tmp/dgraph-alpha-cid-sentry open /tmp/dgraph-alpha-cid-sentry: no such file or directory. Skip
[Sentry] 2023/03/16 13:04:05 Sending fatal event [ed6991f768fe4fc78865ecc67f28e72f] to o318308.ingest.sentry.io project: 1805390
[Sentry] 2023/03/16 13:04:05 Buffer flushed successfully.
This is still happening @mangalaman93 @harshil-goel @rderbier
After a power loss.
I1115 16:42:16.953188 1 init.go:85]
Dgraph version : v23.1.0
Dgraph codename : dgraph
Dgraph SHA-256 : 2b0d2fb977807f9d681c3a8e5f67a6fb133c99c772009158107aa6b1ac4cbd10
Commit SHA-1 : 2b18d19
Commit timestamp : 2023-08-17 13:27:10 -0500
Branch : HEAD
Go version : go1.19.12
jemalloc enabled : true
For Dgraph official documentation, visit https://dgraph.io/docs.
For discussions about Dgraph , visit https://discuss.dgraph.io.
For fully-managed Dgraph Cloud , visit https://dgraph.io/cloud.
Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2023 Dgraph Labs, Inc.I1115 16:42:16.953188 1 init.go:85]
Dgraph version : v23.1.0
Dgraph codename : dgraph
Dgraph SHA-256 : 2b0d2fb977807f9d681c3a8e5f67a6fb133c99c772009158107aa6b1ac4cbd10
Commit SHA-1 : 2b18d19
Commit timestamp : 2023-08-17 13:27:10 -0500
Branch : HEAD
Go version : go1.19.12
jemalloc enabled : true
For Dgraph official documentation, visit https://dgraph.io/docs.
For discussions about Dgraph , visit https://discuss.dgraph.io.
For fully-managed Dgraph Cloud , visit https://dgraph.io/cloud.
Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2023 Dgraph Labs, Inc.
I1115 16:42:17.063427 1 node.go:153] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc00014bc40 Applied:10229909 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x332ac68 DisableProposalForwarding:false}
I1115 16:42:17.063503 1 log.go:33] LoadFromDb Schema Streaming about 0 B of uncompressed data (0 B on disk)
I1115 16:42:17.063573 1 log.go:33] Number of ranges found: 1
I1115 16:42:17.063586 1 log.go:33] Sent range 0 for iteration: [, ) of size: 0 B
I1115 16:42:17.064167 1 log.go:33] LoadFromDb Schema Sent data of size 0 B
I1115 16:42:17.064219 1 log.go:33] LoadFromDb Type Streaming about 0 B of uncompressed data (0 B on disk)
I1115 16:42:17.064237 1 log.go:33] Number of ranges found: 3
I1115 16:42:17.064249 1 log.go:33] Sent range 0 for iteration: [0200000000000000000003616c6cffffffffff74ab33, 020000000000000000000564756d6d79ffffffffffffce7b) of size: 40 MiB
I1115 16:42:17.064253 1 log.go:33] Sent range 1 for iteration: [, 0200000000000000000003616c6cffffffffff74ab33) of size: 0 B
I1115 16:42:17.064257 1 log.go:33] Sent range 2 for iteration: [020000000000000000000564756d6d79ffffffffffffce7b, ) of size: 0 B
I1115 16:42:17.064970 1 log.go:33] LoadFromDb Type Sent data of size 0 B
I1115 16:42:17.064985 1 groups.go:155] Load schema from DB: OK
I1115 16:42:17.064995 1 node.go:312] Found Snapshot.Metadata: {ConfState:{Nodes:[1] Learners:[] XXX_unrecognized:[]} Index:10229909 Term:22 XXX_unrecognized:[]}
I1115 16:42:17.065010 1 node.go:323] Found hardstate: {Term:22 Vote:1 Commit:10239793 XXX_unrecognized:[]}
I1115 16:42:17.069922 1 node.go:328] Group 1 found 39740 entries
I1115 16:42:17.069931 1 draft.go:1803] Restarting node for group: 1
I1115 16:42:17.069935 1 node.go:190] Setting conf state to nodes:1
2023/11/15 16:42:17 1 state.commit 10239793 is out of range [10229909, 10239740]
panic: 1 state.commit 10239793 is out of range [10229909, 10239740]
goroutine 164 [running]:
log.Panicf({0x20af210?, 0xb41d85?}, {0xc000f82f40?, 0x1dc7a40?, 0xc000fc2500?})
/opt/hostedtoolcache/go/1.19.12/x64/src/log/log.go:395 +0x67
github.com/dgraph-io/dgraph/x.(*ToGlog).Panicf(0x9c3efc?, {0x20af210?, 0x90?}, {0xc000f82f40?, 0xb44401?, 0x5e43e693afc59ac6?})
/home/runner/work/dgraph/dgraph/x/log.go:39 +0x32
go.etcd.io/etcd/raft.(*raft).loadState(0xc000378b40, {0x16, 0x1, 0x9c3f31, {0x0, 0x0, 0x0}})
/home/runner/go/pkg/mod/go.etcd.io/[email protected]/raft/raft.go:1475 +0x1a4
go.etcd.io/etcd/raft.newRaft(0xc000f56e70)
/home/runner/go/pkg/mod/go.etcd.io/[email protected]/raft/raft.go:377 +0x791
go.etcd.io/etcd/raft.RestartNode(0xc000f56e70)
/home/runner/go/pkg/mod/go.etcd.io/[email protected]/raft/node.go:242 +0x33
github.com/dgraph-io/dgraph/worker.(*node).InitAndStartNode(0xc00047ce80)
/home/runner/work/dgraph/dgraph/worker/draft.go:1829 +0x6f6
github.com/dgraph-io/dgraph/worker.StartRaftNodes(0xc00014bc40, 0x1?)
/home/runner/work/dgraph/dgraph/worker/groups.go:157 +0x65f
github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run.func4()
/home/runner/work/dgraph/dgraph/dgraph/cmd/alpha/run.go:813 +0x34
created by github.com/dgraph-io/dgraph/dgraph/cmd/alpha.run
/home/runner/work/dgraph/dgraph/dgraph/cmd/alpha/run.go:812 +0x14f1
This issue has been stale for 60 days and will be closed automatically in 7 days. Comment to keep it open.