In 3 node cluster (with q=1), when any one node fails and is re-added as blank, _security object is reset to default
Your Environment
{
"couchdb": "Welcome",
"version": "3.1.1",
"git_sha": "ce596c65d",
"uuid": "8d406054df5edac06ee4906f3259e62f",
"features": [
"access-ready",
"partitioned",
"pluggable-storage-engines",
"reshard",
"scheduler"
],
"vendor": {
"name": "The Apache Software Foundation"
}
}
Description
I have a 3 node couchdb 3.1.1 cluster with following configuration:
[cluster]
q=1
n=2
There is a non-partitioned database named test2, whose shards resides on node1 and node2. test2 database has following cluster settings:
"cluster":{"q":1,"n":2,"w":2,"r":2}
test2 database has few documents and it's _security is:
{"admins":{"names":["superuser"],"roles":["admins"]},"members":{"names":["user1","user2"],"roles":["developers"]}}
I'm now running a scenario where any of the node's disk crashes. Let's say node2's disk crashes.
I have performed following steps:
- Remove
node2from the cluster - Replace
node2disk with a new blank disk - Start
node2 - Add
node2into cluster - It would resync the shards as its blank
- After a while, resync is completed
At this stage,
test2database's shards file(test2.1628258896.couch) can be seen onnode2.
1: Now, when I retrieve _security for test2 database from node2, it is reset to the default like:
{"members":{"roles":["_admin"]},"admins":{"roles":["_admin"]}}
If I retrieve _security from node1 or node3, it responds with the correct: (which I set earlier before node2 crash)
{"admins":{"names":["superuser"],"roles":["admins"]},"members":{"names":["user1","user2"],"roles":["developers"]}}
2: When I restart all nodes, there are following error logs shown:
node2 | [error] 2021-08-06T12:59:21.842335Z couchdb@node2 <0.4465.0> -------- Bad security object in <<"test2">>: [{{[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}},{<<"admins">>,{[{<<"roles">>,[<<"_admin">>]}]}}]},1},{{[{<<"admins">>,{[{<<"names">>,[<<"superuser">>]},{<<"roles">>,[<<"admins">>]}]}},{<<"members">>,{[{<<"names">>,[<<"user1">>,<<"user2">>]},{<<"roles">>,[<<"developers">>]}]}}]},1}]
3: Not relevant to crash, but relevant to same _security thing.
When 1 node is down, PUT _security fails with: {"error":"error","reason":"no_majority"}
-
node2is down - Create a new DB
test3, whose shards would reside onnode1andnode2 -
PUT _securityfortest3, it would fail with:
{"error":"error","reason":"no_majority"}
Other observations
-
_sync_shardsthrows sameBad security objecterror - It works for
q=2 - It fixes the issue when I updated
_securityoftest2DB again
_sync_shards logs
node1 | [notice] 2021-08-06T13:38:55.064067Z couchdb@node1 <0.10379.1> c89bd8ac41 localhost:5984 172.20.0.1 admin POST /test2/_sync_shards 202 ok 2
node2 | [error] 2021-08-06T13:38:55.080064Z couchdb@node2 <0.7232.0> -------- Bad security object in <<"test2">>: [{{[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}},{<<"admins">>,{[{<<"roles">>,[<<"_admin">>]}]}}]},1},{{[{<<"admins">>,{[{<<"names">>,[<<"superuser">>]},{<<"roles">>,[<<"admins">>]}]}},{<<"members">>,{[{<<"names">>,[<<"user1">>,<<"user2">>]},{<<"roles">>,[<<"developers">>]}]}}]},1}]
node1 | [error] 2021-08-06T13:38:55.080503Z couchdb@node1 <0.10417.1> -------- Bad security object in <<"test2">>: [{{[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}},{<<"admins">>,{[{<<"roles">>,[<<"_admin">>]}]}}]},1},{{[{<<"admins">>,{[{<<"names">>,[<<"superuser">>]},{<<"roles">>,[<<"admins">>]}]}},{<<"members">>,{[{<<"names">>,[<<"user1">>,<<"user2">>]},{<<"roles">>,[<<"developers">>]}]}}]},1}]
This was discussed with @janl and @rnewson on Slack at https://couchdb.slack.com/archives/C49LEE7NW/p1628257123045300 which can be helpful.