In 3 node cluster (with q=1), when any one node fails and is re-added as blank, _security object is reset to default

Open ChetanGoti opened this issue 4 years ago • 0 comments

Your Environment

{
  "couchdb": "Welcome",
  "version": "3.1.1",
  "git_sha": "ce596c65d",
  "uuid": "8d406054df5edac06ee4906f3259e62f",
  "features": [
    "access-ready",
    "partitioned",
    "pluggable-storage-engines",
    "reshard",
    "scheduler"
  ],
  "vendor": {
    "name": "The Apache Software Foundation"
  }
}

Description

I have a 3 node couchdb 3.1.1 cluster with following configuration:

[cluster]
q=1
n=2

There is a non-partitioned database named test2, whose shards resides on node1 and node2. test2 database has following cluster settings:

"cluster":{"q":1,"n":2,"w":2,"r":2}

test2 database has few documents and it's _security is:

{"admins":{"names":["superuser"],"roles":["admins"]},"members":{"names":["user1","user2"],"roles":["developers"]}}

I'm now running a scenario where any of the node's disk crashes. Let's say node2's disk crashes.

I have performed following steps:

Remove node2 from the cluster
Replace node2 disk with a new blank disk
Start node2
Add node2 into cluster
It would resync the shards as its blank
After a while, resync is completed

At this stage, test2 database's shards file(test2.1628258896.couch) can be seen on node2.

1: Now, when I retrieve `_security` for `test2` database from `node2`, it is reset to the default like:

{"members":{"roles":["_admin"]},"admins":{"roles":["_admin"]}}

If I retrieve _security from node1 or node3, it responds with the correct: (which I set earlier before node2 crash)

{"admins":{"names":["superuser"],"roles":["admins"]},"members":{"names":["user1","user2"],"roles":["developers"]}}

2: When I restart all nodes, there are following error logs shown:

node2 | [error] 2021-08-06T12:59:21.842335Z couchdb@node2 <0.4465.0> -------- Bad security object in <<"test2">>: [{{[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}},{<<"admins">>,{[{<<"roles">>,[<<"_admin">>]}]}}]},1},{{[{<<"admins">>,{[{<<"names">>,[<<"superuser">>]},{<<"roles">>,[<<"admins">>]}]}},{<<"members">>,{[{<<"names">>,[<<"user1">>,<<"user2">>]},{<<"roles">>,[<<"developers">>]}]}}]},1}]

3: Not relevant to crash, but relevant to same `_security` thing.

When 1 node is down, PUT _security fails with: {"error":"error","reason":"no_majority"}

node2 is down
Create a new DB test3, whose shards would reside on node1 and node2
PUT _security for test3, it would fail with:

{"error":"error","reason":"no_majority"}

Other observations

_sync_shards throws same Bad security object error
It works for q=2
It fixes the issue when I updated _security of test2 DB again

_sync_shards logs

node1 | [notice] 2021-08-06T13:38:55.064067Z couchdb@node1 <0.10379.1> c89bd8ac41 localhost:5984 172.20.0.1 admin POST /test2/_sync_shards 202 ok 2
node2 | [error] 2021-08-06T13:38:55.080064Z couchdb@node2 <0.7232.0> -------- Bad security object in <<"test2">>: [{{[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}},{<<"admins">>,{[{<<"roles">>,[<<"_admin">>]}]}}]},1},{{[{<<"admins">>,{[{<<"names">>,[<<"superuser">>]},{<<"roles">>,[<<"admins">>]}]}},{<<"members">>,{[{<<"names">>,[<<"user1">>,<<"user2">>]},{<<"roles">>,[<<"developers">>]}]}}]},1}]
node1 | [error] 2021-08-06T13:38:55.080503Z couchdb@node1 <0.10417.1> -------- Bad security object in <<"test2">>: [{{[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}},{<<"admins">>,{[{<<"roles">>,[<<"_admin">>]}]}}]},1},{{[{<<"admins">>,{[{<<"names">>,[<<"superuser">>]},{<<"roles">>,[<<"admins">>]}]}},{<<"members">>,{[{<<"names">>,[<<"user1">>,<<"user2">>]},{<<"roles">>,[<<"developers">>]}]}}]},1}]

This was discussed with @janl and @rnewson on Slack at https://couchdb.slack.com/archives/C49LEE7NW/p1628257123045300 which can be helpful.

Aug 06 '21 14:08 ChetanGoti

In 3 node cluster (with q=1), when any one node fails and is re-added as blank, _security object is reset to default

Your Environment

Description

1: Now, when I retrieve _security for test2 database from node2, it is reset to the default like:

2: When I restart all nodes, there are following error logs shown:

3: Not relevant to crash, but relevant to same _security thing.

Other observations

_sync_shards logs

1: Now, when I retrieve `_security` for `test2` database from `node2`, it is reset to the default like:

3: Not relevant to crash, but relevant to same `_security` thing.