[BUG] Garbage Collector removing relation still valid
Describe the bug When a permission is present twice, one valid and the other expired, the Garbage Collector invalidate the relation in spite of relation their is still valid one
To Reproduce Steps to reproduce the behavior:
Setup a Permify with a PostgreSQL database, configure the garbage collector as following :
database:
engine: postgres
uri:
garbage_collection:
enabled: true
interval: 10m
window: 5m
timeout: 5m
Add a relation, then add the same relation (for the same tenant) a bit later
For example :
The permission check work until the garbage collector run, after that, the permission is considered as invalid
Expected behavior The expired relation should be removed from the database, but the permission check should still be allowed since there is a valid relation.
Additional context The permify instance run with the helm chart as well as the postgres instance
Environment :
- Permify : helmchart permify v0.3.5
- Postgres : bitnami postgresql 12.5
Hi @gign0766, I tried to reproduce the issue but couldn't replicate it. Everything seems to be working as expected on my end. I'll update the database tests shortly to include this scenario. Also, can you provide the specific permission check requests you used to ensure I don't miss any details?
Hi, I couldn't reproduce the problem outside of the helm version. When I tryed to recreate it with a docker environment everything work as expected.
The permission check was : POST http://localhost:13476/v1/tenants/4849b21a-df59-45b2-ab1e-e05bb1db8f28/permissions/check
{
"metadata": {
"schema_version": "",
"snap_token": "",
"depth": 20
},
"entity": {
"type": "organisation",
"id": "1"
},
"permission": "is_member",
"subject": {
"type": "user",
"id": "b56661f8-7be6-4342-a4c0-918ee04e5983"
}
}
The test was made though the Go SDK :
cr, err := client.Permission.Check(ctx, &permifyPayload.PermissionCheckRequest{
TenantId: tenantId,
Metadata: &permifyPayload.PermissionCheckRequestMetadata{
SnapToken: "",
SchemaVersion: "",
Depth: 20,
},
Entity: tuple.Entity,
Permission: tuple.Relation,
Subject: tuple.Subject,
})
and with the following schema :
entity user {}
entity organisation {
relation admin @user
relation member @user
relation billing @user
relation guest @user
permission is_member = admin or member or billing or guest
}
// Billing
entity billing {
relation admin @organisation#admin
relation billing @organisation#billing
permission is_member = admin or billing
permission full_access = is_member
permission read_only = is_member
}
// Project
entity project {
relation admin @organisation#admin
relation member @organisation#member
permission is_member = admin or member
permission full_access = admin
permission read_only = is_member
}
entity instance {
relation project @project
permission full_access = project.full_access
permission read_only = project.read_only
}
To add some information on the environment : It's a Kubernetes Cluster with 3 worker nodes. Permify helm as the default replicas number (so 3), so each worker as one permify instance.
Hello,
I just notice there is a distributed configuration for the cache system. Could this be the cause of the problem ?
I'll try to configure it on the cluster and keep you inform of the result
Hi, thank you @gign0766. I couldn't test with your setup. This week I will be updating helm charts, related migration scripts and issues about k8s. I will test both cases!
Hi @gign0766 , the issue might be related to the version of Postgres you're using. Permify supports Postgres 13.8 and above.
cc: @ucatbas
Hi @tolgaOzen My bad, I forgot to write that was the bitnami helm chart version The postgres version is 15.3 Image : docker.io/bitnami/postgresql:15.3.0-debian-11-r17
So far, since I've activated the distributed system, the problem hasn't shown up again.
Hi,
I'm adding information to the ticket.
I'm working on a test environment, the permify is still the same with the same versions of chart helm. Last night everything was working when checking permission and today the schema doesn't seem to be available anymore.
On my application I use permify-go with grpc to make the request, and I also tried with Insomnia with grpc request reproduce the error.
Extract from Permify db - table relation_tuples
Extract from Permify db - table schema_definitions
Organization schema entity
entity organization {
relation admin @user
relation member @user
relation billing @user
relation guest @user
permission OrganizationMember = (((admin or member) or billing) or guest)
permission OrganizationIAMFullAccess = admin
permission OrganizationIAMReadOnly = ((admin or member) or billing)
permission OrganizationSettingsFullAccess = admin
permission OrganizationSettingsReadOnly = ((admin or member) or billing)
permission OrganizationBillingFullAccess = (admin or billing)
permission OrganizationBillingReadOnly = (admin or billing)
permission OrganizationProjectManagerFullAccess = admin
permission OrganizationProjectManagerReadOnly = ((admin or member) or guest)
}
Extract from Permify db - table tenants
Screenshot of Insomnia gRPC call on Permission Check route
Hi @gign0766 , could you share your Permify config file while keeping sensitive information hidden?
Hi @tolgaOzen
Of course, here is the config :
# The logger section sets the logging level for the service.
logger:
level: debug
# The database section specifies the database engine and connection settings,
# including the URI for the database, whether or not to auto-migrate the database,
# and connection pool settings.
database:
engine: postgres
uri: postgres_url
garbage_collection:
enabled: true
interval: 10m
window: 5m
timeout: 5m
distributed:
# Indicates whether the distributed mode is enabled or not
enabled: true
# The address of the distributed service.
# Using a Kubernetes DNS name suggests this service runs in a Kubernetes cluster
# under the 'default' namespace and is named 'permify'
address: service_address
# The port on which the service is exposed
port: "5000"
Can you share any logs? The configuration looks correct. Also, can I get some information about the Postgres you're using? Are you deploying it within Helm?
Could the issue be related to Bitnami Postgres? Are you creating a volume for persistency?
We don't have any logs from Permify, they don't logs correctly
For postgres we use the bitnami posgresql chart in version 12.5.9 (corresponding to a postgres 15.3) The database is deployed with a volume to persist the data.
It's probably not related to Postgres, because when we disable garbage collection, the problem no longer appears.
I'm trying to understand the issue, but I don't think it's related to the garbage collector because the garbage collector removes old data but doesn't delete the schema. The schema not found error you're receiving typically occurs when the schema doesn't exist. I'm looking into it in more detail to better understand the problem. I'll update you. In the meantime, could you also try with Permify v1.1.1?
@tolgaOzen This is also what I understood from reading the permify code. But the bug only seems to appear when garbage collection is enabled.
I will update the chart of permify to the latest version.
There was an error on the latest version of the chart helm, I have submitted a PR to fix the problem on the permify helm chart repo.
I've updated to version 0.3.7 of the chart helm. Now the logs work properly.
I'll update you if I notice the bug again !
Hi @gign0766 , is the issue you encountered still ongoing?
Hi @tolgaOzen,
Since the last time I made some change on the permission schema. (I'm now using custom roles) But the bug still seems to occurs
Here is all the info :
The tuple relations in database :
The tenant in database :
The new schema
entity user {}
entity group {
relation assignee @user
}
entity organization {
relation member @group#assignee
relation IAMFullAccess @group#assignee
relation IAMReadOnly @group#assignee
relation SettingsEdition @group#assignee
relation BillingFullAccess @group#assignee
relation BillingReadOnly @group#assignee
relation ProjectManagement @group#assignee
// Permissions
// - Organization Global
permission OrganizationRead = member
permission OrganizationWrite = SettingsEdition
// - IAM Management
permission OrganizationIAMRead = IAMReadOnly or IAMFullAccess
permission OrganizationIAMWrite = IAMFullAccess
// - Billing
permission OrganizationBillingRead = BillingFullAccess or BillingReadOnly
permission OrganizationBillingWrite = BillingFullAccess
// - Organization Project
permission OrganizationProjectView = member or ProjectManagement
permission OrganizationProjectCreation = ProjectManagement
}
// Project
entity project {
relation parent @organization
relation ProjectInstanceFullAccess @group#assignee
relation ProjectInstanceTerminalAccess @group#assignee
relation ProjectInstanceReadOnly @group#assignee
permission ProjectInstanceRead = ProjectInstanceReadOnly or ProjectInstanceFullAccess or ProjectInstanceTerminalAccess
permission ProjectInstanceTerminal = ProjectInstanceFullAccess or ProjectInstanceTerminalAccess
permission ProjectInstanceCreate = ProjectInstanceFullAccess
permission ProjectInstanceDelete = ProjectInstanceFullAccess
permission ProjectInstanceStart = ProjectInstanceFullAccess
permission ProjectInstanceStop = ProjectInstanceFullAccess
}
The grpc request :
I also try with the /Permission/SubjectPermission but every relation and permission return 2 as if there is no permission
And here are the logs :
Hi @tolgaOzen, I'm adding new informations.
I seem to lose the rights when I add new ones (adding a new tenant without touching the previous one), after a long time.
Will there be a relationship lifetime parameter?
Here is an extract of the relation_tuples :
And here the creation and update timestamp for each tenant :
This morning, the 46a1fad0-a909-4fa9-a387-b6c491e7a11c organization worked perfectly well, but a few minutes after the second one was created, the rights stopped working.
Hey @gign0766,
Hope all is well. We replicated your case multiple times with multiple different environments. Would you like to hope on a quick call this week to figure out your case. Here' my calendar.
Hi,
I've just seen that you've closed the problem. Were the transactions the cause of the problem?
Hi @firatcand,
I would like to reopen the issue as after some testing this week the problem is still there. I've updated the helm chart to the latest version (0.3.9) and restarted my permify deployment on Tuesday to get the latest permify image.
Here is the database transaction before the garbage collector
And right after that (causing every permission check on the tenant cc7284ef-29f9-45a5-8167-7cdca69b2c63 to fail)
If I add line 2 back into the transaction table, the permission check works again.
Hey I was just looking at the code, this line may be the culprit:
squirrel.Expr("expired_tx_id = '0'::xid8"),
https://github.com/Permify/permify/blob/391f4d194b23da52e8f87f97d6b526fc34b6a3aa/internal/repositories/postgres/utils/common.go#L34
It allows for still active tuples to be deleted, CMIIW.