permify icon indicating copy to clipboard operation
permify copied to clipboard

[BUG] Garbage Collector removing relation still valid

Open gign0766 opened this issue 1 year ago • 21 comments

Describe the bug When a permission is present twice, one valid and the other expired, the Garbage Collector invalidate the relation in spite of relation their is still valid one

To Reproduce Steps to reproduce the behavior:

Setup a Permify with a PostgreSQL database, configure the garbage collector as following :

database:
  engine: postgres
  uri: 
  garbage_collection:
     enabled: true
     interval: 10m
     window: 5m
     timeout: 5m

Add a relation, then add the same relation (for the same tenant) a bit later

For example : image

The permission check work until the garbage collector run, after that, the permission is considered as invalid

Expected behavior The expired relation should be removed from the database, but the permission check should still be allowed since there is a valid relation.

Additional context The permify instance run with the helm chart as well as the postgres instance

Environment :

  • Permify : helmchart permify v0.3.5
  • Postgres : bitnami postgresql 12.5

gign0766 avatar Aug 19 '24 11:08 gign0766

Hi @gign0766, I tried to reproduce the issue but couldn't replicate it. Everything seems to be working as expected on my end. I'll update the database tests shortly to include this scenario. Also, can you provide the specific permission check requests you used to ensure I don't miss any details?

ucatbas avatar Aug 22 '24 11:08 ucatbas

Hi, I couldn't reproduce the problem outside of the helm version. When I tryed to recreate it with a docker environment everything work as expected.

The permission check was : POST http://localhost:13476/v1/tenants/4849b21a-df59-45b2-ab1e-e05bb1db8f28/permissions/check

{
  "metadata": {
    "schema_version": "",
    "snap_token": "",
    "depth": 20
  },
  "entity": {
    "type": "organisation",
    "id": "1"
  },
  "permission": "is_member",
  "subject": {
    "type": "user",
    "id": "b56661f8-7be6-4342-a4c0-918ee04e5983"
  }
}

The test was made though the Go SDK :

cr, err := client.Permission.Check(ctx, &permifyPayload.PermissionCheckRequest{
		TenantId: tenantId,
		Metadata: &permifyPayload.PermissionCheckRequestMetadata{
			SnapToken:     "",
			SchemaVersion: "",
			Depth:         20,
		},
		Entity:     tuple.Entity,
		Permission: tuple.Relation,
		Subject:    tuple.Subject,
	})

and with the following schema :

entity user {}

entity organisation {
    relation admin @user
    relation member @user
    relation billing @user
    relation guest @user

    permission is_member = admin or member or billing or guest
}

// Billing
entity billing {
    relation admin @organisation#admin
    relation billing @organisation#billing

	permission is_member = admin or billing

    permission full_access = is_member
    permission read_only = is_member
}

// Project
entity project {
    relation admin @organisation#admin
    relation member @organisation#member

    permission is_member = admin or member


    permission full_access = admin
    permission read_only = is_member
}

entity instance {
    relation project @project

    permission full_access = project.full_access
    permission read_only = project.read_only
}

gign0766 avatar Aug 22 '24 13:08 gign0766

To add some information on the environment : It's a Kubernetes Cluster with 3 worker nodes. Permify helm as the default replicas number (so 3), so each worker as one permify instance.

gign0766 avatar Aug 22 '24 13:08 gign0766

Hello,

I just notice there is a distributed configuration for the cache system. Could this be the cause of the problem ?

I'll try to configure it on the cluster and keep you inform of the result

gign0766 avatar Aug 27 '24 14:08 gign0766

Hi, thank you @gign0766. I couldn't test with your setup. This week I will be updating helm charts, related migration scripts and issues about k8s. I will test both cases!

ucatbas avatar Aug 27 '24 14:08 ucatbas

Hi @gign0766 , the issue might be related to the version of Postgres you're using. Permify supports Postgres 13.8 and above.

cc: @ucatbas

tolgaozen avatar Aug 30 '24 09:08 tolgaozen

Hi @tolgaOzen My bad, I forgot to write that was the bitnami helm chart version The postgres version is 15.3 Image : docker.io/bitnami/postgresql:15.3.0-debian-11-r17

So far, since I've activated the distributed system, the problem hasn't shown up again.

gign0766 avatar Aug 30 '24 10:08 gign0766

Hi,

I'm adding information to the ticket.

I'm working on a test environment, the permify is still the same with the same versions of chart helm. Last night everything was working when checking permission and today the schema doesn't seem to be available anymore.

On my application I use permify-go with grpc to make the request, and I also tried with Insomnia with grpc request reproduce the error.

Extract from Permify db - table relation_tuples image

Extract from Permify db - table schema_definitions image

Organization schema entity

entity organization {
	relation admin @user 
	relation member @user 
	relation billing @user 
	relation guest @user 


	permission OrganizationMember = (((admin or member) or billing) or guest)
	permission OrganizationIAMFullAccess = admin
	permission OrganizationIAMReadOnly = ((admin or member) or billing)
	permission OrganizationSettingsFullAccess = admin
	permission OrganizationSettingsReadOnly = ((admin or member) or billing)
	permission OrganizationBillingFullAccess = (admin or billing)
	permission OrganizationBillingReadOnly = (admin or billing)
	permission OrganizationProjectManagerFullAccess = admin
	permission OrganizationProjectManagerReadOnly = ((admin or member) or guest)
} 

Extract from Permify db - table tenants image

Screenshot of Insomnia gRPC call on Permission Check route image

gign0766 avatar Sep 18 '24 13:09 gign0766

Hi @gign0766 , could you share your Permify config file while keeping sensitive information hidden?

tolgaozen avatar Sep 23 '24 09:09 tolgaozen

Hi @tolgaOzen

Of course, here is the config :

# The logger section sets the logging level for the service.
logger:
  level: debug

#  The database section specifies the database engine and connection settings,
# including the URI for the database, whether or not to auto-migrate the database,
# and connection pool settings.
database:
  engine: postgres
  uri: postgres_url
  garbage_collection:
    enabled: true
    interval: 10m
    window: 5m
    timeout: 5m

distributed:
  # Indicates whether the distributed mode is enabled or not
  enabled: true

  # The address of the distributed service.
  # Using a Kubernetes DNS name suggests this service runs in a Kubernetes cluster
  # under the 'default' namespace and is named 'permify'
  address:  service_address

  # The port on which the service is exposed
  port: "5000"

gign0766 avatar Sep 23 '24 09:09 gign0766

Can you share any logs? The configuration looks correct. Also, can I get some information about the Postgres you're using? Are you deploying it within Helm?

tolgaozen avatar Sep 23 '24 09:09 tolgaozen

Could the issue be related to Bitnami Postgres? Are you creating a volume for persistency?

tolgaozen avatar Sep 23 '24 09:09 tolgaozen

We don't have any logs from Permify, they don't logs correctly image

For postgres we use the bitnami posgresql chart in version 12.5.9 (corresponding to a postgres 15.3) The database is deployed with a volume to persist the data.

It's probably not related to Postgres, because when we disable garbage collection, the problem no longer appears.

gign0766 avatar Sep 23 '24 09:09 gign0766

I'm trying to understand the issue, but I don't think it's related to the garbage collector because the garbage collector removes old data but doesn't delete the schema. The schema not found error you're receiving typically occurs when the schema doesn't exist. I'm looking into it in more detail to better understand the problem. I'll update you. In the meantime, could you also try with Permify v1.1.1?

tolgaozen avatar Sep 23 '24 18:09 tolgaozen

@tolgaOzen This is also what I understood from reading the permify code. But the bug only seems to appear when garbage collection is enabled.

I will update the chart of permify to the latest version.

gign0766 avatar Sep 24 '24 08:09 gign0766

There was an error on the latest version of the chart helm, I have submitted a PR to fix the problem on the permify helm chart repo.

gign0766 avatar Sep 24 '24 08:09 gign0766

I've updated to version 0.3.7 of the chart helm. Now the logs work properly.

I'll update you if I notice the bug again !

gign0766 avatar Sep 24 '24 09:09 gign0766

Hi @gign0766 , is the issue you encountered still ongoing?

tolgaozen avatar Oct 09 '24 09:10 tolgaozen

Hi @tolgaOzen,

Since the last time I made some change on the permission schema. (I'm now using custom roles) But the bug still seems to occurs

Here is all the info : The tuple relations in database : image

The tenant in database : image

The new schema

entity user {}

entity group {
	relation assignee @user
}

entity organization {
	relation member @group#assignee

	relation IAMFullAccess @group#assignee
	relation IAMReadOnly @group#assignee

	relation SettingsEdition @group#assignee

	relation BillingFullAccess @group#assignee
	relation BillingReadOnly @group#assignee

	relation ProjectManagement @group#assignee

// Permissions
// - Organization Global
	permission OrganizationRead = member
	permission OrganizationWrite = SettingsEdition

// - IAM Management
	permission OrganizationIAMRead = IAMReadOnly or IAMFullAccess
	permission OrganizationIAMWrite = IAMFullAccess

// - Billing
	permission OrganizationBillingRead = BillingFullAccess or BillingReadOnly
	permission OrganizationBillingWrite = BillingFullAccess

// - Organization Project
	permission OrganizationProjectView = member or ProjectManagement
	permission OrganizationProjectCreation = ProjectManagement

}


// Project
entity project {
    relation parent @organization

    relation ProjectInstanceFullAccess @group#assignee
    relation ProjectInstanceTerminalAccess @group#assignee
    relation ProjectInstanceReadOnly @group#assignee

    permission ProjectInstanceRead = ProjectInstanceReadOnly or ProjectInstanceFullAccess or ProjectInstanceTerminalAccess
    permission ProjectInstanceTerminal = ProjectInstanceFullAccess or ProjectInstanceTerminalAccess
    permission ProjectInstanceCreate = ProjectInstanceFullAccess
    permission ProjectInstanceDelete = ProjectInstanceFullAccess
    permission ProjectInstanceStart = ProjectInstanceFullAccess
    permission ProjectInstanceStop = ProjectInstanceFullAccess
}

The grpc request : image

I also try with the /Permission/SubjectPermission but every relation and permission return 2 as if there is no permission

And here are the logs : image

gign0766 avatar Oct 10 '24 07:10 gign0766

Hi @tolgaOzen, I'm adding new informations.
I seem to lose the rights when I add new ones (adding a new tenant without touching the previous one), after a long time.

Will there be a relationship lifetime parameter?

Here is an extract of the relation_tuples :
image

And here the creation and update timestamp for each tenant :
image

This morning, the 46a1fad0-a909-4fa9-a387-b6c491e7a11c organization worked perfectly well, but a few minutes after the second one was created, the rights stopped working.

gign0766 avatar Oct 14 '24 10:10 gign0766

Hey @gign0766,

Hope all is well. We replicated your case multiple times with multiple different environments. Would you like to hope on a quick call this week to figure out your case. Here' my calendar.

firatcand avatar Oct 14 '24 10:10 firatcand

Hi,

I've just seen that you've closed the problem. Were the transactions the cause of the problem?

gign0766 avatar Oct 21 '24 13:10 gign0766

Hi @firatcand,

I would like to reopen the issue as after some testing this week the problem is still there. I've updated the helm chart to the latest version (0.3.9) and restarted my permify deployment on Tuesday to get the latest permify image.

Here is the database transaction before the garbage collector Image

And right after that (causing every permission check on the tenant cc7284ef-29f9-45a5-8167-7cdca69b2c63 to fail) Image

If I add line 2 back into the transaction table, the permission check works again.

gign0766 avatar Feb 20 '25 15:02 gign0766

Hey I was just looking at the code, this line may be the culprit:

				squirrel.Expr("expired_tx_id = '0'::xid8"),

https://github.com/Permify/permify/blob/391f4d194b23da52e8f87f97d6b526fc34b6a3aa/internal/repositories/postgres/utils/common.go#L34

It allows for still active tuples to be deleted, CMIIW.

fredmajor avatar Sep 09 '25 11:09 fredmajor