cockroach-operator icon indicating copy to clipboard operation
cockroach-operator copied to clipboard

Permission denied creating the data directory

Open michel-zimmer opened this issue 2 years ago • 4 comments

I'm stuck with the following error when trying to create any kind of CockroachDB cluster using the operator:

E240215 20:18:40.885312 1 1@cli/clierror/check.go:35  [-] 1  ERROR: connection lost.
E240215 20:18:40.885312 1 1@cli/clierror/check.go:35  [-] 1 +creating data directory: mkdir /cockroach/cockroach-data/auxiliary: permission denied
ERROR: connection lost.

creating data directory: mkdir /cockroach/cockroach-data/auxiliary: permission denied
Failed running "start"

The cluster manifest might look like this:

apiVersion: crdb.cockroachlabs.com/v1alpha1
kind: CrdbCluster
metadata:
  name: primary-crdb
spec:
  cockroachDBVersion: v23.1.11
  dataStore:
    pvc:
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: "1Gi"
        storageClassName: primary-nfs
        volumeMode: Filesystem
  nodes: 3
  resources:
    limits:
      cpu: 2
      memory: 2Gi
    requests:
      cpu: 500m
      memory: 2Gi
  tlsEnabled: true

The storage class is for csi-driver-nfs and leads to the following directory tree:

$ ls -lahF /<nfs-csi-dir>/*
/<nfs-csi-dir>/pvc-40b518b5-bccc-4610-b804-0bd2175f5eed:
total 18K
drwxrwsr-x 2 root 1000581000 2 Feb 11 16:26 ./
drwxr-xr-x 6 root root       6 Feb 11 17:07 ../

The CockroachDB pod manifest (kubectl get pods primary-crdb-0 --output yaml) has the following security context:

securityContext:
  fsGroup: 1000581000
  runAsUser: 1000581000

Which explains why the permissions actually don't add up.

For comparison, using this storage setup, it is possible to create a working mount like this:

...
containers:
  - name: busybox
    image: busybox:1.28
    command: [ "sh", "-c", "sleep 1h" ]
    volumeMounts:
      - name: data
        mountPath: "/test"
securityContext:
  runAsUser: 2000
  runAsGroup: 2000
  fsGroup: 2000
volumes:
  - name: data
    persistentVolumeClaim:
      claimName: test

When creating a file (touch /test/file) from inside the container the directory tree looks like this:

$ ls -lahF /<nfs-csi-dir>/*
/<nfs-csi-dir>/pvc-730e175e-af46-4e48-b4e4-5a1dd568307d:
total 19K
drwxrwsr-x 2 root 2000 3 Feb 11 17:14 ./
drwxr-xr-x 6 root root 6 Feb 11 17:07 ../
-rw-rw-r-- 1 2000 2000 0 Feb 11 17:14 file

It works because all owner and group match.

I'm wondering if the operator should specify runAsGroup or if there is something unusual with my setup, and if this should not be necessary at all.

The locations in the code would be the following:

  • https://github.com/cockroachdb/cockroach-operator/blob/v2.12.0/pkg/resource/statefulset.go#L208
  • https://github.com/cockroachdb/cockroach-operator/blob/v2.12.0/pkg/resource/job.go#L95

Even though I don't have much experience in self-hosting storage for Kubernetes, I would say adding runAsGroup is the right idea and I'm happy to create a PR if wanted.

michel-zimmer avatar Feb 15 '24 21:02 michel-zimmer

same issue

Fred-Ko avatar Apr 18 '24 16:04 Fred-Ko

We have the same issue, for us it manifests because we can't trigger a backup via something like:

kubectl exec \
    --namespace cockroachdb \
    --stdin \
    --tty \
    db-0 \
    --container=db \
    -- ./cockroach sql \
    --certs-dir=/cockroach/cockroach-certs \
    --host=localhost
    --execute "BACKUP INTO 'nodelocal://1/backups/' as of system time '-10s'"

Which fails with an error like:

ERROR: opening object for writing: creating target local directory "/cockroach/cockroach-data/extern/backups/2025/03/31-132059.48": mkdir /cockroach/cockroach-data/extern/backups/2025: permission denied

dwt avatar Mar 31 '25 13:03 dwt

Some thoughts on this problem:

The cockroach containers currently run with a user, which is not declared in the container. This makes it unnecessary hard to debug the container, as it is not immediately obvious that it is intended to be run like this. Instead it makes it look like there is an error or process going haywire.

What I would have expected:

  • The container being built with an application user cockroach
  • That user being used as the standard user of the container
  • all files in /cockroach being owned by that user

To be frank, the fact that this is not the case makes it look like you don't know what you are doing, which is quite terrifying.

dwt avatar Mar 31 '25 19:03 dwt

Same issue on gcp with local-ssd, works with pd-ssd

dbabiak avatar May 20 '25 13:05 dbabiak