"unable to open block device" random error
This error seems to be random, some of my pvc works, but eventually one pod will get this error: It can happen either on fresh helm deployment or on one running for a few days
In the pod using the volume/pvc:
Warning FailedMount 34s kubelet, k8s-node-2 Unable to mount volumes for pod "registry-6786df68c4-gxgp2_services(653d5612-56c5-11e8-bd87-de2b040ce014)": timeout expired waiting for volumes to attach or mount for pod "services"/"registry-6786df68c4-gxgp2". list of unmounted volumes=[registry-storage]. list of unattached volumes=[registry-storage credentials-volume default-token-bsjcf]
Warning FailedMount 26s (x9 over 2m) kubelet, k8s-node-2 MountVolume.SetUp failed for volume "pvc-652a3868-56c5-11e8-bd87-de2b040ce014" : stat /var/lib/storageos/volumes/04cc0bb7-0934-3f78-96f4-b1882888abcc: no such file or directory
Log from the storageos pod:
time="2018-05-13T15:50:55Z" level=info msg="[[email protected]:5703]:: attempt to establish channel" category=clconmon module=supervisor
time="2018-05-13T15:50:56Z" level=warning msg="unable to open block device /var/lib/storageos/volumes/04cc0bb7-0934-3f78-96f4-b1882888abcc-sys: No such device or address" category=clconmon module=supervisor
time="2018-05-13T15:50:56Z" level=info msg="[[email protected]:5703]: connection established (fd=6)" category=clconn module=supervisor
Not sure how to further debug this to give you more information, but let me know if I can help, or if you suspect that I misconfigured something
I have also witnessed this. 3 nodes Same hardware. 2 of them have the volume path, the third does not. This is also accompanied by an increase in system load (one system reached a load avg of 85, without ever creating the required folder structure. This occurs semi-regularly when autoprovisioning, but the node that has the failure is random. Ubuntu 18.04, docker 17.12.1-0ubuntu1, kubernetes v1.10.4, latest helm, overlay2 mode, weave CNI.
Not knowing exactly, i suspect it's something to do with this:
Jun 18 15:23:20 k8s2 kernel: [763686.657279] sd 6:0:1:8: [sdc] 10485760 512-byte logical blocks: (5.37 GB/5.00 GiB) Jun 18 15:23:20 k8s2 kernel: [763686.657290] sd 6:0:1:8: [sdc] Write Protect is off Jun 18 15:23:20 k8s2 kernel: [763686.657292] sd 6:0:1:8: [sdc] Mode Sense: 43 00 00 08 Jun 18 15:23:20 k8s2 kernel: [763686.657311] sd 6:0:1:8: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA Jun 18 15:23:20 k8s2 kernel: [763686.657315] loopback/naa.6001405005050001: Unsupported SCSI Opcode 0xa3, sending CHECK_CONDITION. Jun 18 15:23:50 k8s2 kernel: [763717.002495] ABORT_TASK: Found referenced loopback task_tag: 0 Jun 18 15:24:44 k8s2 kernel: [763770.697823] fd_do_rw() returned -5 for non S_ISBLK
I'd had to roll back the install for the moment, but I'd be happy to re-test this in the future. It almost works for me, and I'm optimistic.
Thanks for the report @greenaar, we'll look into this. Do you have repro steps?
Can you do a lsmod | grep tcm and see if you have the tcm_loop module enabled?
Can you run docker run -it --rm -v /mnt:/mnt:shared busybox sh -c /bin/date to check if mount propagation is enabled in Docker?
Can you share the full logs from storageos?
i will re-deploy it in the next day or so, and i should be able to get the dumps out for you.
My repro steps were really easy. Install via helm (the only change, an entry for join:)
Create this volume claim:
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: shell-pvc annotations: volume.beta.kubernetes.io/storage-class: fast spec: accessModes: - ReadWriteOnce resources: requests: storage: 5Gi
and attach it to this deployment:
apiVersion: extensions/v1beta1 kind: Deployment metadata: name: shell namespace: default spec: replicas: 1 strategy: type: RollingUpdate template: metadata: creationTimestamp: null labels: io.kompose.service: shell spec: containers: - image: ubuntu:16.04 command: - tail - "-f" - "/etc/hostname" name: shell volumeMounts: - name: shell mountPath: '/data' restartPolicy: Always volumes: - name: shell persistentVolumeClaim: claimName: shell-pvc
Result: Mount point is created on 2 of my 3 nodes, but not the node the pod is provisioning on. That one gets the fw_do_rw() log message and the ABORT_TASK, and has its' load gradually climb up to near 100. Note the host in question varied, but failure was almost assured.
I will add those other items you asked for as soon as I can. Oh, I should mention, these machines -- bare metal, not virtual.
Hi @greenaar ,
is it feasible that you can provide us with the logs of the pod running in the instance that have issues?
Sorry for the delay:
root@c1:/share/projects/kubernetes/personal/testing/storageos-charts# lsmod | grep tcm tcm_loop 24576 3 target_core_mod 360448 8 target_core_iblock,tcm_loop,target_core_user,target_core_file,target_core_pscsi
Module is loaded
root@c1:/share/projects/kubernetes/personal/testing/storageos-charts# docker run -it --rm -v /mnt:/mnt:shared busybox sh -c /bin/date Unable to find image 'busybox:latest' locally latest: Pulling from library/busybox 07a152489297: Already exists Digest: sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47 Status: Downloaded newer image for busybox:latest Sat Jun 23 01:34:20 UTC 2018
Propogation is enabled.
The moment before this system went off the rails, this appeared in syslog:
Jun 22 18:31:37 c1 kernel: [ 1745.323064] scsi host4: TCM_Loopback Jun 22 18:31:37 c1 kernel: [ 1745.817046] scsi 4:0:1:1: Direct-Access LIO-ORG FILEIO 4.0 PQ: 0 ANSI: 5 Jun 22 18:31:37 c1 kernel: [ 1745.817182] sd 4:0:1:1: [sdb] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB) Jun 22 18:31:37 c1 kernel: [ 1745.817198] sd 4:0:1:1: [sdb] Write Protect is off Jun 22 18:31:37 c1 kernel: [ 1745.817199] sd 4:0:1:1: [sdb] Mode Sense: 43 00 00 08 Jun 22 18:31:37 c1 kernel: [ 1745.817211] sd 4:0:1:1: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA Jun 22 18:31:37 c1 kernel: [ 1745.817214] loopback/naa.6001405005050001: Unsupported SCSI Opcode 0xa3, sending CHECK_CONDITION. Jun 22 18:31:37 c1 kernel: [ 1745.817332] sd 4:0:1:1: Attached scsi generic sg2 type 0 Jun 22 18:31:37 c1 kernel: [ 1745.817714] sd 4:0:1:1: [sdb] Attached SCSI disk Jun 22 18:31:37 c1 kernel: [ 1745.914684] scsi host3: TCM_Loopback Jun 22 18:31:44 c1 systemd-udevd[23920]: inotify_add_watch(9, /dev/sdb, 10) failed: No such file or directory
load average climbed to about 15 or so, and interrupted both weave networking and the connection to the api.
Interestingly, i then deleted the PVC, and re-applied it. load avg of 5, and it created it on all 3 nodes.
tl;dr -- it works.. sometimes. When it doesn't, it blocks on a node -- that includes blocking on removal when using helm delete
I realize this is not a lot of useful information, but i hope it helps in some way. I look forward to your successes.
Hi @greenaar, thank you very much for all this information. We are working on troubleshooting this issue. We keep you posted!.