helm-chart icon indicating copy to clipboard operation
helm-chart copied to clipboard

"unable to open block device" random error

Open tscolari opened this issue 7 years ago • 6 comments

This error seems to be random, some of my pvc works, but eventually one pod will get this error: It can happen either on fresh helm deployment or on one running for a few days

In the pod using the volume/pvc:

Warning  FailedMount            34s               kubelet, k8s-node-2  Unable to mount volumes for pod "registry-6786df68c4-gxgp2_services(653d5612-56c5-11e8-bd87-de2b040ce014)": timeout expired waiting for volumes to attach or mount for pod "services"/"registry-6786df68c4-gxgp2". list of unmounted volumes=[registry-storage]. list of unattached volumes=[registry-storage credentials-volume default-token-bsjcf]
  Warning  FailedMount            26s (x9 over 2m)  kubelet, k8s-node-2  MountVolume.SetUp failed for volume "pvc-652a3868-56c5-11e8-bd87-de2b040ce014" : stat /var/lib/storageos/volumes/04cc0bb7-0934-3f78-96f4-b1882888abcc: no such file or directory

Log from the storageos pod:

time="2018-05-13T15:50:55Z" level=info msg="[[email protected]:5703]:: attempt to establish channel" category=clconmon module=supervisor
time="2018-05-13T15:50:56Z" level=warning msg="unable to open block device /var/lib/storageos/volumes/04cc0bb7-0934-3f78-96f4-b1882888abcc-sys: No such device or address" category=clconmon module=supervisor
time="2018-05-13T15:50:56Z" level=info msg="[[email protected]:5703]: connection established (fd=6)" category=clconn module=supervisor

Not sure how to further debug this to give you more information, but let me know if I can help, or if you suspect that I misconfigured something

tscolari avatar May 13 '18 15:05 tscolari

I have also witnessed this. 3 nodes Same hardware. 2 of them have the volume path, the third does not. This is also accompanied by an increase in system load (one system reached a load avg of 85, without ever creating the required folder structure. This occurs semi-regularly when autoprovisioning, but the node that has the failure is random. Ubuntu 18.04, docker 17.12.1-0ubuntu1, kubernetes v1.10.4, latest helm, overlay2 mode, weave CNI.

Not knowing exactly, i suspect it's something to do with this:

Jun 18 15:23:20 k8s2 kernel: [763686.657279] sd 6:0:1:8: [sdc] 10485760 512-byte logical blocks: (5.37 GB/5.00 GiB) Jun 18 15:23:20 k8s2 kernel: [763686.657290] sd 6:0:1:8: [sdc] Write Protect is off Jun 18 15:23:20 k8s2 kernel: [763686.657292] sd 6:0:1:8: [sdc] Mode Sense: 43 00 00 08 Jun 18 15:23:20 k8s2 kernel: [763686.657311] sd 6:0:1:8: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA Jun 18 15:23:20 k8s2 kernel: [763686.657315] loopback/naa.6001405005050001: Unsupported SCSI Opcode 0xa3, sending CHECK_CONDITION. Jun 18 15:23:50 k8s2 kernel: [763717.002495] ABORT_TASK: Found referenced loopback task_tag: 0 Jun 18 15:24:44 k8s2 kernel: [763770.697823] fd_do_rw() returned -5 for non S_ISBLK

I'd had to roll back the install for the moment, but I'd be happy to re-test this in the future. It almost works for me, and I'm optimistic.

greenaar avatar Jun 19 '18 03:06 greenaar

Thanks for the report @greenaar, we'll look into this. Do you have repro steps?

Can you do a lsmod | grep tcm and see if you have the tcm_loop module enabled? Can you run docker run -it --rm -v /mnt:/mnt:shared busybox sh -c /bin/date to check if mount propagation is enabled in Docker? Can you share the full logs from storageos?

oicheryl avatar Jun 20 '18 15:06 oicheryl

i will re-deploy it in the next day or so, and i should be able to get the dumps out for you.

My repro steps were really easy. Install via helm (the only change, an entry for join:)

Create this volume claim:

apiVersion: v1 kind: PersistentVolumeClaim metadata: name: shell-pvc annotations: volume.beta.kubernetes.io/storage-class: fast spec: accessModes: - ReadWriteOnce resources: requests: storage: 5Gi

and attach it to this deployment:

apiVersion: extensions/v1beta1 kind: Deployment metadata: name: shell namespace: default spec: replicas: 1 strategy: type: RollingUpdate template: metadata: creationTimestamp: null labels: io.kompose.service: shell spec: containers: - image: ubuntu:16.04 command: - tail - "-f" - "/etc/hostname" name: shell volumeMounts: - name: shell mountPath: '/data' restartPolicy: Always volumes: - name: shell persistentVolumeClaim: claimName: shell-pvc

Result: Mount point is created on 2 of my 3 nodes, but not the node the pod is provisioning on. That one gets the fw_do_rw() log message and the ABORT_TASK, and has its' load gradually climb up to near 100. Note the host in question varied, but failure was almost assured.

I will add those other items you asked for as soon as I can. Oh, I should mention, these machines -- bare metal, not virtual.

greenaar avatar Jun 22 '18 03:06 greenaar

Hi @greenaar ,

is it feasible that you can provide us with the logs of the pod running in the instance that have issues?

Arau avatar Jun 22 '18 15:06 Arau

Sorry for the delay:

root@c1:/share/projects/kubernetes/personal/testing/storageos-charts# lsmod | grep tcm tcm_loop 24576 3 target_core_mod 360448 8 target_core_iblock,tcm_loop,target_core_user,target_core_file,target_core_pscsi

Module is loaded

root@c1:/share/projects/kubernetes/personal/testing/storageos-charts# docker run -it --rm -v /mnt:/mnt:shared busybox sh -c /bin/date Unable to find image 'busybox:latest' locally latest: Pulling from library/busybox 07a152489297: Already exists Digest: sha256:141c253bc4c3fd0a201d32dc1f493bcf3fff003b6df416dea4f41046e0f37d47 Status: Downloaded newer image for busybox:latest Sat Jun 23 01:34:20 UTC 2018

Propogation is enabled.

The moment before this system went off the rails, this appeared in syslog:

Jun 22 18:31:37 c1 kernel: [ 1745.323064] scsi host4: TCM_Loopback Jun 22 18:31:37 c1 kernel: [ 1745.817046] scsi 4:0:1:1: Direct-Access LIO-ORG FILEIO 4.0 PQ: 0 ANSI: 5 Jun 22 18:31:37 c1 kernel: [ 1745.817182] sd 4:0:1:1: [sdb] 2097152 512-byte logical blocks: (1.07 GB/1.00 GiB) Jun 22 18:31:37 c1 kernel: [ 1745.817198] sd 4:0:1:1: [sdb] Write Protect is off Jun 22 18:31:37 c1 kernel: [ 1745.817199] sd 4:0:1:1: [sdb] Mode Sense: 43 00 00 08 Jun 22 18:31:37 c1 kernel: [ 1745.817211] sd 4:0:1:1: [sdb] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA Jun 22 18:31:37 c1 kernel: [ 1745.817214] loopback/naa.6001405005050001: Unsupported SCSI Opcode 0xa3, sending CHECK_CONDITION. Jun 22 18:31:37 c1 kernel: [ 1745.817332] sd 4:0:1:1: Attached scsi generic sg2 type 0 Jun 22 18:31:37 c1 kernel: [ 1745.817714] sd 4:0:1:1: [sdb] Attached SCSI disk Jun 22 18:31:37 c1 kernel: [ 1745.914684] scsi host3: TCM_Loopback Jun 22 18:31:44 c1 systemd-udevd[23920]: inotify_add_watch(9, /dev/sdb, 10) failed: No such file or directory

load average climbed to about 15 or so, and interrupted both weave networking and the connection to the api.

Interestingly, i then deleted the PVC, and re-applied it. load avg of 5, and it created it on all 3 nodes.

tl;dr -- it works.. sometimes. When it doesn't, it blocks on a node -- that includes blocking on removal when using helm delete --purge, requiring a system reboot to clear.

I realize this is not a lot of useful information, but i hope it helps in some way. I look forward to your successes.

greenaar avatar Jun 23 '18 01:06 greenaar

Hi @greenaar, thank you very much for all this information. We are working on troubleshooting this issue. We keep you posted!.

Arau avatar Jun 25 '18 09:06 Arau