amazon-cloudwatch-agent icon indicating copy to clipboard operation
amazon-cloudwatch-agent copied to clipboard

CWAgent fails to resolve linux mount point device to EBS VolumeId on nitro instances

Open montaguethomas opened this issue 7 months ago • 8 comments

Describe the bug Linux allows mounting disks using a device alias (symlink) but the CWAgent is not able to resolve the EBS VolumeId for the device.

Steps to reproduce

  1. Launch Linux t3 instance (with required instance profile)

  2. Install, configure, and start the CloudWatch agent

yum install -y amazon-cloudwatch-agent
cat <<'EOF' > /tmp/amazon-cloudwatch-agent-config.json
{
  "agent": {
    "metrics_collection_interval": 60
  },
  "metrics": {
    "aggregation_dimensions": [
      ["VolumeId"]
    ],
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}"
    },
    "metrics_collected": {
      "disk": {
        "append_dimensions": {
          "VolumeId": "${aws:VolumeId}"
        },
        "ignore_file_system_types": ["devtmpfs", "overlay", "shm", "sysfs", "tmpfs"],
        "measurement": ["used_percent"],
        "metrics_collection_interval": 60,
        "resources": ["*"]
      }
    }
  }
}
EOF
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -s -c file:/tmp/amazon-cloudwatch-agent-config.json
  1. Confirm the base metrics are reporting and have VolumeId populated

  2. Create new EBS volume and attach to the instance as /dev/xvdz

  3. Format the EBS volume: mkfs.xfs /dev/xvdz

  4. Mount the EBS volume using /dev/xvdz source device via a direct syscall:

cat <<'EOF' > ~/mount.py
#!/usr/bin/env python3
import ctypes
import ctypes.util
import os

libc = ctypes.CDLL(ctypes.util.find_library("c"), use_errno=True)
libc.mount.argtypes = (ctypes.c_char_p, ctypes.c_char_p, ctypes.c_char_p, ctypes.c_ulong, ctypes.c_char_p)

def mount(source, target, fs, options=""):
  ret = libc.mount(source.encode(), target.encode(), fs.encode(), 0, options.encode())
  if ret < 0:
    errno = ctypes.get_errno()
    raise OSError(errno, f"Error mounting {source} ({fs}) on {target} with options '{options}': {os.strerror(errno)}")

mount("/dev/xvdz", "/mnt/data-xvdz", "xfs", "")
EOF

mkdir -p /mnt/data-xvdz
python3 ~/mount.py
  1. The mounted volume will show up as /dev/xvdz when running df -h and cat /proc/mounts. Running the mount command will show the resolved device symlink name.

  2. Check for metrics for the newly mounted EBS volume and if VolumeId is populated

What did you expect to see? Expected to see VolumeId populated for all disk mount points.

What did you see instead? The VolumeId is not populated.

What version did you use? Version: CWAgent/1.300054.1 (go1.23.8; linux; amd64)

What config did you use?

{
  "agent": {
    "metrics_collection_interval": 60
  },
  "metrics": {
    "aggregation_dimensions": [
      ["VolumeId"]
    ],
    "append_dimensions": {
      "InstanceId": "${aws:InstanceId}"
    },
    "metrics_collected": {
      "disk": {
        "append_dimensions": {
          "VolumeId": "${aws:VolumeId}"
        },
        "ignore_file_system_types": ["devtmpfs", "overlay", "shm", "sysfs", "tmpfs"],
        "measurement": ["used_percent"],
        "metrics_collection_interval": 60,
        "resources": ["*"]
      }
    }
  }
}

Environment OS: Amazon Linux 2 (amazon/amzn2-ami-ecs-hvm-2.0.20250610-x86_64-ebs)

Additional context I make use of the Rexray EBS plugin to handle creation and mounting of EBS Volumes for ECS Services. Turns out that Rexray EBS plugin calls the mount syscall without resolving the symlink that the nvme driver creates. This results in the kernel truly mounting the block device as xvd*.

[root@ip-10-0-91-150 ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p1  100T  9.7G  90.3T   10% /
/dev/xvdp      50G   25G  25G   50% /var/lib/docker/plugins/cfbcd2009d193760d0b441f622a2385bde857b3f4e1b66c827467e6b47fae543/propagated-mount/volumes/my-app-data


[root@ip-10-0-91-150 ~]# cat /proc/mounts
/dev/nvme0n1p1 / xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/nvme0n1p1 /var/lib/docker/plugins/cfbcd2009d193760d0b441f622a2385bde857b3f4e1b66c827467e6b47fae543/propagated-mount xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/nvme0n1p1 /var/lib/docker/plugins/399504751ea4753b38a6931240b4f1ae63be57bf6edaa50bf3535e11aae9ee34/propagated-mount xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/xvdp /var/lib/docker/plugins/cfbcd2009d193760d0b441f622a2385bde857b3f4e1b66c827467e6b47fae543/propagated-mount/volumes/my-app-data xfs rw,relatime,nouuid,attr2,inode64,noquota 0 0

In order to verify what Telegraf is actually reporting directly, I adjusted the generated config CWAgent generates and ran the latest Telegraf

cat <<'EOF' > ~/telegraf-config.toml
[agent]
  collection_jitter = "0s"
  debug = false
  flush_interval = "1s"
  flush_jitter = "0s"
  hostname = ""
  interval = "60s"
  logtarget = "stderr"
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = false
  precision = ""
  quiet = false
  round_interval = false

[inputs]

  [[inputs.disk]]
    fieldpass = ["used_percent"]
    ignore_fs = ["devtmpfs", "overlay", "shm", "sysfs", "tmpfs"]
    interval = "60s"
    tagexclude = ["mode"]
    [inputs.disk.tags]

[outputs]

  [[outputs.file]]
    files = ["stdout"]
EOF

curl -LO https://dl.influxdata.com/telegraf/releases/telegraf-1.34.4_linux_amd64.tar.gz
tar -xzf telegraf-1.34.4_linux_amd64.tar.gz
./telegraf-1.34.4/usr/bin/telegraf -config ~/telegraf-config.toml

Telegraf Results:

disk,device=nvme0n1p1,fstype=xfs,host=ip-10-0-91-150.us-east-2.compute.internal,label=/,path=/ used_percent=9.79178633890271363 1749864322000000000
disk,device=nvme0n1p1,fstype=xfs,host=ip-10-0-91-150.us-east-2.compute.internal,label=/,path=/var/lib/docker/plugins/cfbcd2009d193760d0b441f622a2385bde857b3f4e1b66c827467e6b47fae543/propagated-mount used_percent=9.79178633890271363 1749864322000000000
disk,device=nvme0n1p1,fstype=xfs,host=ip-10-0-91-150.us-east-2.compute.internal,label=/,path=/var/lib/docker/plugins/399504751ea4753b38a6931240b4f1ae63be57bf6edaa50bf3535e11aae9ee34/propagated-mount used_percent=9.79178633890271363 1749864322000000000
disk,device=xvdp,fstype=xfs,host=ip-10-0-91-150.us-east-2.compute.internal,path=/var/lib/docker/plugins/cfbcd2009d193760d0b441f622a2385bde857b3f4e1b66c827467e6b47fae543/propagated-mount/volumes/my-app-data used_percent=49.956185744611764 1749864322000000000

montaguethomas avatar Jun 15 '25 02:06 montaguethomas

Hmm, just digging around and looks like rexray used to create the symlink but removed it due to lack of support in some operating systems. Relevant issue: https://github.com/rexray/rexray/pull/1293

duhminick avatar Jun 18 '25 02:06 duhminick

That PR was closed and never merged. The current active code is:

https://github.com/rexray/rexray/blob/362035816046e87f7bc5a6ca745760d09a69a40c/libstorage/drivers/storage/ebs/executor/ebs_executor.go#L188-L219

I did a full walkthrough of the LocalDevices() method below, however to shortcut to the main point, the results from seem to only be used when trying to select the next device name to use when mounting a volume. It has nothing to do with the actual mounting of the device. I'm happy to test running Rexray EBS with the nvme cli installed for the OS to confirm the problem still exists.

Stepping through the function code in what it does, it will loop through the devices listed in the following output:

[root@ip-10-0-95-22 ~]# cat /proc/partitions
major minor  #blocks  name

 259        0 5242880000 nvme0n1
 259        1 5242877935 nvme0n1p1
 259        2       1024 nvme0n1p128
 259        3 1048576000 nvme1n1
[root@ip-10-0-95-22 ~]#

On Amazon Linux, the nvme cli is not installed by default. After installing it (yum install nvme-cli), running the command the code would execute results in:

[root@ip-10-0-95-22 ~]# /usr/sbin/nvme id-ctrl --raw-binary /dev/nvme0n1
vol0bdc18860d80139eaAmazon Elastic Block Store              1.0      ��?WfD@B@Bxvda

[root@ip-10-0-95-22 ~]# /usr/sbin/nvme id-ctrl --raw-binary /dev/nvme0n1p1
vol0bdc18860d80139eaAmazon Elastic Block Store              1.0      ��?WfD@B@Bxvda

[root@ip-10-0-95-22 ~]# /usr/sbin/nvme id-ctrl --raw-binary /dev/nvme0n1p128
vol0bdc18860d80139eaAmazon Elastic Block Store              1.0      ��?WfD@B@Bxvda

[root@ip-10-0-95-22 ~]# /usr/sbin/nvme id-ctrl --raw-binary /dev/nvme1n1
vol07741b30adb5cbb31Amazon Elastic Block Store              1.0      ��?WfD@B@B/dev/xvdm

If I'm stepping through the code correctly, the return of LocalDevices() would be the following if the nvme cli is installed on the OS:

&types.LocalDevices{
    Driver: d.Name(),
    DeviceMap: map[string]string{
        "/dev/xvdm": "/dev/nvme1n1",
    },
}

Of note, the nvme0n1 devices would be dropped due to failing to match on the allowed device range regex.

montaguethomas avatar Jun 18 '25 15:06 montaguethomas

Just completed testing and found no change after having the nvme cli installed before rexray/ebs plugin is running; EBS Volumes were still being mounted by their attachment point alias.

montaguethomas avatar Jun 19 '25 04:06 montaguethomas

This issue was marked stale due to lack of activity.

github-actions[bot] avatar Sep 20 '25 00:09 github-actions[bot]

bump

montaguethomas avatar Sep 22 '25 15:09 montaguethomas

ping

montaguethomas avatar Nov 05 '25 15:11 montaguethomas

ping

montaguethomas avatar Nov 13 '25 00:11 montaguethomas