CWAgent fails to resolve linux mount point device to EBS VolumeId on nitro instances
Describe the bug Linux allows mounting disks using a device alias (symlink) but the CWAgent is not able to resolve the EBS VolumeId for the device.
Steps to reproduce
-
Launch Linux t3 instance (with required instance profile)
-
Install, configure, and start the CloudWatch agent
yum install -y amazon-cloudwatch-agent
cat <<'EOF' > /tmp/amazon-cloudwatch-agent-config.json
{
"agent": {
"metrics_collection_interval": 60
},
"metrics": {
"aggregation_dimensions": [
["VolumeId"]
],
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
},
"metrics_collected": {
"disk": {
"append_dimensions": {
"VolumeId": "${aws:VolumeId}"
},
"ignore_file_system_types": ["devtmpfs", "overlay", "shm", "sysfs", "tmpfs"],
"measurement": ["used_percent"],
"metrics_collection_interval": 60,
"resources": ["*"]
}
}
}
}
EOF
/opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -s -c file:/tmp/amazon-cloudwatch-agent-config.json
-
Confirm the base metrics are reporting and have VolumeId populated
-
Create new EBS volume and attach to the instance as
/dev/xvdz -
Format the EBS volume:
mkfs.xfs /dev/xvdz -
Mount the EBS volume using
/dev/xvdzsource device via a direct syscall:
cat <<'EOF' > ~/mount.py
#!/usr/bin/env python3
import ctypes
import ctypes.util
import os
libc = ctypes.CDLL(ctypes.util.find_library("c"), use_errno=True)
libc.mount.argtypes = (ctypes.c_char_p, ctypes.c_char_p, ctypes.c_char_p, ctypes.c_ulong, ctypes.c_char_p)
def mount(source, target, fs, options=""):
ret = libc.mount(source.encode(), target.encode(), fs.encode(), 0, options.encode())
if ret < 0:
errno = ctypes.get_errno()
raise OSError(errno, f"Error mounting {source} ({fs}) on {target} with options '{options}': {os.strerror(errno)}")
mount("/dev/xvdz", "/mnt/data-xvdz", "xfs", "")
EOF
mkdir -p /mnt/data-xvdz
python3 ~/mount.py
-
The mounted volume will show up as
/dev/xvdzwhen runningdf -handcat /proc/mounts. Running themountcommand will show the resolved device symlink name. -
Check for metrics for the newly mounted EBS volume and if VolumeId is populated
What did you expect to see? Expected to see VolumeId populated for all disk mount points.
What did you see instead? The VolumeId is not populated.
What version did you use?
Version: CWAgent/1.300054.1 (go1.23.8; linux; amd64)
What config did you use?
{
"agent": {
"metrics_collection_interval": 60
},
"metrics": {
"aggregation_dimensions": [
["VolumeId"]
],
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
},
"metrics_collected": {
"disk": {
"append_dimensions": {
"VolumeId": "${aws:VolumeId}"
},
"ignore_file_system_types": ["devtmpfs", "overlay", "shm", "sysfs", "tmpfs"],
"measurement": ["used_percent"],
"metrics_collection_interval": 60,
"resources": ["*"]
}
}
}
}
Environment OS: Amazon Linux 2 (amazon/amzn2-ami-ecs-hvm-2.0.20250610-x86_64-ebs)
Additional context
I make use of the Rexray EBS plugin to handle creation and mounting of EBS Volumes for ECS Services. Turns out that Rexray EBS plugin calls the mount syscall without resolving the symlink that the nvme driver creates. This results in the kernel truly mounting the block device as xvd*.
[root@ip-10-0-91-150 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/nvme0n1p1 100T 9.7G 90.3T 10% /
/dev/xvdp 50G 25G 25G 50% /var/lib/docker/plugins/cfbcd2009d193760d0b441f622a2385bde857b3f4e1b66c827467e6b47fae543/propagated-mount/volumes/my-app-data
[root@ip-10-0-91-150 ~]# cat /proc/mounts
/dev/nvme0n1p1 / xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/nvme0n1p1 /var/lib/docker/plugins/cfbcd2009d193760d0b441f622a2385bde857b3f4e1b66c827467e6b47fae543/propagated-mount xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/nvme0n1p1 /var/lib/docker/plugins/399504751ea4753b38a6931240b4f1ae63be57bf6edaa50bf3535e11aae9ee34/propagated-mount xfs rw,noatime,attr2,inode64,noquota 0 0
/dev/xvdp /var/lib/docker/plugins/cfbcd2009d193760d0b441f622a2385bde857b3f4e1b66c827467e6b47fae543/propagated-mount/volumes/my-app-data xfs rw,relatime,nouuid,attr2,inode64,noquota 0 0
In order to verify what Telegraf is actually reporting directly, I adjusted the generated config CWAgent generates and ran the latest Telegraf
cat <<'EOF' > ~/telegraf-config.toml
[agent]
collection_jitter = "0s"
debug = false
flush_interval = "1s"
flush_jitter = "0s"
hostname = ""
interval = "60s"
logtarget = "stderr"
metric_batch_size = 1000
metric_buffer_limit = 10000
omit_hostname = false
precision = ""
quiet = false
round_interval = false
[inputs]
[[inputs.disk]]
fieldpass = ["used_percent"]
ignore_fs = ["devtmpfs", "overlay", "shm", "sysfs", "tmpfs"]
interval = "60s"
tagexclude = ["mode"]
[inputs.disk.tags]
[outputs]
[[outputs.file]]
files = ["stdout"]
EOF
curl -LO https://dl.influxdata.com/telegraf/releases/telegraf-1.34.4_linux_amd64.tar.gz
tar -xzf telegraf-1.34.4_linux_amd64.tar.gz
./telegraf-1.34.4/usr/bin/telegraf -config ~/telegraf-config.toml
Telegraf Results:
disk,device=nvme0n1p1,fstype=xfs,host=ip-10-0-91-150.us-east-2.compute.internal,label=/,path=/ used_percent=9.79178633890271363 1749864322000000000
disk,device=nvme0n1p1,fstype=xfs,host=ip-10-0-91-150.us-east-2.compute.internal,label=/,path=/var/lib/docker/plugins/cfbcd2009d193760d0b441f622a2385bde857b3f4e1b66c827467e6b47fae543/propagated-mount used_percent=9.79178633890271363 1749864322000000000
disk,device=nvme0n1p1,fstype=xfs,host=ip-10-0-91-150.us-east-2.compute.internal,label=/,path=/var/lib/docker/plugins/399504751ea4753b38a6931240b4f1ae63be57bf6edaa50bf3535e11aae9ee34/propagated-mount used_percent=9.79178633890271363 1749864322000000000
disk,device=xvdp,fstype=xfs,host=ip-10-0-91-150.us-east-2.compute.internal,path=/var/lib/docker/plugins/cfbcd2009d193760d0b441f622a2385bde857b3f4e1b66c827467e6b47fae543/propagated-mount/volumes/my-app-data used_percent=49.956185744611764 1749864322000000000
Hmm, just digging around and looks like rexray used to create the symlink but removed it due to lack of support in some operating systems. Relevant issue: https://github.com/rexray/rexray/pull/1293
That PR was closed and never merged. The current active code is:
https://github.com/rexray/rexray/blob/362035816046e87f7bc5a6ca745760d09a69a40c/libstorage/drivers/storage/ebs/executor/ebs_executor.go#L188-L219
I did a full walkthrough of the LocalDevices() method below, however to shortcut to the main point, the results from seem to only be used when trying to select the next device name to use when mounting a volume. It has nothing to do with the actual mounting of the device. I'm happy to test running Rexray EBS with the nvme cli installed for the OS to confirm the problem still exists.
Stepping through the function code in what it does, it will loop through the devices listed in the following output:
[root@ip-10-0-95-22 ~]# cat /proc/partitions
major minor #blocks name
259 0 5242880000 nvme0n1
259 1 5242877935 nvme0n1p1
259 2 1024 nvme0n1p128
259 3 1048576000 nvme1n1
[root@ip-10-0-95-22 ~]#
On Amazon Linux, the nvme cli is not installed by default. After installing it (yum install nvme-cli), running the command the code would execute results in:
[root@ip-10-0-95-22 ~]# /usr/sbin/nvme id-ctrl --raw-binary /dev/nvme0n1
vol0bdc18860d80139eaAmazon Elastic Block Store 1.0 ��?WfD@B@Bxvda
[root@ip-10-0-95-22 ~]# /usr/sbin/nvme id-ctrl --raw-binary /dev/nvme0n1p1
vol0bdc18860d80139eaAmazon Elastic Block Store 1.0 ��?WfD@B@Bxvda
[root@ip-10-0-95-22 ~]# /usr/sbin/nvme id-ctrl --raw-binary /dev/nvme0n1p128
vol0bdc18860d80139eaAmazon Elastic Block Store 1.0 ��?WfD@B@Bxvda
[root@ip-10-0-95-22 ~]# /usr/sbin/nvme id-ctrl --raw-binary /dev/nvme1n1
vol07741b30adb5cbb31Amazon Elastic Block Store 1.0 ��?WfD@B@B/dev/xvdm
If I'm stepping through the code correctly, the return of LocalDevices() would be the following if the nvme cli is installed on the OS:
&types.LocalDevices{
Driver: d.Name(),
DeviceMap: map[string]string{
"/dev/xvdm": "/dev/nvme1n1",
},
}
Of note, the nvme0n1 devices would be dropped due to failing to match on the allowed device range regex.
Just completed testing and found no change after having the nvme cli installed before rexray/ebs plugin is running; EBS Volumes were still being mounted by their attachment point alias.
This issue was marked stale due to lack of activity.
bump
ping
ping