node_exporter icon indicating copy to clipboard operation
node_exporter copied to clipboard

Add neighbor state

Open eugercek opened this issue 1 year ago • 6 comments

Implements part of #535

Some things we can talk:

  • We can omit state="stale" like labels and just give the state value to Gauge (IMO current impl is simpler) although I saw below in CONTRIBUTING.md I didn't see many magic numbers in the metrics and labels thus added human readable names for them.

For example a proc file contains the magic number 42 as some identifier, the Node Exporter should expose it as it is and not keep a mapping in code to make this human readable.

  • For code quality I can split Update function to make it more readable (again IMO it's simple enough to do it in one function)

This pr adds below metrics

# HELP node_arp_states ARP states by device
# TYPE node_arp_states gauge
node_arp_states{device="eth0",state="stale"} 1
node_arp_states{device="eth0",state="failed"} 2
# ...

Old comments

I changed the design in second commit due to not exploding cardinality but keep the comments if maintainers wants that design.

Iff you want to have metrics like below, can continue to read:

node_arp_states{device="eth0",state="stale", ip="1.1.1.1"} 1
node_arp_states{device="eth0",state="stale", ip="1.1.1.2"} 1
node_arp_states{device="eth0",state="failed", ip="1.1.1.3"} 1

Some things we can talk:

  • Can remove neighborState struct and just use string slice, but this is easier to follow IMO
  • This change may create many metrics, to make a estimation Default values of arp cache size are: min 128 and max 1024 therefore we may except up to 1024*count(net_int) as max. But in some "fine tuning"s I saw people generally increase this although it's limited by the size of you subnet, one can have many machines in subnet. Maybe I should make a counter for each state?

@SuperQ it's ready for your review, when you have time 🙏🏻

eugercek avatar Nov 23 '24 19:11 eugercek

Also fixed the merge conflict.

eugercek avatar Jan 13 '25 18:01 eugercek

LGTM in general but I think this needs some tests

discordianfish avatar Feb 02 '25 10:02 discordianfish

Also see lint error:

  Error: collector/arp_linux.go:54:6: type `neighborState` is unused (unused)
  type neighborState struct {
       ^

discordianfish avatar Feb 02 '25 10:02 discordianfish

Hi there was a bug on n.State&unix.NUD_NOARP == unix.NUD_NOARP check, fixed that also fixed linter.

On the testing side, I'm not sure how to test RTNL via fixtures. Maybe we can test /proc/net/arp via fixtures, but this PR doesn't touch that part.

To test it, I wrote a reproducible but hacky Bash script. I don't recommend adding it as a test in the repository, but it demonstrates that the functionality implemented correctly. I would appreciate guidance on how to add proper tests for the RTNL part. (You can run the script on your machine. It only needs docker, and cd node_exporter_repo_root)

Script
#!/bin/bash

set -eu

export DOCKER_CLI_HINTS=false

docker rm -f arp-test nginx1 &> /dev/null || true

tempFile=$(mktemp)
cat <<EOT > $tempFile
FROM ubuntu

COPY node_exporter /opt/node_exporter

RUN apt update 1>/dev/null && apt install -y curl iproute2
USER nobody
ENTRYPOINT ["/opt/node_exporter"]
EOT

GOOS=linux GOARCH=amd64 go build .

docker build -t arp-test -f $tempFile . &>/dev/null

docker run --name arp-test \
   -p 9100:9100 \
   --rm -d \
   arp-test &>/dev/null

docker exec arp-test ip neigh

docker run -d --name nginx1 nginx  &>/dev/null

nginxIp=$(docker inspect \
 -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}'  nginx1)

cur_state() {
   echo "====="
   echo "From node_exporter:"
   curl -s localhost:9100/metrics | grep -E ^node_arp
   echo "From ip neigh:"
   docker exec arp-test ip neigh
   echo "====="
}

echo "Initial state"
cur_state

echo "After curl nginx1 (NUD_DELAY or NUD_REACHABLE)"
docker exec arp-test curl -s -o /dev/null $nginxIp
cur_state
sleep 2
echo "Wait arp cache (NUD_REACHABLE)"
cur_state

echo "After curl unknown ip (NUD_INCOMPLETE or NUD_FAILED)"
docker exec arp-test curl -m 1 -s -o /dev/null 172.17.0.10 || true
cur_state

echo "Sleep 30 second for net.ipv4.neigh.default.base_reachable_time_ms (NUD_FAILED)"
sleep 30
cur_state
Output
Initial state
=====
From node_exporter:
node_arp_entries{device="eth0"} 1
node_arp_states{device="eth0",state="reachable"} 1
From ip neigh:
172.17.0.1 dev eth0 lladdr 02:42:6a:74:ba:4a REACHABLE
=====
After curl nginx1 (NUD_DELAY or NUD_REACHABLE)
=====
From node_exporter:
node_arp_entries{device="eth0"} 2
node_arp_states{device="eth0",state="reachable"} 2
From ip neigh:
172.17.0.3 dev eth0 lladdr 02:42:ac:11:00:03 REACHABLE
172.17.0.1 dev eth0 lladdr 02:42:6a:74:ba:4a REACHABLE
=====
Wait arp cache (NUD_REACHABLE)
=====
From node_exporter:
node_arp_entries{device="eth0"} 2
node_arp_states{device="eth0",state="reachable"} 2
From ip neigh:
172.17.0.3 dev eth0 lladdr 02:42:ac:11:00:03 REACHABLE
172.17.0.1 dev eth0 lladdr 02:42:6a:74:ba:4a REACHABLE
=====
After curl unknown ip (NUD_INCOMPLETE or NUD_FAILED)
=====
From node_exporter:
node_arp_entries{device="eth0"} 3
node_arp_states{device="eth0",state="incomplete"} 1
node_arp_states{device="eth0",state="reachable"} 2
From ip neigh:
172.17.0.10 dev eth0  INCOMPLETE
172.17.0.3 dev eth0 lladdr 02:42:ac:11:00:03 REACHABLE
172.17.0.1 dev eth0 lladdr 02:42:6a:74:ba:4a REACHABLE
=====
Sleep 30 second for net.ipv4.neigh.default.base_reachable_time_ms (NUD_FAILED)
=====
From node_exporter:
node_arp_entries{device="eth0"} 3
node_arp_states{device="eth0",state="failed"} 1
node_arp_states{device="eth0",state="reachable"} 2
From ip neigh:
172.17.0.10 dev eth0  FAILED
172.17.0.3 dev eth0 lladdr 02:42:ac:11:00:03 REACHABLE
172.17.0.1 dev eth0 lladdr 02:42:6a:74:ba:4a REACHABLE
=====

eugercek avatar Feb 02 '25 13:02 eugercek

I have a few concerns remaining about this PR.

Firstly, this has the potential to inflate cardinality quite considerably on hosts with a lot of interfaces. The simple ARP entry count is easy to predict, as it won't be more than one metric per interface. However, adding enum states where there could potentially be three or four different non-zero state counts per interface might be a concern for some users. I know for a fact that some users run node_exporter on hosts with hundreds of interfaces. Perhaps this new feature should be gated by a command line flag.

Secondly, as with any metric involving enum states, if only non-zero enum counters are exposed, this will lead to stale metrics when a state changes. For example, if node_arp_states{device="eth0",state="incomplete"} 1 becomes node_arp_states{device="eth0",state="failed"} 1, the state="incomplete" metric will still be returned by queries up to the Prometheus query.lookback-delta (default 5 mins). If that metric changes state again within those 5 minutes (which is quite possible, since Linux neighbor cache gc_stale_time is 60s by default), this will effectively make this metric unreliable and useless. Most enum state metrics strive to expose all the states, even when the counters are zero, to avoid this phenomenon. However this comes at a cardinality price.

Finally, as I already mentioned, the neighbor state is a bit mask. This PR in its current state assumes that the neighbor state will only ever be a single NUD state, and that a simple map lookup for that value is acceptable. Are you sure that there will never be entries that are a combination of state flags? If this occurs, the neighborStatesMap lookup would return an empty string due to no matching key. And if there are multiple neighbor entries with different state flag combinations, this would result in the collector attempting to emit multiple metrics with identical label sets, i.e. node_arp_states{device="eth0",state=""} - which would cause a panic.

dswarbrick avatar Feb 02 '25 18:02 dswarbrick

Sorry for late response, I'm a little busy. Will look at tomorrow.

eugercek avatar Feb 09 '25 14:02 eugercek