DNS resolution on Azure Compute hosts running Ubuntu OS stops working once calico-vpp-node pods get up and running
Environment
- Calico/VPP version: tigera-operator v3.26.3 / Calico VPP v3.26.0 also tried tigera-operator v3.27.2 / Calico VPP v3.27.0
- Kubernetes version: v1.28.8
- Deployment type: kubeadm cluster on Azure Compute instances
- Network configuration: Calico default with VXLAN enabled
- Pod CIDR: 192.168.0.0/16
- Service CIDR: 10.96.0.0/12
- CRI: containerd 1.6.28 (docker is not installed)
- OS: Ubuntu 22.04
- kernel:
Linux master 5.15.0-1042-azure #49-Ubuntu SMP Tue Jul 11 17:28:46 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Issue description
The calico-vpp-node pods somehow break DNS resolution on the hosts once those pods get fully initialized and running. The /etc/resolv.conf file on the hosts get edited when the calico-vpp-node pod is running. The DNS resolution from within the calico-vpp-node pods works fine. The host's DNS resolution is what gets affected which doesn't allow all Calico VPP components to get configured correctly as some pods get stuck in ImagePullBackOff state.
To Reproduce Steps to reproduce the behavior:
- provision Azure Compute instances (e.g. control-plane1, worker1). Used
Standard_D4s_v3size instances - deploy kubeadm cluster. Used kubeadm v1.28.8 version
- install Calico VPP. Used calico-vpp-nohuge.yaml
- edited CALICOVPP_INTERFACES to use
interfaceName: eth0instead of the defaulteth1as shown below:
CALICOVPP_INTERFACES: |-
{
"maxPodIfSpec": {
"rx": 10, "tx": 10, "rxqsz": 1024, "txqsz": 1024
},
"defaultPodIfSpec": {
"rx": 1, "tx":1, "isl3": true
},
"vppHostTapSpec": {
"rx": 1, "tx":1, "rxqsz": 1024, "txqsz": 1024, "isl3": false
},
"uplinkInterfaces": [
{
"interfaceName": "eth0",
"vppDriver": "af_packet"
}
]
}
-
installation-default.yamlwas edit as the following:
kind: Installation
metadata:
name: default
spec:
# Configures Calico networking.
calicoNetwork:
linuxDataplane: VPP
ipPools:
- cidr: 192.168.0.0/16
encapsulation: VXLAN
Expected behavior Installation of Calico VPP should not disrupt host's DNS resolution.
Additional context
- the order of manifest installation
kubectl apply --server-side --force-conflicts -f tigera-operator.yaml
kubectl apply -f installation-default.yaml
kubectl apply -f calico-vpp-nohuge.yaml
- while
calico-vpp-nodepods getting initialized, the DNS resolution on the host works as expected. However, once thecalico-vpp-dataplane/calico-vpp-nodepods get to theRunningstate, the DNS resolution stops working on the host and/etc/resolv.conffile gets modified. - example of
/etc/resolv.confon the host before Calico VPP is installed
nameserver 127.0.0.53
options edns0 trust-ad
search abkhse5g3e5ebd4v3jenyazk4h.xx.internal.cloudapp.net
- example of the
/etc/resolv.confon the host aftercalico-vpp-nodepod reaches theRunningstate
nameserver 127.0.0.53
options edns0 trust-ad
search .
- example of the
/etc/resolv.confinside thecalico-vpp-nodepods
search abkhse5g3e5ebd4v3jenyazk4h.xx.internal.cloudapp.net
nameserver 168.63.129.16
- I have not issue getting a response when running
curl google.comfrom within thecalico-vpp-nodepod, but the same query fails on the host with the messagecurl: (6) Could not resolve host: google.com - I noticed that Calico VPP seems to add the service CIDR into the routing table on the host. I'm not sure if this has any impact on host's DNS resolution, but the programming of that route seems to correlate with the time when DNS resolution on the host stops working.
- example of programmed routes on the host before
calico-vpp-nodeis up or right after when you manually kill the pod and before it's back up
default via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
168.63.129.16 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
169.254.169.254 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4 metric 100
172.10.1.1 dev eth0 proto dhcp scope link src 172.10.1.4 metric 100
- example of programmed routes on the host after the
calico-vpp-nodepod is up
default via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
10.96.0.0/12 via 172.10.1.254 dev eth0 proto static mtu 1440
168.63.129.16 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
169.254.169.254 via 172.10.1.1 dev eth0 proto dhcp src 172.10.1.4 metric 100
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4
172.10.1.0/24 dev eth0 proto kernel scope link src 172.10.1.4 metric 100
172.10.1.1 dev eth0 proto dhcp scope link src 172.10.1.4 metric 100
192.168.0.0/16 via 172.10.1.254 dev eth0 proto static mtu 1440
- one way I can get pods to pull necessary images after the
calico-vpp-nodepods get up and running, is to manually kill thecalico-vpp-nodepods and force restart the pods that are failing to pull the images. Since it takes thecalico-vpp-nodepods a few moments to get to the Running state, the other cycled workload pods usually get a chance to start pulling the image before DNS resolution is broken again. - I can get a bit better workaround if I manually edit the
/etc/resolv.conffile on the host and make it look like the one I fetch from within thecalico-vpp-nodepods. The DNS starts working until thecalico-vpp-nodegets restarted as the restart of that pod seems to overwrite the/etc/resolv.conffile once again.
Would like to understand what breaks the DNS resolution on the hosts when Calico VPP dataplane gets installed on the cluster.
Hi @ivansharamok, could you share the vpp-manager logs:
kubectl logs -n calico-vpp-dataplane calico-vpp-node-XYZ -c vpp
Also, any specific reasons for using v3.26 instead of the latest v3.27? if possible, could you switch to v3.27?
Are the nodes using NetworkManager or systemd.networkd? Could you please share the appropriate logs (NM or systemd.networkd) when this issue happens?
Also, any specific reasons for using v3.26 instead of the latest v3.27? if possible, could you switch to v3.27?
I tried v3.27.0 but the calicovpp/install-whereabouts image wasn't published to the Docker Hub which prompted me to switch to v3.26.0. I see that it was published a few days ago. I'll give it a try and update this ticket.
Installed Calico VPP v3.27.0. Hit the same issue. Below is the info collected from the cluster using Calico VPP v3.27.0.
Looks like Ubuntu 22.04 by default uses systemd-networkd.
# checking if NetworkManager is used
azureuser@master:~$ systemctl status NetworkManager
Unit NetworkManager.service could not be found.
azureuser@master:~$ systemctl status network-manager
Unit network-manager.service could not be found.
# checking if systemd-networkd is used
azureuser@master:~$ systemctl status /etc/network/interfaces
Unit etc-network-interfaces.mount could not be found.
azureuser@master:~$ systemctl status systemd-networkd
● systemd-networkd.service - Network Configuration
Loaded: loaded (/lib/systemd/system/systemd-networkd.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2024-04-04 17:10:06 UTC; 15min ago
TriggeredBy: ● systemd-networkd.socket
Docs: man:systemd-networkd.service(8)
Main PID: 7977 (systemd-network)
Status: "Processing requests..."
Tasks: 1 (limit: 19179)
Memory: 1.3M
CPU: 136ms
CGroup: /system.slice/systemd-networkd.service
└─7977 /lib/systemd/systemd-networkd
Apr 04 17:10:06 master systemd[1]: Starting Network Configuration...
Apr 04 17:10:06 master systemd-networkd[7977]: eth0: Link UP
Here's the log for systemd-networkd (journalctl -u systemd-networkd).
Apr 04 16:32:09 master systemd[1]: Starting Network Configuration...
Apr 04 16:32:09 master systemd-networkd[539]: lo: Link UP
Apr 04 16:32:09 master systemd-networkd[539]: lo: Gained carrier
Apr 04 16:32:09 master systemd-networkd[539]: Enumeration completed
Apr 04 16:32:09 master systemd[1]: Started Network Configuration.
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Link UP
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Gained carrier
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Link DOWN
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Lost carrier
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Link UP
Apr 04 16:32:09 master systemd-networkd[539]: eth0: Gained carrier
Apr 04 16:32:09 master systemd-networkd[539]: eth0: DHCPv4 address 172.10.1.5/24 via 172.10.1.1
Apr 04 16:32:11 master systemd-networkd[539]: eth0: Gained IPv6LL
Apr 04 17:10:06 master systemd-networkd[539]: eth0: Link DOWN
Apr 04 17:10:06 master systemd-networkd[539]: eth0: Lost carrier
Apr 04 17:10:06 master systemd-networkd[539]: eth0: DHCP lease lost
Apr 04 17:10:06 master systemd-networkd[539]: eth0: DHCPv6 lease lost
Apr 04 17:10:06 master systemd-networkd[539]: eth0: Link UP
Apr 04 17:10:06 master systemd-networkd[539]: eth0: Gained carrier
Apr 04 17:10:06 master systemd[1]: Stopping Network Configuration...
Apr 04 17:10:06 master systemd[1]: systemd-networkd.service: Deactivated successfully.
Apr 04 17:10:06 master systemd[1]: Stopped Network Configuration.
Apr 04 17:10:06 master systemd[1]: Starting Network Configuration...
Apr 04 17:10:06 master systemd-networkd[7977]: eth0: Link UP
Apr 04 17:10:06 master systemd-networkd[7977]: eth0: Gained carrier
Apr 04 17:10:06 master systemd-networkd[7977]: lo: Link UP
Apr 04 17:10:06 master systemd-networkd[7977]: lo: Gained carrier
Apr 04 17:10:06 master systemd-networkd[7977]: Enumeration completed
Apr 04 17:10:06 master systemd[1]: Started Network Configuration.
Apr 04 17:10:07 master systemd-networkd[7977]: eth0: Gained IPv6LL
Apr 04 17:29:43 master systemd[1]: Stopping Network Configuration...
Apr 04 17:29:43 master systemd[1]: systemd-networkd.service: Deactivated successfully.
Apr 04 17:29:43 master systemd[1]: Stopped Network Configuration.
Apr 04 17:29:43 master systemd[1]: Starting Network Configuration...
Apr 04 17:29:43 master systemd-networkd[17212]: eth0: Link UP
Apr 04 17:29:43 master systemd-networkd[17212]: eth0: Gained carrier
Apr 04 17:29:43 master systemd-networkd[17212]: lo: Link UP
Apr 04 17:29:43 master systemd-networkd[17212]: lo: Gained carrier
Apr 04 17:29:43 master systemd-networkd[17212]: eth0: Gained IPv6LL
Apr 04 17:29:43 master systemd-networkd[17212]: Enumeration completed
Apr 04 17:29:43 master systemd[1]: Started Network Configuration.
- in the log, the time
Apr 04 17:10:06corresponds to when I installed Calico VPP in my cluster - the time
Apr 04 17:29:43corresponds tosudo systemctl restart systemd-networkdcommand as I tried to see if restarting the networking service could help fix the problem. It didn't.
Logs for one of calico-vpp-node pods
time="2024-04-04T17:10:03Z" level=info msg="Version info\nImage tag : ab81a775fbdeba932888690c68ddf7e9f4bd8d2b\nVPP-dataplane version : ab81a77 Release v3.27.0\nVPP Version : 24.02-rc0~8-g9db45f6ae\nBinapi-generator version : v0.8.0\nVPP Base commit : 06efd532e gerrit:34726/3 interface: add buffer stats api\n------------------ Cherry picked commits --------------------\ncapo: Calico Policies plugin\nacl: acl-plugin custom policies\ncnat: [WIP] no k8s maglev from pods\npbl: Port based balancer\ngerrit:40078/3 vnet: allow format deleted swifidx\ngerrit:40090/3 cnat: undo fib_entry_contribute_forwarding\ngerrit:39507/13 cnat: add flow hash config to cnat translation\ngerrit:34726/3 interface: add buffer stats api\n-------------------------------------------------------------\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_SWAP_DRIVER="
time="2024-04-04T17:10:03Z" level=info msg="Config:SERVICE_PREFIX=[10.96.0.0/12]"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_GRACEFUL_SHUTDOWN_TIMEOUT=10s"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_INTERFACES={\n \"defaultPodIfSpec\": {\n \"rx\": 1,\n \"tx\": 1,\n \"rxqsz\": 0,\n \"txqsz\": 0,\n \"isl3\": true,\n \"rxMode\": 0\n },\n \"maxPodIfSpec\": {\n \"rx\": 10,\n \"tx\": 10,\n \"rxqsz\": 1024,\n \"txqsz\": 1024,\n \"isl3\": null,\n \"rxMode\": 0\n },\n \"vppHostTapSpec\": {\n \"rx\": 1,\n \"tx\": 1,\n \"rxqsz\": 1024,\n \"txqsz\": 1024,\n \"isl3\": false,\n \"rxMode\": 0\n },\n \"uplinkInterfaces\": [\n {\n \"rx\": 0,\n \"tx\": 0,\n \"rxqsz\": 0,\n \"txqsz\": 0,\n \"isl3\": null,\n \"rxMode\": 0,\n \"isMain\": false,\n \"physicalNetworkName\": \"\",\n \"interfaceName\": \"eth0\",\n \"vppDriver\": \"af_packet\",\n \"newDriver\": \"\",\n \"annotations\": null,\n \"mtu\": 0\n }\n ]\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_FEATURE_GATES={}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_IPSEC={\n \"nbAsyncCryptoThreads\": 0,\n \"extraAddresses\": 0\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_INITIAL_CONFIG={\n \"vppStartupSleepSeconds\": 1,\n \"corePattern\": \"/var/lib/vpp/vppcore.%e.%p\",\n \"extraAddrCount\": 0,\n \"ifConfigSavePath\": \"\",\n \"defaultGWs\": \"\",\n \"redirectToHostRules\": null\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_CONFIG_TEMPLATE=unix {\n nodaemon\n full-coredump\n cli-listen /var/run/vpp/cli.sock\n pidfile /run/vpp/vpp.pid\n exec /etc/vpp/startup.exec\n}\napi-trace { on }\ncpu {\n workers 0\n}\nsocksvr {\n socket-name /var/run/vpp/vpp-api.sock\n}\nplugins {\n plugin default { enable }\n plugin dpdk_plugin.so { disable }\n plugin calico_plugin.so { enable }\n plugin ping_plugin.so { disable }\n plugin dispatch_trace_plugin.so { enable }\n}\nbuffers {\n buffers-per-numa 131072\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_BEFORE_IF_READ=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; fixing dns...\"\n sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nundo_dns_fix () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nrestart_network () {\n if systemctl status systemd-networkd > /dev/null 2>&1; then\n echo \"default_hook: system is using systemd-networkd; restarting...\"\n systemctl restart systemd-networkd\n elif systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; restarting...\"\n systemctl restart NetworkManager\n elif systemctl status networking > /dev/null 2>&1; then\n echo \"default_hook: system is using networking service; restarting...\"\n systemctl restart networking\n elif systemctl status network > /dev/null 2>&1; then\n echo \"default_hook: system is using network service; restarting...\"\n systemctl restart network\n else\n echo \"default_hook: Networking backend not detected, network configuration may fail\"\n fi\n}\n\nif which systemctl > /dev/null; then\n echo \"default_hook: using systemctl...\"\nelse\n echo \"default_hook: Init system not supported, network configuration may fail\"\n exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n undo_dns_fix\n restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n undo_dns_fix\n restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:NODENAME=master"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_BGP_LOG_LEVEL=INFO"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_BEFORE_VPP_RUN=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; fixing dns...\"\n sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nundo_dns_fix () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nrestart_network () {\n if systemctl status systemd-networkd > /dev/null 2>&1; then\n echo \"default_hook: system is using systemd-networkd; restarting...\"\n systemctl restart systemd-networkd\n elif systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; restarting...\"\n systemctl restart NetworkManager\n elif systemctl status networking > /dev/null 2>&1; then\n echo \"default_hook: system is using networking service; restarting...\"\n systemctl restart networking\n elif systemctl status network > /dev/null 2>&1; then\n echo \"default_hook: system is using network service; restarting...\"\n systemctl restart network\n else\n echo \"default_hook: Networking backend not detected, network configuration may fail\"\n fi\n}\n\nif which systemctl > /dev/null; then\n echo \"default_hook: using systemctl...\"\nelse\n echo \"default_hook: Init system not supported, network configuration may fail\"\n exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n undo_dns_fix\n restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n undo_dns_fix\n restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_VPP_RUNNING=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; fixing dns...\"\n sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nundo_dns_fix () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nrestart_network () {\n if systemctl status systemd-networkd > /dev/null 2>&1; then\n echo \"default_hook: system is using systemd-networkd; restarting...\"\n systemctl restart systemd-networkd\n elif systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; restarting...\"\n systemctl restart NetworkManager\n elif systemctl status networking > /dev/null 2>&1; then\n echo \"default_hook: system is using networking service; restarting...\"\n systemctl restart networking\n elif systemctl status network > /dev/null 2>&1; then\n echo \"default_hook: system is using network service; restarting...\"\n systemctl restart network\n else\n echo \"default_hook: Networking backend not detected, network configuration may fail\"\n fi\n}\n\nif which systemctl > /dev/null; then\n echo \"default_hook: using systemctl...\"\nelse\n echo \"default_hook: Init system not supported, network configuration may fail\"\n exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n undo_dns_fix\n restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n undo_dns_fix\n restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_SRV6={\n \"localsidPool\": \"\",\n \"policyPool\": \"\"\n}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_LOG_FORMAT="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_INTERFACE="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_INIT_SCRIPT_TEMPLATE="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_CONFIG_EXEC_TEMPLATE="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_VPP_DONE_OK=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; fixing dns...\"\n sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nundo_dns_fix () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nrestart_network () {\n if systemctl status systemd-networkd > /dev/null 2>&1; then\n echo \"default_hook: system is using systemd-networkd; restarting...\"\n systemctl restart systemd-networkd\n elif systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; restarting...\"\n systemctl restart NetworkManager\n elif systemctl status networking > /dev/null 2>&1; then\n echo \"default_hook: system is using networking service; restarting...\"\n systemctl restart networking\n elif systemctl status network > /dev/null 2>&1; then\n echo \"default_hook: system is using network service; restarting...\"\n systemctl restart network\n else\n echo \"default_hook: Networking backend not detected, network configuration may fail\"\n fi\n}\n\nif which systemctl > /dev/null; then\n echo \"default_hook: using systemctl...\"\nelse\n echo \"default_hook: Init system not supported, network configuration may fail\"\n exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n undo_dns_fix\n restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n undo_dns_fix\n restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_LOG_LEVEL=info"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_DEBUG={}"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_HOOK_VPP_ERRORED=#!/bin/sh\n\nHOOK=\"$0\"\nchroot /host /bin/sh <<EOSCRIPT\n\nfix_dns () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; fixing dns...\"\n sed -i \"s/\\[main\\]/\\[main\\]\\ndns=none/\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nundo_dns_fix () {\n if systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; undoing dns fix...\"\n sed -i \"0,/dns=none/{/dns=none/d;}\" /etc/NetworkManager/NetworkManager.conf\n systemctl daemon-reload\n systemctl restart NetworkManager\n fi\n}\n\nrestart_network () {\n if systemctl status systemd-networkd > /dev/null 2>&1; then\n echo \"default_hook: system is using systemd-networkd; restarting...\"\n systemctl restart systemd-networkd\n elif systemctl status NetworkManager > /dev/null 2>&1; then\n echo \"default_hook: system is using NetworkManager; restarting...\"\n systemctl restart NetworkManager\n elif systemctl status networking > /dev/null 2>&1; then\n echo \"default_hook: system is using networking service; restarting...\"\n systemctl restart networking\n elif systemctl status network > /dev/null 2>&1; then\n echo \"default_hook: system is using network service; restarting...\"\n systemctl restart network\n else\n echo \"default_hook: Networking backend not detected, network configuration may fail\"\n fi\n}\n\nif which systemctl > /dev/null; then\n echo \"default_hook: using systemctl...\"\nelse\n echo \"default_hook: Init system not supported, network configuration may fail\"\n exit 1\nfi\n\nif [ \"$HOOK\" = \"BEFORE_VPP_RUN\" ]; then\n fix_dns\nelif [ \"$HOOK\" = \"VPP_RUNNING\" ]; then\n restart_network\nelif [ \"$HOOK\" = \"VPP_DONE_OK\" ]; then\n undo_dns_fix\n restart_network\nelif [ \"$HOOK\" = \"VPP_ERRORED\" ]; then\n undo_dns_fix\n restart_network\nfi\n\nEOSCRIPT\n"
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_IPSEC_IKEV2_PSK="
time="2024-04-04T17:10:03Z" level=info msg="Config:CALICOVPP_NATIVE_DRIVER="
default_hook: using systemctl...
time="2024-04-04T17:10:03Z" level=info msg="No pci device for interface eth0"
time="2024-04-04T17:10:03Z" level=info msg="-- Environment --"
time="2024-04-04T17:10:03Z" level=info msg="Hugepages 0"
time="2024-04-04T17:10:03Z" level=info msg="KernelVersion 5.15.0-1042"
time="2024-04-04T17:10:03Z" level=info msg="Drivers map[uio_pci_generic:false vfio-pci:true]"
time="2024-04-04T17:10:03Z" level=info msg="initial iommu status N"
time="2024-04-04T17:10:03Z" level=info msg="-- Interface Spec --"
time="2024-04-04T17:10:03Z" level=info msg="Interface Name: eth0"
time="2024-04-04T17:10:03Z" level=info msg="Native Driver: af_packet"
time="2024-04-04T17:10:03Z" level=info msg="New Drive Name: "
time="2024-04-04T17:10:03Z" level=info msg="PHY target #Queues rx:0 tx:0"
time="2024-04-04T17:10:03Z" level=info msg="Tap MTU: 0"
time="2024-04-04T17:10:03Z" level=info msg="-- Interface config --"
time="2024-04-04T17:10:03Z" level=info msg="Node IP4: 172.10.1.5/24"
time="2024-04-04T17:10:03Z" level=info msg="Node IP6: "
time="2024-04-04T17:10:03Z" level=info msg="PciId: "
time="2024-04-04T17:10:03Z" level=info msg="Driver: "
time="2024-04-04T17:10:03Z" level=info msg="Linux IF was up ? true"
time="2024-04-04T17:10:03Z" level=info msg="Promisc was on ? false"
time="2024-04-04T17:10:03Z" level=info msg="DoSwapDriver: false"
time="2024-04-04T17:10:03Z" level=info msg="Mac: 00:22:48:c0:5e:e6"
time="2024-04-04T17:10:03Z" level=info msg="Addresses: [172.10.1.5/24 eth0,fe80::222:48ff:fec0:5ee6/64]"
time="2024-04-04T17:10:03Z" level=info msg="Routes: [{Ifindex: 2 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254 Realm: 0}, {Ifindex: 2 Dst: 172.10.1.1/32 Src: 172.10.1.5 Gw: <nil> Flags: [] Table: 254 Realm: 0}, {Ifindex: 2 Dst: 172.10.1.0/24 Src: 172.10.1.5 Gw: <nil> Flags: [] Table: 254 Realm: 0}, {Ifindex: 2 Dst: 168.63.129.16/32 Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0}, {Ifindex: 2 Dst: 169.254.169.254/32 Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0}, <Dst: nil (default), Ifindex: 2, Gw: 172.10.1.1, Src: 172.10.1.5, >]"
time="2024-04-04T17:10:03Z" level=info msg="PHY original #Queues rx:64 tx:64"
time="2024-04-04T17:10:03Z" level=info msg="MTU 1500"
time="2024-04-04T17:10:03Z" level=info msg="isTunTap false"
time="2024-04-04T17:10:03Z" level=info msg="isVeth false"
time="2024-04-04T17:10:03Z" level=info msg="Running with uplink af_packet"
default_hook: using systemctl...
time="2024-04-04T17:10:03Z" level=info msg="VPP started [PID 7918]"
vpp[7918]: clib_sysfs_prealloc_hugepages:236: pre-allocating 149 additional 2048K hugepages on numa node 0
vpp[7918]: buffer: numa[0] falling back to non-hugepage backed buffer pool (vlib_physmem_shared_map_create: pmalloc_map_pages: Unable to lock pages: Cannot allocate memory)
time="2024-04-04T17:10:04Z" level=info msg="Waiting for VPP... [0/10]"
vpp[7918]: perfmon: skipping source 'intel-uncore' - intel_uncore_init: no uncore units found
vpp[7918]: tls_init_ca_chain:1086: Could not initialize TLS CA certificates
vpp[7918]: tls_openssl_init:1209: failed to initialize TLS CA chain
vpp[7918]: vat-plug/load: vat_plugin_register: idpf plugin not loaded...
vpp[7918]: vat-plug/load: vat_plugin_register: oddbuf plugin not loaded...
time="2024-04-04T17:10:06Z" level=info msg="Created AF_PACKET interface 1"
time="2024-04-04T17:10:06Z" level=info msg="tagging interface [1] with: main-eth0"
time="2024-04-04T17:10:06Z" level=info msg="Adding address 172.10.1.5/24 eth0 to uplink interface"
time="2024-04-04T17:10:06Z" level=info msg="Not adding address fe80::222:48ff:fec0:5ee6/64 to uplink interface (vpp requires /128 link-local)"
time="2024-04-04T17:10:06Z" level=info msg="Creating Linux side interface"
time="2024-04-04T17:10:06Z" level=info msg="Adding address 172.10.1.5/24 eth0 to tap interface"
time="2024-04-04T17:10:06Z" level=info msg="Not adding address fe80::222:48ff:fec0:5ee6/64 to data interface (vpp requires /128 link-local)"
time="2024-04-04T17:10:06Z" level=info msg="Adding ND proxy for address fe80::222:48ff:fec0:5ee6"
time="2024-04-04T17:10:06Z" level=info msg="Adding address 172.10.1.5/24 eth0 to tap interface"
time="2024-04-04T17:10:06Z" level=info msg="Adding address fe80::222:48ff:fec0:5ee6/64 to tap interface"
time="2024-04-04T17:10:06Z" level=warning msg="add addr fe80::222:48ff:fec0:5ee6/64 via vpp EEXIST, file exists"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="add route via vpp : {Ifindex: 3 Dst: fe80::/64 Src: <nil> Gw: <nil> Flags: [] Table: 254 Realm: 0} already exists"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: 172.10.1.1/32 Src: 172.10.1.5 Gw: <nil> Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: 172.10.1.0/24 Src: 172.10.1.5 Gw: <nil> Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: 168.63.129.16/32 Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: 169.254.169.254/32 Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Adding route {Ifindex: 3 Dst: <nil> Src: 172.10.1.5 Gw: 172.10.1.1 Flags: [] Table: 254 Realm: 0} via VPP"
time="2024-04-04T17:10:06Z" level=info msg="Using 172.10.1.254 as next hop for cluster IPv4 routes"
time="2024-04-04T17:10:06Z" level=info msg="Setting BGP nodeIP 172.10.1.5/24"
time="2024-04-04T17:10:06Z" level=info msg="Updating node, version = 1741, metaversion = 1741"
default_hook: using systemctl...
default_hook: system is using systemd-networkd; restarting...
time="2024-04-04T17:10:06Z" level=info msg="Received signal child exited, vpp index 1"
time="2024-04-04T17:10:06Z" level=info msg="Ignoring SIGCHLD for pid 0"
time="2024-04-04T17:10:06Z" level=info msg="Done with signal child exited"
I just tried switching from Ubuntu 22.04 to CentOS 8 and I didn't run into DNS resolution issue on the host when using CentOS hosts. I noticed that CentOS uses NetworkManager by default. At this point, I'm not sure what the exact root cause of the issue is, but it might be related to networking managed by systemd-networkd or perhaps some other default network management configuration bundled in Ubuntu.
On CentOS hosts the /etc/resolv.conf file doesn't get edited when the calico-vpp-node pods get up and running.
Thanks for the details and sorry about the missing whereabouts image - tagging it somehow got missed out during the release :)
What happens is that when calico-vpp-node starts, it takes over the uplink interface and replaces it with a tap, and systemd-networkd does not like this disappearance act and causes a reset which involves expiry of the DHCP lease ( as can be seen in the logs ) and in some cases also wipes out the DNS config.
We have faced this issue in the past and usually a restart of systemd-networkd has done the trick. Somehow the restart trick doesn't seem to be effective in your case. This will require some further digging. But for a quick fix I can think of the following:
NetworkManager has a config option, dns=none, which tells it to not meddle with the dns config at all which means the dns config remains intact when calico-vpp-node gets running. So, if switching to NM is ok with you then you could try it.
After the azure instances are up and running, modify netplan to make the network config static instead of DHCP, and then start the kubeadm steps to install the cluster.
Try the systemd-networkd option, Unmanaged=true for the uplink interface. It seems like similar to the NM dns=none but not really sure. Refer to this link: https://github.com/systemd/systemd/issues/28626