K8s Troubleshooting Insights: Looking into CoreDNS Issues
Introduction
Welcome to the the first post of the brand new Kubernetes Troubleshooting Insights section! The series of blog posts will share helpful information and troubleshooting tips for issues that might appear in a Kubernetes environment. The posts are focused on real-life scenarios from either test
, staging
or production
environments.
In today’s blog post, we’ll explore an issue with CoreDNS setup on RKE2 clusters. Cilium CNI with Hubble were enabled for this setup. Let’s jump right in!
Environment Setup
+-----------------+-------------------+--------------------------+----------------+
| Cluster Name | Type | Version | OS |
+-----------------+-------------------+--------------------------+----------------+
| cluster01 | Managed Cluster | v1.28.14+rke2r1 | Ubuntu 22.04 |
+-----------------+-------------------+--------------------------+----------------+
+-------------+---------------------+
| Deployment | Version |
+-------------+---------------------+
| Cilium | v1.16.1 |
+-------------+---------------------+
Scenario
DNS issues are wide and can be are caused by many reasons. Unfortunately, there is no one-size-fits-all approach when it comes to troubleshooting. In our scenario, we started with the virtual machines, ensured routing and DNS work and then moved to the Kubernetes layer. The blog post is an attempt to provide readers with troubleshooting ideas about DNS issues.
Background: We performed a major migration to a new instance with new DNS servers and domains. We had to test if everything worked with the new setup. Everything appeared fine apart from synchronising ArgoCD with internal Git repositories. An error from an internal Kubernetes IP address on port 53 appeared. Weird, right? We were confident the underlying virtual machines were using the correct DNS, and the configuration of the DHCP server was updated.
The troubleshooting session was performed on an Ubuntu 22.04 environment. If another operating system is used, the troubleshooting methodology remains the same. The Linux commands will be different.
Troubleshooting Steps
Step 1: Underlying Infrastructure
The below steps were performed to double-check the underlying infrastructure.
-
DHCP Server (if used): Ensure the configuration points to the new DNS servers
$ cat /etc/dhcp/dhcpd.conf
Check the
domain-name-servers
configuration. -
Cluster Node: Perform an SSH connection to one of the cluster nodes
-
Check the DNS configuration
$ cat /etc/resolv.conf # Check the local DNS configuration
$ sudo resolvectl status # Check the global and per-link DNS settings currently in effectFrom the command above, we would like to see how the Ethernet network interface on the virtual machine resolves domains. The
resolvectl status
command will reveal the use of the new DNS servers. -
Check routing
$ ping test-site.example-domain.com # Check whether we can reach the custom domain
$ ping 8.8.8.8 # Check whether we can reach the InternetIf one of the commands above fail, this might be an indication that routing is broken, or something else is blocking traffic.
If the commands above are successful, we continue with the next step and dive into the Kubernetes world.
-
Step 2: Kubernetes Troubleshooting
Depending on the Kubernetes environment in place, identify how DNS queries are resolved from a Kubernetes cluster point of view. In our case, the RKE2 clusters use CoreDNS
.
Even from a Kubernetes point of view, we will perform the well used ping
and curl
commands to see what is going on. For that reason, we will deploy the dns-utils
pod to perform network calls.
-
Deploy the
dns-utils
pod to check DNS queries$ kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: dnsutils
namespace: default
spec:
containers:
- name: dnsutils
image: registry.k8s.io/e2e-test-images/agnhost:2.39
imagePullPolicy: IfNotPresent
restartPolicy: Always
EOF -
Exec into the
dnsutils
pod - Performping
anddig
to well known website$ kubectl exec -it dnsutils -- /bin/sh
/ # ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8): 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=119 time=41.526 ms
/ # dig SOA www.google.com
;; AUTHORITY SECTION:
google.com. 30 IN SOA <YOUR DNS Server>.google.com. dns-admin.google.com. 689372973 900 900 1800 60If the above commands are successful, we know that routing and resolution works for well known websites.
tipLearn more about SOA.
-
Exec into the
dnsutils
pod - Performcurl
anddig
to the custom domain$ kubectl exec -it dnsutils -- /bin/sh
/ # curl -vvv example-domain.com
* Could not resolve host: example-domain.com
* Closing connection 0
curl: (6) Could not resolve host: example-domain.com
/ # dig SOA example-domain.com
...
;; AUTHORITY SECTION:
example-domain.com. 20 IN SOA ns.dns.example-domain.com. hostmaster.example-domain.com. 1729344685 7200 1800 86400 30
/ # dig @YOUR DNS Server SOA example-domain.com
...
;; AUTHORITY SECTION:
example-domain.com. 3600 IN SOA <YOUR DNS SERVER NAME>.example-domain.com. mail.example-domain.com. 1729687405 604800 86400 2419200 604800From the above, we can see that we can resolve the custom domain by defining our DNS server. But, when we do not define it the responsible entity to answer the DNS queries of the custom zone is set to the name
ns.dns.
. What is going on here? We would expect the DNS quetries to be resolved by the DNS serve defined in the virtual machines.noteIt is important to check the
AUTHORITY SECTION
of thedig
command output.SOA ns.dns.example-domain.com.
signifies this is an SOA record, which provides essential information about a DNS zone. Thens.dns.example-domain.com.
defines the primary DNS for that domain. But why? We would expect to see one of the new DNS servers instead. -
Identify the CoreDNS deployment
$ kubectl get deploy -n kube-system
rke2-coredns-rke2-coredns 2/2 2 2 23h
rke2-coredns-rke2-coredns-autoscaler 1/1 1 1 23h -
Check the
CoreDNS ConfigMap
$ kubectl get cm -n kube-system
chart-content-rke2-coredns 1 23h
rke2-coredns-rke2-coredns 1 23h
rke2-coredns-rke2-coredns-autoscaler 1 23h
$ kubectl get cm rke2-coredns-rke2-coredns -n kube-system -o jsonpath='{.data.Corefile}'
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes example-domain.com cluster.local cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus 0.0.0.0:9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
} -
Exec to the CoreDNS deployment and
cat
the/etc/resolv.conf
$ kubectl exec -it deploy/rke2-coredns-rke2-coredns -n kube-system -- cat /etc/resolv.conf
nameserver <DNS01 IP>
nameserver <DNS02 IP>If the above output returns the expected nameservers, continue with the next step.
-
Analyse the CoreDNS config
The output
kubernetes example-domain.com cluster.local cluster.local in-addr.arpa ip6.arpa
indicates thatCoreDNS
is responsible for responsing to DNS queries of the specified zones includingexample-domain.com
. ShouldCoreDNS
be responsible, or this is just a misconfiguration?
In our case, the custom domain was included to the CoreDNS
configuration by mistake. Responses to DNS queries related to the custom domain should be forwarded to the defined DNS server instead.
If this is the case for your environment, edit the ConfigMap
and remove the custom domain from the configuration. Use the commands below.
$ kubectl patch cm rke2-coredns-rke2-coredns -n kube-system --type='json' -p='[{"op": "replace", "path": "/data/Corefile", "value": ".:53 {\n errors \n health {\n lameduck 5s\n }\n ready \n kubernetes cluster.local cluster.local in-addr.arpa ip6.arpa {\n pods insecure\n fallthrough in-addr.arpa ip6.arpa\n ttl 30\n }\n prometheus 0.0.0.0:9153\n forward . /etc/resolv.conf\n cache 30\n loop \n reload \n loadbalance \n}"}]'
configmap/rke2-coredns-rke2-coredns patched
$ kubectl get cm rke2-coredns-rke2-coredns -n kube-system -o jsonpath='{.data.Corefile}' # Ensure the custom domain is removed
$ kubectl exec -it dnsutils -- /bin/sh
/ # curl <domain>:<port> # You should be able to resolved the domain now
Optional: Kubernetes Troubleshoot with Cilium Hubble
Another option to troubleshoot network issues is with Hubble. If it is available as part of your installation, you can exec into the Cilium daemonset
and start using the Hubble CLI to observe traffic. For example, you can use something like the below.
$ kubectl exec -it ds/cilium -n kube-system -- /bin/sh
# hubble observe --pod rke2-coredns-rke2-coredns-84b9cb946c-b7l9k --namespace kube-system --protocol UDP -f
The command above will display UDP packages between the dnsutils
pod and CoreDNS
. The Hubble cheat sheet can be found here.
Optional: Kubernetes Troubleshoot with Netshoot Pod and TCPDump
If we want to see what happens with the UDP packages when we perform a CURL
request on a custom domain, it might be easier to instantiate a tcpdump
using the netshoot pod. Follow the commands below.
$ kubectl run -it --rm debug --image=nicolaka/netshoot -- /bin/bash
debug:~# tcpdump -i eth0 -n udp port 53
Once tcpdump is enabled, exec into the same pod and perform CURL
requests.
$ kubectl exec -it dnsutils -n kube-system -- /bin/sh
/ # curl example-domain.com
tcpdump Output
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:31:52.197604 IP 10.42.2.184.49109 > 10.43.0.10.53: 59274+ [1au] A? example-domain.com.default.svc.cluster.local. (86)
16:31:52.197677 IP 10.42.2.184.49109 > 10.43.0.10.53: 134+ [1au] AAAA? example-domain.com.default.svc.cluster.local. (86)
16:31:52.198333 IP 10.43.0.10.53 > 10.42.2.184.49109: 59274 NXDomain*- 0/1/1 (179)
16:31:52.198553 IP 10.43.0.10.53 > 10.42.2.184.49109: 134 NXDomain*- 0/1/1 (179)
The output above indicates that a client queries a DNS server, and the server responds that the domain does not exist. This would be a hint to check the CoreDNS
configuration! :)
Resources
✉️ Contact
If you have any questions, feel free to get in touch! You can use the Discussions
option found here or reach out to me on any of the social media platforms provided. 😊
We look forward to hearing from you!
Conclusions
Is it DNS at the end? This is something you will have to find out! Hopefully, the post gave you some ideas to troubleshoot with confidence DNS issues in a Kubernetes environment.
It's a wrap for this post! 🎉 Thanks for reading! Stay tuned for more exciting updates!