Skip to main content

K8s Troubleshooting Insights: Looking into CoreDNS Issues

· 8 min read
Eleni Grosdouli
DevOps Consulting Engineer at Cisco Systems

Introduction

Welcome to the the first post of the brand new Kubernetes Troubleshooting Insights section! The series of blog posts will share helpful information and troubleshooting tips for issues that might appear in a Kubernetes environment. The posts are focused on real-life scenarios from either test, staging or production environments.

In today’s blog post, we’ll explore an issue with CoreDNS setup on RKE2 clusters. Cilium CNI with Hubble were enabled for this setup. Let’s jump right in!

title image reading "It's not DNS"

Environment Setup

+-----------------+-------------------+--------------------------+----------------+
| Cluster Name | Type | Version | OS |
+-----------------+-------------------+--------------------------+----------------+
| cluster01 | Managed Cluster | v1.28.14+rke2r1 | Ubuntu 22.04 |
+-----------------+-------------------+--------------------------+----------------+


+-------------+---------------------+
| Deployment | Version |
+-------------+---------------------+
| Cilium | v1.16.1 |
+-------------+---------------------+

Scenario

DNS issues are wide and can be are caused by many reasons. Unfortunately, there is no one-size-fits-all approach when it comes to troubleshooting. In our scenario, we started with the virtual machines, ensured routing and DNS work and then moved to the Kubernetes layer. The blog post is an attempt to provide readers with troubleshooting ideas about DNS issues.

Background: We performed a major migration to a new instance with new DNS servers and domains. We had to test if everything worked with the new setup. Everything appeared fine apart from synchronising ArgoCD with internal Git repositories. An error from an internal Kubernetes IP address on port 53 appeared. Weird, right? We were confident the underlying virtual machines were using the correct DNS, and the configuration of the DHCP server was updated.

note

The troubleshooting session was performed on an Ubuntu 22.04 environment. If another operating system is used, the troubleshooting methodology remains the same. The Linux commands will be different.

Troubleshooting Steps

Step 1: Underlying Infrastructure

The below steps were performed to double-check the underlying infrastructure.

  1. DHCP Server (if used): Ensure the configuration points to the new DNS servers

    $ cat /etc/dhcp/dhcpd.conf

    Check the domain-name-servers configuration.

  2. Cluster Node: Perform an SSH connection to one of the cluster nodes

    1. Check the DNS configuration

      $ cat /etc/resolv.conf # Check the local DNS configuration

      $ sudo resolvectl status # Check the global and per-link DNS settings currently in effect

      From the command above, we would like to see how the Ethernet network interface on the virtual machine resolves domains. The resolvectl status command will reveal the use of the new DNS servers.

    2. Check routing

      $ ping test-site.example-domain.com # Check whether we can reach the custom domain

      $ ping 8.8.8.8 # Check whether we can reach the Internet

      If one of the commands above fail, this might be an indication that routing is broken, or something else is blocking traffic.

      If the commands above are successful, we continue with the next step and dive into the Kubernetes world.

Step 2: Kubernetes Troubleshooting

Depending on the Kubernetes environment in place, identify how DNS queries are resolved from a Kubernetes cluster point of view. In our case, the RKE2 clusters use CoreDNS.

Even from a Kubernetes point of view, we will perform the well used ping and curl commands to see what is going on. For that reason, we will deploy the dns-utils pod to perform network calls.

  1. Deploy the dns-utils pod to check DNS queries

      $ kubectl apply -f - <<EOF
    apiVersion: v1
    kind: Pod
    metadata:
    name: dnsutils
    namespace: default
    spec:
    containers:
    - name: dnsutils
    image: registry.k8s.io/e2e-test-images/agnhost:2.39
    imagePullPolicy: IfNotPresent
    restartPolicy: Always
    EOF
  2. Exec into the dnsutils pod - Perform ping and dig to well known website

      $ kubectl exec -it dnsutils -- /bin/sh

    / # ping 8.8.8.8
    PING 8.8.8.8 (8.8.8.8): 56 data bytes
    64 bytes from 8.8.8.8: icmp_seq=0 ttl=119 time=41.526 ms

    / # dig SOA www.google.com
    ;; AUTHORITY SECTION:
    google.com. 30 IN SOA <YOUR DNS Server>.google.com. dns-admin.google.com. 689372973 900 900 1800 60

    If the above commands are successful, we know that routing and resolution works for well known websites.

    tip

    Learn more about SOA.

  3. Exec into the dnsutils pod - Perform curl and dig to the custom domain

      $ kubectl exec -it dnsutils -- /bin/sh
    / # curl -vvv example-domain.com
    * Could not resolve host: example-domain.com
    * Closing connection 0
    curl: (6) Could not resolve host: example-domain.com

    / # dig SOA example-domain.com
    ...
    ;; AUTHORITY SECTION:
    example-domain.com. 20 IN SOA ns.dns.example-domain.com. hostmaster.example-domain.com. 1729344685 7200 1800 86400 30

    / # dig @YOUR DNS Server SOA example-domain.com
    ...
    ;; AUTHORITY SECTION:
    example-domain.com. 3600 IN SOA <YOUR DNS SERVER NAME>.example-domain.com. mail.example-domain.com. 1729687405 604800 86400 2419200 604800

    From the above, we can see that we can resolve the custom domain by defining our DNS server. But, when we do not define it the responsible entity to answer the DNS queries of the custom zone is set to the name ns.dns.. What is going on here? We would expect the DNS quetries to be resolved by the DNS serve defined in the virtual machines.

    note

    It is important to check the AUTHORITY SECTION of the dig command output. SOA ns.dns.example-domain.com. signifies this is an SOA record, which provides essential information about a DNS zone. The ns.dns.example-domain.com. defines the primary DNS for that domain. But why? We would expect to see one of the new DNS servers instead.

  4. Identify the CoreDNS deployment

    $ kubectl get deploy -n kube-system
    rke2-coredns-rke2-coredns 2/2 2 2 23h
    rke2-coredns-rke2-coredns-autoscaler 1/1 1 1 23h
  5. Check the CoreDNS ConfigMap

    $ kubectl get cm -n kube-system
    chart-content-rke2-coredns 1 23h
    rke2-coredns-rke2-coredns 1 23h
    rke2-coredns-rke2-coredns-autoscaler 1 23h

    $ kubectl get cm rke2-coredns-rke2-coredns -n kube-system -o jsonpath='{.data.Corefile}'
    .:53 {
    errors
    health {
    lameduck 5s
    }
    ready
    kubernetes example-domain.com cluster.local cluster.local in-addr.arpa ip6.arpa {
    pods insecure
    fallthrough in-addr.arpa ip6.arpa
    ttl 30
    }
    prometheus 0.0.0.0:9153
    forward . /etc/resolv.conf
    cache 30
    loop
    reload
    loadbalance
    }
  6. Exec to the CoreDNS deployment and cat the /etc/resolv.conf

    $ kubectl exec -it deploy/rke2-coredns-rke2-coredns -n kube-system -- cat /etc/resolv.conf
    nameserver <DNS01 IP>
    nameserver <DNS02 IP>

    If the above output returns the expected nameservers, continue with the next step.

  7. Analyse the CoreDNS config

    The output kubernetes example-domain.com cluster.local cluster.local in-addr.arpa ip6.arpa indicates that CoreDNS is responsible for responsing to DNS queries of the specified zones including example-domain.com. Should CoreDNS be responsible, or this is just a misconfiguration?

In our case, the custom domain was included to the CoreDNS configuration by mistake. Responses to DNS queries related to the custom domain should be forwarded to the defined DNS server instead.

If this is the case for your environment, edit the ConfigMap and remove the custom domain from the configuration. Use the commands below.

$ kubectl patch cm rke2-coredns-rke2-coredns -n kube-system --type='json' -p='[{"op": "replace", "path": "/data/Corefile", "value": ".:53 {\n    errors \n    health  {\n        lameduck 5s\n    }\n    ready \n    kubernetes cluster.local cluster.local in-addr.arpa ip6.arpa {\n        pods insecure\n        fallthrough in-addr.arpa ip6.arpa\n        ttl 30\n    }\n    prometheus 0.0.0.0:9153\n    forward . /etc/resolv.conf\n    cache 30\n    loop \n    reload \n    loadbalance \n}"}]'

configmap/rke2-coredns-rke2-coredns patched

$ kubectl get cm rke2-coredns-rke2-coredns -n kube-system -o jsonpath='{.data.Corefile}' # Ensure the custom domain is removed

$ kubectl exec -it dnsutils -- /bin/sh
/ # curl <domain>:<port> # You should be able to resolved the domain now

Optional: Kubernetes Troubleshoot with Cilium Hubble

Another option to troubleshoot network issues is with Hubble. If it is available as part of your installation, you can exec into the Cilium daemonset and start using the Hubble CLI to observe traffic. For example, you can use something like the below.

$ kubectl exec -it ds/cilium -n kube-system -- /bin/sh
# hubble observe --pod rke2-coredns-rke2-coredns-84b9cb946c-b7l9k --namespace kube-system --protocol UDP -f

The command above will display UDP packages between the dnsutils pod and CoreDNS. The Hubble cheat sheet can be found here.

Optional: Kubernetes Troubleshoot with Netshoot Pod and TCPDump

If we want to see what happens with the UDP packages when we perform a CURL request on a custom domain, it might be easier to instantiate a tcpdump using the netshoot pod. Follow the commands below.

$ kubectl run -it --rm debug --image=nicolaka/netshoot -- /bin/bash
debug:~# tcpdump -i eth0 -n udp port 53

Once tcpdump is enabled, exec into the same pod and perform CURL requests.

$ kubectl exec -it dnsutils -n kube-system -- /bin/sh
/ # curl example-domain.com

tcpdump Output

tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:31:52.197604 IP 10.42.2.184.49109 > 10.43.0.10.53: 59274+ [1au] A? example-domain.com.default.svc.cluster.local. (86)
16:31:52.197677 IP 10.42.2.184.49109 > 10.43.0.10.53: 134+ [1au] AAAA? example-domain.com.default.svc.cluster.local. (86)
16:31:52.198333 IP 10.43.0.10.53 > 10.42.2.184.49109: 59274 NXDomain*- 0/1/1 (179)
16:31:52.198553 IP 10.43.0.10.53 > 10.42.2.184.49109: 134 NXDomain*- 0/1/1 (179)

The output above indicates that a client queries a DNS server, and the server responds that the domain does not exist. This would be a hint to check the CoreDNS configuration! :)

Resources

✉️ Contact

If you have any questions, feel free to get in touch! You can use the Discussions option found here or reach out to me on any of the social media platforms provided. 😊

We look forward to hearing from you!

Conclusions

Is it DNS at the end? This is something you will have to find out! Hopefully, the post gave you some ideas to troubleshoot with confidence DNS issues in a Kubernetes environment.

It's a wrap for this post! 🎉 Thanks for reading! Stay tuned for more exciting updates!