2

I have 2 EKS clusters, in 2 different AWS accounts and with, I might assume, different firewalls (which I don't have access to). The first one (Dev) is all right, however, with the same configuration, UAT cluster pods is struggling to resolve DNS. The Nodes can resolve and seems to be all right.

1) ping 8.8.8.8 works

--- 8.8.8.8 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms

2) I can ping the IP of google (and others), but not the actual dns names.

Our configuration:

  1. configured with Terraform.
  2. The worker nodes and control plane SG are the same than the dev ones. I believe those are fine.
  3. Added 53 TCP and 53 UDP on inbound + outbound NACl (just to be sure 53 was really open...). Added 53 TCP and 53 UDP outbound from Worker Nodes.
  4. We are using ami-059c6874350e63ca9 with 1.14 kubernetes version.

I am unsure if the problem is a firewall somewhere, coredns, my configuration that needs to be updated or an "stupid mistake". Any help would be appreciated.

shrimpy
  • 653
  • 3
  • 11
  • 27
  • there is a lot of variables in your case, do you mind sharing your terraform script? Don't forget to remove sensitive data. Also need to read the yamls from your services, if you don't have them, please run `kubectl get services -o yaml` to export it and paste in your question. – Will R.O.F. Jan 10 '20 at 10:03

2 Answers2

3

After days of debugging, here is what was the problem : I had allowed all traffic between the nodes but that all traffic is TCP, not UDP.

It was basically a one line in AWS: In worker nodes SG, add an inbound rule from/to worker nodes port 53 protocol DNS (UDP).

If you use terraform, it should look like that:

resource "aws_security_group_rule" "eks-node-ingress-cluster-dns" {
  description = "Allow pods DNS"
  from_port                = 53
  protocol                 = 17
  security_group_id        = "${aws_security_group.SG-eks-WorkerNodes.id}"
  source_security_group_id = "${aws_security_group.SG-eks-WorkerNodes.id}"  
  to_port                  = 53
  type                     = "ingress"
}
shrimpy
  • 653
  • 3
  • 11
  • 27
3

Note that this issue may present itself in many forms (e.g. DNS not resolving is just one possible case). The terraform-awk-eks module exposes a terraform input to create the necessary security group rules that allow these inter worker-group/node-group communications: worker_create_cluster_primary_security_group_rules. More information in this terraform-awk-eks issue https://github.com/terraform-aws-modules/terraform-aws-eks/issues/1089

When the input is enabled, terraform creates the following security group rules:

  # module.eks.module.eks.aws_security_group_rule.cluster_primary_ingress_workers[0] will be created                                                                                                                                                                                                                           
  + resource "aws_security_group_rule" "cluster_primary_ingress_workers" {                                                                                                                                                                                                                                                     
      + description              = "Allow pods running on workers to send communication to cluster primary security group (e.g. Fargate pods)."                                                                                                                                                                                
      + from_port                = 0                                                                                                                                                                                                                                                                                           
      + id                       = (known after apply)                                                                                                                                                                                                                                                                         
      + protocol                 = "-1"                                                                                                                                                                                                                                                                                        
      + security_group_id        = "sg-03bb33d3318e4aa03"                                                                                                                                                                                                                                                                      
      + self                     = false                                                                                                                                                                                                                                                                                       
      + source_security_group_id = "sg-0fffc4d49a499a1d8"                                                                                                                                                                                                                                                                      
      + to_port                  = 65535                                                                                                                                                                                                                                                                                       
      + type                     = "ingress"                                                                                                                                                                                                                                                                                   
    }                                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                               
  # module.eks.module.eks.aws_security_group_rule.workers_ingress_cluster_primary[0] will be created                                                                                                                                                                                                                           
  + resource "aws_security_group_rule" "workers_ingress_cluster_primary" {                                                                                                                                                                                                                                                     
      + description              = "Allow pods running on workers to receive communication from cluster primary security group (e.g. Fargate pods)."                                                                                                                                                                           
      + from_port                = 0                                                                                                                                                                                                                                                                                           
      + id                       = (known after apply)                                                                                                                                                                                                                                                                         
      + protocol                 = "-1"                                                                                                                                                                                                                                                                                        
      + security_group_id        = "sg-0fffc4d49a499a1d8"                                                                                                                                                                                                                                                                      
      + self                     = false
      + source_security_group_id = "sg-03bb33d3318e4aa03"
      + to_port                  = 65535
      + type                     = "ingress"
    }
fvdnabee
  • 573
  • 6
  • 11
  • Great to see an answer using the terraform eks module. Interesting that the `worker_create_cluster_primary_security_group_rules` is disabled by default. That and the `enable_irsa` features both need to be enabled if you wish to use the AWS ingress controller and connect to an RDS endpoint. – Wilhelm May 27 '21 at 01:24