I have a test kubernetes cluster with 1 master and 2 nodes, running on coreos. I have Kubernetes DNS service running in the cluster and also some test pods in the cluster. I faced a strange problem, in some containers DNS was getting resolved and in some containers it was not. Checking the containers where DNS was not getting resolved, I found that these containers unable to connect the DNS pod. While checking the IPs of the pods using kubectl command, I found that even two pods are having the same IPs but they were running on different nodes, which is quite not possible in a Kubernetes cluster. So definitely there was network misconfiguration in the cluster. The pods were getting IPs which are not from the flannel service IP range, but IPs were from local docker IP range 172.17.0.0. So clearly docker is not picking up the IP range from flannel service.
Investigating the issue, I found that etcd master was not in proper start sequence and was listening to the localhost interface not on the Ethernet interface. Because of that the client etcd services on each node were unable to connect to the master etcd. As a result flannel service also error out and not started on each node.
When flannel service starts properly it creates a file /run/flannel/flannel_docker_opts.env. This file contains the host system’s docker0 network interface IP (bridge IP)
DOCKER_OPT_BIP="--bip=10.244.10.1/24"
When docker starts it reads the file
EnvironmentFile=-/run/flannel/flannel_docker_opts.env
and loads the environment variables of the file /run/flannel/flannel_docker_opts.env and configures itself accordingly.
When flannel service does not start properly, the environment variables required to start docker are not added to the file /run/flannel/flannel_docker_opts.env. Because of that the docker service was getting with the default bip 172.17.0.1.
To resolve the issue, changes were made on the master node, so that etcd master starts properly. In my cloud config file /var/lib/coreos-install/user_data I added one unit to restart the etcd service once my static network interface is configured.
- name: etcd2.service
command: restart
Now after booting, etcd was properly listening on my static IP.
Also I have edited docker service to start after the flannel service
systemctl edit docker.service
After=containerd.service docker.socket network-online.target flanneld.service flannel-docker-opts.service
Requires=containerd.service docker.socket flanneld.service
Next on each node of kubernetes cluster, I have changed the cloud config file /var/lib/coreos-install/user_data.
First I have added a script to check whether a port is open, if the port is not open then check again after 1 second. I will use this script to check whether the etcd master service is up.
write-files:
- path: /opt/bin/checkport
permissions: '0755'
content: |
#!/bin/bash
# This script waits till the port is accessible
[ -n "$1" ] && [ -n "$2" ] && while ! curl -s http://${1}:${2} > /dev/null; \
do sleep 1 && echo -n .; done;
exit $?
- name: etcd2.service
command: restart
drop-ins:
- name: 30-wait-for-server.conf
content: |
[Service]
# wait for kubernetes master to be up and ready
ExecStartPre=/opt/bin/checkport 192.168.10.75 2380
The above part is checking whether our etcd master (192.168.10.75) is up and listening to port 2380. If master etcd service is not up, then it waits for the master service.
Again edited docker service in each node to start after the flannel service
systemctl edit docker.service
After=containerd.service docker.socket network-online.target flanneld.service flannel-docker-opts.service
Requires=containerd.service docker.socket flanneld.service
After restarting everything, docker service starts picking up the bridge IP from flannel service.
Also the pods are getting the correct IP from the flannel IP range
Checklist:
1. Use ifconfig and check if flannel and docker interface IPs are in sync.
2. Check the IP subnet ranges in etcd and whether each flannel node using the correct subnet
curl http://127.0.0.1:2379/v2/keys/coreos.com/network/subnets.
3. Check the IPs of the pods.
Investigating the issue, I found that etcd master was not in proper start sequence and was listening to the localhost interface not on the Ethernet interface. Because of that the client etcd services on each node were unable to connect to the master etcd. As a result flannel service also error out and not started on each node.
When flannel service starts properly it creates a file /run/flannel/flannel_docker_opts.env. This file contains the host system’s docker0 network interface IP (bridge IP)
DOCKER_OPT_BIP="--bip=10.244.10.1/24"
When docker starts it reads the file
EnvironmentFile=-/run/flannel/flannel_docker_opts.env
and loads the environment variables of the file /run/flannel/flannel_docker_opts.env and configures itself accordingly.
When flannel service does not start properly, the environment variables required to start docker are not added to the file /run/flannel/flannel_docker_opts.env. Because of that the docker service was getting with the default bip 172.17.0.1.
To resolve the issue, changes were made on the master node, so that etcd master starts properly. In my cloud config file /var/lib/coreos-install/user_data I added one unit to restart the etcd service once my static network interface is configured.
- name: etcd2.service
command: restart
Now after booting, etcd was properly listening on my static IP.
Also I have edited docker service to start after the flannel service
systemctl edit docker.service
After=containerd.service docker.socket network-online.target flanneld.service flannel-docker-opts.service
Requires=containerd.service docker.socket flanneld.service
Next on each node of kubernetes cluster, I have changed the cloud config file /var/lib/coreos-install/user_data.
First I have added a script to check whether a port is open, if the port is not open then check again after 1 second. I will use this script to check whether the etcd master service is up.
write-files:
- path: /opt/bin/checkport
permissions: '0755'
content: |
#!/bin/bash
# This script waits till the port is accessible
[ -n "$1" ] && [ -n "$2" ] && while ! curl -s http://${1}:${2} > /dev/null; \
do sleep 1 && echo -n .; done;
exit $?
- name: etcd2.service
command: restart
drop-ins:
- name: 30-wait-for-server.conf
content: |
[Service]
# wait for kubernetes master to be up and ready
ExecStartPre=/opt/bin/checkport 192.168.10.75 2380
The above part is checking whether our etcd master (192.168.10.75) is up and listening to port 2380. If master etcd service is not up, then it waits for the master service.
Again edited docker service in each node to start after the flannel service
systemctl edit docker.service
After=containerd.service docker.socket network-online.target flanneld.service flannel-docker-opts.service
Requires=containerd.service docker.socket flanneld.service
After restarting everything, docker service starts picking up the bridge IP from flannel service.
Also the pods are getting the correct IP from the flannel IP range
Checklist:
1. Use ifconfig and check if flannel and docker interface IPs are in sync.
2. Check the IP subnet ranges in etcd and whether each flannel node using the correct subnet
curl http://127.0.0.1:2379/v2/keys/coreos.com/network/subnets.
3. Check the IPs of the pods.