Calico单一pod not ready

问题描述:

自建K8S,查看Calico的pod状态,发现node1上面的calico没有ready

[root@master ec2-user]# kubectl -n kube-system  get pod -o wide| grep cali
calico-kube-controllers-677cd97c8d-2rxtd   1/1     Running   0             18h   10.244.135.1     node3               
calico-node-jw575                          1/1     Running   0             18h   10.0.2.50        node4               
calico-node-mkcwg                          1/1     Running   0             18h   172.31.24.53     node3               
calico-node-v8zph                          1/1     Running   0             18h   10.0.0.180       master              
calico-node-xrq8m                          1/1     Running   0             18h   10.0.0.106       node2               
calico-node-xtg8k                          0/1     Running   0             16h   10.0.0.27        node1               

分析过程:

按理说都是一个yaml部署的,不应该有问题,重建了一下pod还是这个情况,看一下事件。

  Normal   Created    5m19s                  kubelet            Created container calico-node
  Warning  Unhealthy  5m17s                  kubelet            Readiness probe failed: calico/node is not ready: felix is not ready: readiness probe reporting 503
  Warning  Unhealthy  5m16s (x2 over 5m18s)  kubelet            Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused
  Warning  Unhealthy  5m11s                  kubelet            Readiness probe failed: 2025-07-22 11:56:14.088 [INFO][210] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.0.0.180,10.0.0.106,172.31.24.53,10.0.2.50

  Warning  Unhealthy  5m1s  kubelet  Readiness probe failed: 2025-07-22 11:56:24.104 [INFO][230] confd/health.go 180: Number of node(s) with BGP peering established = 0
calico/node is not ready: BIRD is not ready: BGP not established with 10.0.0.180,10.0.0.106,172.31.24.53,10.0.2.50

看来主要的是BGP邻居没有建立成功。看一下calico的日志,没发现什么问题。bird是用于BGP的进程。

bird: Mesh_10_0_0_180: Starting
bird: Mesh_10_0_0_180: State changed to start
bird: Mesh_10_0_0_106: Starting
bird: Mesh_10_0_0_106: State changed to start
bird: Mesh_172_31_24_53: Starting
bird: Mesh_172_31_24_53: State changed to start
bird: Mesh_10_0_2_50: Starting
bird: Mesh_10_0_2_50: State changed to start

在节点上抓了一下179端口的报文,发现三次握手能建立,但是发送OPEN报文之后就被RST了。值得注意的是,是node1的calico发送报文给其他node,被其他node给RST了,怀疑BGP的参数有问题,这下只能登陆进去看看了。
登录进node1和另一个node上面的calico看一下BGP配置
node1的calico

[root@node1 /]# birdcl
BIRD v0.3.3+birdv1.6.8 ready.
bird> show protocols
****
Mesh_10_0_0_180 BGP      master   down     13:15:09    active   
Mesh_10_0_0_106 BGP      master   down     13:15:09    active   
Mesh_172_31_24_53 BGP      master   down     13:15:07    active   
Mesh_10_0_2_50 BGP      master   down     13:15:08    active   

另一个node的calico

Mesh_10_0_0_180 BGP      master   up     11:56:12    Established   
Mesh_172_31_24_53 BGP      master   up     11:56:13    Established   
Mesh_172_21_0_1 BGP      master   down     11:56:14    active   
Mesh_10_0_0_27 BGP      master   up     13:15:09    Established 

发现了问题,其他node要跟一个172.21.0.1的IP地址建立BGP,其他的IP地址都是node的IP,但node1的IP并不是这个172.21.0.1。下一步要看看这个IP怎么出来的。
还是要看node1上面calico的日志,发现了是从一个网卡上自动获取的。

[INFO][8] startup/startup.go 768: Using autodetected IPv4 address on interface br-18f903dae19f: 172.21.0.1/16

看了下node1上确实有这个网卡

[root@node1 ec2-user]# ip add show
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0:  mtu 9001 qdisc mq state UP group default qlen 1000
    link/ether 0a:7d:ab:6f:86:6a brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.27/24 brd 10.0.0.255 scope global dynamic eth0
       valid_lft 2040sec preferred_lft 2040sec
3: docker0:  mtu 1500 qdisc noqueue state DOWN group default 
    link/ether 02:42:9e:e6:d3:9d brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
4: tunl0@NONE:  mtu 8981 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
    inet 10.244.166.128/32 scope global tunl0
       valid_lft forever preferred_lft forever
388: br-cf273aa4db36:  mtu 1500 qdisc noop state DOWN group default 
    link/ether 02:42:68:c5:20:f6 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.1/16 brd 172.18.255.255 scope global br-cf273aa4db36
       valid_lft forever preferred_lft forever
389: br-3e57ff82d79a:  mtu 1500 qdisc noop state DOWN group default 
    link/ether 02:42:e3:34:ee:59 brd ff:ff:ff:ff:ff:ff
    inet 172.19.0.1/16 brd 172.19.255.255 scope global br-3e57ff82d79a
       valid_lft forever preferred_lft forever
390: br-fbeba96c8bcb:  mtu 1500 qdisc noop state DOWN group default 
    link/ether 02:42:86:31:fe:73 brd ff:ff:ff:ff:ff:ff
    inet 172.20.0.1/16 brd 172.20.255.255 scope global br-fbeba96c8bcb
       valid_lft forever preferred_lft forever
391: br-18f903dae19f:  mtu 1500 qdisc noop state DOWN group default 
    link/ether 02:42:3a:7e:04:1d brd ff:ff:ff:ff:ff:ff
    inet 172.21.0.1/16 brd 172.21.255.255 scope global br-18f903dae19f
       valid_lft forever preferred_lft forever

解决方案

回顾了一下calico的配置文件,没有单独指定过网卡; 对比了其他node,其他node确实没有这些br-的网卡,这个网卡应该是之前卸载flannel的时候没有卸载干净。我把这些网卡删除之后,calico就可以获取到eth0的IP地址,BGP就可以正常建立了。pod就ready了。

不过我还是想知道为什么自动获取没有获取到eth0的网卡,翻了一下calico的代码。
函数FilteredEnumeration从网卡列表中选择第一个可用的网卡以及它的IP
https://github.com/projectcalico/calico/blob/master/node/pkg/lifecycle/startup/autodetection/filtered.go#L30

// Find the first interface with a valid matching IP address and network.
    // We initialise the IP with the first valid IP that we find just in
    // case we don't find an IP *and* network.
    for _, i := range interfaces {
        log.WithField("Name", i.Name).Debug("Check interface")
        for _, c := range i.Cidrs {
            log.WithField("CIDR", c).Debug("Check address")
            if c.IP.IsGlobalUnicast() && matchCIDRs(c.IP, cidrs) {
                return &i, &c, nil
            }
        }
    }

那么这个网卡列表是怎么来的?有一个GetInterfaces函数,这个函数获取所有网卡列表
https://github.com/projectcalico/calico/blob/master/node/pkg/lifecycle/startup/autodetection/interfaces.go

// Loop through interfaces filtering on the regexes.  Loop in reverse
    // order to maintain behavior with older versions.
    for idx := len(netIfaces) - 1; idx >= 0; idx-- {
        iface := netIfaces[idx]
        include := (includeRegexp == nil) || includeRegexp.MatchString(iface.Name)
        exclude := (excludeRegexp != nil) && excludeRegexp.MatchString(iface.Name)
        if include && !exclude {
            if i, err := convertInterface(&iface, version); err == nil {
                filteredIfaces = append(filteredIfaces, *i)
            }
        }
    }

这下找到原因了,这个地方的网卡队列顺序是倒着的,所以才没有首选eth0。

如果需要自己指定网卡,则可以在calico配置文件里添加如下环境变量。

# IP automatic detection
- name: IP_AUTODETECTION_METHOD
value: "interface=eth0"


发表评论

  • OωO
  • |´・ω・)ノ
  • ヾ(≧∇≦*)ゝ
  • (☆ω☆)
  • (╯‵□′)╯︵┴─┴
  •  ̄﹃ ̄
  • (/ω\)
  • ∠(ᐛ」∠)_
  • (๑•̀ㅁ•́ฅ)
  • →_→
  • ୧(๑•̀⌄•́๑)૭
  • ٩(ˊᗜˋ*)و
  • (ノ°ο°)ノ
  • (´இ皿இ`)
  • ⌇●﹏●⌇
  • (ฅ´ω`ฅ)
  • (╯°A°)╯︵○○○
  • φ( ̄∇ ̄o)
  • (งᵒ̌皿ᵒ̌)ง⁼³₌₃
  • (ó﹏ò。)
  • Σ(っ°Д°;)っ
  • ╮(╯▽╰)╭
  • o(*
  • >﹏<
  • (。•ˇ‸ˇ•。)
  • 泡泡
  • 颜文字

*