My Kubernetes Networking Journey – Part 2

Service Networking

Service Proxy and Load Balancing, enabled by kube-proxy.

Kubernetes seem to handle network traffic from Pods in different ways depending on the destination. A core concept in Kubernetes is the “Service”, which acts as a Layer 4 (L4) load balancer for Pods. Kubernetes supports multiple types of services, but the most basic type is ClusterIP. This service type provides a unique Virtual IP (VIP) that is only routable within the cluster. The component responsible for managing this is kube-proxy, which runs on every node and configures complex Netfilter rules to handle filtering and Network Address Translation (NAT) between Pods and Services.

Netfilter Tables and Chains

Key points:

  • Netfilter is a packet filtering and processing framework within the Linux kernel.
  • All host traffic passes through the Netfilter framework.
  • Netfilter provides five hooking points: PRE_ROUTING, INPUT, FORWARD, OUTPUT, and POST_ROUTING.
  • These hooking points work in conjunction with other kernel networking facilities, like the kernel routing subsystem.
  • The iptables command-line tool can dynamically insert filtering rules into these hooking points.
  • With Netfilter, packets can be manipulated (accepted, redirected, dropped, modified, etc.) by combining various rules.
  • Rules in each hooking point are organized into chains, which are further grouped into tables based on their function. There are five main tables:
    • filter: For common filtering (accept, reject/drop, jump).
    • nat: For Network Address Translation (SNAT and DNAT).
    • mangle: For modifying packet attributes (e.g., TTL).
    • raw: For early processing before kernel connection tracking (conntrack).
    • security: Used by the Mandatory Access Control subsystem (e.g., SELinux).

Custom Netfilter Chains

Kubernetes creates several custom chains to manage packet filtering and NAT operations. These chains are critical for routing and managing traffic to and from Pods.

  1. KUBE-POSTROUTING (nat table):
    • The POSTROUTING chain is responsible for performing SNAT (Source Network Address Translation) just before the packet is sent to the network interface. This is necessary if the source IP is a Service ClusterIP.
    Example:
    -A POSTROUTING -m comment --comment "Kubernetes postrouting rules" -j KUBE-POSTROUTING
    -A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN
    -A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0
    -A KUBE-POSTROUTING -m comment --comment "Kubernetes service traffic requiring SNAT" -j MASQUERADE
  2. KUBE-SERVICES (nat table):
    • The PREROUTING chain handles inbound traffic from both the external network and Pod network.

    • The OUTPUT chain handles outbound traffic to both the external network and Pod network.

    • Rules are added to redirect traffic to the KUBE-SERVICES chain.
    Example:
    -A PREROUTING -m comment --comment "Kubernetes service portals" -j KUBE-SERVICES
    -A OUTPUT -m comment --comment "Kubernetes service portals" -j KUBE-SERVICES
  3. KUBE-MARK-MASQ (nat table):
    • This chain adds a Netfilter mark to packets destined for services outside the cluster’s network. These packets will then be altered in the POSTROUTING chain to use SNAT with the node’s IP as the source IP.
    Example:
    -A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
  4. KUBE-FORWARD (filter table):
    • This chain handles all forwarded packets. Only forwarded traffic (not incoming or outgoing traffic) passes through here. This chain filters service traffic that requires SNAT.
    Example:
    -A FORWARD -j KUBE-FORWARD -A KUBE-FORWARD -m comment --comment "Kubernetes forwarding rules" -m mark --mark 0x4000/0x4000 -j ACCEPT
  5. KUBE-MARK-DROP (nat table):
    • This chain marks packets that do not have destination NAT enabled. These packets will be dropped in the KUBE-FIREWALL chain if no local endpoint is available to service the request.
    Example:
    -A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000
  6. KUBE-FIREWALL (filter table):
    • The INPUT chain filters incoming traffic destined for the local host, and the OUTPUT chain filters outgoing traffic.Packets marked by KUBE-MARK-DROP are discarded here if there are no available endpoints to service the request.
    Example:
    -A INPUT -j KUBE-FIREWALL -A OUTPUT -j KUBE-FIREWALL -A KUBE-FIREWALL -m comment --comment "Kubernetes firewall for dropping marked packets" -m mark 0x8000/0x8000 -j DROP

Service Packet Netfilter Path

The most important chains in the nat table are KUBE-SERVICES, KUBE-NODEPORTS, KUBE-SVC-, and **KUBE-SEP-`.

  • KUBE-SERVICES is the entry point for service packets and matches the destination IP:PORT. It dispatches packets to the corresponding KUBE-SVC- chain*.
  • The KUBE-SVC- chain* acts as a load balancer and distributes packets to the KUBE-SEP- chain* equally. The number of *KUBE-SEP- chains corresponds to the number of endpoints behind the service.
  • KUBE-SEP- chains* represent Service EndPoints. They perform DNAT, replacing the Service IP:Port with the Pod’s IP:Port.

For example, for a ClusterIP service:

KUBE-SERVICES -> KUBE-SVC-X -> KUBE-SEP-X

For a NodePort service:

KUBE-NODEPORTS -> KUBE-SVC-X -> KUBE-SEP-X

In both cases, conntrack tracks the connection state, ensuring the destination address is correctly remembered and applied to the returning packet.

Service Netfilter Rules

These rules, scoped to the nat table, are used for service types like ClusterIP and NodePort/LoadBalancer.

Example:

KUBE-SERVICES -d 172.20.0.115/32 -p tcp -m comment --comment "octopus/octopus-web:web cluster IP" -m tcp --dport 80 -j KUBE-SVC-A5DEC2RXT22GYINZ
-A KUBE-SERVICES -m comment --comment "Kubernetes service nodeports" -m addrtype --dst-type LOCAL -j KUBE-NODEPORTS

Netfilter’s statistic module is used for load balancing, selecting one of the available backend endpoints. For example:

-A KUBE-SVC-A5DEC2RXT22GYINZ -m statistic --mode random --probability 0.5 -j KUBE-SEP-F73MGWMXW3Q05EYA
-A KUBE-SVC-A5DEC2RXT22GYINZ -j KUBE-SEP-F6GC0KQ4JCB6DUYR

For DNAT, packets are forwarded to either endpoint 10.179.18.8:8080 or 10.179.18.29:8080.

Bonus: Hairpin NAT (Reflective Relay Mode)

A Pod can connect to its own service IP, effectively looping back to itself. In this case, the Linux bridge implementation prevents the packet from being processed because it has the same input and output interfaces. To enable this path, also known as Hairpin NAT, additional rules are used.

Example:

-A KUBE-SEP-F73MGWMXW3Q05EYA -s 10.179.18.29/32 -m comment --comment "octopus/octopus-web:web" -m tcp -j KUBE-MARK-MASQ
-A KUBE-SEP-F6GC0KQ4JCB6DUYR -s 10.179.18.8/32 -m comment --comment "octopus/octopus-web:web" -m tcp -j KUBE-MARK-MASQ

Leave a comment

About the author

Simon Shakya is an Information Technology graduate with a passion for exploring the dynamic fields of software engineering, cloud infrastructure (AWS), and cybersecurity. With a strong foundation in building and automating software tools, deploying cloud-based solutions, and utilizing data analytics, Simon is dedicated to enhancing system reliability and driving innovation in the tech world. When not coding or optimizing cloud environments, you can find Simon experimenting with the latest technologies or exploring new ways to push the boundaries of what’s possible in the digital landscape.