How To Cordon & Drain ECP Server
What is cordon and drain?
"Cordon/Drain ECMP Servers" refers to a process used in system administration, particularly in clustered or load-balanced environments using Equal-Cost Multi-Path (ECMP) routing, to safely take servers out of service for maintenance or upgrades.
Here's a breakdown of the process and what each term means:
-
Cordon: This step marks a server as unavailable for new incoming connections. The server is effectively "roped off." Existing connections on the server are allowed to complete gracefully, but no new requests are routed to it. This prevents disruption to ongoing operations. Think of it like putting up a "Closed" sign at a store, allowing current customers to finish shopping but not letting any new customers enter.
-
Drain: This step waits for all existing connections on the cordoned server to finish and terminate. This ensures that no active sessions are interrupted when the server is taken offline. It's like waiting for all the customers to leave the store before locking up.
-
ECMP Servers: These are the servers participating in an ECMP routing setup. As mentioned before, ECMP distributes traffic across multiple servers with equal cost paths. Cordoning and draining are crucial in ECMP environments because they prevent abrupt service interruptions when a server is removed from the pool.
The combined process (Cordon and Drain):
Identify the server to be taken offline.
-
Cordon the server. This is often done using tools specific to the load balancing or orchestration system being used (e.g., kubectl cordon in Kubernetes, or commands within a cloud provider's console). The load balancer/router stops sending new traffic to the cordoned server.
-
Drain the server. This might involve waiting for existing sessions to time out or using other mechanisms to gracefully terminate connections.
-
Perform maintenance or upgrades on the server. Once the server is drained, it's safe to take it offline completely.
-
Uncordon the server (after maintenance). This brings the server back into the pool, making it available for new connections again.
Why Cordon and Drain are Important:
-
Prevents service disruptions: By gracefully removing a server from service, you avoid dropping active connections.
-
Ensures data consistency: Draining ensures all in-flight transactions are completed before the server goes offline.
-
Facilitates zero-downtime maintenance: Combined with other techniques like rolling upgrades, cordoning and draining enable maintenance and upgrades without impacting users.
Example In kubernetes
kubectl cordon <node_name> # Cordon the node (server)
kubectl drain <node_name> # Drain the node
# ... Perform maintenance ...
kubectl uncordon <node_name> # Uncordon the node
How to Cordon/Drain ECP Servers
- Identify the kube master node that contains the target ECP server IP
- Go to dev-ops repository, under "/ansible/inventory/live/idc/cis file. There will be a [cis-master] section that containers list of IPs.
- smc-toc into each one to try and find the targeted machine via running:
kubectl get node -o wide | grep <IP>
- Extract from the first column, the hostname of the targetted ECP server for later use.
- Cordon Run
kubectl cordon <hostname>
in order to cordon off the server (i.e. prevent any more pods from being scheduled there) - Drain Run
kubectl drain <hostname> --force --disable-eviction --ignore-daemonsets
to drain the server (i.e. kick out the remaining pods there)