Emprovise Blog: Kubernetes: Container Orchestration at work

In the previous post we discussed the benefits of Containerization over Virtualization using docker containerization platform and docker swarm orchestration of containers. Containers package the application and isolate it from the host making them more reliable and scalable. Although after scaling up to say 1000 containers or 500 services, the container deployment, management, load balancing and peer to peer communication becomes a daunting task. Container orchestration automates the deployment, management, scaling, networking, and availability of the containers, and hence becomes a necessity while operating at such a scale. It automates the arrangement, coordination, and management of software containers. Kubernetes is currently the best available platform for container orchestration.

Kubernetes is an open-source platform for automating deployments, scaling, and operations of application containers across clusters of hosts, providing container-centric infrastructure. It enables managing containerized workloads and services, that facilitates both declarative configuration and automation. Kubernetes uses declarative description of desired state in form of configuration to manage the scheduling, (re)provisioning of resources and for running the containers. It is very portable, configurable, modular, and provides features like auto-scaling, auto-placement, auto-restart, auto-replication, load balancing and auto-healing of containers. Kubernetes can group number of containers into one logical unit for managing and deploying an application or service. Kubernetes is ideally container agnostic but it mostly runs docker containers. It was developed by Google and later donated it to Cloud Native Computing Foundation which currently maintains it. Kubernetes has a large growing online community and many KubeCon conferences are held around the world.

Some of the key features of Kubernetes are as below:

Automatic Binpacking: It packages the application and automatically places containers based on system requirements and available resources.
Service Discovery and Load balancing: It provides ability to discover services and distribute traffic across the worker nodes.
Storage Orchestration: It automatically mounts external storage system or volumes.
Self Healing: Whenever a container fails it automatically creates a new container in its place. When a node fails, kubernetes will create and run all the containers from the failed node into different nodes.
Secret and Configuration Management: It deploys or updates secret and application configuration without having to restart or rebuild the entire image on the running container.
Batch Execution: It manages batch and CI work loads which replaces failed containers.
Horizontal Scaling: It provides simple CLI commands in order to scale applications up or down based on network load.
Automatic Rollbacks and Rollouts: It progressively rollouts updates/changes for the application or its configuration ensuring that individual instances are updated one after the other. When things go wrong kubernetes rollbacks the corresponding change, backing out to the previous state of running containers.

Comparison with Docker Swarm

Docker swarm is easy to setup and requires only few commands to configure and run a docker swarm cluster. While kubernetes setup is complex and requires more commands to be executed to setup a cluster, the effective cluster is more customizable, stable and reliable. Scaling up using docker swarm is faster compared to Kubernetes, as docker swarm is a native orchestration for running docker containers. Docker swarm also has inbuilt automated load balancing compared to Kubernetes which requires manual service configuration. In kubernetes data volumes can only be shared with containers within the same Pod, while docker swarm allows volumes to be shared across any other docker container. Kubernetes provides inbuilt tools for logging and monitoring while docker swarm relies on external 3rd party tools. Kubernetes also provides GUI dashboard for deployment of applications along with command line interface. Kubernetes does process scheduling to maintain the services while updating services similar to docker swarm but also provides rollback functionality in case of failure during the update.

Kubernetes Concepts

CLUSTER: A cluster is a set of nodes with at least one master node and several worker nodes

NODE: Node is a worker machine (or virtual machine) in the cluster.

PODS: Pods are basic scheduling unit, which consists of one or more containers guaranteed to be co-located on the host machine and able to share resources. These co-located group of containers within a pod share an IP address, port space, namespaces, cgroups and storage volumes. The containers within a pod are always scheduled together sharing the same context and lifecycle (start or stop). Each pod is assigned a unique IP address within the cluster, allowing the application to use ports without conflict. The desired state of the containers within a pod is described through a YAML or JSON object called a PodSpec. These objects are passed to the kubelet through the API server. Pods are mortal, i.e. they are not resurrected after their death. Also similar to docker containers they are ephemeral, i.e. when a pod dies another pod comes back up in another host. The logical collection of containers within a Pod interact with each other for execution of a service. A pod corresponds to a single instance of the service.

SERVICE: Service is an abstraction which defines a logical set of Pods and a policy by which to access them. Its a REST object, similar to a Pod and requires a service definition to be POSTed to the API server in order to create a new instance. It provides access to dynamic pods using labels and load balances traffic across the nodes. The Label Selector determines the set of Pods targeted by a Service. Each Service is assigned a unique IP address also known as clusterIP which is tied to the lifespan of the Service, until its death. Service provides a stable endpoint for the pods to reference. Since Pods are created or destroyed dynamically, their IP addresses cannot be relied upon to be stable over time. Hence the external clients relies on the service abstraction to provide reliable access by decoupling from the actual Pods. All communication to the service is automatically load balanced using the member pods of the service. Kubernetes offers native applications a simple Endpoints API that is updated whenever the set of Pods in a Service changes. For non-native applications, Kubernetes offers a virtual-IP-based bridge to Services which redirects to the backend Pods. A Service instance is a REST object similar to a Pod. There are 4 types of service as below:

ClusterIP: Exposes the Service on an internal IP in the cluster. This type of service is only reachable from within the cluster. This is the default Type.
NodePort: Exposes the Service on each Node’s IP at a static port, It uses the same port of each selected Node in the cluster using NAT. It is accessible from outside the cluster using <NodeIP>:<NodePort>. Superset of ClusterIP.
LoadBalancer: It creates an external load balancer in the cloud and assigns a fixed, external IP to the Service. It is a superset of NodePort service type.
ExternalName: Exposes the Service using an arbitrary name by returning a CNAME record with the name. It does not use any proxy.

NAMESPACE: Namespaces are a way to divide cluster resources between multiple users, were users spread across multiple teams. Names of resources need to be unique within a namespace, but not across namespaces. They provide logical separation between the teams and their environments acting as virtual clusters.

LABELS: Labels are key-value pairs which are used to identify objects such as pods, deployments (services) with specific attributes. They can be used to organize and to select subsets of objects. Each label key is unique for a given object. They can be added to an object at creation time and can be added or modified at the run time. Labels allow to distinguish resources within the same namespace.

LABEL SELECTORS: Labels are not unique with multiple objects having same label within the same namespace. The label selector is the core grouping primitive in Kubernetes which allows users to identify/select a set of objects. It can be made of multiple requirements which are comma-separated acting as AND operator. Kubernetes API currently supports two type of selectors.

Equality-based Selectors: They allow filtering by key and value. Matching objects should satisfy all the specified labels and allow three kinds of operators =,==,!=.
Set-based Selectors: They allow filtering of keys according to a set of values. It supports three kinds of operators; in, notin and exists (only the key identifier).

REPLICA SET: ReplicaSet manages the lifecycle of pods and ensures specified number are running. It typically creates and destroys Pods dynamically, especially while scaling in or out. The ReplicaSet is next generation replication controller and also supports set-based selector compared to replication controller only supports equality-based selector. It is similar to services in docker swarm.

DEPLOYMENT: Deployments controls the number of pod instances running using Replica Sets and upgrade them in a controlled manner. They have the capability to update the replica set and to roll back to the previous version. The deployment controller allows to update or pause/cancel the ongoing deployment before completion, or roll back the deployment entirely in the midway. The Deployment Controller drains and terminates a given number of replicas, creates replicas from the new deployment definition, and continues the process until all replicas in the deployment are updated. A YAML file represents and is used to define a Deployment. Deployment can be done either using Recreate i.e. killing all the existing pods and then bringing up new ones, or Rolling Update by gradually bringing down the old pods and bringing up the new ones. By default deployments are performed as a Rolling update. Deployment facilitates scaling (by updating no of replicas), rolling updates, rollbacks, version updates (image updates), Pod health checks and healing.

VOLUME: A volume is a directory which is accessible by containers within a pod and provides persistent storage for the pods. The lifespan of a volume in Kubernetes is same as that of the Pod enclosing it. A volume outlives any containers that run within the Pod, and data is preserved across container restarts. A pod can use any number of volumes simultaneously. The Pod specifies the volumes to be used using the .spec.volumes field and mounts those into its Containers. Every container in the Pod specifies independently the path on which to mount each volume. Few of the volume types include hostPath volume which mounts a file or directory in the node’s filesystem into the Pod, emptyDir volume is created when node is assigned to the pod existing as long as the Pod is running on the node and, secret volume which is used to pass sensitive information to the pods. The gcePersistentDisk volume mounts Google Compute Engine (GCE) Persistent Disk on the pod with pre-populated data and has its contents preserved even after the pod is removed from the node. Kubernetes Persistent Volume is used to retain data even after the pod dies compared to regular volumes. Kubernetes persistent volumes are administrator provisioned volumes which are created with a particular filesystem, size, and identifying characteristics such as volume IDs and names. Creating a kubernetes persistent volume involves first provisioning a network storage, then requesting for persistent volume claim for storage volume and finally using claimed persistent volume. The claim for persistent volume is referenced in the spec for a pod in order for the containers in the pod to use the volume.

MASTER: The API Server, etcd, Controller manager and Scheduler processes together make the central control plane of the cluster which runs on the Master node. It provides a unified view of the cluster. Master cannot run any containers in kubernetes.

WORKER: It is Docker host running kubelet (node agent) and proxy services. It runs pods and containers.

Kubernetes Architecture

Master Components

Master components provide the cluster’s control plane and are responsible for global activities about the cluster such as scheduling, detecting and responding to cluster events.

API Server: The API server exposes Kubernetes API and is the entry points for all the REST commands used to control the cluster. It processes the REST requests, validates them and executes the bound business logic. It relies on etcd for storage of result state. Kubectl (Kubernetes CLI) makes request to Kubernetes API server.

etcd Storage: etcd is a simple, distributed, consistent key-value store for all cluster data. It’s mainly used for shared configuration and service discovery. It provides a REST API for CRUD operations as well as an interface to register watchers on specific nodes, which enables a reliable way to notify the rest of the cluster about configuration changes. Example of data stored by Kubernetes in etcd are the jobs being scheduled, created and deployed, pod/service details and state, namespaces and replication information, etc.

Scheduler: Scheduler is responsible for deployment of configured pods and services onto the nodes. It selects a node for execution of newly created pods which have no assigned nodes. It has the information regarding resources available on the members of the cluster i.e. nodes, as well as the ones required for the configured service to run and hence is able to decide where to deploy a specific service. Factors taken into account for scheduling decisions include individual and collective resource requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference and deadlines.

Controller Manager: Controller manager is a daemon which runs different embedded controllers inside the master node. Each controller is logically a separate process, even though all controllers run as part of a single process. It uses API server to watch the shared state of the cluster and makes corrective changes to the current state to the desired one. Below are some of the controllers executed by the controller manager:

Node Controller: Responsible for noticing and responding when nodes go down.
Replication Controller: It maintains the correct number of pods are always running based on the configured replication factor and recreates any failed pods or removes extra-scheduled pods.
Endpoints Controller: It populates the Endpoints object. For headless services (no load balancer) with selectors, it creates Endpoints records in the API and modifies the DNS configuration to return addresses that point directly to the pods backing the service.
Token Controller: Create default accounts and API access tokens for new namespaces. It manages creation and deletion of ServiceAccount and its associated ServiceAccountToken Secrets that allows API access.

Node Components

Node components run on every node, maintain running pods and provide them the Kubernetes runtime environment.

Kubelet: Each worker node runs an agent process called a kubelet which is responsible for managing the state of the node i.e. starting, stopping, and maintaining application containers based on instructions from the API server. It gets the configuration of a pod from the API server and ensures that the described containers are up and running. It is the worker service which communicates with the API server to get information about services and write the details about newly created ones. All containers run on the host node using kubelet. It monitors the health of the respective worker node and reports it to the API server.

Kube Proxy: It is a network proxy and a load balancer for a service on a single worker node. It handles the requests from the internet by enabling the network routing for TCP and UDP packets. kube-proxy enables the Kubernetes service abstraction by maintaining network rules on the host and performing connection forwarding.

Kubectl: It is default command line tool to communicate with the API service and send commands to the master node. It also allows to enable the kube dashboard which allows to deploy and run containers using GUI.

Monitoring Kubernetes

Kubernetes has built in monitoring tools to check the health of individual nodes using cAdvisor and Heapster. The cAdvisor is an open source container resource usage collector. It operates at node level in kubernetes. It auto-discovers all containers in the given node and collects CPU, memory, filesystem, and network usage statistics. cAdvisor also provides the overall machine usage by analyzing the ‘root’ container on the machine. cAdvisor provides the basic resource utilization for the node but is unable to determine the resource utilization for individual applications running within the containers. The Heapster is another tool which aggregates monitoring data from cAdvisors across all nodes in the Kubernetes cluster. It runs as a pod in the cluster similar to any application. The Heapster pod discovers all the nodes in the same cluster and then pulls metrics by querying usage information from the kubelet of each node (which in turn queries cAdvisor), aggregates them by pod and label, and reports metrics to a monitoring service or storage backend. Heapster allows to collect data easily from the kubernetes cluster using cAdvisor but does not provide in built storage. It can integrate with InfluxDB or Google Cloud for storage and use UI tools such as Grafana for data visualization.

Kubernetes Networking Internals

A kubernetes pod consists of one or more containers that are collocated on the same host and are configured to share a network were all containers in the pod can reach each other using localhost. A typical docker container has a virtual network interface veth0 which is attached to bridge docker0 using a pair of linked virtual ethernet devices. The bridge docker0 is in turn attached to a physical network interface eth0 which is in the root network namespace. Both the docker0 and veth0 are on the same container network namespace. When a second container starts, docker shares the existing interface veth0, instead of getting a new virtual network interface. Now both containers are accessible from veth0 IP address (172.17.0.2) and both can hit ports opened by the other on localhost. Kubernetes implements this by creating a special container (started with pause command) for each pod whose main purpose is to provide a virtual network interface for all other containers to communicate with each other and outside world. The pause command suspends current process until a signal is received so that these containers do nothing at all except sleep until kubernetes sends them SIGTERM. The local routing rules set up during bridge creation allows any packet arriving at eth0 with destination address of veth0 (172.17.0.2) to be forwarded to the bridge which will then send it on to veth0. Although this works well for single host, but causes issues when multiple hosts in the cluster have their own private address space assigned to their bridge without any idea about address space assigned to other hosts thus causing potential conflicts.

Pods in kubernetes are able to communicate with other pods whether they are running on the same local host or different hosts or nodes. A kubernetes cluster consists or one or more nodes typically connected with router gateway on a cloud platforms such as GCP or AWS. Kubernetes assigns an overall address space for the bridges on each node and then assigns the bridges addresses within that space, based on the node the bridge is built on. It also adds routing rules to the gateway router telling it how the packets destined for each bridge should be routed, i.e. which node's eth0 the bridge can be reached through. This combination of virtual network interfaces, bridges and routing rules is called an overlay network.

Pods in kubernetes are ephemeral, hence there is no guarantee that IP address of a pod won't change when the pod is recreated. General solution to such problem is to run the traffic through reverse-proxy/load balancer which is highly durable, failure resistant and maintains the list of healthy servers to forward the requests to. Kubernetes service enables load balancing across a set of server pods and allows client pods to operate independently and durably. Service causes a proxy to be configured to forward requests to a set of pods usually determined by the selector which matches labels assigned to the pods. Kubernetes provides an internal cluster DNS which resolves the service name to corresponding service IP. The service is assigned IP address in the service network which is different from pod network. The service network address range similar to pod network is not exposed via kubectl and require to use provider specific commands to retrieve cluster properties. Both the service and pod networks are virtual networks.

Every ClusterIP service is assigned an IP address on the service network which is reachable from any pod within the cluster. The service network does not have any routes, connected bridges and interfaces on the hosts of nodes making up the cluster. Typically IP networks are configured with routes such that when an interface cannot deliver a packet to its destination because no device with the specified address exists locally it forwards the packet on to its upstream gateway. When the virtual ethernet interface sees packets addressed to service IP address, it forwards packets to the bridge cbr0 as it cannot find any devices with service IP on its pod network. The bridges being dumb passes the traffic to the host/node ethernet interface. In theory if the host ethernet interface also cannot find any devices with service IP address it forwards packet to this interface's gateway, the top level router. But the kube-proxy redirects the packets mid-flight to the addresses server pod.

Proxy usually run in user space were packets are marshaled into user space and back to kernel space on every trip through the proxy which can be expensive. Since both pods and nodes are ephemeral entities in the cluster, kubernetes uses a virtual network for service addressing to provide a stable and non-conflicting network address space. The virtual service network have no actual devices i.e. no ports to listen on or interfaces to open an connection, Kubernetes uses netfilter feature of linux kernel and a user space interface called iptables to route the packets. Netfilter is a rules based packet processing engine which runs in kernel space and is able to look at every packet at various points in its life cycle. It matches the packets against the rules and takes a specific action such as redirecting the packet to another destination when the corresponding rule matches. The kube-proxy opens a port and inserts the correct netfilter rules for the service in response to the notifications from the master api server for the changes in the cluster which includes changes to services and endpoints. The kube-proxy has the ability to run in iptables mode in which it mostly ceases to be a proxy for inter-cluster connections and instead delegates the work of detecting packets bound to service IPs in kernel space and redirecting them to pods, to the netfilter. Kube-proxy's main job is to keep the netfilter rules in sync using iptables based on updates received from master api server. Kub-proxy is very reliable and runs on systemd unit by default were it restarts on failure whereas in Google container engine it runs on pod controlled by a deamonset. Health checks against the endpoints are performed by the kubelet running on every node. The kubelet notifies the kube-proxy via api server when unhealthy endpoints are found and the kube-proxy then removes the endpoint from netfilter rules until the endpoint becomes healthy again. This works well for the requests that originate inside the cluster from on pod to another, but for requests from outside the cluster the netfilter rules obfuscate the origin IP.

Connections and requests operate at OSI layer 4 (tcp) or layer 7 (http, rpc, etc). Netfilter routing rules operate on IP packets at layer 3. All routers, including netfilter make routing decisions based solely on information contained in the packet; generally where it is from and where it is going. Each packet that arrives at a node's eth0 interface and is destined for a cluster IP address of a service, is processed by netfilter which matches the rules established for the service, and forwards the packet to the IP address of a healthy pod. The cluster IP of a service is only reachable from a node's ethernet interface. Although the netfilter rules for the service are not scoped to a particular origin network, i.e. any packet from anywhere that arrives on the node's ethernet interface with a destination of service's cluster IP is going to match and get routed to a pod. Hence clients can essentially call the cluster IP, the packets follow a route down to a node, and get forwarded to a pod.

The problem with this approach is that nodes are ephemeral to some extent similar to pods, for e.g. nodes can be migrated to a new VM or clusters can be scaled up and down. Since the routers are operating on layer 3 packets, they are unable to determine healthy services from unhealthy ones. They expect the next hop in the route to be stable and available. If the node becomes unreachable the route will break and stay broken for a significant time in most cases. Also if the route were durable, all external traffic passing through a single node is not optimal. Kubernetes ingress uses load balancers for distributing client traffic across the nodes within the cluster to solve this problem. Instead of using the static address of the nodes, the address of ethernet interfaces connected to nodes is used by the gateway router to route the packets sent from the load balancer. With this approach when the client tries to connect to the service using a particular port e.g. 80, it fails as there is no process listening on service IP address on the specified port. The node's ethernet interface cannot be connected with the specified port and the netfilter rules which intercepts request and redirects to a pod don't match the destination address which is cluster IP address on the service network. The service network that netfilter is set up to forward packets for is not easily routable from the gateway to the nodes, and the pod network that is easily routable is not the one netfilter is forwarding for.

NodePort services creates a bridge between the pod and service network. NodePort service is similar to clusterIP service with an additional capability to reach the IP address of the node as well as the assigned cluster IP on the services network. When kubernetes creates a NodePort service, kube-proxy allocates a port in the range 30000–32767 and opens this port on the eth0 interface of every node. Connections to this port are forwarded to the service's cluster IP. Since NodePorts exposes the service to clients on a non-standard port, a LoadBalancer is usually configured in front of the cluster which exposes the usual port, masking the NodePort from end users. NodePorts are the fundamental mechanism by which all external traffic gets into a kubernetes cluster.

LoadBalancer service type has all the capabilities of a NodePort service plus the ability to build out a complete ingress path, only when running in an environment like GCP or AWS that supports API-driven configuration of networking resources. An external IP is allocated for LoadBalancer service type thus extending a single service to support external clients. The load balancer has few limitations, namely it cannot be configured to terminate https traffic. It also cannot do virtual hosts or path-based routing, hence cannot use a single load balancer to proxy to multiple services. To overcome these limitations a new Ingress resource for configuring load balancers was added in version 1.2.

Ingress is a separate resource that configures a load balancer with much more flexibly. The Ingress API supports TLS termination, virtual hosts, and path-based routing. It can easily set up a load balancer to handle multiple backend services. The ingress controller is responsible for satisfying the configured requests by driving resources in the environment to the necessary state. When services of type NodePort are created using Ingress, the Ingress controller manages the traffic to the nodes. There are ingress controller implementations for GCE load balancers, AWS elastic load balancers, and for popular proxies such as nginx and haproxy. Mixing Ingress resources with LoadBalancer services can cause subtle issues in some environments.

Helm Package Manager

In order to manage and deploy the complex kubernetes applications there are some third party tools available such as Helm. Helm is a package manager and templating engine for Kubernetes. It allows to easily package, configure, and deploy applications and services onto Kubernetes clusters. It allows to easily create multiple interdependent kubernetes resources such as pods, services, deployments, and replicasets by generating YAML manifest files using its packaging format. It is a convenient way to package YAML files and distribute them in public and private repositories. Such bundle of YAML files is called Helm charts. After Helm installation it sets up a Helm tool.

Helm allows to define a common blueprint for example similar micro services and replace the dynamic values using placeholders. Values are defined using a YAML file or using --set flag in command prompt.

apiVersion: v1
kind: Pod
metadata:
  name: {{ .Values.name }}
spec:
  containers:
  - name: {{ .Values.container.name }}
    image: {{ .Values.container.image }}
    port: {{ .Values.container.port }}

values.yaml

name: my-app
container:
  name: my-app-container
  image: my-app-image
  port: 9001

Helm chart consists of Chart.yaml (contains meta information about the chart, name, version, dependencies), values.yaml contains all the values configured in the template files. Charts directory has chart dependencies while templates contains template files.

$ helm install <chart-name>

$ helm upgrade <chart-name>

$ helm install --values=my-values.yaml <chart-name>

$ helm install --set version=2.0.0 <chart-name>

Search Helm charts from Helm hub and Helm charts GitHub Project.

$ helm search <keyword>

Helm 2 utilizes the Tiller, a server-side component installed inside the Kubernetes cluster, to manage Helm chart installations. Tiller runs on the Kubernetes cluster and performs configuration and deployment of software releases on the cluster using helm commands. The helm command line tool accepts commands which are listened by a tiller server component. Tiller stores the local copy of all the help installations which can be used for release management using release names.

Helm 3 removes Tiller entirely and uses the Helm CLI to interact directly with the Kubernetes API, which simplifies deployments and addresses security concerns were Tiller had was so much powerful and complex. This change also means that Helm 3 relies more on the existing security configuration of your Kubernetes cluster. Helm 3 release names are now scoped to their namespace, allowing for the same release name to be used in different namespaces.

Installation on Ubuntu

Install apt-transport-https

$ sudo apt-get update && sudo apt-get install -y apt-transport-https

Add docker signing key and repository URL

$ curl -s https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"

Install docker on every node

$ sudo apt update && sudo apt install -qy docker-ce

Start and enable the Docker service

$ sudo systemctl start docker
$ sudo systemctl enable docker

Install Kubernetes involves installing kubeadm which bootstraps a Kubernetes cluster, kubelet which configures containers to run on a host and kubectl which deploys and manages apps on Kubernetes.

Add the Kubernetes in signing key

$ sudo curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add

Create the file /etc/apt/sources.list.d/kubernetes.list and add kubernetes repository URL as below.

$ sudo touch /etc/apt/sources.list.d/kubernetes.list
$ sudo vi /etc/apt/sources.list.d/kubernetes.list

Add "deb http://apt.kubernetes.io/ kubernetes-xenial main" to vi and save with "!wq".

$ sudo apt-get update
$ sudo apt-get install -y kubelet kubeadm kubectl kubernetes-cni

Cgroup drivers: When systemd is chosen as the init system for a Linux distribution, the init process generates and consumes a root control group (cgroup) and acts as a cgroup manager. Systemd has a tight integration with cgroups and will allocate cgroups per process. Using cgroupfs which also can configure container/kubelets, alongside systemd means that there will then be two different cgroup managers. Control groups are used to constrain resources that are allocated to processes. A single cgroup manager will simplify the view of what resources are being allocated and will by default have a more consistent view of the available and in-use resources. When we have two managers we end up with two views of those resources causing unstability under resource pressure. Hence cgroup driver is configured to systemd which is recommended driver for Docker cgroup driver.

Setup daemon for docker.

$ cat > /etc/docker/daemon.json <<EOF
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2"
}
EOF

$ sudo mkdir -p /etc/systemd/system/docker.service.d

Restart docker.

$ sudo systemctl daemon-reload
$ sudo systemctl restart docker

Kubernetes Master/Worker Node Setup

Initialize Master Node. The --pod-network-cidr add-on option allows to specify the Container Network Interface (CNI) also called as Pod Network. There are various third party pod network interfaces available which can be selected using --pod-network-cidr option. For example to start a Calico CNI we specify 192.168.0.0/16 and to start a Flannel CNI we use 10.244.0.0/16. It is recommended that the master host have at least 2 core CPUs and 4GB of RAM. If set, the control plane will automatically allocate CIDRs (Classless Inter-Domain Routing or Subnet) for every node. A pod network add-on must be installed so that the pods can communicate with each other.

Kubeadm uses the network interface associated with the default gateway to advertise master node's IP address which it would be listening on. The --apiserver-advertise-address option allows to select a different network interface on master node machine. Specify '0.0.0.0' to use the address of the default network interface.

$ sudo kubeadm init --pod-network-cidr=192.168.1.0/16 --apiserver-advertise-address=<master-ip-address>

The master node can be initialized using default options, were pod is isolated.

$ sudo kubeadm init

Issue following commands as regular user before joining a node

$ mkdir -p $HOME/.kube
$ sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
$ sudo chown $(id -u):$(id -g) $HOME/.kube/config

Kubeadm sets up a secure cluster by default and enforces use of RBAC.

The kubectl apply command is part of Declarative Management were changes applied to a live object directly are retained, even if they are not merged back into the configuration files. The kubectl automatically detects the create, update, and delete operations for every object. The below command installs a pod network add-on. Only one pod network can be installed per cluster.

$ kubectl apply -f <add-on.yaml>

The below command creates a Pod based on Calico (Calico Pod Network) using specified release 3.6 calico.yaml file.

$ kubectl apply -f https://docs.projectcalico.org/v3.6/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml

We can also download the calico.yaml file and then pass it to kubectl apply command.

$ wget "https://docs.projectcalico.org/v3.6/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml" --no-check-certificate

$ kubectl apply -f calico.yaml

The kubeadm join command is ran on the worker nodes to allow them to join the cluster using the <worker-token> returned by the kubeadm init command on the master node. It is recommended that each worker host have at least 1 core CPUs and 4GB of RAM.

$ sudo kubeadm join <master-ip-address>:6443 --token <worker-token> --discovery-token-ca-cert-hash sha256:<worker-token-hash>

To generate the worker token again, run below command with --print-join-command option on the master node. In case kubeadm join command fails with "couldn't validate the identity of the API Server", then use below command to regenerate the token for the join command.

$ sudo kubeadm token create --print-join-command

Kubernetes Dashboard Setup

Setup the dashboard on master node before any worker nodes join the master to avoid issues.

All the domains accessing Kubernetes Dashboard (1.7.x) over HTTP it will not be able to sign in. Nothing will happen after clicking Sign in button on login page.

Use the below command to create the kubernetes dashboard.

$ kubectl create -f https://raw.githubusercontent.com/kubernetes/dashboard/v1.10.1/src/deploy/recommended/kubernetes-dashboard.yaml

To start the dashboard server on default port 8001 with blocking process use the below command.

$ kubectl proxy

To access the dashboard from outside the cluster from any host, and custom port and address use the below command.

$ kubectl proxy --address="<master-node-address>" -p 8080 --accept-hosts='^*$' &

To create a service account for your dashboard, using "default" namespace

$ kubectl create serviceaccount dashboard -n default

To add cluster binding rules for your roles on dashboard

$ kubectl create clusterrolebinding dashboard-admin -n default \
--clusterrole=cluster-admin \
--serviceaccount=default:dashboard

To get the secret key to be pasted into the dashboard token pwd. Copy the out-coming secret key.

$ kubectl get secret $(kubectl get serviceaccount dashboard -o jsonpath="{.secrets[0].name}") -o jsonpath="{.data.token}" | base64 --decode

Go to http://<master-node-address>:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy/#!/login , which displays Kubernetes Dashboard. Select option Token and paste the secure key obtain from previous command to access the Kubernetes dashboard.

Kubernetes Commands

Create a deployment

$ kubectl create deployment <deployment-name> --image=<image-name>

Verify the deployment

$ kubectl get deployments

Get more details about the deployment

$ kubectl describe deployment <deployment-name>

Create a deployment with specified <deployment-name> and an associated ReplicaSet object. The --replicas option specifies the number of pods which would run the specified image.

$ kubectl run <deployment-name> --replicas=5 --labels=<label-name> --image=<image-name>:<image-version> --port=8080

Get the information about the ReplicaSets

$ kubectl get replicasets
$ kubectl describe replicasets

Create the service on the nodes

$ kubectl create service nodeport <deployment-name> --tcp=80:80

Check which deployment is running on which node

$ kubectl get svc

Delete the deployment

$ kubectl delete deployment <deployment-name>

Get status of all the nodes

$ kubectl get nodes

Get status of all the Podes

$ kubectl get pods --all-namespaces -o wide

Get the status of all the pods with specified namespace using the -n parameter.

$ kubectl get pods -n <namespace> -o wide

Get the status of the pods with specified label using -l parameter and namespace.

$ kubectl get pods -n <namespace> -l <label-key>=<label-value> -o wide

Get detail status of the Pods

$ kubectl get -o wide pods --all-namespaces

Get the list of current namespaces in a cluster

$ kubectl get namespaces

Delete the Pod with the specified name

$ kubectl delete pod <pod-name>

Delete the Pods and Services with the specified Pod and Service names respectively.

$ kubectl delete pod,service <pod-name> <service-name>

Delete the pods and services with the specified label name, including uninitialized ones

$ kubectl delete pods,services -l name=<label-name> --include-uninitialized

The kubectl create command is part of Imperative Object Management were the Kubernetes API is used to create, replace or delete objects by specifying the operation directly. It creates a resource by filename of or stdin. Below example creates deployment using Nginx YAML configuration file

$ kubectl create -f nginx.yaml

Deletes resources by file name, stdin, resource and names.

$ kubectl delete -f pod.yml

Create a service which exposes the specified deployment at port 8080 and using --type to specify service type as LoadBalancer instead of the default type ClusterIP.

$ kubectl expose deployment <deployment-name> --type=LoadBalancer --port=8080 --name=<service-name>

Get the information of the specified service.

$ kubectl get services <service-name>

Get the detailed information of the specified service

$ kubectl describe services <service-name>

Update image for the existing container or pod template.

$ kubectl set image deployments/<deployment-name> <container-name>=<image-name>

The scale command scales the specified deployment up or down with the specified number of replicas.

$ kubectl scale deployments/<deployment-name> --replicas=3

Drain particular node or remove it from service. It would safely evict all of pods from the specified node in case to perform maintenance on the node.

$ kubectl drain <node-name> --delete-local-data --force --ignore-daemonsets

Delete a node with specified name

$ kubectl delete node <node-name>

Delete the headless service using the kubectl delete command. Multiple service names can be passed to kubectl delete command.

$ kubectl delete service <service-name>
$ kubectl delete service <service1-name> <service2-name> <service3-name>

Delete pods and services with label name=myLabel

$ kubectl delete pods,services -l name=myLabel

Delete the deployments passing multiple deployment names.

$ kubectl delete deployments <deployment1-name> <deployment2-name>

Reverts any changes made by kubeadm init or kubeadm join commands. The --ignore-preflight-errors option allows to ignore errors from all the checks.

$ kubectl reset --ignore-preflight-errors stringSlice

The below kubectl delete command deletes serviceaccount and clusterrole by namespace and name.

$ kubectl delete serviceaccount -n kube-system admin-user
$ kubectl delete clusterrole cluster-admin

To view logs of a particular container inside a pod running multiple container. The -n (namespace) option allows to filter out the pod by namespace.

$ kubectl logs -f <pod-name> -c <container-name>

$ kubectl -n kube-system logs <pod-name> -c <container-name>

Note: If you face an certificate error "Unable to connect to the server: x509: certificate signed by unknown authority", append --insecure-skip-tls-verify=true argument to kubectl commands

Monday, December 31, 2018

Kubernetes: Container Orchestration at work

4 comments: