Monday, December 31, 2018

Kubernetes: Container Orchestration at work

In the previous post we discussed the benefits of Containerization over Virtualization using docker containerization platform and docker swarm orchestration of containers. Containers package the application and isolate it from the host making them more reliable and scalable. Although after scaling up to say 1000 containers or 500 services, the container deployment, management, load balancing and peer to peer communication becomes a daunting task. Container orchestration automates the deployment, management, scaling, networking, and availability of the containers, and hence becomes a necessity while operating at such a scale. It automates the arrangement, coordination, and management of software containers. Kubernetes is currently the best available platform for container orchestration.

Kubernetes is an open-source platform for automating deployments, scaling, and operations of application containers across clusters of hosts, providing container-centric infrastructure. It enables managing containerized workloads and services, that facilitates both declarative configuration and automation. Kubernetes uses declarative description of desired state in form of configuration to manage the scheduling, (re)provisioning of resources and for running the containers. It is very portable, configurable, modular, and provides features like auto-scaling, auto-placement, auto-restart, auto-replication, load balancing and auto-healing of containers. Kubernetes can group number of containers into one logical unit for managing and deploying an application or service. Kubernetes is ideally container agnostic but it mostly runs docker containers. It was developed by Google and later donated it to Cloud Native Computing Foundation which currently maintains it. Kubernetes has a large growing online community and many KubeCon conferences are held around the world.

Some of the key features of Kubernetes are as below:
  • Automatic Binpacking: It packages the application and automatically places containers based on system requirements and available resources.
  • Service Discovery and Load balancing: It provides ability to discover services and distribute traffic across the worker nodes.
  • Storage Orchestration: It automatically mounts external storage system or volumes.
  • Self Healing: Whenever a container fails it automatically creates a new container in its place. When a node fails, kubernetes will create and run all the containers from the failed node into different nodes.
  • Secret and Configuration Management: It deploys or updates secret and application configuration without having to restart or rebuild the entire image on the running container.
  • Batch Execution: It manages batch and CI work loads which replaces failed containers.
  • Horizontal Scaling: It provides simple CLI commands in order to scale applications up or down based on network load.
  • Automatic Rollbacks and Rollouts: It progressively rollouts updates/changes for the application or its configuration ensuring that individual instances are updated one after the other. When things go wrong kubernetes rollbacks the corresponding change, backing out to the previous state of running containers.

Comparison with Docker Swarm

Docker swarm is easy to setup and requires only few commands to configure and run a docker swarm cluster. While kubernetes setup is complex and requires more commands to be executed to setup a cluster, the effective cluster is more customizable, stable and reliable. Scaling up using docker swarm is faster compared to Kubernetes, as docker swarm is a native orchestration for running docker containers. Docker swarm also has inbuilt automated load balancing compared to Kubernetes which requires manual service configuration. In kubernetes data volumes can only be shared with containers within the same Pod, while docker swarm allows volumes to be shared across any other docker container. Kubernetes provides inbuilt tools for logging and monitoring while docker swarm relies on external 3rd party tools. Kubernetes also provides GUI dashboard for deployment of applications along with command line interface. Kubernetes does process scheduling to maintain the services while updating services similar to docker swarm but also provides rollback functionality in case of failure during the update.

Kubernetes Concepts


CLUSTER: A cluster is a set of nodes with at least one master node and several worker nodes

NODE: Node is a worker machine (or virtual machine) in the cluster.

PODS: Pods are basic scheduling unit, which consists of one or more containers guaranteed to be co-located on the host machine and able to share resources. These co-located group of containers within a pod share an IP address, port space, namespaces, cgroups and storage volumes. The containers within a pod are always scheduled together sharing the same context and lifecycle (start or stop). Each pod is assigned a unique IP address within the cluster, allowing the application to use ports without conflict. The desired state of the containers within a pod is described through a YAML or JSON object called a PodSpec. These objects are passed to the kubelet through the API server. Pods are mortal, i.e. they are not resurrected after their death. Also similar to docker containers they are ephemeral, i.e. when a pod dies another pod comes back up in another host. The logical collection of containers within a Pod interact with each other for execution of a service. A pod corresponds to a single instance of the service.

SERVICE: Service is an abstraction which defines a logical set of Pods and a policy by which to access them. Its a REST object, similar to a Pod and requires a service definition to be POSTed to the API server in order to create a new instance. It provides access to dynamic pods using labels and load balances traffic across the nodes. The Label Selector determines the set of Pods targeted by a Service. Each Service is assigned a unique IP address also known as clusterIP which is tied to the lifespan of the Service, until its death. Service provides a stable endpoint for the pods to reference. Since Pods are created or destroyed dynamically, their IP addresses cannot be relied upon to be stable over time. Hence the external clients relies on the service abstraction to provide reliable access by decoupling from the actual Pods. All communication to the service is automatically load balanced using the member pods of the service. Kubernetes offers native applications a simple Endpoints API that is updated whenever the set of Pods in a Service changes. For non-native applications, Kubernetes offers a virtual-IP-based bridge to Services which redirects to the backend Pods. A Service instance is a REST object similar to a Pod. There are 4 types of service as below:
  1. ClusterIP:  Exposes the Service on an internal IP in the cluster. This type of service is only reachable from within the cluster. This is the default Type.
  2. NodePort: Exposes the Service on each Node’s IP at a static port, It uses the same port of each selected Node in the cluster using NAT. It is accessible from outside the cluster using <NodeIP>:<NodePort>. Superset of ClusterIP.
  3. LoadBalancer: It creates an external load balancer in the cloud and assigns a fixed, external IP to the Service. It is a superset of NodePort service type.
  4. ExternalName: Exposes the Service using an arbitrary name by returning a CNAME record with the name. It does not use any proxy.

NAMESPACENamespaces are a way to divide cluster resources between multiple users, were users spread across multiple teams. Names of resources need to be unique within a namespace, but not across namespaces. They provide logical separation between the teams and their environments acting as virtual clusters.

LABELS: Labels are key-value pairs which are used to identify objects such as pods, deployments (services) with specific attributes. They can be used to organize and to select subsets of objects. Each label key is unique for a given object. They can be added to an object at creation time and can be added or modified at the run time. Labels allow to distinguish resources within the same namespace.

LABEL SELECTORS: Labels are not unique with multiple objects having same label within the same namespace. The label selector is the core grouping primitive in Kubernetes which allows users to identify/select a set of objects. It can be made of multiple requirements which are comma-separated acting as AND operator. Kubernetes API currently supports two type of selectors.
  • Equality-based Selectors: They allow filtering by key and value. Matching objects should satisfy all the specified labels and allow three kinds of operators =,==,!=.
  • Set-based Selectors: They allow filtering of keys according to a set of values. It supports three kinds of operators; in, notin and exists (only the key identifier).

REPLICA SET: ReplicaSet manages the lifecycle of pods and ensures specified number are running. It typically creates and destroys Pods dynamically, especially while scaling in or out. The ReplicaSet is next generation replication controller and also supports set-based selector compared to replication controller only supports equality-based selector. It is similar to services in docker swarm.

DEPLOYMENTDeployments controls the number of pod instances running using Replica Sets and upgrade them in a controlled manner. They have the capability to update the replica set and to roll back to the previous version. The deployment controller allows to update or pause/cancel the ongoing deployment before completion, or roll back the deployment entirely in the midway. The Deployment Controller drains and terminates a given number of replicas, creates replicas from the new deployment definition, and continues the process until all replicas in the deployment are updated. A YAML file represents and is used to define a Deployment. Deployment can be done either using Recreate i.e. killing all the existing pods and then bringing up new ones, or Rolling Update by gradually bringing down the old pods and bringing up the new ones. By default deployments are performed as a Rolling update. Deployment facilitates scaling (by updating no of replicas), rolling updates, rollbacks, version updates (image updates), Pod health checks and healing.

VOLUME: A volume is a directory which is accessible by containers within a pod and provides persistent storage for the pods. The lifespan of a volume in Kubernetes is same as that of the Pod enclosing it. A volume outlives any containers that run within the Pod, and data is preserved across container restarts. A pod can use any number of volumes simultaneously. The Pod specifies the volumes to be used using the .spec.volumes field and mounts those into its Containers. Every container in the Pod specifies independently the path on which to mount each volume. Few of the volume types include hostPath volume which mounts a file or directory in the node’s filesystem into the Pod, emptyDir volume is created when node is assigned to the pod existing as long as the Pod is running on the node and, secret volume which is used to pass sensitive information to the pods. The gcePersistentDisk volume mounts Google Compute Engine (GCE) Persistent Disk on the pod with pre-populated data and has its contents preserved even after the pod is removed from the node.   Kubernetes Persistent Volume is used to retain data even after the pod dies compared to regular volumes. Kubernetes persistent volumes are administrator provisioned volumes which are created with a particular filesystem, size, and identifying characteristics such as volume IDs and names. Creating a kubernetes persistent volume involves first provisioning a network storage, then requesting for persistent volume claim for storage volume and finally using claimed persistent volume. The claim for persistent volume is referenced in the spec for a pod in order for the containers in the pod to use the volume.

MASTER: The API Server, etcd, Controller manager and Scheduler processes together make the central control plane of the cluster which runs on the Master node. It provides a unified view of the cluster. Master cannot run any containers in kubernetes.

WORKER: It is Docker host running kubelet (node agent) and proxy services. It runs pods and containers.


Kubernetes Architecture


Master Components

Master components provide the cluster’s control plane and are responsible for global activities about the cluster such as scheduling, detecting and responding to cluster events.

API Server: The API server exposes Kubernetes API and is the entry points for all the REST commands used to control the cluster. It processes the REST requests, validates them and executes the bound business logic. It relies on etcd for storage of result state. Kubectl (Kubernetes CLI) makes request to Kubernetes API server.

etcd Storage: etcd is a simple, distributed, consistent key-value store for all cluster data. It’s mainly used for shared configuration and service discovery. It provides a REST API for CRUD operations as well as an interface to register watchers on specific nodes, which enables a reliable way to notify the rest of the cluster about configuration changes. Example of data stored by Kubernetes in etcd are the jobs being scheduled, created and deployed, pod/service details and state, namespaces and replication information, etc.

Scheduler: Scheduler is responsible for deployment of configured pods and services onto the nodes. It selects a node for execution of newly created pods which have no assigned nodes. It has the information regarding resources available on the members of the cluster i.e. nodes, as well as the ones required for the configured service to run and hence is able to decide where to deploy a specific service. Factors taken into account for scheduling decisions include individual and collective resource requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference and deadlines.

Controller Manager: Controller manager is a daemon which runs different embedded controllers inside the master node. Each controller is logically a separate process, even though all controllers run as part of a single process. It uses API server to watch the shared state of the cluster and makes corrective changes to the current state to the desired one. Below are some of the controllers executed by the controller manager:
  • Node Controller: Responsible for noticing and responding when nodes go down.
  • Replication Controller: It maintains the correct number of pods are always running based on the configured replication factor and recreates any failed pods or removes extra-scheduled pods.
  • Endpoints Controller: It populates the Endpoints object. For headless services (no load balancer) with selectors, it creates Endpoints records in the API and modifies the DNS configuration to return addresses that point directly to the pods backing the service.
  • Token Controller: Create default accounts and API access tokens for new namespaces. It manages creation and deletion of ServiceAccount and its associated ServiceAccountToken Secrets that allows API access.

Node Components

Node components run on every node, maintain running pods and provide them the Kubernetes runtime environment.

Kubelet: Each worker node runs an agent process called a kubelet which is responsible for managing the state of the node i.e. starting, stopping, and maintaining application containers based on instructions from the API server. It gets the configuration of a pod from the API server and ensures that the described containers are up and running. It is the worker service which communicates with the API server to get information about services and write the details about newly created ones. All containers run on the host node using kubelet. It monitors the health of the respective worker node and reports it to the API server.

Kube Proxy: It is a network proxy and a load balancer for a service on a single worker node. It handles the requests from the internet by enabling the network routing for TCP and UDP packets. kube-proxy enables the Kubernetes service abstraction by maintaining network rules on the host and performing connection forwarding.

Kubectl: It is default command line tool to communicate with the API service and send commands to the master node. It also allows to enable the kube dashboard which allows to deploy and run containers using GUI.


Monitoring Kubernetes

Kubernetes has built in monitoring tools to check the health of individual nodes using cAdvisor and Heapster. The cAdvisor is an open source container resource usage collector. It operates at node level in kubernetes. It auto-discovers all containers in the given node and collects CPU, memory, filesystem, and network usage statistics. cAdvisor also provides the overall machine usage by analyzing the ‘root’ container on the machine. cAdvisor provides the basic resource utilization for the node but is unable to determine the resource utilization for individual applications running within the containers. The Heapster is another tool which aggregates monitoring data from cAdvisors across all nodes in the Kubernetes cluster. It runs as a pod in the cluster similar to any application. The Heapster pod discovers all the nodes in the same cluster and then pulls metrics by querying usage information from the kubelet of each node (which in turn queries cAdvisor), aggregates them by pod and label, and reports metrics to a monitoring service or storage backend. Heapster allows to collect data easily from the kubernetes cluster using cAdvisor but does not provide in built storage. It can integrate with InfluxDB or Google Cloud for storage and use UI tools such as Grafana for data visualization.

Kubernetes Networking Internals

A kubernetes pod consists of one or more containers that are collocated on the same host and are configured to share a network were all containers in the pod can reach each other using localhost. A typical docker container has a virtual network interface veth0 which is attached to bridge docker0 using a pair of linked virtual ethernet devices. The bridge docker0 is in turn attached to a physical network interface eth0 which is in the root network namespace. Both the docker0 and veth0 are on the same container network namespace. When a second container starts, docker shares the existing interface veth0, instead of getting a new virtual network interface.  Now both containers are accessible from veth0 IP address (172.17.0.2) and both can hit ports opened by the other on localhost. Kubernetes implements this by creating a special container (started with pause command) for each pod whose main purpose is to provide a virtual network interface for all other containers to communicate with each other and outside world. The pause command suspends current process until a signal is received so that these containers do nothing at all except sleep until kubernetes sends them SIGTERM. The local routing rules set up during bridge creation allows any packet arriving at eth0 with destination address of veth0 (172.17.0.2) to be forwarded to the bridge which will then send it on to veth0. Although this works well for single host, but causes issues when multiple hosts in the cluster have their own private address space assigned to their bridge without any idea about address space assigned to other hosts thus causing potential conflicts.

Pods in kubernetes are able to communicate with other pods whether they are running on the same local host or different hosts or nodes. A kubernetes cluster consists or one or more nodes typically connected with router gateway on a cloud platforms such as GCP or AWS. Kubernetes assigns an overall address space for the bridges on each node and then assigns the bridges addresses within that space, based on the node the bridge is built on. It also adds routing rules to the gateway router telling it how the packets destined for each bridge should be routed, i.e. which node's eth0 the bridge can be reached through. This combination of virtual network interfaces, bridges and routing rules is called an overlay network.


Pods in kubernetes are ephemeral, hence there is no guarantee that IP address of a pod won't change when the pod is recreated. General solution to such problem is to run the traffic through reverse-proxy/load balancer which is highly durable, failure resistant and maintains the list of healthy servers to forward the requests to. Kubernetes service enables load balancing across a set of server pods and allows client pods to operate independently and durably. Service causes a proxy to be configured to forward requests to a set of pods usually determined by the selector which matches labels assigned to the pods. Kubernetes provides an internal cluster DNS which resolves the service name to corresponding service IP. The service is assigned IP address in the service network which is different from pod network. The service network address range similar to pod network is not exposed via kubectl and require to use provider specific commands to retrieve cluster properties. Both the service and pod networks are virtual networks.

Every ClusterIP service is assigned an IP address on the service network which is reachable from any pod within the cluster. The service network does not have any routes, connected bridges and interfaces on the hosts of nodes making up the cluster. Typically IP networks are configured with routes such that when an interface cannot deliver a packet to its destination because no device with the specified address exists locally it forwards the packet on to its upstream gateway. When the virtual ethernet interface sees packets addressed to service IP address, it forwards packets to the bridge cbr0 as it cannot find any devices with service IP on its pod network. The bridges being dumb passes the traffic to the host/node ethernet interface. In theory if the host ethernet interface also cannot find any devices with service IP address it forwards packet to this interface's gateway, the top level router. But the kube-proxy redirects the packets mid-flight to the addresses server pod.

Proxy usually run in user space were packets are marshaled into user space and back to kernel space on every trip through the proxy which can be expensive. Since both pods and nodes are ephemeral entities in the cluster, kubernetes uses a virtual network for service addressing to provide a stable and non-conflicting network address space. The virtual service network have no actual devices i.e. no ports to listen on or interfaces to open an connection, Kubernetes uses netfilter feature of linux kernel and a user space interface called iptables to route the packets. Netfilter is a rules based packet processing engine which runs in kernel space and is able to look at every packet at various points in its life cycle. It matches the packets against the rules and takes a specific action such as redirecting the packet to another destination when the corresponding rule matches. The kube-proxy opens a port and inserts the correct netfilter rules for the service in response to the notifications from the master api server for the changes in the cluster which includes changes to services and endpoints. The kube-proxy has the ability to run in iptables mode in which it mostly ceases to be a proxy for inter-cluster connections and instead delegates the work of detecting packets bound to service IPs in kernel space and redirecting them to pods, to the netfilter. Kube-proxy's main job is to keep the netfilter rules in sync using iptables based on updates received from master api server. Kub-proxy is very reliable and runs on systemd unit by default were it restarts on failure whereas in Google container engine it runs on pod controlled by a deamonset. Health checks against the endpoints are performed by the kubelet running on every node. The kubelet notifies the kube-proxy via api server when unhealthy endpoints are found and the kube-proxy then removes the endpoint from netfilter rules until the endpoint becomes healthy again. This works well for the requests that originate inside the cluster from on pod to another, but for requests from outside the cluster the netfilter rules obfuscate the origin IP.




Connections and requests operate at OSI layer 4 (tcp) or layer 7 (http, rpc, etc). Netfilter routing rules operate on IP packets at layer 3. All routers, including netfilter make routing decisions based solely on information contained in the packet; generally where it is from and where it is going. Each packet that arrives at a node's eth0 interface and is destined for a cluster IP address of a service, is processed by netfilter which matches the rules established for the service, and forwards the packet to the IP address of a healthy pod. The cluster IP of a service is only reachable from a node's ethernet interface. Although the netfilter rules for the service are not scoped to a particular origin network, i.e. any packet from anywhere that arrives on the node's ethernet interface with a destination of service's cluster IP is going to match and get routed to a pod. Hence clients can essentially call the cluster IP, the packets follow a route down to a node, and get forwarded to a pod.

The problem with this approach is that nodes are ephemeral to some extent similar to pods, for e.g. nodes can be migrated to a new VM or clusters can be scaled up and down. Since the routers are operating on layer 3 packets, they are unable to determine healthy services from unhealthy ones. They expect the next hop in the route to be stable and available. If the node becomes unreachable the route will break and stay broken for a significant time in most cases. Also if the route were durable, all external traffic passing through a single node is not optimal. Kubernetes ingress uses load balancers for distributing client traffic across the nodes within the cluster to solve this problem. Instead of using the static address of the nodes, the address of ethernet interfaces connected to nodes is used by the gateway router to route the packets sent from the load balancer. With this approach when the client tries to connect to the service using a particular port e.g. 80, it fails as there is no process listening on service IP address on the specified port. The node's ethernet interface cannot be connected with the specified port and the netfilter rules which intercepts request and redirects to a pod don't match the destination address which is cluster IP address on the service network. The service network that netfilter is set up to forward packets for is not easily routable from the gateway to the nodes, and the pod network that is easily routable is not the one netfilter is forwarding for.

NodePort services creates a bridge between the pod and service network. NodePort service is similar to clusterIP service with an additional capability to reach the IP address of the node as well as the assigned cluster IP on the services network. When kubernetes creates a NodePort service, kube-proxy allocates a port in the range 30000–32767 and opens this port on the eth0 interface of every node. Connections to this port are forwarded to the service's cluster IP. Since NodePorts exposes the service to clients on a non-standard port, a LoadBalancer is usually configured in front of the cluster which exposes the usual port, masking the NodePort from end users. NodePorts are the fundamental mechanism by which all external traffic gets into a kubernetes cluster.


LoadBalancer service type has all the capabilities of a NodePort service plus the ability to build out a complete ingress path, only when running in an environment like GCP or AWS that supports API-driven configuration of networking resources. An external IP is allocated for LoadBalancer service type thus extending a single service to support external clients. The load balancer has few limitations, namely it cannot be configured to terminate https traffic. It also cannot do virtual hosts or path-based routing, hence cannot use a single load balancer to proxy to multiple services. To overcome these limitations a new Ingress resource for configuring load balancers was added in version 1.2.

Ingress is a separate resource that configures a load balancer with much more flexibly. The Ingress API supports TLS termination, virtual hosts, and path-based routing. It can easily set up a load balancer to handle multiple backend services. The ingress controller is responsible for satisfying the configured requests by driving resources in the environment to the necessary state. When services of type NodePort are created using Ingress, the Ingress controller manages the traffic to the nodes. There are ingress controller implementations for GCE load balancers, AWS elastic load balancers, and for popular proxies such as nginx and haproxy. Mixing Ingress resources with LoadBalancer services can cause subtle issues in some environments.


Helm Package Manager

In order to manage and deploy the complex kubernetes applications there are some third party tools available such as Helm. Helm is a package manager and templating engine for Kubernetes. It allows to easily package, configure, and deploy applications and services onto Kubernetes clusters. It allows to easily create multiple interdependent kubernetes resources such as pods, services, deployments, and replicasets by generating YAML manifest files using its packaging format. It is a convenient way to package YAML files and distribute them in public and private repositories. Such bundle of YAML files is called Helm charts. After Helm installation it sets up a Helm tool.

Helm allows to define a common blueprint for example similar micro services and replace the dynamic values using placeholders. Values are defined using a YAML file or using --set flag in command prompt.
apiVersion: v1
kind: Pod
metadata:
  name: {{ .Values.name }}
spec:
  containers:
  - name: {{ .Values.container.name }}
    image: {{ .Values.container.image }}
    port: {{ .Values.container.port }}

values.yaml

name: my-app
container:
  name: my-app-container
  image: my-app-image
  port: 9001

Helm chart consists of Chart.yaml (contains meta information about the chart, name, version, dependencies), values.yaml contains all the values configured in the template files. Charts directory has chart dependencies while templates contains template files.

$ helm install <chart-name>

$ helm upgrade <chart-name>

$ helm install --values=my-values.yaml <chart-name>

$ helm install --set version=2.0.0 <chart-name>

Search Helm charts from Helm hub and Helm charts GitHub Project.

$ helm search <keyword>

Helm 2 utilizes the Tiller, a server-side component installed inside the Kubernetes cluster, to manage Helm chart installations. Tiller runs on the Kubernetes cluster and performs configuration and deployment of software releases on the cluster using helm commands. The helm command line tool accepts commands which are listened by a tiller server component. Tiller stores the local copy of all the help installations which can be used for release management using release names.

Helm 3 removes Tiller entirely and uses the Helm CLI to interact directly with the Kubernetes API, which simplifies deployments and addresses security concerns were Tiller had was so much powerful and complex. This change also means that Helm 3 relies more on the existing security configuration of your Kubernetes cluster. Helm 3 release names are now scoped to their namespace, allowing for the same release name to be used in different namespaces.


Installation on Ubuntu

Install apt-transport-https

$ sudo apt-get update && sudo apt-get install -y apt-transport-https

Add docker signing key and repository URL

$ curl -s https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"

Install docker on every node

$ sudo apt update && sudo apt install -qy docker-ce

Start and enable the Docker service

$ sudo systemctl start docker
$ sudo systemctl enable docker

Install Kubernetes involves installing kubeadm which bootstraps a Kubernetes cluster,  kubelet which configures containers to run on a host and kubectl which deploys and manages apps on Kubernetes.

Add the Kubernetes in signing key

$ sudo curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add 

Create the file /etc/apt/sources.list.d/kubernetes.list and add kubernetes repository URL as below.

$ sudo touch /etc/apt/sources.list.d/kubernetes.list
sudo vi /etc/apt/sources.list.d/kubernetes.list

Add "deb http://apt.kubernetes.io/ kubernetes-xenial main" to vi and save with "!wq".

$ sudo apt-get update
$ sudo apt-get install -y kubelet kubeadm kubectl kubernetes-cni

Cgroup drivers: When systemd is chosen as the init system for a Linux distribution, the init process generates and consumes a root control group (cgroup) and acts as a cgroup manager. Systemd has a tight integration with cgroups and will allocate cgroups per process. Using cgroupfs which also can configure container/kubelets, alongside systemd means that there will then be two different cgroup managers. Control groups are used to constrain resources that are allocated to processes. A single cgroup manager will simplify the view of what resources are being allocated and will by default have a more consistent view of the available and in-use resources. When we have two managers we end up with two views of those resources causing unstability under resource pressure. Hence cgroup driver is configured to systemd which is recommended driver for Docker cgroup driver.

Setup daemon for docker.

$ cat > /etc/docker/daemon.json <<EOF
{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2"
}
EOF

$ sudo mkdir -p /etc/systemd/system/docker.service.d

Restart docker.

$ sudo systemctl daemon-reload
$ sudo systemctl restart docker


Kubernetes Master/Worker Node Setup

Initialize Master Node. The --pod-network-cidr add-on option allows to specify the Container Network Interface (CNI) also called as Pod Network. There are various third party pod network interfaces available which can be selected using --pod-network-cidr option. For example to start a Calico CNI we specify 192.168.0.0/16 and to start a Flannel CNI we use 10.244.0.0/16. It is recommended that the master host have at least 2 core CPUs and 4GB of RAM. If set, the control plane will automatically allocate CIDRs (Classless Inter-Domain Routing or Subnet) for every node. A pod network add-on must be installed so that the pods can communicate with each other.

Kubeadm uses the network interface associated with the default gateway to advertise master node's IP address which it would be listening on. The --apiserver-advertise-address option allows to select a different network interface on master node machine. Specify '0.0.0.0' to use the address of the default network interface.

$ sudo kubeadm init --pod-network-cidr=192.168.1.0/16 --apiserver-advertise-address=<master-ip-address>

The master node can be initialized using default options, were pod is isolated.

sudo kubeadm init

Issue following commands as regular user before joining a node

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

Kubeadm sets up a secure cluster by default and enforces use of RBAC.

The kubectl apply command is part of Declarative Management were changes applied to a live object directly are retained, even if they are not merged back into the configuration files. The kubectl automatically detects the create, update, and delete operations for every object. The below command installs a pod network add-on. Only one pod network can be installed per cluster.

$ kubectl apply -f <add-on.yaml>

The below command creates a Pod based on Calico (Calico Pod Network) using specified release 3.6 calico.yaml file.

kubectl apply -f https://docs.projectcalico.org/v3.6/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml

We can also download the calico.yaml file and then pass it to kubectl apply command.

$ wget "https://docs.projectcalico.org/v3.6/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml" --no-check-certificate

kubectl apply -f calico.yaml

The kubeadm join command is ran on the worker nodes to allow them to join the cluster using the <worker-token> returned by the kubeadm init command on the master node.  It is recommended that each worker host have at least 1 core CPUs and 4GB of RAM.

$ sudo kubeadm join <master-ip-address>:6443 --token <worker-token> --discovery-token-ca-cert-hash sha256:<worker-token-hash>

To generate the worker token again, run below command with --print-join-command option on the master node. In case kubeadm join command fails with "couldn't validate the identity of the API Server", then use below command to regenerate the token for the join command.

$ sudo kubeadm token create --print-join-command


Kubernetes Dashboard Setup

Setup the dashboard on master node before any worker nodes join the master to avoid issues.

All the domains accessing Kubernetes Dashboard (1.7.x) over HTTP it will not be able to sign in. Nothing will happen after clicking Sign in button on login page.

Use the below command to create the kubernetes dashboard.

kubectl create -f https://raw.githubusercontent.com/kubernetes/dashboard/v1.10.1/src/deploy/recommended/kubernetes-dashboard.yaml

To start the dashboard server on default port 8001 with blocking process use the below command.

kubectl proxy

To access the dashboard from outside the cluster from any host, and custom port and address use the below command.

kubectl proxy --address="<master-node-address>" -p 8080 --accept-hosts='^*$' &

To create a service account for your dashboard, using "default" namespace

kubectl create serviceaccount dashboard -n default

To add cluster binding rules for your roles on dashboard

kubectl create clusterrolebinding dashboard-admin -n default \
--clusterrole=cluster-admin \
--serviceaccount=default:dashboard

To get the secret key to be pasted into the dashboard token pwd. Copy the out-coming secret key.

kubectl get secret $(kubectl get serviceaccount dashboard -o jsonpath="{.secrets[0].name}") -o jsonpath="{.data.token}" | base64 --decode

Go to http://<master-node-address>:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy/#!/login , which displays Kubernetes Dashboard. Select option Token and paste the secure key obtain from previous command to access the Kubernetes dashboard.


Kubernetes Commands

Create a deployment

$ kubectl create deployment <deployment-name> --image=<image-name>

Verify the deployment

$ kubectl get deployments

Get more details about the deployment

$ kubectl describe deployment <deployment-name>

Create a deployment with specified <deployment-name> and an associated ReplicaSet object. The --replicas option specifies the number of pods which would run the specified image.

kubectl run <deployment-name> --replicas=5 --labels=<label-name> --image=<image-name>:<image-version> --port=8080

Get the information about the ReplicaSets

$ kubectl get replicasets
$ kubectl describe replicasets

Create the service on the nodes

$ kubectl create service nodeport <deployment-name> --tcp=80:80

Check which deployment is running on which node

$ kubectl get svc

Delete the deployment

$ kubectl delete deployment <deployment-name>

Get status of all the nodes

kubectl get nodes

Get status of all the Podes

kubectl get pods --all-namespaces -o wide

Get the status of all the pods with specified namespace using the -n parameter.

$ kubectl get pods -n <namespace> -o wide

Get the status of the pods with specified label using -l parameter and namespace.

$ kubectl get pods -n <namespace> -l <label-key>=<label-value> -o wide

Get detail status of the Pods

kubectl get -o wide pods --all-namespaces

Get the list of current namespaces in a cluster

$ kubectl get namespaces

Delete the Pod with the specified name

kubectl delete pod <pod-name>

Delete the Pods and Services with the specified Pod and Service names respectively.

kubectl delete pod,service <pod-name> <service-name>

Delete the pods and services with the specified label name, including uninitialized ones

$ kubectl delete pods,services -l name=<label-name> --include-uninitialized

The kubectl create command is part of Imperative Object Management were the Kubernetes API is used to create, replace or delete objects by specifying the operation directly. It creates a resource by filename of or stdin. Below example creates deployment using Nginx YAML configuration file

$ kubectl create -f nginx.yaml

Deletes resources by file name, stdin, resource and names.

$ kubectl delete -f pod.yml

Create a service which exposes the specified deployment at port 8080 and using --type to specify service type as LoadBalancer instead of the default type ClusterIP.

kubectl expose deployment <deployment-name> --type=LoadBalancer --port=8080 --name=<service-name>

Get the information of the specified service.

kubectl get services <service-name>

Get the detailed information of the specified service

$ kubectl describe services <service-name>

Update image for the existing container or pod template.

$ kubectl set image deployments/<deployment-name> <container-name>=<image-name>

The scale command scales the specified deployment up or down with the specified number of replicas.

$ kubectl scale deployments/<deployment-name> --replicas=3

Drain particular node or remove it from service. It would safely evict all of pods from the specified node in case to perform maintenance on the node.

$ kubectl drain <node-name> --delete-local-data --force --ignore-daemonsets

Delete a node with specified name

$ kubectl delete node <node-name>

Delete the headless service using the kubectl delete command. Multiple service names can be passed to kubectl delete command.

$ kubectl delete service <service-name>
$ kubectl delete service <service1-name> <service2-name> <service3-name>

Delete pods and services with label name=myLabel

$ kubectl delete pods,services -l name=myLabel

Delete the deployments passing multiple deployment names.

$ kubectl delete deployments <deployment1-name> <deployment2-name>

Reverts any changes made by kubeadm init or kubeadm join commands. The --ignore-preflight-errors option allows to ignore errors from all the checks.

kubectl reset --ignore-preflight-errors stringSlice

The below kubectl delete command deletes serviceaccount and clusterrole by namespace and name.

$ kubectl delete serviceaccount -n kube-system admin-user
$ kubectl delete clusterrole cluster-admin

To view logs of a particular container inside a pod running multiple container. The -n (namespace) option allows to filter out the pod by namespace.

$ kubectl logs -f <pod-name> -c <container-name>

$ kubectl -n kube-system logs <pod-name> -c <container-name>

Note: If you face an certificate error "Unable to connect to the server: x509: certificate signed by unknown authority", append --insecure-skip-tls-verify=true argument to kubectl commands



Thursday, December 27, 2018

Docker: Platform for Microservices


With the advent of micro services architectural style were applications are a collection of loosely coupled services which can be independently deployed, upgraded and scaled, many organizations are switching to micro-services design in order to achieve greater scalability and availability. In order to run individual services on different instances to scale efficiently, a self contained unit such as virtual machines or docker containers could be used.

Virtualization is a technique of importing a guest operating system on a host operating system. It allows multiple operating systems to run on a single machine, which allows easy recovery on failure. A virtual machine is comprised of some level of hardware, kernel virtualization on which runs a guest operating system and a guest kernel that can talk to this virtual hardware. Virtual machine emulates a real machine and runs on top of either hosted hypervisor or a bare-metal hypervisor which in turn runs on host machine. Hosted virtualization hypervisor runs on the operating system of the host machine hence is almost hardware independent while bare metal hypervisor runs directly on the host machine’s hardware providing better performance. The hypervisor drives virtualization by allowing the physical host machine to operate multiple virtual machines as guests to help maximize the effective use of computing resources such as memory, network bandwidth and CPU cycles. It also allows sharing of resources amongst multiple virtual machines. Either way, the hypervisor approach is considered heavy weight as it requires virtualizing multiple parts if not all of the hardware and kernel. The virtual machine packages up the virtual hardware, a kernel (i.e. OS) and user space for each new instance thus requiring lot of hardware resources. Running multiple VMs on the same host machine degrades the system performance, as each virtual OS runs its own kernel and libraries/dependencies taking considerable chunk of host system resources. Virtual machines are also slower to boot up which becomes critical for real time processing production applications. Once any virtual machine is allocated memory, it cannot be taken back later even though it only uses fraction of its allocated memory. Virtualization thus involves in adding extra hardware to achieve desirable performance, and is a tedious and costly affair to maintain.





Containerization is a lightweight alternative to full machine virtualization that involves encapsulating an application in a container with its own operating environment. Containers run on the same host operating system and on host kernel requiring significantly less resources making booting up the container much faster than a virtual machine.

Docker is a Containerization platform which packages the application and all its dependencies together in the form of Containers so as to ensure that the application works seamlessly in any environment, be it Development, Testing or Production. Docker Containers similar to VMs have a common goal to isolate an application and its dependencies into a self-contained unit that can run anywhere. Each container runs independently of the other containers with process level isolation. Docker containers requires very less space, starts up faster and can be easily integrated with many Dev-Ops tools for automation compared to virtual machines. The Docker container gets allocated the exact amount of memory to run each container thus avoiding any unused memory allocated to any container. Unlike virtual machines which require hardware virtualization for machine level isolation, docker containers operate on isolation within the same operation system. The overhead difference between VM and containers becomes really apparent as the number of isolated spaces increase. Further since docker containers runs on the host system kernel it makes them very lightweight and faster to execute.

Docker container is an isolated application platform which contains everything needed to run the application. They are built from one base docker image & dependencies are installed on top of the image as "image layers". A Docker image is equivalent to an executable which run specific services in a particular environment. Hence in other words, a Docker container is a live running instance of a Docker image. Docker registry is a storage component for docker images. The registry can be user's local repository or a public repository like DockerHub in order to collaborate to build an application.
Docker engine is the heart of docker system, and it creates and runs Docker containers. It works as a client server application with server being a Docker Daemon process which is communicated by Docker CLI using rest APIs and socket I/O to create/run docker containers. Docker daemon builds an image based on inputs or pulls an image from docker registry after receiving corresponding docker build command or docker pull command from docker CLI. When docker run command is received from docker CLI, docker daemon creates a running instance of docker image by creating and running docker container. For Windows and Mac OS X, there is an additional Docker Toolbox which acts as an installer to quickly setup docker environment which includes Docker client, Kitematic, Docker machine and Virtual box.
Docker provides various restart policies to allow the containers to start automatically when they exit, or when Docker restarts. It is always preferred to restart the container if it stops mostly in case of failures.





Docker Machine

Docker Machine is a tool which allows to create (and manage) virtual hosts with installed docker engine on either local machine using VirtualBox or on any cloud providers such as DigitalOcean, AWS and Azure. The docker-machine commands allows to start, inspect, stop, and restart a managed host, upgrade the Docker client and daemon, and configure a Docker client to talk to corresponding host. Docker Machine enables to provision multiple remote Docker hosts on various flavors of Linux and allows to run docker on older Windows or Mac operating systems.


Docker Networking

By default docker creates three networks automatically on install: bridge, none, and host.

BRIDGE: All Docker installations represent the docker0 network with bridge; since docker connects to bridge driver by default. Docker also automatically creates a subnet and gateway for the bridge network, and docker run automatically adds containers to it. Containers running on the same network can communicate with one another via IP addresses. Docker does not support automatic service discovery on bridge network. To connect containers with the network use the "--network" option of docker run command.

NONE: The None network offers a container-specific network stack that lacks a network interface.
The container for none network only has a local loopback interface (i.e., no external network interface).

HOST: Host enables a container to attach to your host’s network (meaning the configuration inside the container matches the configuration outside the container).

Containers can communicate within networks but not across networks. A container with attachments to multiple networks can connect with all of the containers on all of those networks. The docker network create command allows to create custom isolated networks. Any other container created on such network can immediately connect to any other container on this network. The network isolates containers from other (including external) networks. However, we can expose and publish container ports on the network, allowing portions of our bridge access to an outside network.

Overlay network provides native multi-host networking and requires a valid key-value store service, such as Consul, Etcd, or ZooKeeper. A key-value store service should be installed and configured before creating the network. Multiple docker hosts within overlay network must communicate with the key-value store service. Hosts can be provisioned by docker machine. Once we connect, every container on the network has access to all the other containers on the network, regardless of the Docker host serving the container.

Docker Compose

When the docker application includes more than one container, building, running, and connecting the containers from separate Dockerfiles is cumbersome and time-consuming. Docker compose solves this by allows to define a multi-container application using a single YAML file and spin up the application using a single command. It allows to build images, scale containers, link containers in a network and define volumes for data storage. Docker compose is a wrapper around the docker CLI in order to gain time. A docker-compose.yml file is organized into four sections:

version: It specifies the docker compose file syntax version
services: A service is the name for the docker container in production. This section defines the containers that will be started as a part of the Docker Compose instance.
networks: This section is used to configure networking for the application. It enables to change the settings of the default network, connect to an external network, or define app-specific networks.
volumes: It enables to mount a linked path on host machine which is used by the container for persistent storage.


Installing Docker on Ubuntu

$ sudo apt update

$ sudo apt install apt-transport-https ca-certificates curl software-properties-common

$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"

$ sudo apt update

$ sudo apt install docker-ce

$ sudo apt install docker-compose


Docker Commands

Log in to a Docker registry

docker login -u <docker-username> -p <docker-password>

Pull an image or a repository from a registry

docker pull elkozmon/zoonavigator-api:0.2.3

Docker command to clean up any resources; images, containers, volumes, and networks which are dangling and not associated with a container.

$  docker system prune

To remove any stopped containers and all unused images (not just dangling images), add the -a flag to the command.

$  docker system prune -a

Remove all stopped containers, unused volumes and unused images. The -force option does not ask for confirmation during removal.

$  docker system prune --all --force --volumes

Remove all dangling images were no container is associated to them, skipping confirmation for removal.

$ docker image prune -f

Remove all unused images, not just dangling ones

$ docker image prune -a

Remove all stopped containers.

docker container prune

Remove all unused local volumes

docker volume prune

To delete all dangling volumes, use the below command

$ docker volume rm `docker volume ls -q -f dangling=true`

Below docker ps command with the -a flag gives the details of the container including its name, container id and ports on which they are running.

$ docker ps -a

The docker ps -a command can also be used to locate the containers and filter them using -f flag by their status: created, restarting, running, paused, or exited.

$ docker ps -a -f status=exited

Build a docker image using the docker build command. The -t option allows to tagging of the image.

docker build .
docker build -t username/repository-name .

Remove the container by container name or id using rm command.

$ docker rm <container-id> or <container-name>

Removes (and un-tags) one or more images. The -f option removes an image from running container.

docker rmi

To stop all the docker containers

$ docker stop $(docker ps -a -q)

Then to remove all the stopped containers, pass the docker container ids from docker ps to docker rm command

$ docker rm $(docker ps -a -q)

Create a volume with specified volume driver using --driver (-d) option. The --options (-o) allows to set driver specific options.

$ docker volume create -d local-persist -o mountpoint=/mnt/ --name=<volume-name>

Display detailed information of the specified volume

$ docker volume inspect <volume-name>

The docker volume ls command is used to locate the volume name or names to be deleted. Remove one or more volumes using the docker volume rm command as below.

$ docker volume ls
$ docker volume rm <volume_name> <volume_name>

Using the --filter (-f) option, list volumes by filtering only those which are dangling.

docker volume ls -f dangling=true

Get the assigned address for specified docker container

$ docker inspect <container-name>

To get the process id i.e. PID of the specified docker container we use the below command.

$ docker inspect -f '{{.State.Pid}}' <container-name>

Find IP addresses of the container specified by container name. The argument ultimately passed to the docker inspect command is the container id.

$ docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <container-name>

The display the health check of a docker container.

$ docker inspect --format='{{json .State.Health}}' <container-name>
{
  "Status": "healthy",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2017-07-21T06:10:51.809087707Z",
      "End": "2017-07-21T06:10:51.868940223Z",
      "ExitCode": 0,
      "Output": "Hello world"
    }
  ]
}

Execute the specified command in a running docker container. The command is not restarted if the container gets restarted.

$ docker exec <container-name> ps 
$ docker exec -it <container-name> /bin/bash

The exec command comes in handy for all kinds of debuging purposes, e.g. to ensure UDP ports are being listened we use the below netstat command..

$ docker exec -it <container-name> netstat -au

Run a one-time command in a new container. The -e option allows to set an environment variable while running the command. The -t option allocates a pseudo-TTY, while -i keeps the STDIN open even if not attached. The docker run command first creates a writeable container layer over the specified image, and then starts it using the specified command.

$ docker run -it -e "ENV=dev" <docker-image-name>

By default a container’s file system persists even after the container exits. The --rm flag allows to avoid persisting container file systems for short term processes and to automatically clean up the container and remove the file system when the container exits. The --rm parameter is ignored in detached mode with -d (--detach) parameter. By default all containers are connected to a bridge interface to make any outgoing connections, but a custom network can be provided by --network option.

$ docker run -it --rm --network net postgres-service psql -h postgres-service -U appuser

Get a list of all container IDs, only displaying numeric container ids.

$ docker container ls -aq

Display list of all images along with their repository, tags, and size. On passing the name or tag allows to list images by name and tag.

docker images
docker images java

Create a network with the specified driver using --driver (-d) option.

docker network create -d bridge <network-name>

Use the docker run command with the --net flag to specify the network to which you wish to connect your specified container.

docker run <container-name> --net=<network-name>

Get he list of Docker networks

$ docker network ls

Network inspect allows to get further details on networks.

$ docker network inspect <network-name>

To get the details of the default bridge network in JSON format we use below command.

$ docker network inspect bridge

Create a docker machine with a --driver flag indicating the provider on which the machine should be created on e.g. VirtualBox, DigitalOcean, AWS, etc.

$ docker-machine create --driver virtualbox <docker-machine-name>

The docker logs command shows information logged by a running container. The -t option allows to follow the log and -t option displays the timestamp.

docker logs -t -f <container-name>

Builds, (re)creates, starts, and attaches to containers for a service. Unless they are already running, docker-compose up also starts any linked services. When the command exits it stops all containers.

docker-compose up

Start all docker containers. With the -d (--detach) option specified, docker-compose will start all the services in the foreground and leaves them running. When –no-recreate option, if the container already exits, this will not recreate it.

$ docker-compose up -d

Build and start only the specified docker container.

docker-compose up <container-name>

The --scale flag allows to scale the number of instances of the specified service.

docker-compose up --scale <service-name>=3

The --no-deps argument for docker-compose up command doesn't start linked services.

$ docker-compose up -d --no-deps --build <container-name>

The --file or -f option allows to specify alternate compose file from the default "docker-compose.yml". Multiple configuration files can be supplied using -f option. The compose combines and builds the configuration in the order compose files were supplied. Subsequent files override and add to their predecessors.

docker-compose -f docker-compose.yml -f docker-compose.dev.yml up

The --build option allows to build images before starting containers.

docker-compose -f docker-compose-dev.yml up --build

The docker-compose with --force-recreate option allows to stop and recreate containers from fresh images every time even if their configuration and image haven't changed.

$ docker-compose up -d <container-name> --build --force-recreate

The --no-recreate does not create containers if they already exists.

docker-compose up -d <container-name> --build --no-recreate

The --remove-orphans removes containers for services not defined in the compose file.

docker-compose up -d <container-name> --build --remove-orphans

Starts existing containers for a service.

$ docker-compose start

Stop all the docker containers

$ docker-compose stop

Below down command, stops containers and removes containers, networks, volumes, and images created by up.

docker-compose down

The Run command runs a one-time command against a service. It starts a container, runs the command and discards the container.

$ docker-compose run <container-name> bash

The exec command allows to run arbitrary commands in the services. It is similar to run, but allows to attach to a running container and run commands inside it e.g. for debugging. By default it allocates a TTY to get an interactive prompt.

$ docker-compose exec <container-name> sh

Removes the stopped service containers without asking for any confirmation.

$ docker-compose rm -f

Stop the containers, if required, and then remove all the stopped service containers.

$ docker-compose rm -s

Remove the volumes that are attached to the containers

$ docker-compose rm -v

Rebuild the docker containers and tags it. It helps to rebuild whenever there is a change in the Dockerfile or the contents of its build directory.

$ docker-compose build

Get status of docker containers

$ docker-compose ps

The docker-compose log command displays output logs from all running services. The -f option means to follow the log output and the -t option gives the timestamps.

$ docker-compose logs -f -t

Validate and view the docker compose file.

docker-compose config

It also allows to test the resultant docker compose file where variables e.g. $SERVICE_PASSWORD need to populated by passing from the command line before the docker command as below. It is important to note that docker detects if environment variables are changed in dependent container compared to the existing running container and recreates the container again.

$ SERVICE_PASSWORD=secret docker-compose config


Docker Swarm

Swarmkit is a separate project which implements Docker’s orchestration layer and cluster management and is embedded in the Docker Engine. Docker swarm is a technique to create and maintain a cluster of Docker Engines. Many docker engines connected to each other form a network , which is called docker swarm cluster.

A docker manager initializes the swarm in a docker swarm cluster, and along with many other nodes executes the services. A node is an instance of the Docker engine participating in the swarm. Though docker manager can execute the services, its primary role is to maintain and manage docker nodes running the services. Docker manager also performs cluster and orchestration management by electing a single leader to conduct orchestration tasks. The manager node uses the submitted service definition to dispatch units of work called tasks to worker nodes. A task is an instance of running container which is part of a swarm cluster managed by docker manager. The docker manager assigns tasks to worker nodes according to the number of replicas set in the service scale. Once a task is assigned to a node, it cannot move to another node. The worker nodes receive and execute the corresponding tasks dispatched from manager node. The docker manager maintains the desired state of each worker node using their current state of assigned tasks reported by the agent running on each worker nodes. The docker Manager has two kinds of tokens, a manager token and a worker token. The worker nodes use the worker token to join the swarm as a worker node, while another node can join as a docker manager by getting the token from docker manager creating multi-manager docker cluster. The multi manager cluster has a single primary docker manager while multiple secondary docker managers. While a request to deploy the application (start a service) can be made to either the primary or secondary manager, any request to secondary manager is automatically routed to primary manager which is responsible for scheduling/starting container on the host. All the docker managers in a multi manager cluster form a Raft consensus groupRaft consensus algorithm enables to design fault-tolerant distributed system were multiple servers agree on the same information. It allows an election of a leader and for each subsequent request to the leader which is appended to its log, logs of every follower is replicated with the same information. It is highly recommended to have odd number of docker managers (typically 1, 3 or 5) to avoid split brain issue were more than one candidate gets equal majority aka tie. The Worker nodes communicate with to each other using gossip network.

There are two modes in which services are executed in docker swarm, namely replicated or global. The replicated mode allows to have multiple instances (task) of the service be executed in same docker host, depending on its load and capacity. Also it allows to have no instance of the service running on already loaded docker node. The global mode however ensures that one instance of the service is running on every node of the docker cluster. It ensures that unless all the nodes fail the service would still up and running across other nodes. It is used for critical services which required to be up all the time e.g. Eureka service.






A service is an higher abstraction which helps to run an image in a swarm cluster while swarm manages individual instances aka tasks. It is a an docker image which docker swarm manages and executes. When a desired service state is declared by creating or updating a service, the orchestrator realizes the desired state by scheduling tasks. Each task is a slot that the scheduler fills by spawning a container. The container is the instantiation of the task. Service creation requires to specify which docker image to use and which commands to execute inside running containers.

The services requested would be divided and executed across multiple docker nodes as tasks to achieve load balancing. Multiple tasks belonging to single or different service can be executed within a docker node. At any point of time when a node goes down in docker swarm cluster, the docker manager starts the tasks for the services running on stopped docker node on another nodes to balance the load and thus providing high availability of services. Auto load balancing ensures that during any node down time the docker manager will execute the corresponding down services on other nodes and also scale the services on multiple nodes during high load time. The docker manager uses an internal DNS server  for load balancing, which connects all the nodes in the docker cluster. Decentralized access allows to the service deployed in any node to be accessed from other nodes in the same cluster. Docker swarm also allows seamless rolling updates for each service with delay between individual nodes. Docker Swarm manages individual containers on the node for us.

A stack is a group of interrelated services that share dependencies, and can be orchestrated and scaled together. A single stack is capable of defining and coordinating the functionality of an entire application. The stack abstraction goes beyond the individual services and deals with the entirety of application services, which are closely interlinked or interdependent. Stacks allow for multiple services, which are containers distributed across a swarm, to be deployed and grouped logically. The services running in a Stack can be configured to run several replicas, which are clones of the underlying container. The stack is configured using a docker-compose file and it takes one command to deploy the stack across an entire swarm of Docker nodes. Stacks are very similar to docker-compose except they define services while docker-compose defines containers. Docker stack simplifies deployment and maintenance of multiple inter-communicating microservices and is ideal for running stateless processing applications.




Docker Swarm Commands

Initialize Docker Swarm using below swarm init command. The specified <ip-addr> would be the docker manager node's ip address which ideally should be same machine.

$ docker swarm init --advertise-addr <ip-addr>

The swarm init command's listen-addr option allows the current node to listen for inbound swarm manager traffic on the specified IP address.

$ docker swarm init --listen-addr <ip-addr>:2377

The swarm join command allows the node with specified IP address to join the swarm cluster as a node and/or manager.

$ docker swarm join --token <token> <ip-addr>:2377
docker swarm join --token <worker-token> <manager>:2377

Create a multi master docker cluster by joining the swarm as docker manager and then giving the manager token. Below we have 2 docker managers in the docker cluster.

$ docker swarm join --manager --token <manager_token> --listen-addr <master2-addr>:2377 <master1-addr>:2377

The below join-token allows docker swarm to manage join tokens. It is usually used to add the current node as a manager or a worker to the current swarm cluster (often by generating swarm join --token command).

docker swarm join-token (worker|manager)

Leave the current swarm cluster. When command is ran on a worker, the worker node leaves the swarm. The --force option on a docker manager removes it from the swarm cluster.

docker swarm leave --force

All below service commands can only run on docker manager.

Below command lists all the services running inside docker swarm.
$ docker service ls

Below command lists tasks running on one or more nodes for specified nodes.
$ docker service ps <name>

Create new services published on specified node port.

$ docker service create <name> -p host-port:container-port <image-name>
$ docker service create <name> --publish host-port:container-port --replicas 2 <image-name>
$ docker service create --name <name> alpine ping <host-name>

The replicas option allows to specify the number of tasks (instances) which the new created service will be executed.

$ docker service create --replicas 3 --name <name> <image-name>

With mode set as global for create service command, it downloads the specified image and starts the corresponding service on each single node of the cluster.

$ docker service create --mode=global --name=<name> <image-name>

Remove service running in docker swarm.

$ docker service rm <name>

Scale one or more replicated services

$ docker service scale <name>=5

Display details regarding the specified service

$ docker service inspect <name> --pretty

Update the service by increasing the number of replicated services.

$ docker service update --replicas 10 <name>

The service logs command shows information logged by all containers participating in a service or task.

docker service logs

List all the nodes present in the swarm cluster

$ docker node ls

Lists all the services (tasks) running on current node (by default).

$ docker node ps

Removes one or more nodes specified by id from the swarm cluster

$ docker node rm <id>

Stop allocating services to Manager-1 node

$ docker node update --availability drain <Manager-1>

Start allocating services to Manager-1 node

$ docker node update --availability active <Manager-1>

Deploy a new stack or update an existing stack. The --compose-file (-c) option allows to provide path to the docker compose file.

$ docker stack deploy <stack-name>
$ docker stack deploy -c docker-compose.yml <stack-name>

List all the stacks

$ docker stack ls

List all the services in the specified stack

$ docker stack services <stack-name>

List the tasks in the specified stack

$ docker stack ps <stack-name>