Emprovise Blog

Large Language Models (LLMs) - Empowering Generative AI

2023-12-31T15:17:00.000-08:00

ChatGPT from OpenAI is the fastest growing consumer application in human history. ChatGPT is the chatbot variant of GPT 3.5 (Generative Pre-trained Transformer) which is the Large Language Model from OpenAPI. GPT model was trained by 570 GB of text data from the internet, digital books, wikipedia and many more. It makes billions of connections between all the words using the trained data, which is used to answer any given question. GPT model was trained using reinforcement learning from human feedback to align its responses to be truthful, helpful and harmless. GPT 3.5 has 175 Billion parameters spread across 96 layers in a neural network.

Artificial Neural Networks

Artificial Neural Networks (ANNs) are modeled after the neurons in the human brain. ANNs contain artificial neurons which are called units. These units are arranged in a series of layers that together constitute the whole Artificial Neural Network in a system. A layer can have only a dozen units or millions of units as this depends on how the complex neural networks will be required to learn the hidden patterns in the dataset. Commonly, Artificial Neural Network has an input layer, an output layer as well as hidden layers. The input layer receives data from the outside world which the neural network needs to analyze or learn about. Then this data passes through one or multiple hidden layers that transform the input into data that is valuable for the output layer. Finally, the output layer provides an output in the form of a response of the ANNs to input data provided. In the majority of neural networks, units are interconnected from one layer to another. Each of these connections has weights that determine the influence of one unit on another unit. As the data transfers from one unit to another, the neural network learns more and more about the data which eventually results in an output from the output layer. The input layer receives the input from external sources and releases it to the hidden layer. In the hidden layer, each neuron receives input from the previous layer neurons, computes the weighted sum, and sends it to the neurons in the next layer. These connections are weighted means effects of the inputs from the previous layer are optimized more or less by assigning different-different weights to each input and it is adjusted during the training process by optimizing these weights for improved model performance.

Artificial neural networks are trained using a training set. The output obtained by the ANN is corroborated by a human-provided description. Backpropagation is used to make adjustments by fine-tuning the weights of the connections in ANN units based on the error rate obtained. The depth, number of hidden layers, and I/O capabilities of each node are a few criteria used to identify neural networks. Types of neural network models are:

Feedforward artificial neural networks.
Perceptron and Multilayer Perceptron neural networks.
Radial basis functions artificial neural networks.
Recurrent neural networks.
Modular neural networks.

Supervised and Unsupervised Learning

Supervised machine learning is the type of machine learning in which machines are trained using well "labelled" training data, on the basis of which machines can predict the output. Labelled data is the input data were its already tagged with the correct output. In supervised learning, the training data provided to the machines work as the supervisor (teacher) that teaches the machines to predict the output correctly. The aim of supervised learning by providing input data as well as correct output data to machine learning model, is to find a mapping function to map the input variable (x) with the output variable (y). Supervised learning can be further divided into Regression and Classification problems. Unsupervised machine learning, uses machine learning algorithms to analyze and cluster unlabeled datasets. These algorithms discover hidden patterns or data groupings (based on similarities or differences) without the need for human intervention. The role of an unsupervised learning algorithm is to discover the underlying structure of an unlabeled dataset by itself. Clustering methods involve grouping untagged data based on their similarities and differences. When two instances appear in different groups, we can infer they have dissimilar properties. The goal of supervised learning is to build a model that performs well on new data. Train test split is a model validation process that allows you to simulate how your model would perform with new data. Train test split is a model validation procedure that allows you to simulate how a model would perform on new/unseen data. Below are the some of prominent unsupervised machine learning techniques/algorithms.

Regression Analysis is used in machine learning for prediction and forecasting. Linear Regression is used to handle regression problems whereas Logistic regression is used to handle the classification problems.
K-nearest neighbors (kNN), is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point.
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.
Decision Tree (DT) is a decision support hierarchical model that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.
Random Forest (RF) combines the output of multiple decision trees to reach a single result and handles both classification as well as regression problems.
GridSearchCV (Grid Search Cross-Validation) is a technique used in machine learning to search and find the optimal combination of hyperparameters for a given model.

Next Token Prediction

Masked Language Modeling

GPT Model

Sequence-to-sequence (seq2seq) models are used to convert one type of sequence to another. They are based on Recurrent Neural Networks (RNNs). RNNs are good at processing sequences because they can remember the previous inputs in the sequence. Adding attention mechanism to the seq2seq models allows it to focus on specific parts of the input sequence when generating the output sequence. seq-2-seq models are unable to work with long ranging inputs with dependencies (paragraphs of text or essays) and unable to parallelize hence take a long time to process.

Transformer Architecture

The Transformer is a new architecture that was proposed in the paper Attention Is All You Need in June 2017. Transformers consists of encoder which works on the input and the decoder which works on the target output. The transformer takes a sequence of tokens (words) and predicts the next word in the output sequence. It iterates through the encoder layers to generate encodings. Transformer relies on self-attention to compute representations of its input and output sequences. Self-attention allows the Transformer to attend to all positions in the input sequence, regardless of their distance. This makes it possible for the Transformer to handle long-range dependencies more effectively than RNNs.

The architecture of the Transformer consists of two main parts: the encoder and the decoder. The encoder reads the input sequence and creates a representation of it. The decoder uses the representation created by the encoder to generate the output sequence. Both the encoder and decoder are made up of a stack of self-attention layers. The self-attention layers allow the Transformer to attend to all positions in the input or output sequence, regardless of their distance. The input word vectors in the encoder are parsed into key, query, and value vectors. The dot product of the key and query provide the attention weight, which is squashed using a softmax function across all attention weights so that the total weighting sums to 1. The value vectors corresponding to each element are summed according to their attention weights before being fed into subsequent layers.

Encoder

During training the encoder will take the input English words of the sentence simultaneously and it will generate word vectors simultaneously. These words vectors will eventually be context aware using the attention mechanism. To transform a word into a vectors which capture a lot of the meaning/semantic information of the words, methods such as word embedding algorithms are used.

The model's loss is determined by comparing the predication made with the soft max and the labeled sentence. The common loss function used for such problems is the cross entropy loss. Cross entropy loss is computed for every predicted word and add them up to get a single loss. This loss is back-propagated through the RNN to update the its parameters.

Number of sentences are passed at once through the network (called batch size) in order to update the weights of the entire network at once. The batch size is a network parameter. Also another parameter is the maximum number of words any sentence passed contains. Passing sentences in batches enables faster training. When training neural networks with gradient descent we pass a single input to generate a single output prediction, compare the prediction and label and quantify this as a loss. The loss is then back propagated to update the parameters of the network. This works for small networks but for large networks with millions of parameters the updates are slower. Hence the weights are updated after passing through batch of sentences instead of updating them after every single sentence, making Mini-batch Gradient Descent a commonly used practice.

Each word is converted into vector (512 dimensional vectors), which is stored in a tensor with batch size (no of sentences per batch) by max words in sentence by 512 dimensional tensor. A position encoding with same tensor size which are generated from sin/cosine functions (values between -1 to +1) are then added to the word tensor. The positional encoder defines the ordering of the words passed to the encoder.

The resulting tensor is then passed through the feed forwarder network in order to get query vectors, key vectors and value vectors. Every word vector is split into Query vector, Value vector and Key Vector. Hence we have three times that of original word vector with (batch size x max words x 512) dimensions. This split of the word is required to carry out the multi head attention. The multi head means we perform the parallel operations in multiple stacks with copies of same query, key and value vectors. Self attention as to analyze the context and build context within the same sentence by comparing every within word sentence to each other. Since many sentences will not have exactly the max word size, padding tokens are added. Padding tokens are not considered while computing a loss and performing back propagation, hence these are masked out from those operations. The query, key and value vectors are broken into 8 small (64 dimensional) vectors. Then the query vector is multiplied by key vector. This will compare every word in English sentence i.e. query vector with every word in same English sentence which is the key vector, with result called as self attention matrix.

Some scaling is typically done on the Attention matrix before adding the padding mask to stabilize the training, which prevents the values after multiplication from exploding to very high or very low numbers. Scaling means dividing every value in the self attention tensor by a constant value, which is square root of the key dimension size vector for one head (used in the paper). Once scaling is completed the padding mask is added to the attention matrix. The padding mask is used to prevent the padding tokens from propagating values. Technically zero is used for pass through and negative infinity (-10^9) is used for masking, since a soft max is performed eventually.

The soft max operation uses exponents, were exponent of zero becomes 1 (pass through) and exponent of negative infinity becomes 0 (mask or block info).

After applying the padding mask and then applying the soft max operation we get the Attention Matrix. Each value in attention matrix quantifies how much attention each word should pay to each other word. The value matrix computed earlier is applied to the attention matrix to get a Value Tensors which are contextually aware. All the value tensors from different heads are concatenated.

Decoder

The decoder initially takes just the start token as input, followed by resulting words in subsequent iterations and end token is added at the end. The English word vectors generated from the encoder are also passed to the decoder. It predicts the next word given the first word, then next word given the first and second word and so forth.

Positional Encoding tags each word in the input with an ordering number before being sent for processing. Store information about the word order in the data itself rather than within the structure of the network. As the model is trained with lot of text data, it learns how to interpret those position encodings.

The word embeddings of the input sequence are passed to the first encoder. These are then transformed and propagated to the next encoder. The output from the last encoder in the encoder stack is passed to all the decoders in the decoder stack as shown in the figure below.

In addition to the self-attention and feed-forward layers, the decoders also have one more layer of the Encoder-Decoder Attention layer, which helps the decoder focus on the appropriate parts of the input sequence.

Self attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. The self-attention layer calculates a score for each pair of words in the sentence, and then uses these scores to determine how much attention to pay to each word when computing the representation of the word.

LLaMA 2 model

Jurassic Model

Amazon Titan

Mistral Models

Mistral AI is a company which produces open source AI models with less parameters than GPT4.

OpenSource Models

There are over 325000 open source models on HuggingFace with top LLMs chart. These models are smaller compared to the proprietary models and have considerably less parameters. But they can be fine tuned or modified based on anyone's requirements. Many open source models are variations on LLaMA 2 model provided by Meta. Vicuna model is an open source model created on top of LLaMA 2. Bloom model is a multi-lingual language model created by BigScience. IBMs watsonx.ai also offers many version of Llama 2 and other foundation models. Hermes-13b. Falcon 180B is an open-access large language model that builds on the previous releases in the “Falcon” family and has 180 billion parameters.

Disadvantages of LLMs

Hallucinations results when the LLMs are trained on incomplete, contradictory or inaccurate data from misunderstanding context.
Bias happens when the source of training data is not diverse or not representative.
Security issues occur when Private Personal Information (PPI) is exposed when data is used for training models.

AI Tools

There are various tools which can be leveraged for training and working with AI Models.

Ollama is the tool used to run LLM on local machine. For example the below command runs the llama model.

ollama run llama2

Prometheus: Metric Collection and Analysis

2022-12-30T19:19:00.002-08:00

Prometheus is an open-source monitoring and alerting tool. It collects data from applications, enable to visualize the data and issue alerts based on the data. Prometheus collects metrics as time series data, process, filter, aggregate, and represent them in a human-readable format. It supports multi-dimensional data model with time series data were data can be sliced and diced along dimensions like instance, service, endpoint, and method. The metrics in the form of key-value pairs.

Prometheus scrapes metrics from instrumented jobs, either directly or via an intermediary push gateway for short-lived jobs. It stores all scraped samples locally and runs rules over this data to either aggregate and record new time series from existing data or generate alerts. Grafana or other API consumers can be used to visualize the collected data. Prometheus stores all data as time series i.e. streams of timestamped values belonging to the same metric and the same set of labeled dimensions. Timestamps are stored in milliseconds, while values are always 64-bit floats.

Prometheus provides PromQL, a flexible query language which supports multi-dimensional data model and allows filtering and aggregation based on these dimensions. It supports autonomous single server nodes. It enables discovering targets via service discovery or static configuration. Prometheus supports templating in the annotations and labels of alerts. Prometheus workspace limits to a single region.

Architecture

Prometheus Server collects multi-dimensional data in time series and then analyzes and aggregates the collected data. The process of collecting metrics is called scraping. Time series format is when data is collected after successive or fixed time intervals.

Prometheus server automatically pulls metrics from the targets; hence the user does not need to push metrics for analysis. The client needs to create an HTTP endpoint with /metrics, which returns the complete set of metrics accessible to Prometheus. Prometheus currently supports three metric types, namely counter which is a cumulative metric, gauge which can arbitrarily go up and down, and summary is a client-side calculated histogram of observations. Prometheus Gateway is the intermediary source used for metrics from those jobs which can not be scraped by usual methods, albeit comes with some drawbacks if not used properly.

Prometheus server uses a pull method to collect metrics by reaching out to exporters to pull data. An exporter is any application that exposes data in a format Prometheus can read. The scrape_config in prometheus.yml configures the Prometheus server to regularly connect to each exporter and collect metric data. Exporters do not reach out to Prometheus. Such a pull-based metric system helps in scraping metrics remotely also. However, there are some use cases where a push method is necessary, such as monitoring batch job processes. Prometheus Pushgateway serves as a middle-man for such use cases, were the client application pushes metric data to Pushgateway. The Prometheus server pulls metrics from Pushgateway, just like any other exporter.

Recording rules allow to pre-compute the values of expressions and queries, and save the results as their own separate set of time-series data. Recording rules are evaluated on a schedule, executing an expression and saving the result as a new metric. Recording rules are especially useful when we have complex or expensive queries that are executed frequently. For example, by saving pre-computed results using a recording rule, the expression does not need to be re-evaluated every time someone opens a dashboard. Recording rules are configured using YAML. Create them by placing YAML files in the location specified by rule_files in prometheus.yml. When creating or changing recording rules, reload their configuration the same way as we would when changing prometheus.yml.

Alertmanager is an application that runs in a separate process from the Prometheus server. It is responsible for handling alerts sent to it by clients such as the Prometheus server. Alerts are notifications that are triggered automatically by metric data. Alertmanager does not create alerts or determine when alerts need to be sent based on metric data. Prometheus handles that step and forwards the resulting alerts to Alertmanager. Alertmanager does the following:

Deduplicating alerts when multiple clients send the same alert.
Suppress or mute notifications for a particular time frame by any label set.
Grouping multiple alerts together when they happen around the same time.
Routing alerts to the proper destination such as email, or another external alerting service such as PagerDuty or OpsGenie.

Terraform - Infrastructure As Code

2021-09-18T00:43:00.028-07:00

Since the emergence of cloud services such as AWS, Azure, Google Cloud, more and more co-operations are moving away from on-premises infrastructure, saving them investment & maintenance costs and alleviating from infrastructure management. As the size and complexity of the application grows, manual setup becomes time consuming and prone to errors. Large applications require multiple cloud resources, custom configuration and role/user based permissions. In order to automate the provisioning of the resources and efficiently manage the manual processes, the concept of infrastructure as code, similar to programming scripts, has become more prevalent.

Infrastructure as Code (IoC)

Infrastructure as code codifies and manages underlying IT infrastructure as software. It allows to easily setup multiple environments to develop, test and pilot the application. All the environment are consistent with the production environment given the same code is used to setup all the environments. With the infrastructure setup written as code, it can go through the same version control, automated testing and other steps of a continuous integration and continuous delivery (CI/CD) pipeline similar to application code. Infrastructure as code does requires additional tools, such as a configuration management and automation/orchestration system which could introduce learning curves and room for errors.

Immutable infrastructure is preferred for high scalable cloud and microservices environments, were a set of components and resources are assembled to create a full service or application. When a change is required for any component, they are not changed or reconfigured, but they are all updated and effectively redeployed in an instance. Mutable infrastructure on the other hand is preferred in legacy systems, were the infrastructure components are changed in production, while the overall service or application continues to operate as normal.

Infrastructure-as-code can be declarative and imperative. A declarative programming approach outlines the desired, intended state of the infrastructure, but does not explicitly list the steps to reach that state, e.g. AWS CloudFormation templates. An imperative approach defines commands that enable the infrastructure to reach the desired state, for example Chef script and Ansible. In both the approaches we have a template which specifies the resources to be configured on each server and allows to verify or setup the corresponding infrastructure.

Infrastructure-as-code tools configure and automate the provisioning of infrastructure. These tools can automatically execute the deployment of infrastructure, such as servers, along with orchestration functionality. They also can configure and monitor previously provisioned systems. These tools enforce the setup from the template via push or pull methods. They also enable to roll back changes to the code, as in the event of unexpected problems from an update.

Terraform

Terraform is an open-source infrastructure as Code tool developed by HashiCorp. It is used to define and provision the complete infrastructure using its declarative HashiCorp Configuration Language (HCL). It also supports JSON configuration. It enables to store the cloud infrastructure setup as code. Terraform manages the life cycle of the resources from its provisioning, configuration and decommissioning. Compared to CloudFormation which only allows to automate the AWS infrastructure, Terraform works with multiple cloud platform such as AWS, Azure, GCP, DigitalOcean and many more. It supports multiple tiers such as SaaS, PaaS and IaaS. Terraform provides both configuration management and orchestration. Terraform can be used for creating or provision new infrastructure, managing existing infrastructure, and to replicate infrastructure. Terraform allows to define infrastructure in a config file, which can be used to track infrastructure changes using source control, and can be used to build and deploy the infrastructure.

To install Terraform, download the latest available binary package, unzip the archive into a directory and include the path to binary (bin) directory in system PATH environment variable. Set the Terraform plugin cache by creating a ~/.terraformrc file and adding:

plugin_cache_dir = "$HOME/.terraform.d/plugin-cache"

Terraform Architecture

Terraform has three main components, the Core, the Plugins and Upstream APIs.

The Core is responsible for reading configuration and building the dependency graph. It creates a plan of what resources needs to be created/changed/removed.

Terraform Plugins are external single static binaries with the Terraform Provider Plugin as the most common type of plugin. During the planning and applying phases, Terraform's Core communicates with the plugins via an RPC interface. Terraform Provider Plugins implement resources with a basic CRUD (create, read, update, and delete) API for communicating with third party services. The providers makes it possible to create infrastructure across all this platforms.

Upstream APIs are third party, external services with which Terraform interacts. The Terraform Core asks the Terraform Provider Plugin to perform an operation, and the plugin communicates with the upstream API.

HashiCorp Configuration Language (HCL)

HCL syntax is defined using blocks which contains arguments and configuration data represented as key/value pairs . Block syntax usually has the block name e.g. resource, followed by the Resource type which is dependent on the provider e.g. "local_file" and finally the resource name in order to identify the resource. The resource type contains the provider type (e.g. local, AWS etc) before the underscore, followed by the actual resource type (e.g. file, EC2 etc). The arguments of the resource depends on the type of the provider and type of the resource being created. Each resource type expects specific arguments to be provided.

Syntax:

<block> <parameters> {
key1 = value1
key2 = value2
}

resource "local_file" "config" {
   filename = "/root/config.txt"
   content = "This is system config."
}

Terraform Core Commands

Init Command

Init command is the first command to run before we can start using Terraform. The terraform binary contains the basic functionality for Terraform and does not come with the code for any of the providers (e.g., AWS provider, Azure provider, GCP provider, etc). The init command checks the terraform configuration file and initializes the working directory containing the .tf file. Terraform executes the code in the files with the .tf extension. Based on the provider type declared in the configuration, it downloads the plugins to work with the corresponding resources declared in the tf file. By default, the provider code will be downloaded into a .terraform folder. It also downloads the additional modules referenced by the terraform code. The init command also sets up the backend for storing the Terraform state file, which allows Terraform to track resources. The init command can be ran multiple times as it is idempotent.

$ terraform init

In order to initialize Terraform working directory without accessing any configured remote backend, we use below command.

$ terraform init -backend=false

The -upgrade flag upgrades all modules and providers to the latest version.

$ terraform init -upgrade

Plan Command

Plan command reads the terraform code and shows the actions that would be carried out by terraform to create the resources. The output of the plan command shows the differences, were resources with a plus sign (+) are going to be created, resources with a minus sign (-) are going to be deleted, and resources with a tilde sign (~) are going to be modified in-place. It allows users to review the action plan before execution and ensure that all the actions performed by the execution/deployment plan are as desired. The plan command does not deploy anything and is considered as a read-only command. Terraform uses the authentication credentials to connect to cloud platform were the infrastructure is to be deployed.

$ terraform plan

The -out=FILE option allows to save the generated plan into a file.

$ terraform plan -out <file-name>

The -destroy option outputs the destroy plan.

$ terraform plan -destroy

To plan only a specific target module

$ terraform plan -target module.<module-name>

Apply Command

Apply command It executes the actions proposed in a Terraform plan and makes the infrastructure changes into the cloud platform. Terraform executes the code in the files with the .tf extension. It displays execution plan once again, ask confirmation to create the resources. Once confirmed it creates the specified resources. It also updates the deployment changes into the state file which keeps track of all infrastructure updates. The state file can be stored locally or in remote location and usually named as "terraform.tfstate" by default.

$ terraform apply

The --auto-approve flag skips the interactive approval prompt to enter "yes" before applying the plan.

$ terraform apply --auto-approve

The apply command can also the filename of a saved plan file created earlier using terraform plan command -out=... and directly apply the changes without any confirmation prompt.

$ terraform apply <saved-plan-file>

When terraform apply command is ran without a saved plan file, terraform apply supports all of terraform plan's planning modes and planning options.

The below command apply/deploy changes only to the targeted resource. The resource address syntax is used to specify the target resource.

$ terraform apply -target=<resource-address>

$ terraform apply -target=aws_instance.my_ec2

The -lock option (enabled by default) holds the lock to the state file so that others cannot concurrently run commands against the same workspace and modify the state file.

$ terraform apply -lock=true

The terraform apply command can pass the -lock-timeout=<time> argument which tells Terraform to wait up to the specified time for the lock to be released. Another user cannot execute apply command with the same terraform state file until the lock is released. In the below example the lock will be released after 10 minutes.

$ terraform apply -lock-timeout=10m

The -refresh=false option is used to not reconcile state file with real-world resources.

$ terraform apply refresh=false

The parallelism option limits the number of concurrent (resource) operations as Terraform walks the graph. By default its value is 10.

$ terraform apply --parallelism=5

The -refresh-only option only updates Terraform state file and any module output values to match with all the managed remote objects outside of Terraform. It reconciles the Terraform state file with real-world resources. It replaces the old deprecated terraform refresh command.

$ terraform apply -refresh-only

The -var option is used in both terraform plan and terraform apply commands to pass input variables.

$ terraform apply -var="<variable-name>=<value>"

$ terraform apply -var="image_id=ami-abc123" -var="instance_type=t2.micro"

In order to set lots of variables, the variables & their values are specified in a variable definitions file .tfvars or .tfvars.json, and passed with -var-file option to the apply command.

$ terraform apply -var-file=<your file>

Destroy Command

Destroy command It looks at the recorded, stored state file create during the deployment and destroys all the resources which are being tracked by Terraform state file. This command is a non-reversible command and hence should be used with caution. It good to take backups and ensure that we want to delete all the infrastructure. It is mainly used for cleaning up the resources which are created and tracked using Terraform.

$ terraform destroy

Terraform Providers

Terraform abstracts the integration with API control layer of infrastructure vendors using the Provider. Every cloud vendor has its own provider. Terraform by default looks for providers in the Terraform providers registry.

Terraform configurations must declare which providers they require so that Terraform can install and use them. Additionally, some providers require configuration (like endpoint URLs or cloud regions) before they can be used. Each provider adds a set of resource types and/or data sources that Terraform can manage. Every resource type is implemented by a provider, which enable Terraform to manage the corresponding cloud infrastructure.

The below command shows information about the provider requirements for the configuration in the current working directory.

$ terraform providers

Providers can also be sourced locally or internally and referenced with the Terraform code. Providers are plugins which are distributed separately from Terraform itself, and each provider has its own series of version numbers. It is recommended to use specific version of terraform providers in the terraform code. We can also write a customer provider s. Terraform finds and installs the providers when initializing the working directory using the terraform init command. The terraform providers are downloaded in a hidden directory named .terraform.

Configuring the Provider

provider "aws" {
    version = "3.7.0"
    region = "us-west-1"
    assume_role {
      role_arn = local.provider_role
      session_name = "Terraform"
    }
}

provider "google" {
   version = "2.20.0"
   credentials = file("credentials.json")
   project = "my-gcp-project"
   region = "us-west-1"
}

Resource

A resource is an object managed by Terraform. Terraform manages the life cycle of the resources from its provisioning, configuration and decommissioning. The resource block creates a new resource from scratch. The resource block has a number of required or optional arguments that are needed to create the resource. Any missing optional configuration in the configuration parameters uses default value.

Syntax:

resource "<provider>_<type>" "<name>" {

<config> ...
}

<provider> is the name of a provider (e.g., aws)

<type> is the type of resources to create in that provider (e.g., instance, security_group)

<name> is an identifier or name for the resource

<config> consists of one or more arguments that are specific to that resource (e.g., ami = "ami-0c550")

We can also specify inline blocks as an argument to attribute set within a resource.

Syntax:

resource "<provider>_<type>" "<name>" {

<name> {
<config> ...
}
}

where <name> is the name of the inline block (e.g. tag) and <config> consists of one or more arguments that are specific to that inline block (e.g., key and value).

resource "aws_instance" "web" {
   ami = "ami-a1b2c3d4",
   instance_type = "t2.micro"
   vpc_security_group_ids = [aws_security_group.instance.id]

   tag {
       key                 = "Name"
       value               = var.cluster_name
       propagate_at_launch = true
   }
}

Each provider has a unique set of resources that can be created on the specified platform. Below is an example of creating a docker container using terraform resource.

#Image used by the container
resource "docker_image" "terraform-centos" {
   name = "centos:7"
   keep_locally = true
}

# Create a container
resource "docker_container" "centos" {
   image = docker_image.terraform-centos.latest
   name = "terraform-centos"
   start = true
   command = ["/bin/sleep", "500"]
}

The syntax for resource reference is <provider>_<type>.<resource-name>, for example. aws_instance.web. To access the resource attribute i.e. the arguments of the resource (e.g., name) or one of the attributes exported by the resource we use the use, <provider>_<type>.<resource-name>.<attribute>, e.g. aws_instance.web.instance_type.

It is important to note that, to change the identifiers in Terraform requires a state change. The parameters of many resources are immutable, hence Terraform will delete the old resource and create a new one to replace it. Most of the cloud provider's API such as AWS, are asynchronous and eventually consistent. Eventually consistent means it takes time for a change to propagate throughout the entire system, so for some period of time, we may get inconsistent responses. Usually wait and retry after some time approach is used until the action is completed and the changes propagated.

Local-only Resources operate only within Terraform itself. The behavior of local-only resources is the same as all other resources, but their result data exists only within the Terraform state. Local-only resource types exist for generating private keys, issuing self-signed TLS certificates, and even generating random ids.

When we add a reference from one resource to another, we create an implicit dependency. Terraform parses these dependencies, builds a dependency graph from them, and uses it to automatically figure out the order in which to create resources. For example, when creating an EC2 instance above, Terraform would know it needs to create the security group before the EC2 Instance, since the EC2 Instance references the ID of the security group. Terraform walks through the dependency tree, creating as many resources in parallel as it can.

Meta-Arguments

Count Meta-Argument

Every Terraform resource has a meta-Argument called count which helps to iterate over the resource to create multiple copies. In the below example the count parameter creates three copies of the IAM users. The count.index allows to get the index of each iteration in the count loop.

resource "aws_iam_user" "example" {
  count = 3
  name  = "neo.${count.index}"
}

Another example of creating resources with different names from an array variable.

resource "aws_iam_user" "example" {
  count = length(var.user_names)
  name  = var.user_names[count.index]
}

When count argument is set to value in a resource block, it becomes a list of resources, rather than just one resource. If the count parameter is set to 1 on a resource, we get single copy of that resource; if the count is set to 0, the resource is not created. In order to read an attribute from the resource list, we need to specify the index in the resource list. The syntax to read a particular resource in resource list is as below.

<provider>_<type>.<name>[index].<attribute>

Although count argument allows to loop over an entire resource, it cannot be used to loop over inline blocks. Also since terraform identifies each resource within the array by its index position, if an item is removed from array, it causes all other items after to shift back by one, hence incorrectly causing terraform to rename resources rather than deleting them. Count is used when the instances are almost identical and can directly derived from an integer.

For-Each Meta-Argument

The for_each expression allows to loop over lists, sets, and maps to create multiple copies of either an entire resource or an inline block within a resource. It can be used with modules and with every resource type. The syntax of for_each is as below:

resource "<provider>_<type>" "<name>" {
  for_each = <collection>
  [config ...]
}

where <provider> is the name of a provider (e.g. aws), <type> is the type of resource to create in the provider (e.g. instance), <name> is an identifier of the resource, <collection> can be a set or map to loop over (lists are not supported when using for_each on a resource) and <config> is one or more arguments that are specific to that resource. In the for_each block, an additional each object is available which allows to modify the configuration of each instance. We can use each.key and each.value to access the key and value of the current item in <collection>, within the <config> parameters.

In the below example, for_each loops over the set (converted from list) and makes each value of user_names set available in each.value. When looping a map, each.key is used to get each key, while each.value gets each value. Terraform transforms the resource with for_each into a map of resources.

resource "aws_iam_user" "example_accounts" {
  for_each = toset( ["Todd", "James", "Alice", "Dottie"] )
  name     = each.value
}

resource "azurerm_resource_group" "rg" {
  for_each = {
    a_group = "eastus"
    another_group = "westus2"
  }
  name     = each.key
  location = each.value
}

for_each is preferred over count most of the time, as reflects changes (add/delete) in the collection, into terraform resource plan. for_each also allows to create multiple inline blocks within a resource. Below is the syntax of for_each to dynamically generate inline blocks.

dynamic "<variable_name>" {
  for_each = <collection>
  content {
    [<config>...]
  }
}

where <variable_name> is variable name which stores the value each iteration, <collection> is a list or map to iterate over, and the content block is what to generate from each iteration. We can us <variable_name>.key and <variable_name>.value within the content block to access the key and value, respectively, for the current item in the <collection>.

When using for_each with a list, the key will be the index and the value will be the item in the list at that index, and when using for_each with a map, the key and value will be one of the key-value pairs in the map.

variable "custom_tags" {
  description = "Custom tags to set on the Instances in the ASG"
  type        = map(string)
  default     = {}
}

resource "aws_autoscaling_group" "example_asg" {
  dynamic "tag" {
    for_each = var.custom_tags
    content {
      key                 = tag.key
      value               = tag.value
      propagate_at_launch = true
    }
  }
}

When an empty collection is passed to a for_each expression, it produces no resources or inline blocks. For a non-empty collection, it creates one or more resources or inline blocks. Terraform requires that it can compute count and for_each during the plan phase, before any resources are created or modified. Hence count and for_each cannot contain reference to any resource outputs. Terraform also currently does not support count and for_each within a module configuration.

Depends-On Meta-Argument

Most of the resources in terraform configuration don't have any relationship with each other, hence Terraform can make changes to such unrelated resources in parallel. But some resources are dependent on other resources and requires information generated by another resource. Terraform handles most of the resource dependencies automatically. Terraform analyses any expressions within a resource block to find references to other objects, and treats those references as implicit ordering requirements when creating, updating, or destroying resources. However in some cases, dependencies cannot be recognized implicitly in configuration, e.g. resource creation can have a hidden dependency on access policy. In such rare cases the depends_on meta-argument can explicitly specify a dependency.

The depends-on meta-argument is used to handle the hidden resource or module dependencies which cannot be inferred by Terraform automatically. It specifies that the current resource or module relies on the other dependent resources for its behavior, without accessing the dependent resource's data in its arguments. The depends-on meta-arguments is available in module blocks and in all the resource blocks. The depends-on meta-argument is a list of references to other resources or child modules in the same calling module. The depends-on meta-argument is used as a last resort to explicitly specify a dependency.

resource "aws_rds_cluster" "this" {
  depends_on = [
    aws_db_subnet_group.this,
    aws_security_group.this
  ]
}

Lifecycle Meta-Argument

The lifecycle meta-argument is a nested block within a resource block and enables to customize the lifecycle of the resource. Below are the arguments used within a lifecycle block.

create_before_destroy (bool): By default, When Terraform needs to update a resource in-place, it first destroys the existing object and then creates a new replacement object. The create_before_destroy meta-argument changes this behavior by first creating the new replacement object, and then once replacement created, destroys the previous object. It does require that the remote object accommodate unique name and other constraints (by appending random suffix), so that both the new and old objects can exist concurrently.

prevent_destroy (bool): When enabled, this meta-argument causes Terraform to reject with an error any plan that would destroy the infrastructure object associated with the resource, as long as the argument and resource block remains present in the configuration. It provides safety against accidental replacement of objects which can be costly to reproduce.

ignore_changes (list of attribute names): Terraform by default detects any changes in current real world infrastructure objects and plans to update the remote object to match configuration. The ignore_changes allows to ignore certain changes to the resource after its creation. It specifies resource attributes as Map or list (or special keyword all), which Terraform would ignore when planning updates to the associated remote object.

resource "azurerm_resource_group" "example" {
  lifecycle {
    create_before_destroy = true,
    ignore_changes = [
      # Ignore changes to tags
      tags
    ]
  }
}

Input Variable

Terraform input variables act as input parameters passed at runtime to customize the terraform deployments. Below is the syntax for declaring an input variable.

Syntax:

variable <name> {
<config> ...
}

variable "my-var" {
   description = "Example Variable"
   type = string
   default = "Hello World"
}

The body of the variable declaration can contain three parameters, description, default, type, which are all optional. The value of the variable can be provided by using the -var option from command line or the -var-file option via a file, or via an environment variable using the name TF_VAR_<variable_name>. If no value is passed in, the variable will fall back to this default value. If there is no default value, Terraform will interactively prompt the user for one. Terraform supports a number of type constraints, including string, number, bool, list, map, set, object, tuple, and any (which is default value).

Since all the variable configuration parameters are optional, a variable can also be defined as below. The value for the below variable needs be passed using environment variable or using command line arguments to avoid runtime error.

variable "my-var" {}

A variable reference is used to get the value from an input variable in Terraform code. Variables can be referenced using the notation var.<variable-name>. Variables are usually defined in terraform.tfvars file, which by default loads all the variables.

Terraform provides variable validation, which allows to set a criteria for allowed values for a variable. Terraform checks if the value of the variable meets the validation criteria before deploying any infrastructure changes.

variable "my-var" {
   description = "Example Variable"
   type = string
   default = "Hello World"
   validation {
     condition = length(var.my-var) > 4
     error_message = "The string must be more than 4 characters"
   }
}

The sensitive configuration parameter prevents the value of the variable being displayed during terraform execution.

variable "my-var" {
   description = "Example Variable"
   type = string
   default = "Hello World"
   sensitive = true
}

Variable Types Constraints

Base Types:

string
number
bool

Complex Types:

- list, set, map, object, tuple

1) Example of list type variable.

variable "availability_zones" {
   type = list(string)
   default = ["us-west-1a"]
}

2) Example of list of object type variable.

variable "docker_ports" {
  type = list(object({
    internal = number
    external = number
    protocol = string
  }))

  default = [
    {
       internal = 8300
       external = 8300
       protocol = "tcp"
    }
  ]
}

Terraform reads the variables passed through operating system environment variables with higher precedence, followed by the variables defined in the terraform.tfvars file.

Output Variable

Output variables allows to define values in terraform configuration which can be shared with other resources or modules. Terraform defines output variables with the following syntax:

Syntax:

output "<name>" {
value = <value>
<config> ...
}

<name> is the name of the output variable,

<value> can be any Terraform expression and

<config> can contain two additional parameters, both optional: description and sensitive (true not to log this output at the end, .e.g. passwords).

The value is a mandatory config argument which can be assigned any value or reference values of other terraform resources or variables. Output also provides the sensitive configuration argument to hide the sensitive variable values.

output "instance_ip" {
   description = "Private IP of VM"
   value = aws_instance.my-vm.private_ip
}

The terraform apply command not only applies the changes, but also shows the output values on the console. We can also use the terraform output command to list all outputs without applying any changes. The -json option formats the output as a JSON object.

$ terraform output

$ terraform output -json

To check the value of a specific output called <output-variable>, we can run the below command.

$ terraform output <output-variable>

Terraform supports several lifecycle settings which allows to customize how resources are created and destroyed.

create_before_destroy - It controls the order in which resources are recreated. The default order is to delete the old resource and then create the new one. Setting to true reverses this order, creating the replacement first, and then deleting the old one.

prevent_destroy - Terraform rejects with an error any plan that would destroy the infrastructure object associated with the resource, as long as the argument remains present in the configuration.

ignore_changes - By default, Terraform detects any differences in the current settings of a real infrastructure object and plans to update the remote object to match the configuration. It ignores when planning updates to the associated remote object.

A data source represents a piece of read-only information that is fetched from the provider i.e. AWS every time we run Terraform. It's just a way to query the provider's APIs for data and to make that data available to the rest of the Terraform code. Each Terraform provider exposes a variety of data sources. For example, the AWS provider includes data sources to look up VPC data, subnet data, AMI IDs, IP address ranges, the current user’s identity, and much more.

Data Sources

Data sources allow to fetch and track details of already existing resources in Terraform configuration.

While a resource causes Terraform to create and manage a new infrastructure component, data sources on the other hand provides read-only views into pre-existing data and compute new values on the fly within Terraform itself. Providers are responsible in Terraform for defining and implementing data sources. The syntax for using a data source is very similar to the syntax of a resource:

Syntax:

data "<provider>_<type>" "<name>" {
<config> ...
}

<provider> is the name of a provider (e.g., aws),

<type> is the type of data source (e.g., vpc),

<name> is an identifier to refer to the data source, and

<config> consists of one or more arguments that are specific to this data source.

data "aws_instance" "my-vm" {
   instance_id = "i-4352435234dsfs0"
}

In order to access the above data source, we use the syntax data.<provider>_<type>.<name>, example data.aws_intance.my-vm. The attribute of the data source can in turn accessed as data.<provider>_<type>.<name>.<attribute>.

Local Values

A local value assigns a name to an expression, so you can use it multiple times within a module without repeating it. A set of related local values can be declared together in a single locals block. Local values can be literal constants or can reference to other values such as variables, resource attributes, or other local values in the module. Local values can refer the local values in same block as long as they don't introduce any circular dependencies. The local values can be referenced using local.<name>. Local values help to avoid repeating the same values or expressions multiple times in terraform configuration.

locals {
  instance_ids = concat(aws_instance.blue.*.id, aws_instance.green.*.id)
}

locals {
  common_tags = {
    Service = local.service_name
    Owner   = local.owner
  }
}

resource "aws_instance" "my_instance" {
  ...
  tags = local.common_tags
}

Modules

Module is a collection of terraform code files within a directory, whose output can be referenced in other parts of the project. It is a container for multiple resources that are used together. Modules is used to group together multiple resources which are used together in a project. Modularization allows to make the code reusable.

The main working directory which holds the terraform code is called the root module. The modules which are referenced from root modules are called the child modules, which can be passed input parameters and can fetch outputs values. Modules can be download or referenced from either Terraform Public Registry, a Private Registry or Local file system. The syntax for module is as below:

Syntax:

module "<name>" {
source = "<source>"
<config> ...
}

Modules are referenced using the module block which is shown in the below example.

module "my-vpc-module" {
   source = "./modules/vpc"         # Module source (mandatory)
   version = "0.0.5"                # Module version
   region = var.region               # Inputs parameter for the module
}

Other parameters allowed inside the module block are as below

count allows spawning multiple separate instances of module's resources.
for_each allows iterating over complex vairables,
providers allows to tie down specific providers for the module,
depends_on allows to set dependencies for the module.

Note that whenever we add a module to the Terraform configurations or modify the source parameter of a module, we need to run the init command before we run the plan or apply command. The init command downloads providers, modules, and configures the backends.

Modules can optionally take arbitrary number of inputs and return outputs to plug back into the main code. Terraform Module inputs are arbitrarily named parameters that can be passed inside the module block. These inputs can be used as variables inside the module code. Below is the example were `server-name` input parameter is passed to the module, and can be referenced as 'var.server-name' inside the module.

module "my-vpc-module" {
   source = "git::ssh://git@github.com/emprovise/terraform-aws-vpc?ref=v1.0.5"
   name = "monitoring-vpc"
   server-name = 'us-east-1'        # Input parameter for module
}

The outputs declared inside the module code can be fed back into the root module or the main code. The syntax to read the output inside the module code from outside terraform code is module.<module-name>.<output-name>.

The value of the below output block can be access using module.my-vpc-module.subnet_id from outside the module.

output "subnet_id" {
   value = aws_instance.subnet_id
}

resource "aws_instance" "my-vpc-module" {
   .... # other arguments
   subnet_id = module.my-vpc-module.subnet_id
}

Terraform configuration files naming conventions:

variables.tf: Input variables.
outputs.tf: Output variables.
main.tf: The actual resources.

Terraform Configuration Block

Terraform configuration block is a specific configuration block for controlling Terraform's own behavior. The terraform block only allows constant values, with no reference to named objects such as resources, input variables, etc, or any usage of the Terraform built-in functions. It allows to configure various aspects of terraform workflow such as:

Configuring backend with nested backend block for storing state files.
Specifying a required Terraform version (required_version), against which the Terraform code is executed.
Specifying a required Terraform Provider version with required_providers block.
Enabling and testing Terraform experimental features using experiments argument.
Passing provider metadata (provider_meta) for each provider of the module.

Below is an example of terraform configuration block which ensures that Terraform only runs when the Terraform binary version is above 0.13.0 and Terraform AWS provider version is above 3.0.0, using regular expression.

terraform {
   required_version = ">=0.13.0"
   required_providers {
      aws = ">=3.0.0"
   }
}

# Configure Docker provider
terraform {
   required_providers {
      docker = {
         source = "terraform-providers/docker"
      }
   }
   required_version = ">=0.13"
}

Terraform Backend

Terraform Backend defines the state snapshot storage and operations to create, read, update, or destroy resources. Terraform supports multiple built-in backend types each with their own set of configuration arguments. Terraform backend types are divided into two main types, Enhanced backends which can store state and perform operations e.g. local and remote, and the Standard backend which only stores state and rely on local backend for performing operations e.g. consul, artifactory etc.

Terraform only allows a single backend block within the configuration and does not allow references to input variables, locals, or data source attribute. If a configuration includes no backend block, Terraform defaults to using the local backend, which performs operations on the local system and stores state as a plain file in the current working directory. After updating the backend configuration, terraform init should be ran to validate and configure the backend before performing any plans, applies, or state operations. Terraform allows to omit certain required arguments which can be passed later by Terraform automation script. It although requires at least an empty backend configuration to be specified in root Terraform configuration file. The omitted required arguments can be passed using configuration file with -backend-config=PATH option, key/value pairs with -backend-config="KEY=VALUE" option, or using interactive prompt while running terraform init command. Terraform allows to change backend configuration along with the backend type or remove the backend altogether. In order for the configuration changes to take affect, terraform requires to be reinitialized using terraform init.

Below is an example of the Backend configuration.

terraform {
  backend "remote" {
    organization = "emprovise"

    workspaces {
      name = "terraform-prod-app"
    }
  }
}

In the below example of backend configuration, we create an S3 bucket by using the aws_s3_bucket resource, with "sse_algorithm" as "AES256" to securely store state files. Then a DynamoDB table is created which has a primary key called LockID to enable locking on state files in S3 bucket. The below backend configuration uses the S3 bucket to store terraform state file and DynamoDB table to prevent multiple users acquiring loack to the state file.

terraform {
  backend "s3" {
    bucket         = "terraform-up-and-running-state"
    key            = "global/s3/terraform.tfstate"
    region         = "us-east-2"
    dynamodb_table = "terraform-up-and-running-locks"
    encrypt        = true
  }
}

Terraform State

Terraform State is a mechanism for Terraform to keep track of the deployed resources, which is used to determine the actions to be taken to update the corresponding platform. Terraform state is the blueprint of the infrastructure deployed by Terraform. Terraform determines the state of the deployed resources to decide whether to create new resources from scratch, modified or even destroyed. Terraform records information about the real world infrastructure it creates into the Terraform state file. By default, Terraform creates the json state file named terraform.tfstate in the current directory. The state file contains a custom JSON format that records a mapping from the Terraform resources in our templates to the representation of those resources in the real world. The state file helps Terraform to calculate the deployment delta and create new deployment plans. Before terraform modifies any infrastructure, it checks and refreshes the state file ensuring that it is upto date with real world infrastructure. Terraform state file also tracks the dependency between the resources deployed, i.e. Resource dependency metadata. Terraform ensures that entire infrastructure is always in defined state at all times. It helps to boost deployment performance by caching resource attributes for subsequent use. Terraform backups the last known state file recorded after successful terraform apply locally. Since state file is critical to Terraform's functionality and losing the state file causes to lose the reference to the cloud infrastructure. Also since all the infrastructure deployed with Terraform ends up in plain text in state file, the state file should always be encrypted, both in transit and on disk. In such case using terraform import command is only option to get configuration of existing resources in cloud.

Terraform State Command

Terraform State Command is a utility for manipulating (modifying) and reading the Terraform State file. The state command is used for advance state management. It allows to manually remove resources from state file so they are not managed by Terraform, and to list out tracked resources and its details (via state and list commands).

Below state command lists out all the resources that are tracked by the Terraform State file.

$ terraform state list

Below state command shows the details of a resource and its attributes tracked in the Terraform State file.

$ terraform state show <resource-address>

Below command allows to renamed a resource block or move to a different module, but retaining the existing remote object. It updates state to track the resource as a different resource instance address.

$ terraform state mv <source-resource-address> <destination-resource-address>

The below command manually downloads and outputs the state file from remote state or even local state.

$ terraform state pull

The push command manually upload a local state file to remote state. It also works with local state.

$ terraform state push

Below command manually unlocks the locked state file for the defined configuration, using the LOCK_ID provided when locking the state file beforehand.

$ terraform force-unlock LOCK_ID

Below state command deletes a resource from the Terraform State file, there by un-tracking it from Terraform.

$ terraform state rm <resource-address>

Terraform enables to store Terraform State file to a remote data source such as AWS S3, Google Storage and other preset platforms. Remote State Storage allows to share Terraform State file between distributed teams and provides better security, availability and consistent backups in the cloud. Cloud providers can provide granular security policies to access & modify the Terraform State file. Terraform allows state locking so another user cannot execute terraform deployment in parallel which coincides with each other. State locking is common feature across both local and remote state storage. In local state storage, state locking is enabled by default when the terraform apply command is issued. State locking is only supported by few remote state storage platforms such as AWS S3, Google Cloud Storage, Azure Storage and HashiCorp Console. State file also contains the output values from the Terraform code. Terraforms enables to share these output values with other Terraform configuration or code when the State file is stored remotely. This enables distributed teams working remotely on data pipelines requiring successful execution and outputs from the previous Terraform deployment.

Terraform Remote State

Terraform remote state retrieves the state data from a Terraform backend. It allows to use the root-level outputs of one or more Terraform configurations as input data for another configuration. It is referenced by the terraform_remote_state type and because it is a data source, it provides read-only access, so there is no danger of accidentally interfering with the state file. The output variables from the terraform_remote_state is read using the syntax: data.terraform_remote_state.<name>.outputs.<attribute-name>.

data "terraform_remote_state" "vpc" {
    backend = "s3"
    config {
        bucket  = "networking-terraform-state-files"
        key     = "vpc-prod01.terraform.tfstate"
        region  = "us-east-1"
    }
}

resource "aws_instance" "remote_instance" {
  subnet_id = data.terraform_remote_state.vpc.outputs.subnet_id
}

Expressions

Terraform supports simple literals as well as complex expressions which help to evaluate values within the Terraform configuration.

Interpolation

The ${ ... } sequence is an interpolation, which evaluates the expression between the markers, converts the result to a string if required, and inserts it into the final string.

"Hello, ${var.name}!"

Directives

Terraform supports directives using the sequence %{ ... }, which allows for conditional evaluation and iteration over collections. The conditional evaluation uses the %{if <BOOL>}/%{else}/%{endif} directive, with the below example.

"Hello, %{ if var.name != "" }${var.name}%{ else }unnamed%{ endif }!"

The iteration over collections is done using %{for <NAME> in <COLLECTION>} / %{endfor} directive. The template is evaluated for each element within the collection and the result for each element is concatenated together.

The template directives can be formatted for readability without adding unwanted spaces or newlines, by adding optional strip markers (~), immediately after the opening characters or immediately before the end of template sequence. The template sequence consumes all of the literal whitespace either at the beginning or end, based on the added strip marker (~).

<<EOT
%{ for ip in aws_instance.example.*.private_ip ~}
server ${ip}
%{ endfor ~}
EOT

Operators

Terraform supports Arithmetic Operators (+, -, *, /, %), Equality Operators (==, !=), Comparison Operators (<, <=, >, >=) and Logical Operators (||, &&, !) in the expressions.

Filesystem Path Variables

Terraform provides special path variables to show filesystem path of terraform modules.

path.module - Filesystem path of the module where the expression is placed.
path.root - Filesystem path of the root module of the configuration.
path.cwd - Filesystem path of the current working directory.

Conditional Expressions

A conditional expression uses the value of a bool expression to select one of two values. The two result values can be of any type, but they both must be of the same type for Terraform to determine the type of whole conditional expression to return.

Syntax: <condition> ? <true-value> : <false-value>

count = var.enable_syn_alarms ? 1 : 0

Splat Expression

A splat expression provides a more concise way to express a common operation that could otherwise be performed with a for expression. The splat expression uses the special [*] symbol which iterates over all of the elements of the list given to its left and accesses from each one the attribute name given on its right.

If var.list is a list of objects, all of which have an attribute id, then a list of the ids could be produced with the following for expression:

[for o in var.list : o.id]

The corresponding splat expression is :

var.list[*].id

A splat expression can also be used to access attributes and indexes from lists of complex types by extending the sequence of operations to the right of the symbol, as shown in below example.

var.list[*].interfaces[0].name

The above expression is equivalent to the following for expression:

[for o in var.list : o.interfaces[0].name]

Terraform provides an array lookup syntax to look up elements in an array at a given index. For example, to look up the element at index 1 within the array var.user_names, we use var.user_names[1].

Expanding Function Arguments

Terraform allows to expand the list or tuple value to pass in as separate arguments to a function. A list value is passed in as an argument and followed by the ... symbol. The general syntax of the function is: function_name(arg1, arg2, ...).

min([55, 2453, 2]...)

For Expression

The for expression allows to iterate over the collection (list, map) and update, filter or create new items within the collection. The basic syntax of a for expression is:

[ for <item> in <list> : <output> ]

[ for <key>, <value> in <map> : <output_key> => <output_value> ]

The <list> is the list being iterated and <item> is local variable assigned each item during iteration. For <map> being iterated, we have <key> and <value> local variables for each entry during the iteration. The <output> is an expression that transforms <item> for list or <key> & <value> for map in some way. Below is an example of for expression converting a list of names to upper case and filtering out the names with less than 5 characters.

variable "names" {
  description = "A list of names"
  type        = list(string)
  default     = ["neo", "trinity", "morpheus"]
}

output "upper_names" {
  value = [for name in var.names : upper(name) if length(name) < 5]
}

We can also use the expressions to output a map rather than list using the below syntax, were curly brackets are used to wrap the expression instead of square brackets and, outputs a key and value separated by an arrow instead of a single value.

variable "hero_thousand_faces" {
  description = "map"
  type        = map(string)
  default     = {
    neo      = "hero"
    trinity  = "love interest"
    morpheus = "mentor"
  }
}

output "upper_roles" {
  value = {for name, role in var.hero_thousand_faces : upper(name) => upper(role)}
}

Built-In Functions

Terraform comes pre-packaged with built-in functions to help transform and combine values. Terraform does not allow user defined functions, but only provides built-in functions, which is extensive list of functions. The built-in functions can be used in terraform code within resources, data sources, provisioners, variables etc.

variable "project-name" {
   type = string
   default = "prod"
}

resource "aws_vpc" "my-vpc" {
   cidr_block = ""
   tags = {
      Name = join("-", ["terraform", var.project-name])       // terraform-prod
   }
}

Below are few terraform built-in functions:

max(num1, num2, ...): It takes one or more numbers and returns the maximum value from the set.

flatten([["a", "b"], [], ["c"]]): It takes a list and replaces any elements that are lists with a flattened sequence of the list contents e.g. ["a", "b", "c"]. Hence it creates a singular list from a provided set of lists.

contains(list, value): It determines whether the given value is present in the provided list or set.

matchkeys(valueslist, keyslist, searchset): It constructs a new list by taking a subset of elements from one list whose indexes match the corresponding indexes of values in another list.

values({a=3, c=2, d=1}): It takes a map and returns a list containing the values of the elements in that map.

distinct([list]): It takes a list and returns a new list with any duplicate elements removed.

lookup(map, key, default): It retrieves the value for the given key, from the map. If the given key does not exist, the given default value is returned instead.

merge({a="b", c="d"}, {e="f"}): It takes an arbitrary number of maps or objects, and returns a single map or object that contains a merged set of elements from all arguments.

slice(list, startindex, endindex): It extracts elements from startindex inclusive to endindex exclusive from the list.

cidrsubnet(prefix, newbits, netnum): It calculates a subnet address within given IP network address prefix. The prefix is in CIDR notation, newbits is additional no of bits which extend the prefix and netnum is a number represented as a binary integer used to populate the additional bits added to the prefix.

cidrsubnet("10.1.2.0/24", 4, 15)

length(): It determines the length by returning the number of elements/chars in a given list, map, or string.

substr(string, offset, length): It extracts a substring from a given string by offset and length.

keys({a=1, c=2, d=3}): It takes a map and returns a list containing the keys from that map, e.g. ["a", "c", "d"].

toset(["a", "b", "c"]): It converts its argument, a list to a set, remove any duplicate elements and discard the ordering of the elements.

file(path): It reads the contents of a file at the given path and returns them as a string.

concat(["a", ""], ["b", "c"]): It takes two or more lists and combines them into a single list.

urlencode(string): It applies URL encoding to a given string.

jsonencode(string): It encodes a given value to a string using JSON syntax.

policy = jsonencode(
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "s3:ListAllMyBuckets"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
})

Terraform Console

The terraform console command provides an interactive console for evaluating expressions. If the current state of the deployment is empty or has not yet been created, the console can be used for running built-in functions and expression syntax.

$ terraform console

> max(4, 5, 7, 9)

> timestamp()

> join("_", ["james", "bond"])

> contains(["john", "wick", 2, 4, 6], "wick")

Type Contraints

Type constraints control the type of variable values that can be passed to the Terraform code.

Primitive Type which have single type value.

number - e.g. replicas = 3
string - e.g. name = "cluster2"
bool - e.g. backup = true

Complex Types which have multiple types in a single variable. e.g. list, tuple, map, object.

Complex types can be grouped in Collection type and Structural type.

Collection types allows multiple values of one primitive type to be grouped together against a variable.

Constructors for these Collections include:

list(<type>)
map(<type>)
set(<type>)

variable "training" {
   type = list(string)               // Variable is list of several strings
   default = ["ACG", "GQ"]           // Two separate strings in one variable
}

Structural types allow multiple values of different primitive types to be grouped together. Structural type allows more than one type of values assigned within a variable as opposed to a Collection types which allows only a single type of value within a variable.

Constructors for these Collections include:

object({<attribute_name> = <type>, ... })
tuple([<type>, ...])
set(<type>)

variable "instructor" {
   type = object({
      name = string         // Primitive Types
      age = number          // Several named attributes
   })
}

Any Constraint

Any is a placeholder for a primitive type yet to be decided. Terraform allows to leave out the type of the variable while defining it when its an optional field. The Actual type of the variable assigned the any constraint is determined at runtime and assigned a proper primitive type. In the below example were variable is of list type with any constraint. Terraform recognized all the values passed in default value of the variable as numbers, and assign the type of the list as numbers.

variable "data" {
   type = list(any)
   default = [1, 42, 7]
}

Dynamic Blocks

Dynamic Blocks enable to dynamically construct repeatable nested configuration blocks inside Terraform resources. They can be used within resource blocks, data blocks, provider blocks and provisioners.

Below is the example of Terraform code which creates an AWS Security group which takes several rules, which in-turn take several inputs represented by ingress block as below.

resource "aws_security_group" "my-sg" {
   name = 'my-aws-security-group'
   vpc_id = aws_vpc.my-vpc.id
   ingress {
      from_port = 22
      to_port = 22
      protocol = 'tcp'
      cidr_blocks = ["1.2.3.4/32"]
   },
   ingress {
     ... # more ingress rules
   }
}

The above code with many ingress rule blocks can be streamed line as below. The nested content block defines the body of each generated block using the complex variable provided (var.rules in below example). The ingress variable inside the content block is the iterator argument. The iterator argument can be provided with a custom name, but by default it uses the name of the dynamic block, hence its ingress variable in below example.

resource "aws_security_group" "my-sg" {
   name = 'my-aws-security-group'
   vpc_id = aws_vpc.my-vpc.id
   dynamic "ingress" {                // using dynamic block for config block to replicate
      for_each = var.rules            // for_each loop uses complex variable to iterate over
      content {
         from_port = ingress.value["port"]
         to_port = ingress.value["port"]
         protocol = ingress.value["proto"]
         cidr_blocks = ingress.value["cidrs"]
      }
   }
}

The complex variable rules passed in above dynamic block is defined as below.

variable "rules" {
  default = [
    {
       port = 80
       proto = "tcp"
       cidr_blocks = ["0.0.0.0/0"]
    },
    {
       port = 22
       proto = "tcp"
       cidr_blocks = ["1.2.3.4/32"]
    }
  ]
}

Dynamic blocks expect a complex variable types to iterate over. They act like a for loops to output a nested block for each element in the (complex) variable passed to them. Dynamic blocks help to make the code cleaner by cutting down on writing repetitive chunks of nested block resources. Overuse of dynamic blocks in the code will make it hard to read and maintain. Dynamic blocks are mostly used to hide detail in order to build a cleaner user interface when writing reusable modules.

Terraform Format

Terraform formats the terraform code for readability. It makes the code consistent and easy to maintain. Terraform format is safe to run at anytime as it just changes the code format. It looks for all the files ending with .tf extension and formats them. It is mostly used before pushing the code to version control and after upgrading Terraform or its modules.

$ terraform fmt

Terraform Taint

The taint command marks/taints the existing terraform resource, forcing it to be destroyed and recreated. It only modifies the state file which tracks the resources which are created. The state file is marked with the resources to be tainted which causes the recreation workflow. After tainting the resource which causes it to be destroyed, the next terraform apply causes it to be recreated. Tainting a resource may cause other dependent resources to be modified as well, e.g. tainting a virtual machine with ephemeral public IP address will cause the public IP address to change when the resource is recreated.

Terraform taint command takes the resource address within the Terraform code.

$ terraform taint <resource-address>

Taint command is generally used to trigger the execution of provisioners (by resource create/destroy), replace bad resources forcefully and replicate the side effects of recreation process.

The untaint command is used to remove the taint from a resource.

$ terraform untaint <resource-address>

Terraform Import

Terraform import command takes the existing resource which is not managed by Terraform and maps to a resource within terraform code using an ID. The ID is dependent on the underlying vendor infrastructure from which the resource is imported, for e.g. to import an AWS EC2 instance we need to provide its instance ID. Importing the same resource to multiple Terraform resources can cause unknown behavior and is not recommended. Terraforms ensures that there is one-to-one mapping between its resources and real world resources, but cannot prevent same resource being added using Terraform import.

The import command syntax is as below which takes the resource address i.e. the terraform resource name to be mapped with real world resource and ID of the real world resource.

$ terraform import <resource-address> ID

Terraform import is helpful in working with the existing resources, and enables them to be managed using Terraform. The user can import resources even though has no access to create new resources. If there are a lot of existing resources to be imported into Terraform, writing Terraform code for all of them would be time consuming process. In such case Terraforming tool allows to import both code and state from an AWS account automatically. Terraformer is a CLI tool that generates tf/json and tfstate files based on existing infrastructure, reverse converting real world infrastructure into Terraform.

Terraform Show

The terraform show command provides human-readable output from a state or plan file

$ terraform show <path-to-state-or-plan-file>

The -json command line flag prints the state or plan file in JSON format.

$ terraform show -json <state-plan-file>

$ terraform show -json tfplan.binary > tfplan.json

Miscellaneous Terraform Commands

In order to provide Terraform command completion using tab for either bash or zsh shell, we run the below command.

$ terraform -install-autocomplete

The terraform validate command validates the configuration files in current directory without accessing any remote services such as remote state, provider APIs etc. It checks whether the configuration code syntax is valid and consistent, and ensures the correctness of attribute names and value types. It support -json flag to output validation results in JSON format.

$ terraform validate

The get command downloads and updates the modules mentioned in the root module. The -update option checks for updates and updates the already downloaded modules.

$ terraform get -update=true

The terraform graph command is used to generate a graph in DOT format of either a configuration or execution plan. It shows the relationship and dependencies between Terraform resource in the configuration code. The dot utility to generate PNG images can be downloaded from Graphviz.

$ sudo apt install graphviz

$ terraform graph | dot -Tsvg > graph.svg

Provisioners

Provisioners allow users to execute custom scripts, commands and actions. We can be either run such scripts locally or remotely on resources spun up through the Terraform deployment. A provisioner is attached to terraform resource which allows custom connection parameters that can be passed to connect to remote resources using ssh or WinRM to carry out commands against that resource. Within Terraform code, each individual resource can have its own provisioner defining the connection method and actions/commands or scripts to execute.

There are 2 types of provisioners which cover two types of events in terraform resources lifecycle.

The Create-time provisioner and Destroy-time provisioner, can be set to run when a resource is being created or destroyed respectively. HashiCorp recommends using Provisioners as a last resort and to use inherent mechanisms provided by cloud vendors within the infrastructure deployment to carry out custom tasks where possible. Provisioners are used only when underlying vendor such as AWS does not provide a built in mechanism for bootstrapping via custom commands or scripts. Terraform cannot track changes to provisioners as they can take any independent action via script or command, hence they are not tracked by Terraform state files. Hence the provisioners changes are not provided as part of the output of terraform plan/apply command.

Provisioners are recommended for use when we want to invoke actions that are not covered by Terraforms declarative model or through inherent options for the resources in available providers. Provisioners expect any custom script or commands to be executed with the return code of zero. If the command within a provisioner returns non-zero return code, then it is considered failed and underlying resource is tainted (marking the resource against which the provisioner is to be ran, to be created during next run).

Below is the example of the provisioner. By default the provisioner is a create provisioner, i.e. the provisioner by default executes once the resource is created within the same directory.

resource "null_resource" "dummy_resource" {
    provisioner "local-exec" {
      command = "echo '0' > status.txt"
    }

    provisioner "local-exec" {
      when = destroy
      command = "echo '1' > status.txt"
    }
}

We can use multiple provisioners against the same resource and they are executed in the order in the code.

Accessing resource using variable convention inside the provisioner can cause cyclical dependencies, causing the provisioner to run a command against the resource which has not been created, because provisioner is independent of the terraform plans. The self object can access any attribute available to the resource, which is attached to the provisioner.

resource "aws_instance" "ec2-virtual-instance" {
    ami = ami-12345
    instance_type = t2.micro
    key_name = aws_key_pair.master-key.key_name
    associate_publice_ip_address = true
    vpc_security_group_ids = [aws_security_group.jenkins-sg.id]
    subnet_id = aws_subnet.subnet.id
    provisioner "local-exec" {
      command = "aws ec2 wait instance-status-ok --region us-west-1 --instance-ids ${self.ids}"
    }
}

Workspace

Terraform Workspaces aka CLI are alternate state files within the same working directory. By keeping alternate state files for the same code or configuration, distinct environments can be spun up. Terraform starts with a single default workspace which cannot be deleted. Each workspace tracks a separate independent copy of the state file against terraform code in that directory. Below are the terraform workspace subcommands.

Create a new workspace

$ terraform workspace new <workspace-name>

Lists all the available terraform workspaces and highlights the current workspace using an asterisk (*).

$ terraform workspace list

The Terraform workspace show command displays the current workspace.

$ terraform workspace show

Selects and switches to an existing terraform workspace

$ terraform workspace select <workspace-name>

Terraform workspace is used to test changes using a parallel, distinct copy of an infrastructure. Since each workspace tracks an independent copy of the state file, Terraform can deploy a new environment for each workspace using the common terraform code. It can be modeled against branches, by committing terraform state files in version control such as Git.

Workspaces are meant for sharing resources and enabling collaboration between teams. Each team can test the same common code using different workspaces. Terraform code can access the Workspace name using the ${terraform.workspace} variable. The workspace name can be used within the resources to associate them to the workspace or to perform certain unique actions on the resources based on the workspace name.

In the below example we spin up 5 EC2 instances if current workspace is default workspace, or else only spin up a single EC2 instance in AWS cloud, using the ${terraform.workspace} variable.

resource "aws_instance" "example" {
   count 	 = terraform.workspace == "default" ? 5 : 1
   ami           = data.aws_ami.ubuntu.id
   instance_type = "t3.micro"
   tags = {
     Name = "${terraform.workspace}-instance"
   }
}

Another example below is to switch the region based on the current workspace in AWS provider config.

provider "aws" {
   region = terraform.workspace == "default" ? "us-east-2" : "us-west-1"
}

The default workspace state file is the terraform.tfstate file in the root directory. Terraform State files for non-default workspaces are stored into the terraform.tfstate.d directory.

Managing Terraform Secrets

Secrets such as passwords, API keys, and other sensitive data should not be stored directly in Terraform code in plain text. Terraform state file which contains all the secrets should be stored with encryption. There are various techniques of managing secrets in Terraform as below.

We can pass the secrets using environment variables to the Terraform code. Secrets can be passed as environment variables (prefixed with TF_VAR) to set terraform variable values, referenced in resource for credentials. Passwords and secure data passed to environment variables can be stored and managed using a password manager such as 1Password, LastPass or pass. Passwords can be stored using tools like pass which is a unix CLI tool which takes input and output via stdin and stdout, storing password in PGP encrypted files.

$ pass insert db_username

Enter password for db_username: admin

$ pass db_username

admin

Another technique relies on encrypting the secrets, storing the cipher text in a file, and checking that file into version control. The Cloud Key Service like AWS KMS, GCP KMS, Azure Key Vault is used to store the key and encrypt the credentials file.

We create a new AWS KMS key using below command.

$ aws kms create-key --description "KMS Demo Application"

We add the secret credentials in credentials.yml file and use below AWS KMS command to encrypt using KMS key.

$ aws kms encrypt \

--key-id <AWS_KMS_key> \

--region <AWS_Region> \

--plaintext fileb://credentials.yml \

--output text \

--query CiphertextBlob \

> credentials.yml.encrypted

Then aws_kms_secrets data source for AWS (google_kms_secret for GCP KMS or azurerm_key_vault_secret for Azure Key Vault) is used to decrypt the secret credentials by reading credentials.yml.encrypted file.

data "aws_kms_secrets" "creds" {
  secret {
    name    = "db"
    payload = file("${path.module}/credentials.yml.encrypted")
  }
}

locals {
  db_creds = yamldecode(data.aws_kms_secrets.creds.plaintext["db"])
}

resource "aws_db_instance" "mysql_test_instance" {
  engine               = "mysql"
  engine_version       = "5.7"
  instance_class       = "db.t2.micro"
  name                 = "mysql_test_instance"

  username = local.db_creds.username
  password = local.db_creds.password
}

Sops is an open source tool designed to make it easier to edit and work with files that are encrypted via AWS KMS, GCP KMS, Azure Key Vault, or PGP. sops can automatically decrypt a file when you open it in your text editor, so you can edit the file in plain text, and when you go to save those files, it automatically encrypts the contents again. This removes the need to run long aws kms commands to encrypt or decrypt data or worry about accidentally checking plain text secrets into version control. Terragrunt has native built in support for sops. The terragrunt.hcl can use the sops_decrypt_file function built into Terragrunt to decrypt that file and yamldecode to parse it as YAML.

locals {
  db_creds = yamldecode(sops_decrypt_file(("db-creds.yml")))
}

We can also directly store the terraform secrets in a dedicated cloud secret store, which is a database, designed specifically for securely storing sensitive data and tightly controlling access to it. The Cloud Secret Stores such as HashiCorp Vault, AWS Secrets Manager, AWS Param Store, GCP Secret Manager etc can be used to store secrets. The AWS Secrets Manager uses aws_secretsmanager_secret_version, HashiCorp Vault uses vault_generic_secret, AWS SSM Param Store uses aws_ssm_parameter and GCP Secret Store uses google_secret_manager_secret_version data source to read the secrets stored in respective cloud secret store.

data "aws_secretsmanager_secret_version" "creds" {
  secret_id = "db-creds"
}

locals {
  db_creds = jsondecode(
    data.aws_secretsmanager_secret_version.creds.secret_string
  )
}

Debugging Terraform

The TF_LOG is an environment variable which enables verbose logging in Terraform. By default, it sends logs to the standard error output (stderr) displayed on the console. The Terraform logs have 5 levels of verbosity, with following levels: TRACE, DEBUG, INFO, WARN AND ERROR. TRACE is the most verbose level of logging and both terraform internal logs along with backend API calls made to cloud providers. We can redirect the output logs to persist into a file by setting the TF_LOG_PATH environment variable to a file name. By default TF_LOG_PATH is disabled, but can be enabled by setting a value to the environment variable as below.

$ export TF_LOG=TRACE

$ export TF_LOG_PATH=./terraform.log

HashiCorp Sentinel

HashiCorp Sentinel is a feature provided in Terraform Enterprise version. HashiCorp Sentinel enforces policies (restrictions) on the Terraform code. Sentinel has its own policy language called Sentinel language. It prevents dangerous and malicious code is stopped even before it gets executed or applied using the terraform apply command. The sentinel integration runs in enterprise terraform after terraform plan and before terraform apply command. The sentinel's policies have access to the data in the created plan and the state of resources & configuration at the time of the plan. Sentinel codifies the security policies in terraform code which can also be version controlled. It provides guardrails for automation and sandboxing. Below is the example of sentinel policy code which only allows to create EC2 instances with instance types t2.small, t2.medium or t2.large.

import "tfplan-functions" as plan

allowed_types = ["t2.small", "t2.medium", "t2.large"]

allEC2Instances = plan.find_resources("aws_instance")

violatingEC2Instances = plan.filter_attribute_not_in_list(allEC2Instances,
                        "instance_type", allowed_types, true)

violations = length(violatingEC2Instances["messages"])

main = rule {
  violations is 0
}

HashiCorp Vault

HashiCorp Vault is a secrets management software which stores sensitive data securely and provides short lived temporary credentials to users. Vault handles the rotating these temporarily credentials as per an expiration schedule which is configurable. It generates cryptographic keys which are used to encrypt sensitive data at rest or in transit, and provides fine-grained access to secrets using ACLs.

The Vault admin stores the ACLs and long lived credentials in Vault and configures permissions for temporary generated credentials using Vault's integration with AWS or GCP's IAM service or Azure RBAC. The Terraform Vault Provider allows to integrate Vault into Terraform code, and allows to access temporarily short-lived credentials with appropriate IAM permissions. Terraforms uses these credentials for deployment with the terraform apply command. Vault allows fine grained ACLs for access to temporary credentials.

Terraform Cloud

Terraform Cloud is HashiCorp's cloud solution which allows to execute Terraform code on cloud hosted system. It simplifies environment management, code execution, state file management, as well as permissions management. Terraform Cloud allows to create remote workspace, maintaining a separate directory for each workspace host in the cloud. Terraform Cloud manages storage & security of Terraform State files, variables and secrets within the cloud workspace. It stores older versions of state files by default. Terraform Cloud Workspace maintains records of all execution activity within the workspace. All Terraform commands can be executed using Terraform CLI, Terraform Workspace APIs, Github Actions or Terraform Cloud GUI, within the Terraform Cloud managed VMs. Terraform Cloud integrates with various version control systems e.g. Github, Bitbucket to fetch latest terraform code. Terraform Cloud also provides cost estimation for the terraform deployment.

Terragrunt

Terragrunt is a thin wrapper, command line interface tool to make Terraform better or build a better infrastructure as code pipeline. It provides extra tools for keeping the configurations DRY (Don't Repeat Yourself), working with multiple Terraform modules, and managing remote state. Terragrunt helps with code structuring were we can write the Terraform code once and apply the same code with different variables and different remote state locations for each environment. it also provides before and after hooks, which make it possible to define custom actions that will be called either before or after execution of the terraform command.

Terragrunt can be installed by downloading the binary (windows/linux/mac) and adding its location to the system environment path variable.

Terragrunt Configuration

Terragrunt configuration is defined in a terragrunt.hcl file. This uses the same HCL syntax as Terraform itself. Terragrunt also supports JSON-serialized HCL defined in a terragrunt.hcl.json. Terragrunt by default checks for terragrunt.hcl or terragrunt.hcl.json file in the current working directory. We can also pass command line argument --terragrunt-config or set TERRAGRUNT_CONFIG environment variable specifying the config path, overriding the default behavior.

# vim: set syntax=terraform:

skip = local.toplevel.inputs.stage == "rd"

include {
  path = find_in_parent_folders()
}

dependencies {
  paths = [
    "../base",
    "../pub"
  ]
}

locals {
  topLevel = read_terragrunt_config(find_in_parent_folders())
  stack = get_env("MY_STACK")
  local.region = "us-west-1"
}

remote_state {
  backend = "s3"
  config = {
    bucket = "${local.toplevel.inputs.account_name}-terraform-state"
    key = "${path_relative_to_include()}/terraform.tfstate"
    region = local.region
    encrypt = local.toplevel.remote_state.config.encrypt
    dynamodb_table = local.toplevel.remote_state.config.dynamodb_table
  }
}

inputs = {
   cft_generator_image = get_env("MY_IMAGE_cft-generator", "IMAGE_NA")
}

skip = !contains(["dev", "cert"], local.stack)

Terragrunt Features

In software development, we often need to setup up multiple environments (quality, stage, production) for different phases of software development. The terraform contents for each environment would be more or less identical, except perhaps for a few settings. Although terraform modules provides a solution to reduce duplicate code, it requires to set up input variables, output variables, providers, and remote state, adding more maintenance overhead.

Terragrunt provides the ability to download remote Terraform configurations. We define Terraform infrastructure code once with any configuration which is different across environments being exposed as terraform input variable. We then define terragrunt.hcl file with a terraform { ... } block that specifies from where to download the Terraform code, as well as the environment-specific values for the input variables in that Terraform code. Below is an example of env-name/application/terragrunt.hcl file. The below source parameter specifies the location of module to deploy and the inputs specifies the input values for the current environment.

terraform {
  source = "git::git@github.com:foo/modules.git//app?ref=v0.0.3"
}

inputs = {
  instance_count = 3
  instance_type  = "t2.micro"
}

Terragrunt downloads the configuration code specified via the source parameter, along with modules, providers into the .terragrunt-cache directory (in current directory) by default. Once downloaded it reuses the config/module files for the subsequent commands. It then copies all files in current directory to temp directory and executes the Terraform command in the temp directory. Terragrunt passes any variables defined in the inputs block as environment variables (prefixed with TF_VAR_) to the Terraform code. Terragrunt downloads the code in tmp directory only for the first time, and skips the download step subsequently unless the source URL is changed. The --terragrunt-source command-line option or the TERRAGRUNT_SOURCE environment variable is used to override the source parameter in .hcl files. If the source parameter refers local file path, then Terragrunt copies the local files in tmp directory for every execution.

Although Terraform supports remote state storage for various backend cloud storage, it does not support expressions, variables, or functions. Hence the remote state config would in most cases have to be duplicated across multiple terraform modules. With Terragrunt, although the terraform backend is specified in main.tf for each module, it is left blank and the entire remote state configuration is defined in the remote_state block within the single terragrunt.hcl at the root level. Each child terragrunt.hcl file automatically includes all the settings from the root terragrunt.hcl file using include block. The include block tells Terragrunt to use the exact same Terragrunt configuration from the terragrunt.hcl file specified via the path parameter (with find_in_parent_folders() or path_relative_to_include() path value).

terraform {
  # The configuration for this backend will be filled in by Terragrunt
  backend "s3" {}
}

Terragrunt enables to configure passing specific CLI arguments for specific commands using an extra_arguments block in the terragrunt.hcl file. There could be more than one extra_arguments block in the hcl file. Terragrunt's run-all command enables to deploy multiple Terraform modules using a single command. Terragrunt allows to add before and after hooks in order to execute pre-defined custom actions before or after execution of the terraform command.

Terragrunt auto initializes by default and automatically calls the terraform init during other terragrunt commands (e.g. terragrunt plan) if it detects that the terraform init is never called, sources are not downloaded or modules/remote state is recently updated. It also automatically re-runs the terraform command again when Terraform fails with transient errors. Terragrunt provides a way to configure logging level through the --terragrunt-log-level command flag. It also provides --terragrunt-debug, that can be used to generate terragrunt-debug.tfvars.json.

Terragrunt allows to assume an AWS IAM role using --terragrunt-iam-role command line argument or TERRAGRUNT_IAM_ROLE environment variable. It can also automatically load credentials using the Standard AWS approach.

Terragrunt Configuration Blocks

Terragrunt has below config blocks which are used to configure and interact with Terraform. Terragrunt parses the config blocks in below order of precedence:

include block
locals block
terraform block (of all configurations when -all flavored commands are used)
dependencies block
dependency block (for -all flavored commands, dependency block is executed before terraform block)
everything else
config referenced by include

Blocks that are parsed earlier in the process will be made available for use in the parsing of later blocks. But we cannot use blocks that are parsed later, in the blocks earlier in the process.

terraform block

The terraform block is used to configure how Terragrunt interacts with Terraform. It specifies where to find the Terraform configuration files or if there are any to hooks to run before or after calling Terraform. Below are the arguments supported by terraform block.

source (attribute): It specifies the location to the Terraform configuration files. It supports the exact same syntax as the module source parameter, allowing local file paths, Git URLs, and Git URLs with ref parameters.

extra_arguments (nested block): It allows to pass extra arguments to the terraform CLI. We can also pass sub commands, map of environment variables and list of file paths to terraform vars files (.tfvars).

terraform {
  extra_arguments "retry_lock" {
    commands = [
      "init",
      "apply"
    ]

    arguments = [
      "-lock-timeout=20m"
    ]

    env_vars = {
      TF_VAR_var_from_environment = "value"
    }
  }
}

before_hook (nested block): It specifies command hooks that run before terraform is called. Hooks run from the directory with the terraform module or where terragrunt.hcl lives. It takes list of terraform sub commands for which the hook should run before, the list of commands (& args) that should be run as the hook, the working directory of the hook and flag indicating to run hook even if a previous hook hit an error.

after_hook (nested block): It specifies command hooks that run after terraform is called. Hooks run from the directory where terragrunt.hcl lives. It supports same arguments as before_hook.

terraform {

  source = "git::git@github.com:acme/infrastructure-modules.git//networking/vpc?ref=v0.0.1"

  extra_arguments "retry_lock" {
    commands  = get_terraform_commands_that_need_locking()
    arguments = ["-lock-timeout=20m"]
  }

  before_hook "before_hook_2" {
    commands     = ["apply"]
    execute      = ["echo", "Bar"]
    run_on_error = true
  }
}

remote_state block

The remote_state block is used to configure remote state configuration for Terraform code using Terragrunt. It supports the following arguments:

backend (attribute): It specifies which remote state backend supported by terraform will be configured.

disable_init (attribute): It allows to skip automatic initialization of the backend by Terragrunt. The s3 and gcs backends have support in Terragrunt for automatic creation if the storage does not exist. By default it is set to false.

disable_dependency_optimization (attribute): It disables optimized dependency fetching for terragrunt modules using this remote_state block.

generate (attribute): Configure Terragrunt to automatically generate a .tf file that configures the remote state backend. This is a map that expects two properties

config (attribute): An arbitrary map that is used to fill in the backend configuration in Terraform. All the properties will automatically be included in the Terraform backend block (with a few exceptions: see below). Terragrunt does special processing of the config attribute for the s3 and gcs remote state backends, and supports additional keys that are used to configure the automatic initialization feature of Terragrunt.

remote_state {
    backend = "s3"
    config = {
      bucket = "testbucket"
      key    = "path/to/test/key"
      region = "us-west-1"
    }
}

include block

The include block is used to specify inheritance of Terragrunt configuration files. The included config (called as parent) will be merged with the current configuration (called the child) before processing. It support below arguments:

path (attribute): It specifies the path of the Terragrunt configuration file (parent config) that should be merged with the current (child) configuration.

expose (attribute, optional): It specifies whether or not the included config should be parsed and exposed as a variable. The data of the included config can be referenced using the variable under include.

merge_strategy (attribute, optional): It determines how the included config would be merged. Its values include no_merge (do not merge the included config), shallow (do a shallow merge - default), deep (do a deep merge of the included config).

include {
  path   = find_in_parent_folders()
  expose = true
}

inputs = {
  remote_state_config = include.remote_state
}

locals block

The locals block is used to define aliases for Terragrunt expressions that can be referenced within the configuration. The locals block does not have a defined set of arguments that are supported. Instead, all the arguments passed into locals are available under the reference local.ARG_NAME throughout the Terragrunt configuration.

locals {
  aws_region = "us-east-1"
}

inputs = {
  region = local.aws_region
  name   = "${local.aws_region}-bucket"
}

dependency block

The dependency block is used to configure module dependencies. Each dependency block exports the outputs of the target module as block attributes which can be referenced throughout the configuration. We can define more than one dependency block, with each block identified using a label. The dependency blocks are fetched in parallel at each source level, but each recursive dependency will be parsed serially. The dependency block supports the below arguments:

name (label): It is used by each dependency block to differentiate from other dependency blocks in the terragrunt configuration. It is used to reference the specific dependency by name.

config_path (attribute): It specifies the path to a Terragrunt module (directory containing terragrunt.hcl) to be included as a dependency in the current configuration.

skip_outputs (attribute): It skips calling terragrunt output when processing this dependency, when set true.

mock_outputs (attribute): A map of arbitrary key value pairs to be used as the outputs attribute when no outputs are available from the target module, or when skip_outputs is set to true.

mock_outputs_allowed_terraform_commands (attribute): A list of Terraform commands for which mock_outputs are allowed. If a command is used where mock_outputs is not allowed, and no outputs are available in the target module, Terragrunt will throw an error when processing this dependency.

mock_outputs_merge_with_state (attribute): It merges the mock_outputs with the state outputs when enabled.

dependency "vpc" {
  config_path = "../vpc"

  mock_outputs_allowed_terraform_commands = ["validate"]
  mock_outputs = {
    vpc_id = "fake-vpc-id"
  }
}

dependency "rds" {
  config_path = "../rds"
}

inputs = {
  vpc_id = dependency.vpc.outputs.vpc_id
  db_url = dependency.rds.outputs.db_url
}

dependencies block

The dependencies block enumerates all the Terragrunt modules that need to be applied in order for current module to be able to apply. It supports the paths attribute as below:

paths (attribute): A list of paths to modules that should be marked as a dependency.

dependencies {
  paths = ["../vpc", "../rds"]
}

generate block

The generate block is used to arbitrarily generate a file in the terragrunt working directory (where terraform is called). This enables to generate common terraform configurations that are shared across multiple terraform modules. The generate block supports the below arguments:

name (label): The name to identify the generate block as multiple generate blocks can be defined in a terragrunt config.

path (attribute): The path where the generated file should be written. If a relative path, it’ll be relative to the Terragrunt working dir.

if_exists (attribute): It specifies whether to always overwrite the existing file (overwrite), or overwrite only on error (overwrite_terragrunt) or skip file write (skip) or exit with error (error), when a file already exists in the specified path attribute.

comment_prefix (attribute): A prefix (default #) which can be used to indicate comments in the generated file.

disable_signature (attribute): It disables including a signature in the generated file, when set as true.

contents (attribute): The contents of the generated file.

generate "provider" {
  path      = "provider.tf"
  if_exists = "overwrite"
  contents = <<EOF
  provider "aws" {
    region              = "us-east-1"
    version             = "= 2.3.1"
    allowed_account_ids = ["1234567890"]
  }
EOF
}

Terragrunt Configuration Attributes

inputs: The input attribute is a map that is used to specify the input variables and their values to pass in to Terraform. Each entry is passed to terraform using the form TF_VAR_variablename, with the value in json encoded format.

download_dir: The download_dir string option is used to override the default download directory.

prevent_destroy: The prevent_destroy boolean flag prevents destroy or destroy-all command to actually destroy resources of the selected Terraform module, thus protecting the module.

skip: The skip boolean flag skips the selected terragrunt module, protecting it from any changes or ignoring them if they do not define any infrastructure by themselves.

iam_role: The iam_role attribute can be used to specify an IAM role that Terragrunt should assume prior to invoking Terraform.

iam_assume_role_duration: The iam_assume_role_duration attribute specifies the STS (Security Token Service) session duration, in seconds, for the IAM role that Terragrunt should assume prior to invoking Terraform.

terraform_binary: The terraform_binary string option overrides the default terraform binary path.

terraform_version_constraint: The terraform_version_constraint string overrides the default minimum supported version of terraform. Terragrunt supports the latest version of terraform by default.

terragrunt_version_constraint: The terragrunt_version_constraint string specifies the Terragrunt CLI version to be used with the configuration.

retryable_errors: The retryable_errors list overrides the default list of retryable errors with this custom list.

Terragrunt Built-in functions

All Terraform built-in functions are supported in Terragrunt config files. Terragrunt also has the following built-in functions which can be used in terragrunt.hcl.

find_in_parent_folders(): It searches up the directory tree from the current terragrunt.hcl file and returns the absolute path to the first terragrunt.hcl in a parent folder or exit with an error if no such file is found. It also takes optional name parameter to search with different filename and fallback value to return if filename not found.

path_relative_to_include(): It returns the relative path between the current terragrunt.hcl file and the path specified in its include block.

path_relative_from_include(): It returns the relative path between the path specified in its include block and the current terragrunt.hcl file.

get_env(NAME, DEFAULT): It returns the value of variable named NAME. If variable not set then it returns the DEFAULT value if set, or else it throws an exception.

get_platform(): It returns the current Operating System.

get_terragrunt_dir(): It returns the directory where the Terragrunt configuration file (by default terragrunt.hcl) lives.

get_parent_terragrunt_dir(): It returns the absolute directory where the Terragrunt parent configuration file (by default terragrunt.hcl) lives.

get_terraform_commands_that_need_vars(): It returns the list of terraform commands that accept -var and -var-file parameters.

get_terraform_commands_that_need_input(): Returns the list of terraform commands that accept the -input=(true or false) parameter. Similar functions are available to return terraform commands accepting -locl-timeout and -parallelism parameter s.

get_terraform_command(): It returns the current terraform command in execution.

get_terraform_cli_args(): It returns cli args for the current terraform command in execution.

get_aws_account_id(): It returns the AWS account id associated with the current set of credentials.

get_aws_caller_identity_arn(): It returns the ARN of the AWS identity associated with the current set of credentials.

run_cmd(command, arg1, ..): It runs the specified shell command and returns the stdout as the result of the interpolation. The --terragrunt-quiet argument prevents terragrunt to display the output into the terminal to redact sensitive values. The invocations of run_cmd are cached based on directory and executed command.

read_terragrunt_config(config_path, [default_val]): It parses the terragrunt config in the specified path and serializes the result into a map which can be used to reference the values of the parsed config. It exposes all the attributes and blocks in the terragrunt configuration, along with outputs from dependency blocks. The optional default value is returned if file does not exist.

sops_decrypt_file(): It decrypts a yaml or json file encrypted with sops

get_terragrunt_source_cli_flag(): It returns the value passed in via the CLI --terragrunt-source or an environment variable TERRAGRUNT_SOURCE. It returns an empty string when neither of those values are not provided.

Terragrunt Commands

Since Terragrunt is a thin wrapper for Terraform, with few exceptions of special commands, Terragrunt forwards all other commands to Terraform. Hence when we run terragrunt apply, Terragrunt executes terraform apply.

$ terragrunt <terraform-command>

$ terragrunt init

$ terragrunt plan

$ terragrunt apply

$ terragrunt destroy

The Terragrunt run-all command runs the provided terraform command against a stack, which is a tree of terragrunt modules. The command will recursively find terragrunt modules in the current directory tree and run the terraform command in each module, in the defined dependency ordering.

$ terragrunt run-all <terraform-command>

$ terragrunt run-all apply

The below command outputs the current terragrunt state (in limited form) in JSON format.

$ terragrunt terragrunt-info

The validate-inputs command outputs the information about the input variables configured with the given terragrunt configuration. It specifically prints out unused inputs not defined in corresponding Terraform module and undefined required terraform inputs variables which are not being passed. Byt default it runs in related mode and can run in strict mode using --terragrunt-strict-validate flag.

$ terragrunt validate-inputs

The below command prints the terragrunt dependency graph in DOT format. It recursively searches the current directory for Terragrunt modules and builds the dependency graph based on dependency and dependencies blocks.

$ terragrunt graph-dependencies

The HCL format command recursively searches for hcl files under a given directory tree and formats them. The --terragrunt-check flag allows to only verify the file formats without rewriting them.

$ terragrunt hclfmt

$ terragrunt hclfmt --terragrunt-check

tfsec Scanner

tfsec is a static analysis security scanner for the Terraform code. It uses deep integration with the official HCL parser to ensure security issues can be detected before the infrastructure changes take effect.

It is installed using HomeBrew on linux and Chocolatey package manager on windows. Alternatively we can also download the latest binary from the release page.

The tfsec config file is a file in the .tfsec folder in the root, named as config.json or config.yml. It is automatically loaded if it exists. The config file can also be set with the --config-file option:

$ tfsec --config-file tfsec.yml

The tfsec config file contains the severity overrides and check exclusions. The "severity_overrides" section increases or decreases the severity for any check identifier. The "exclude" section allows to specify the list of check identifiers to be excluded from the scan. The check identifiers can be found left menu, under provider checks.

The below tsec command scans the specified directory or current directory if not directory is specified.

$ tfsec <directory-path>

TerraTest

Terraform supports unit testing, acceptance testing and end to end testing. Terratest is a unit testing framework which uses Go’s unit testing framework and Go libraries. It provides a variety of helper functions and patterns for common infrastructure testing tasks. Tests are written using Go with Terraform file passed as argument and is invoked using "go test" command which initializes and applies the infrastructure. Once the tests are complete it destroys the infrastructure. Terratest provides a whole range of Go modules to help with the testing.

Modern Javascript - A Brief Tutorial

2021-08-09T17:44:00.102-07:00

Javascript is a light weight object oriented language which dominates most of the Web development. It has a simple syntax, supports Event-based programming and allowed to dynamically generate content on web pages. All the major browsers slowly started providing support to run JavaScript efficiently by developing their own JavaScript engines. Different browsers use different JS engine to execute Javascript code, for example Microsoft Edge uses Chakra, Firefox uses SpiderMonkey and Google Chrome uses V8 JS Engine. Hence the same Javascript code could behave differently on different browsers. The release of Node JS, a runtime environment for JavaScript, allowed to run Javascript outside the browser on the server side and supported asynchronous programming, which has truly revolutionized Javascript. Node JS was developed in 2009 using Chrome V8 engine of the Chromium Project. It allowed JavaScript to run everywhere, unifying all of web application development around a single programming language, rather having a different language for server-side and client-side scripts. This allowed companies to develop applications entirely with JavaScript as a full-stack single programming language which made it easier to debug and save costs in the overall development process. The popularity of JavaScript led to the creation of several libraries and frameworks that have made application development much more efficient with better performance. Libraries like React JS, jQuery, D3.js etc and Frameworks like Angular, Ember JS, and Vue JS provide optimal performance and features to build large applications.

ECMAScript is the Javascript standard meant to ensure the interoperability of web pages across different web browsers. It is commonly used for client-side scripting on WebPages and it is increasingly being used for server side applications using Node JS. The First edition of ECMA-262 was published in 1997 and subsequently many editions followed. The ECMAScript 2015 also known as ECMAScript 6 (ES6) is the second major revision of JavaScript language since ES5 which was standardized in 2009. Recently ECMAScript 2020 (ES11) features were released in 2020.

Javascript uses semicolon to separate each statement. The semicolon is only obligatory when there are two or more statements on the same line. The semicolon in JavaScript is used to separate statements, but it can be omitted if the statement is followed by a line break or there’s only one statement in a {block}.

Data Types

There are six basic data types in JavaScript which can be divided into three main categories: primitive (or primary), composite (or reference), and special data types. String, Number, and Boolean are primitive data types. Object, Array, and Function (which are all types of objects) are composite data types. Whereas Undefined and Null are special data types.

Primitive data types can hold only one value at a time. The string data type is used to represent textual data. The number data type is used to represent positive or negative numbers with or without decimal place. The Boolean data type can hold only two values: true or false. If a variable has been declared, but has not been assigned any value, then its value is undefined. A null value means that there is no value for the variable.

The composite data types can hold collections of values and more complex entities. The object is a complex data type which allows to store collections of data. An object contains properties, defined as a key-value pair. A property key (name) is always a string, but the value can be any data type, like strings, numbers, booleans, or complex data types like arrays, function and other objects.

An array is a type of object used for storing multiple values in single variable. Array can contain data of any data type for example numbers, strings, booleans, functions, objects, and other arrays. In Javascript we can define any collection by ending with a comma as below.

namelist = ['John', 'Tyler', 'Mike',]

The function is callable object that executes a block of code. The typeof operator can be used to find out what type of data a variable or operand contains. It can be used with or without parentheses (typeof(x) or typeof x).

var greeting = function(){ 
    return "Hello World!"; 
}
 
console.log(typeof greeting)
console.log(greeting());

JavaScript is a loosely typed language and does not require a data type to be declared. We can assign any literal values to a variable, e.g., string, integer, float, boolean, etc.

Variable Types: Var, Let Const

Var variable scope is either global or local in ES6. When a var variable is declared outside a function, its scope is global, while when its declared inside a function, the scope is local. When the var variable is declared outside it is added to the property list of the global object. The global object is window on the web browser and global on Node JS. The var variable can be redeclared without any issue. When var variable is defined it is immediately initialized to undefined. Var variable can be redeclared without any issue. Variables can be declared and initialized to a value without the var keyword, but this is not recommended.

Let variables are declared with a block-scope, were the variable is only accessible within the block. Let variable are not initialized to any value on declaration, and are not attached to the global object. Redeclaring a variable using the let keyword will cause an error. A temporal dead zone of a variable declared using the let keyword starts from the block until the initialization is evaluated. Hence once a let variable is defined a storage space is allocated for the variable but it is not initialized. Referencing uninitialized let variable causes a ReferenceError. Let variables are mutable, hence its value can be updated.

Const variable similar to let does not allow to redefine the same variable name again and does not allow variables to be accessed outside its declared (block/method) scope. Const variable unlike let variable are immutable. The const keyword creates a read-only reference to a value which cannot be reassigned but the value can be changed. In order to prevent any changes to the object itself, Object.freeze() is used for shallow freeze.

It is recommended to avoid using var as much as possible, prefer const or let if possible. It is important to note that variable names are case-sensitive in JavaScript. Javascript allows to use a variable and function before declaring it. The Javascript compiler moves all the declarations of variables and functions at the top so that there will not be any error, called as Hoisting.

Strict Mode (ES 5)

Strict mode eliminates some JavaScript silent errors by changing them to throw errors. For example accessing global object or updating non-existing property etc. Strict mode applies to entire scripts or to individual functions. It does not apply to block statements enclosed in {} braces and applying it in such way does not do anything. In order to invoke strict mode for an entire script, the statement "use strict"; is added before any other statements. Similarly to invoke strict mode for a function (and any inner functions), the statement "use strict"; is added within the function before any other statements.

// Entire script strict mode
"use strict";
x = 3.14;       // throws error as x is not declared

// Function script strict mode
function strictMode() {
  'use strict';
  function nested() { return 'Nested function also in strict mode !'; }
  return "This is a strict mode function!  " + nested();
}

Operators

Javascript supports standard Logical operators i.e. AND (&&), OR (||) and NOT (!), standard Arithmetic (+, -, *, **, /, %, ++, --), Assignment (=, +=, -=, *=, /=, %=, <<=, >>=, >>>=, &=, ^=, |=, **=) and Ternary (:?) Operator. Javascript supports standard Comparison operators !=, >, <, >= and <=.

The == comparison operator compares the equality of two operands without considering type, while the === operator compares equality of two operands including their type. The == operator compares for equality after doing any necessary type conversions. The === operator does not do any conversion, hence if two values are not the same type === will simply return false. It is recommended to always use === and !== operators instead of their evil twins == and != operators.

'' == '0'           // false
0 == ''             // true
0 == '0'            // true

0 == false   		// true
0 === false    		// false

"" == false   		// true
"" === false    	// false

"" == 0   			// true
"" === 0   			// false

'0' == 0   			// true
'0' === 0    		// false

'17' == 17   		// true
'17' === 17    		// false

[1,2] == '1,2'   	// true
[1,2] === '1,2'    	// false

null == undefined   // true
null === undefined  // false

"abc" == new String("abc")    // true, == operator evaluates String object using its toString() or valueOf() to primitive type
"abc" === new String("abc")   // false, as primitive string and String object reference are of different types

String

String is a primitive data type in JavaScript. Any string must be enclosed in single or double quotation marks. A string can also be treated like zero index based character array. A string is immutable in JavaScript, it can be concatenated using plus (+) operator in JavaScript.

var str = 'Hello World';

str[0] // H
str.length //  11

The characters of the string can be accessed by index.

for(var ch of str) {
  console.log(ch);
}

The eval() is a global function in JavaScript that evaluates a specified string as JavaScript code and executes it.

eval("console.log('this is executed by eval()')");

The String object is used to represent and manipulate a sequence of characters. It allows to check the length, and methods to get location of substring using indexOf() method, get character at a location charAt(), extracting substrings with the substring() method and allow many other string operations. The replace() method is used to replace part of the string with another string. The search() method takes a regular expression and returns the first index on which the expression was found. The String() method can be used to convert any other data type value to string values.

String(25) 		// 25 is converted to string "25"
String([]) 		// [] is converted to empty string ""
String(null) 	// null is converted to string "null"
String(true) 	// true is converted to string "true"
String({}) 		// {} is converted to string(similar to calling toString() on a object)

Number

The Number is a primitive data type used for positive or negative integer, float, binary, octal, hexadecimal, and exponential values in JavaScript. The Number type in JavaScript is double-precision 64 bit binary format like double in Java. The integer value is accurate up to 15 digits in Javascript, and float values can keep precision upto 17 decimal places. Arithmetic operations on floating-point numbers in JavaScript are not always accurate.

Number is a primitive wrapper object used to represent and manipulate numbers. The Number() is a constructor function in JavaScript that converts values of other types to numbers. The Number constructor contains constants and methods for working with numbers. The new operator with Number() returns an object which contains constants and methods for working with numbers. JavaScript treats primitive values as objects, so all the properties and methods are applicable to both literal numbers and number objects.

var i = Number('100');
typeof(i); // returns number

var f = new Number('10.5');            
typeof(f); // returns object

Below are examples of using the Number() constructor to convert any other data type value to numeric values.

Number("25")   // "25" string is converted to number 25
Number("")     // "" string is converted to 0
Number([])     // [] is converted to 0
Number(null)   // null is converted to 0
Number(true)   // true is converted to 1
Number(false)  // false is converted to 0
Number("Test") // "Test" could not be converted to number

It is very important to note that new Number() creates object instance, while Number function returns a number type which is same as any primtive number.

3 === new Number(3)			// false, new keyword creates a Number object instance. Value does not equal to object reference.
3 === Number(3)				// true, Number function returns a value type

typeof new Number(3)		// object
typeof Number(3)			// number
typeof 3					// number

The parseInt(), parseFloat() functions parses a string argument and returns an integer or floating point number respectively.

The global NaN property is a value representing Not-A-Number. NaN is a special numeric value that is not equal to itself.

NaN === NaN 				// false

const notANumber = 3 * "a" 	// NaN

notANumber == notANumber 	// false
notANumber === notANumber 	// false

The corresponding global isNaN() function is used to determine whether a value is NaN or not. The isNaN() global function attempts to coerce it's argument before checking if it's NaN. It is recommended to avoid global isNAN function.

isNaN("name") 			// true
isNaN("1") 				// false

ECMAScript 6 introduced a method for checking NaN, Number.isNaN.

Number.isNaN(NaN) 		// true
Number.isNaN("name") 	// false

Boolean

The Boolean object is an object wrapper for a boolean value. The Boolean converts the passed value to boolean value. If the value is omitted or is 0, -0, null, false, NaN, undefined, or the empty string (""), the object has an initial value of false. All other values, including any object, an empty array ([]), or the string "false", create an object with an initial value of true.

Boolean(null) 			// false
Boolean(0)	 			// false
Boolean(25) 			// true
Boolean([]) 			// true
Boolean({}) 			// true
Boolean("Yeah !") 		// true

Math

Math is a built-in object that has properties and methods for mathematical constants and functions. It’s not a function object. There are various Math built in functions available in Javascript, few of them below.

The Math.random() function returns a floating-point, pseudo-random number in the range 0 to less than 1 with approximately uniform distribution over that range.

console.log(Math.random());     // number from 0 to <1

The Math.floor() function returns the largest integer less than or equal to a given number.

console.log(Math.floor(5.95));    // output: 5

console.log(Math.floor(5.05));    // output: 5

The Math.max() function returns the largest of the zero or more numbers given as input parameters, or NaN if parameter is not a number.

console.log(Math.max(1, 3, 2));

const array1 = [1, 3, 2];

console.log(Math.max(...array1));

Object

Objects are same as variables in JavaScript, the only difference is that an object holds multiple values in terms of properties and methods. Objects are collections of key value pairs in Javascript. Object can be created either using Object Literal/Initializer Syntax or using the Object() Constructor with the new keyword. An object can be created with figure brackets {…} with an optional list of properties. A property is a “key: value” pair, where key is a string (also called a “property name”), and value can be anything. Alternatively, properties and methods can be declared using the dot notation .property-name or using the square brackets ["property-name"], as shown below. Object in JavaScript is passed by reference from one function to another.

const person = {           // object literal syntax
	name: 'Jarvis',
	walk: function() {},
	talk() { console.log("talking.."); }
}

person.talk();
person['name'] = 'John';
person.name = '';

var newperson = new Object();  // Object() constructor
newperson.firstName = "James";
newperson["lastName"] = "Bond"; 
newperson.age = 25;
newperson.getFullName = function () {
        return this.firstName + ' ' + this.lastName;
    };

The for in loop allows to enumerate properties of the object.

for(var prop in person){
   console.log(prop);  	      // access property name
   console.log(person[prop]); // access property value
};

The Object class store various keyed collections and more complex entities. The Object() constructor is used to create Objects. Object class a number of static methods which allows to get or set properties to any Object.

Object.assign(target, source): It copies the values of all enumerable properties from one or more source objects to a target object, and returns the modified target object.

Object.fromEntries(value): It transforms a list of key-value pairs into an object.

Object.freeze(): It freezes the object, were new properties cannot be added, existing properties cannot be removed, existing properties configurability, or writability cannot be altered and values of existing properties cannot be changed. It also prevents its prototype from being changed.

Object.entries(value): It returns an array containing all of the [key, value] pairs of a given object's own enumerable string properties.

Object.is(value1, value2): Compares if two values are the same value.

Object.seal(object1): It seals the object, preventing new properties from being added to it and marking all existing properties as non-configurable.

Object.values(object1): It returns an array of a specified object's property values in order in which they are defined.

Object.keys(object1): It returns an array containing all the property names for the specified object.

Object.getOwnPropertyDescriptor(object1, 'property1'): It returns an object describing the configuration of a specific property on a given object. The configuration for specified property of an object includes its value, boolean config such as configurable, enumerable and writable.

Object.defineProperty(object1, 'property1', { .. }): It defines a new property directly on an object, or modifies an existing property on an object, and returns the object.

const object1 = {};

Object.defineProperty(object1, 'property1', {
  value: 87,
  writable: false
});

object1.property1 = 909;

console.log(object1.property1);     // value still remains 87, throws error in strict mode

Object.getPrototypeOf(object1): It returns the value of internal prototype of the specified object.

Object.setPrototypeOf(object, prototype): It sets the internal of a specified object to another object or null.

Memory management in JavaScript is performed automatically and invisibly to us. JavaScript automatically allocates memory when objects are created and frees it when they are not used anymore (garbage collection). Objects are retained in memory while they are reachable.

Date

JavaScript provides Date object to work with date & time including days, months, years, hours, minutes, seconds and milliseconds.

// current date as function
Date();

// current date as an object
var currentDate = new Date(); 

// Date specifying milliseconds
var date2 = new Date(1000);

var date3 = new Date("3 august 2021");

var date4 = new Date("2021 3 August");
var date5 = new Date("3 august 2021 20:21:44");

// Date using valid separator
var date6 = new Date("August-2021-3");

date.toDateString();

date.toLocaleDateString();

date.toGMTString();

date.toISOString();

date.toLocaleString();

date.toString('YYYY-MM-dd');

if(date1 > date2) {
  console.log(date1 + ' is greater than ' + date2);
}

// convert valid date string into unix epoch milliseconds
Date.parse("8/2/2021");

The toJSON() method converts a Date object into a string, formatted as a JSON date.

const date = new Date('August 13, 1973 23:15:30 UTC');
console.log(date.toJSON());

Arrays

An array is a special type of variable in Javascript, which can store multiple values. Every value is associated with numeric index starting with 0. JavaScript array can store multiple element of different data types. It is not required to store value of same data type in an array. An array can be initialized using a constructor using new keyword. Array has several methods such as forEach(), join(), map(), push(), reduce(), slice() etc.

var stringArray1 = ["one", "two", "three"];
var numericArray1 = [1, 2, 3, 4];
var mixedArray1 = [1, "two", "three", 4];

var stringArray = new Array();
stringArray[0] = "one";

var numericArray = new Array(1);
numericArray[0] = 1;

var mixedArray = new Array(1, "two", 3, "four");

Array class is a global object that is used in the construction of arrays. It has following handy static methods.

Array.from(): It creates a new, shallow-copied Array instance from an array-like or iterable object.

Array.isArray(): It determines whether the passed value is an Array.

Array.of(): It creates a new Array instance from a variable number of arguments, regardless of number or type of the arguments.

JSON

JSON stands for Javascript Object Notation. It is derived from JavaScript object syntax, but it is entirely text-based. JSON is a natural choice for data format in Javascript. JSON data is normally accessed in Javascript through dot notation. We can also use the square bracket syntax to access data from JSON.

Javascript provides a JSON object which contains methods for parsing JavaScript Object Notation (JSON) and converting values into JSON.

JSON.stringify(original, replacer): It converts a JavaScript object or value to a JSON string. By default, all instances of undefined are replaced with null. The optional replacer function allows to replace specific values. On other hand when the optional replacer array is specified then the JSON string only includes the specified properties from the replacer array.

console.log(JSON.stringify([new Number(3), new String('false'), new Boolean(false)]));
// expected output: "[3,"false",false]"

console.log(JSON.stringify(new Date(2021, 0, 2, 15, 4, 5)));
// expected output: ""2021-01-02T15:04:05.000Z""

function replacer(key, value) {
  if (typeof value === 'string') {   // Filter properties
    return undefined;
  }
  return value;
}

var foo = {company: 'SpaceX', model: 'box', week: 45, transport: 'jet', month: 7};

// Replacer Function
JSON.stringify(foo, replacer);

// Replacer Array
JSON.stringify(foo, ['week', 'month']);

JSON.parse(text, reviver): It parses a JSON string to construct the JavaScript object or value. An optional reviver function can be provided to perform a transformation on the resulting object before it is returned. JSON.parse() does not allow trailing commas and single quotes.

JSON.parse('{}');              // {}
JSON.parse('true');            // true
JSON.parse('"foo"');           // "foo"
JSON.parse('[1, 5, "false"]'); // [1, 5, "false"]
JSON.parse('null');            // null

const jsonString = '[{"name":"John","score":51},{"name":"Jack","score":17}]';

var data = JSON.parse(jsonString, function reviver(key, value) {
  return key === 'name' ? value.toUpperCase() : value;
});

console.log(data);

Regular Expressions

A regular expression in Javascript is a type of object. The regular expression can be either constructed using the RegExp constructor or written as a literal value by enclosing a pattern in forward slash (/) characters. The RegExp constructor uses the pattern as a normal string, so the regular rules apply for backslashes. The pattern between slash characters, requires a backslash be added before any forward slash in order to be the part of the pattern. Also backslashes that aren’t part of special character codes (like \n) will be preserved, rather than ignored as they are in strings, and change the meaning of the pattern. Some characters, such as question marks and plus signs, have special meanings in regular expressions and must be preceded by a backslash if they are meant to represent the character itself. In the below example, the pattern matches for "abc" and "eighteen+".

let reg1 = new RegExp("abc");
let reg2 = /abc/;
let eighteenPlus = /eighteen\+/;

The syntax of regular expression is /pattern/modifiers. Modifiers are used to perform change the match criteria. The modifier (i) performs case-insensitive matching, the (g) modifier finds all the matches rather than stopping after the first match and the (m) modifier performs multiline matching. Modifiers are optional in regular expression. The pattern is used to determine the match in the string.

In a regular expression pattern, putting a set of characters between square brackets makes that part of the expression match any of the characters between the brackets. Within square brackets, a hyphen (-) between two characters can be used to indicate a range of characters, e.g. [0-9] covers all of them and matches any digit. Below are common character groups with their own built-in shortcuts.

Character Group	Description
\d	Any digit character
\w	An alphanumeric character (“word character”)
\s	Any whitespace character (space, tab, newline, and similar)
\D	A character that is not a digit
\W	A nonalphanumeric character
\S	A nonwhitespace character
.	Any character except for newline

In order to invert the set of characters, i.e. to match any character except the ones in the set, a caret (^) character after the opening bracket e.g. /[^01]/. To match an element which may be repeated more than once, a plus sign (+) after the element in the regular expression, e.g. /\d+/ matches one or more digit characters. Similarly, the element with a star (*) after it allows the pattern to match element zero or more times. A question mark (?) in a pattern indicates the part od pattern may occur zero times or one time. To indicate that a pattern should occur a precise number of times, braces are used with number of exact occurrences, e.g. {4}. It is also possible to specify a range this way, with {2,4} means the element must occur at least twice and at most four times. Open-ended ranges can also be specified using braces by omitting the number after the comma, e.g. {5,} means five or more times. In order to use an operator such as * or + on more than one element at a time, each such part a regular expression is enclosed in parentheses representing as a single element .e.g /boo+(hoo+)+/i.

The RegExp object is a regular expression object with predefined properties and methods. The exec() and test() methods of RegExp object are used to match the regular expression patterns. The test() method takes the string and returns a Boolean indicating whether the string contains a match of the pattern in the expression.

console.log(/abc/.test("abcde"));	// true
console.log(/[01239]/.test("in 1992"));	// true
console.log(/[0-9]/.test("in 1992"));	// true

The execute() method returns an object containing the details of the match or else null if there is no match. The matched object has an index property which tells us where in the string the successful match begins. Most of the matched object is an array of strings, whose first element is the string that was matched.

let match = /\d+/.exec("one two 100");
console.log(match);				// ["100"]
console.log(match.index);			// 8

Regular expression objects have properties source and lastIndex. The source property contains the string that expression was created from, while the lastIndex property controls the position from where the next match will start. The regular expression must have the global (g) or sticky (y) option enabled, and the match must happen through the exec method. If the match was successful, the call to exec automatically updates the lastIndex property to point after the match. If no match was found, lastIndex is set back to zero, which is also the value it has in a newly constructed regular expression object.

let pattern = /y/g;
pattern.lastIndex = 3;
let match = pattern.exec("xyzzy");
console.log(match.index);		// 4
console.log(pattern.lastIndex);		// 5

let global = /abc/g;
console.log(global.exec("xyz abc"));	// ["abc"]
let sticky = /abc/y;
console.log(sticky.exec("xyz abc"));	// null

Implicit Coercion

Javascript's implicit coercion simply refers to Javascript attempting to coerce an unexpected value type to the expected type. So we can pass a string where it expects a number, an object where it expects a string etc, and it will try to convert it to the right type. It is recommended to avoid this feature.

Whenever we pass a string as an operand in a numeric expression involving either of arithmetic operators (-, *, /, %), the number's conversion process is similar to calling the in-built Number function on the value. Hence any string containing only numeric characters will be converted to its number equivalent, but a string containing a non-numeric character returns NaN.

3 * "3" 			// 3 * 3
3 * Number("3") 	// 3 * 3

The + operator unlike other mathematical operators, performs both Mathematical addition as well as String concatenation. When a string is an operand of the + operator, Javascript instead of converting the string to a Number, converts the number to a string.

1 + 2 			// 3
1 + "2" 		// "12"

1 + 2 + "1" 	// "31"
1 + "2" + 1 	// "121"

Every javascript Object inherits a toString method, that is called whenever an Object is to be converted to a string. The return value of the toString method is used for such operations as string concatenation and mathematical expressions.

const foo = {}
foo.toString() 	// [object Object]

const baz = {
  toString: () => "I'm object baz"
}

baz + "!" 		// "I'm object baz!"

When it's a mathematical expression, Javascript will attempt to convert the return value to a number, if it's not.

const bar = {
  toString: () => "2"
}

2 + bar 	// "22"
2 * bar 	// 4

The toString method inherited from Object works similar to the join method without arguments for arrays.

[1,2,3].toString() 	// "1,2,3"
[1,2,3].join() 		// "1,2,3"

"me" + [1,2,3] 		// "me1,2,3"
4 + [1,2,3] 		// "41,2,3"
4 * [1,2,3] 		// NaN

When we pass an array where it expects a string, Javascript concatenates the return value of the toString method with the second operand. If it expects a number, it attempts to convert the return value to a number.

4 * [] 						// 0
4 * Number([].toString())
4 * Number("")
4 * 0

4 / [2] 					// 2
4 / Number([2].toString())
4 / Number("2")
4 / 2

Every Javascript value can be coerced into either true or false. Coercion into boolean true means the value is truthy, while coercion into boolean false means the value is falsy. Javascript returns false when them the value of variable is 0, -0, null, undefined, "", NAN and of course for false. In all other cases Javascript returns boolean expression value as true.

if (-1) 			// truthy
if ("0") 			// truthy
if ({}) 			// truthy

const counter = 2

if (counter)  		// truthy

Javascript converts boolean true as 1 and boolean false as 0 when it is used in mathematical expression. It converts empty string to a 0 when used in mathematical expression.

Number(true) 		// 1
Number(false) 		// 0
Number("") 			// 0

true + true 		// 2
4 + true 			// 5
3 * false 			// 0
3 * "" 				// 0
3 + "" 				// "3"

The valueOf method defined in the Object is used by Javascript whenever we pass an Object where it expects a string or numeric value. The valueOf method defined on an Object takes precedence (for both mathematical/string expressions) and is used even if the Object has defined the toString method. The Number object has a valueOf() method defined which returns a numeric value.

const foo = {
  valueOf: () => 3
}

3 + foo 					// 6
3 * foo 					// 9

const bar = {
  toString: () => 2,
  valueOf: () => 5
}

"sa" + bar 					// "sa5"
3 * bar 					// 15
2 + bar 					// 7

Global Object

Historically the global object in the web browser is window or frames. The Web Workers API which has no browsing context, uses self as a global object. Node JS uses the global keyword to reference the global object. In order to standardize this and make it available across all environments, ES2020 introduced the globalThis object.

Execution Context

When a JavaScript engine executes a script, it creates global and function execution contexts. Each execution context has two phases: the creation phase and the execution phase. When a script executes for the first time, the JavaScript engine creates a Global Execution Context. During the creation phase of Global context, it performs the following tasks:

Create a global object i.e., window in the web browser or global in Node.js.
Create a this object binding which points to the global object above.
Setup a memory heap for storing variables and function references.
Store the function declarations in the memory heap and variables within the global execution context with the initial values as undefined.

During the execution phase, the JavaScript engine executes the code line by line. In this phase, it assigns values to variables and executes the function calls. For every function call, the JavaScript engine creates a new Function Execution Context. The Function Execution Context is similar to the Global Execution Context, but instead of creating the global object, it creates the arguments object that contains a reference to all the parameters passed into the function:

Javascript uses call stack to keep track of all the execution contexts including the Global Execution Context and Function Execution Contexts.

Functions

Javascript allows to pass less or more arguments while calling a function. If we pass less arguments then rest of the parameters will be undefined. If we pass more arguments then additional arguments will be ignored. Javascript has an arguments object which includes value of each parameter passed to the function. The values of the argument object can be accessed using index similar to an array. An arguments object is still valid even if function does not include any parameters.

function ShowMessage(firstName, lastName) {
    alert("Hello " + arguments[0] + " " + arguments[1]);
}

function ShowMessage() {
    console.log("Hello " + arguments[0] + " " + arguments[1]);
}

ShowMessage("Steve", "Jobs"); // displays Hello Steve Jobs

In JavaScript, a function can have one or more inner functions. These nested functions are in the scope of outer function. The inner function always has access to the variables and parameters of its outer function, even after the outer function has returned, known as Closure. However, outer function cannot access variables defined inside inner functions. A function can also return another function in JavaScript. A function can be assigned to a variable and then the variable can be uses as a function, called as function expression. JavaScript allows to define a function without any name called the anonymous function. Anonymous function must be assigned to a variable.

var showMessage = function (){
    console.log("Hello World!");
};

showMessage();

All functions in JavaScript are objects. They are the instances of the Function type. Since functions are objects, they have properties and methods like other objects. Each function has two properties, the length property which determines the number of named arguments specified in the function declaration and the prototype property references the actual function object.

function swap(x, y) {
    let tmp = x;
    x = y;
    y = tmp;
}

console.log(swap.length);
console.log(swap.prototype);

The function can be called as a constructor to create a new object, e.g. as new function_name(). ES6 has added a new property called target.new which allows to detect whether a function is called as a normal function (target.new is undefined) or as a constructor using the new operator (target.new is constructor fn). The apply() and call() methods as discussed below call a function with a given this value and arguments. The bind() method creates a new function instance whose this value is bound to the object that we provide. The below example changes the this value inside the start() method of the car object to the aircraft object and the bind() method returns a new function that is assigned to the taxiing variable.

let car = {
    speed: 5,
    start: function() {
        console.log('Start with ' + this.speed + ' km/h');
    }
};

let aircraft = {
    speed: 10,
    fly: function() {
        console.log('Flying');
    }
};

let taxiing = car.start.bind(aircraft);

This also can be achieved directly by using the call method as below.

car.start.call(aircraft);

Template Literals

A back tick is used for template literal, were '${..}' specifies an expression.

const price = 12.34;

console.log("The price is $" + price);
console.log(`The price is $${price}`);

Object Literals

Prior to ES6, an object literal is a collection of name-value pairs.

const name = 'Samuel';
const age = 23;

const person = {
                  name: name,
                  age: age
               };

ES6 allows to eliminate the duplication when a property of an object is the same as the local variable name by including the name without a colon and value as below. Internally, when a property of an object literal only has a name, the JavaScript engine searches for a variable with the same name in the surrounding scope. If it can find one, it assigns the property the value of the variable.

const samuel = { name, age };
console.log(samuel);

Error Handling

Javascript uses try-catch-finally block for error handling. In order to throw a custom error, the throw keyword is used with the Error object. The Error object is a base object for user-defined exceptions. The Error object has name, message and other properties. JavaScript allows to use throw with any argument, so technically the custom error classes need not inherit from Error. Although it becomes possible to use object instanceof Error to identify the error objects.

try {
  someMethod();
  
} catch (e) {
  if (e instanceof RangeError) {
    console.log(e);
    throw "Range Error occurred";
  } else {
    throw e;
  }
} finally {
  console.log("Always execute finally");
}

Computed property name

The square brackets allows to use the string literals and variables as the property names.

let name = 'machine name';
let machine = {
    [name]: 'server',
    'machine hours': 10000
};

console.log(machine[name]); // server

In ES6, the computed property name is a part of the object literal syntax, and it uses the square bracket notation. When a property name is placed inside the square brackets, the JavaScript engine evaluates it as a string. Hence we can use an expression as a property name as below.

let prefix = 'machine';
let machine = {
    [prefix + ' name']: 'server',
    [prefix + ' hours']: 10000
};

console.log(machine['machine name']); // server

Concise method syntax

Prior to ES6, when defining a method for an object literal, we specify the name and full function definition as shown below.

let server = {
	name: "Server",
	restart: function () {
		console.log("The" + this.name + " is restarting...");
	}
};

ES6 makes the syntax for making a method of the object literal more succinct by removing the colon (:) and the function keyword. It’s valid to have spaces in the property name. The method is called using object_name['property name']();

let server = {
    name: 'Server',
    restart() {
        console.log("The " + this.name + " is restarting...");
    },
    'starting up'() {
        console.log("The " +  this.name + " is starting up!");
    }
};

server['starting up']();

for...of loop (ES6)

The for...of loop iterates over iterable objects such as Array, String, Map, Set, arguments or NodeList and any object implementing the iterator protocol. Each iteration assigns a property of the iterable object to the variable which can be declared as var, let, or const. The for...of loop also iterates over characters of a string. The for...of iterates over elements of any collection that has the [Symbol.iterator] property.

let scores = [80, 90, 70];

for (let score of scores) {
    score = score + 5;
    console.log(score);
}

In order to access the index of the array elements inside the for loop, the entries() method of array is used, which returns a pair of [index, element] for each iteration of the array.

let colors = ['Red', 'Green', 'Blue'];

for(const entry of colors.entries()) {
   console.log(`${entry[0]}--${entry[1]}`);
}

for (const [index, color] of colors.entries()) {
    console.log(`${color} is at index ${index}`);
}

for...in loop

The for...in iterates over all enumerable properties of an object but does not iterate over a collection such as Array, Map or Set.

var person = {
    firstName: 'John',
    lastName: 'Doe',
    ssn: '299-24-2351'
};

for(var prop in person) {
    console.log(prop + ':' + person[prop]);
}

for each loop

The forEach method is used to loop through arrays. It passes a callback function for each element of an array with parameters current value, optional index and optional array object.

const numbers = [1, 2, 3, 4, 5];

numbers.forEach(number => {
    console.log(number);
});

numbers.forEach((number, index) => {
    console.log('Index: ' + index + ' Value: ' + number);
});

numbers.forEach((number, index, array) => {
    console.log(array);
});

Collections

The Set object stores unique values of any type, whether primitive values or object references. The elements of a set are stored in order of insertion.

let nums = new Set([1, 2, 3]);

The Map object holds key-value pairs and retains the original insertion order of the keys. Both primitive data types and Objects can be used as keys or values.Map

let colors = new Map();

colors.set('red', '#ff0000');
colors.set('green', '#00ff00');
colors.set('blue', '#0000ff');

Decorators, func.call and func.apply

A Decorator function is a higher order function that modifies the behavior of the function or method passed to it by returning a new function. Below is an example of decorator function.

function amount(x) {
  console.log(`Amount is ${x}`);
  return x;
}

function dollarDecorator(func) {
  return function(x) {
    let result = func(x + "$");
    return result;
  };
}

amount = dollarDecorator(amount);
console.log( amount(45.67) );

The above decorator is not suitable for object methods as decorator function calling the original object method will fail as it does not have access to the object instance (this). This is resolved using Javascript func.call() function. The func.call(context, ...args) is a special built-in function method that allows to call a function explicitly setting this. The syntax is: func.call(context, arg1, arg2, ...)

The func.call executes func by providing the first argument as this, and the next as the arguments. The below two calls are exactly the same.

func(1, 2, 3);
func.call(obj, 1, 2, 3)

The func.apply is same as func.call, only that the call expects a list of arguments, while apply takes an array-like object with them.

func.call(this, ...arguments)

func.apply(this, arguments)

Generators

The yield keyword allows to pause and resume a generator function (function*) asynchronously. A generator function similar to a normal function but whenever it is returning any value, it does that using the ‘yield’ keyword instead of returning it directly. Yield can’t be called from nested functions or from callbacks.

The yield expression returns an object with two properties, the “value” which is the actual value and “done” which is a boolean value, it returns true when generator function is full completed else it returns false. If we pause the yield expression, the generator function will also get paused and resumes only when we call the next() method. When the next() method is encountered the function keeps on working until it faces another yield or returns expression.

The syntax of yield is [variable_name] = yield [expression];, were the expression specifies the value to return from a generator function via the iteration protocol, and variable_name stores the optional value passed to the next() method of the iterator object.

function* showValues(index) {
  while (index < 3) {
    yield index;
    index++;
  }
}

const iterator = showValues(1);

console.log(iterator.next().value);     // output: 1

console.log(iterator.next().value);     // output: 2

for(let value of showValues(1)) {
  console.log(value); // 1, then 2
}

Arrow Functions

Arrow function is an anonymous function which is similar to lambda expressions in other languages. An arrow function expression is a compact alternative to a traditional function expression, but is limited and can't be used in all situations. Arrow function is represented as (parameter list) => body. Arrow functions support rest and default parameters, and de-structuring within parameters.

// original function
const square = function(number) {
	return number * number;
}

// arrow function
const square = (number) => {
	return number * number;
}

Arrow function with a single parameter

const square = number => {
	return number * number;
}

If body of function contains single line and returns a value then we can write as

const square = number => number * number;

Scoping defines the range of functionality of a variable so that it may only be called (referenced) from within the block of code in which it is defined. There are two main types of scoping in programming languages.

Lexical Scoping: An unbounded variable is bounded to a definition in the defining scope.
Dynamic Scoping: An unbounded variable is bounded to a variable passed in by the caller of the function.

In regular traditional function, all variables are lexically scoped except this and arguments which are dynamically scoped. In an arrow function, all variables (including this and arguments) are lexically scoped. Hence arrow functions should not be used as a method of a class, due to lexical binding of this. The arrow function does not have its own bindings to this and super. It does not have arguments, or new.target keywords. It is not suitable for call, apply and bind methods, which generally rely on establishing a scope. It cannot use yield, within its body.

const somedata = 34;
this.newdata = 12;

const example = function(n) {
  console.log(n);
  console.log(stuff);  // lexically scoped
  console.log(this.newdata);  // dynamic for regular function, lexical for arrow function
}

example(10);
example.call({ newdata: 23 }, 10);

When a normal function is called as a standalone function outside an object by default the this keyword returns the window object. When we change the function() to an arrow function then it will not reset 'this' value, instead it inherits the 'this' keyword in the context in which the code is defined.

const stdfunc = {
	show() {
		setTimeout(function() {
		    // this returns reference to window object.
			console.log("this", this);
		}, 1000);
	}
}

const arrowfunc = {
	show() {
		setTimeout(() => {
		    // this returns reference to window object.
			console.log("this", this);
		}, 1000);
	}
}

stdfunc.show();
arrowfunc.show();

Filter function

The filter() method creates an array filled with all array elements that passes the test provided by argument function.

const jobs = [
	{ id: 1, isActive: true },
	{ id: 2, isActive: true },
	{ id: 3, isActive: false }
];

// filter method takes the function as a predicate to decide wether to keep or filter out each element of array.
const activeJobs = jobs.filter(function(job) { return job.isActive; } )

const activeJobs = jobs.filter(job => job.isActive)

Map function

The Array.map() function is used to transform a list of items.

const colors = ['red', 'green', 'blue'];

// the map() function transforms each element in the array to a html <li/> list element
const items = colors.map(function(color) {
	return '<li>' + color + '</li>';
});

const items = colors.map(color => '<li>' + color + '</li>');

// use template literals, we define template for string
const items = colors.map(color => `<li>${color}</li>`);
console.log(items);

The map can be used to convert the array of one type of object into array of another object type. It can also convert array of objects to primitive arrays with values from object properties.

const records = [
  {name: 'Week 1', owner: 0, others: 4, amt: 4},
  {name: 'Week 2', owner: 0, others: 7, amt: 7}
];

var result = records.map(record => ({ value: record.amt, text: record.name }));

let arrayOwner = [];              // end result [0, 0, 0, 0, 0, 0, 0]
let arrayOthers = [];             // end result [2, 4, 3, 6, 3, 8, 9]

const res = records.map((e,i) => {
    e.owner = arrayOwner[i];
    e.others = arrayOthers[i];
    return e;
});

Curry function

A curried function takes multiple arguments one at a time, were for each curried argument, it returns a function that takes the next argument, with the last function returning the result of applying the function to all of its arguments.

// Normal function
const sum = (x, y, z) => x + y + z;
console.log(sum(1, 5, 6));

// Curried function
var sum = function sum(x) {
  return function (y) {
    return function (z) {
      return x + y + z;
    };
  };
};
console.log(sum(1)(5)(6));

We have three nested functions each one is taking one argument and returning the following function that takes the next argument, except that last function which is doing the calculation.

const sum = x => y => z => x + y + z;
console.log(sum(1)(5)(6));

DeStructuring

The object destructuring is a useful JavaScript feature to extract properties from objects and bind them to variables. Object destructuring can extract multiple properties in single statement, access properties from nested objects, and can set a default value if the property doesn't exist.

const address = {
	street: 'Downing Street',
	city: 'London',
	country: 'UK'
};

const street = address.street;
const city = address.city;
const country = address.country;

const { street, city, country } = address;
const { street } = address;
const { street: st } = address;

Object destructuring also works on the returned object from the function by extracting the required object attributes into variables.

const getPerson = function() {
   return {extracting
      first: 'John',
      first: 'Fitzerald',
      first: 'Kennedy'
   }
};

// Here first is the object property, while firstName is the corresponding local variable
const {first: firstName, last: lastName} = getPerson();
console.log(`${firstName} ${lastName}`);

const {first, last} = getPerson();
console.log(`${first} ${last}`);

Below is example of object destructuring in the function parameter.

const john = {
  name: 'John F Kennedy',
  age: 37,
  address: { street: '1600 Pennsylvania Avenue' },
  mailing: { street: 'Boston' }
}

const printName = function({ name, age, address: {street}, mailing: { street: mailstreet } }) {
  console.log(`${name} is ${age} years old`);
  console.log(`${name} address street is ${street} and mailing street is ${mailstreet}`);
}

printName(john);

Destructing can also be used in for...of loop, defining the variable which extracts property from the object.

const ratings = [
    {user: 'John',score: 3},
    {user: 'Jane',score: 4},
    {user: 'David',score: 5},
    {user: 'Peter',score: 2},
];

let sum = 0;
for (const {score} of ratings) {
    sum += score;
}

console.log(`Total scores: ${sum}`);

Destructuring using array, were individual values of the array is extracted into variables.

const getPerson = function() {
   return ['John', 'Fitzgerald', 'Kennedy'];
}

const [first, , last] = getPerson();
console.log(`${first} ${last}`);

const [firstname, ...allelse] = getPerson();
console.log(`${firstname} ${allelse}`);

Spread Operator

The spread operator (...) allows to spread out elements of an iterable object such as an array, map, or a set. It unpacks the elements of the array, map or list. The spread operator comes handy to construct an array using literal form, concatenating/combining arrays or copying an array.

const first = [1, 2, 3];
const second = [4, 5, 6];

const combined = first.concat(second);

// spread all items in first array (i.e. get all items in first array) and then spread second array together to form new array
const combined = [...first, ...second]
const combined = [...first, 'a', ...second, 'b']

// clone array
const clone = [...first];

Spread operator can also be used create new object from existing objects.

const first = { name: 'Jarvis' };
const second = { job: 'Instructor'};

const combined = {...first, ...second, location: 'USA'};
const clone = {...first};

const user = Object.freeze({ name: 'James', age: 23 });
const updatedUser = { ...james, age: james.age + 1 };

Rest Parameter

The rest parameter (...) collects all remaining arguments of a function into an array. The rest parameters must be the last arguments of a function. The rest parameter packs elements into an array, while the spread operator unpacks elements.

const max = function(...numbers) {
  numbers.reduce((large, e) => e > large ? e : large);
}

const values = [1, 12, 7];
console.log(max(values[]));

Default Parameter

The default function parameters allows to initialize the named parameters with default values if no values or undefined are passed into the function. They enable the existing functions to extend by adding new parameters. Default parameters can also use the values passed to the function to its left. When null is passed as a value for the default parameter, then its value becomes null. On the other hand if undefined is passed as the default parameter, then it is replaced by the default value if the default parameter is specified.

const greet = function(name, msg = `Hi${name.length}`) {
   console.log(`${msg} ${name}`);
}

greet('Jerry');

Object Properties

In JavaScript, an object is an unordered list of key-value pairs. The key is usually a string or a symbol. The value can be a value of any primitive type (string, boolean, number, undefined, or null), an object, or a function. The following example creates a new object using the object literal syntax:

const person = {
    firstName: 'John',
    lastName: 'Doe
};

JavaScript specifies characteristics of properties of objects via internal attributes surrounded by the two pair of square brackets. There are two types of object properties: data properties and accessor properties.

A data property contains a single location for a data value. A data property has four attributes:

[[Configurarable]] – determines whether a property can be redefined or removed via delete operator.
[[Enumerable]] – indicates that if a property will be returned in the for...in loop.
[[Writable]] – specifies that the value of a property can be changed.
[[Value]] – contains the actual value of a property.

By default, the [[Configurable]] , [[Enumerable]], and [[Writable]] attributes set to true for all properties defined directly on an object. The default value of the [[Value]] attribute is undefined.

The Enumerable attribute determines whether or not a property is accessible when the object’s properties are enumerated using the for...in loop or Object.keys() method. By default, all properties created via a simple assignment or via a property initializer are enumerable.

Object attributes of a property can be changed using Object.defineProperty() method, which takes the object, property name and the property descriptor for four properties (configurable, enumerable, writable, and value). The below example creates a person object and adds the ssn property to it using the Object.defineProperty() method.

let person = {};
Object.defineProperty(person, 'ssn', {
    configurable: false,
    value: '012-38-9119'
});

delete person.ssn;

Accessor properties, similar to data properties also have [[Configurable]] and [[Enumerable]] attributes. They have the [[Get]] and [[Set]] attributes instead of [[Value]] and [[Writable]]. When you read data from an accessor property, the [[Get]] function is called automatically to return a value, which is undefined by default. When a value is assigned to an accessor property, the [[Set]] function is called automatically. An accessor property can be defined using the Object.defineProperty() method as below.

let person = {
    firstName: 'John',
    lastName: 'Doe'
}

Object.defineProperty(person, 'fullName', {
    get: function () {
        return this.firstName + ' ' + this.lastName;
    },
    set: function (value) {
        let parts = value.split(' ');
        if (parts.length == 2) {
            this.firstName = parts[0];
            this.lastName = parts[1];
        } else {
            throw 'Invalid name format';
        }
    }
});

The Object.getOwnPropertyDescriptor() method allows to get the descriptor object of a property. It returns a descriptor object that describes four properties; configurable, enumerable, writable, and value.

Classes

Modern Javascript has introduced the class construct which provides great features of object oriented programming. The object of the class is initialized by the new keyword which automatically calls the constructor(). The value if this inside the constructor equals to the newly created instance. If a constructor is not defined for the class, a default constructor is created. The default constructor is an empty function, which doesn’t modify the instance.

class Person {

    region = "Europe";
    
	constructor(name) {
		this.name = name;
	}

	walk() {
		console.log(this.name + " walks");
	}
}

const person = new Person('Randy');
person.walk();

Javascript class and constructor is a syntactic sugar which uses functions and prototype to provide the class like entity. Internally javascript creates a function with the class name, that becomes the result of the class declaration. The function code is taken from the constructor method. The class methods are set as prototype for the function. The class field syntax allows to add properties to the class.

// create constructor function
function Person(name) {
  this.name = name;
}

// add method to prototype
Person.prototype.walk = function() {
  console.log(this.name + " walks");
};

Person.prototype.region = "Europe";
let person = new Person("John");
person.walk();

Classes in Javascript can be used as an expression by assigning to a variable.

const example = class Car {}

Just like functions, classes can be defined inside another expression, passed around, returned, assigned, etc. Class can have a computed method name using brackets [...].

class User {
  ['say' + 'Hi']() {
    console.log("Hello");
  }
}

new User().sayHi();

Javascript provides 2 levels if field accessibility, public field (by default) and private fields. A field can be made private by adding the prefix # to its name. The getter and setter mimic regular field, but allows more control on how the field is accessed and changed. The getter is executed on an attempt to get the field value, while setter on an attempt to set a value.

class User {
  #nameValue;

  constructor(name) {
    this.name = name;
  }

  get name() {
    return this.#nameValue;
  }

  set name(name) {
    if (name === '') {
      throw new Error(`name field of User cannot be empty`);
    }
    this.#nameValue = name;
  }
}

const user = new User('Jon Snow');
console.log(user.name); // The getter is invoked, returns 'Jon Snow'
user.name = 'Jon White'; // The setter is invoked

user.name = ''; // The setter throws an Error

Every class in javascript is internally a function, which cannot be invoked. The class constructor cannot be called directly without the new keyword.

class Car {}

console.log(typeof(Car));   // returns function
Car();  // TypeError: Class constructor Car cannot be invoked.

The new.target is a special value, which is available when a function is invoked as a constructor, in other invocations it is undefined.

const Car = function() {
  console.log(new.target);
}

new Car();    // prints [Function: Car]
Car();        // prints undefined

Internally the new keyword performs following operations:

It creates new empty object e.g. obj = { };
It sets new empty object's invisible prototype property to be the constructor function's visible and accessible prototype property. Every function has visible prototype property whereas every object includes invisible prototype property.
It binds property or function which is declared with this keyword to the new object.
It returns newly created object unless the constructor function returns a non-primitive value such as a custom JavaScript object. If constructor function does not include return statement then compiler will insert 'return this;' implicitly at the end of the function. If the constructor function returns a primitive value then it will be ignored.

Fields are defined in the constructor of the class, not at the top of the class as in Java.

class Car {
  constructor(year) {
    this.year = year;
    this.mileage = 0;
    this.modelName = 'None';
  }

  drive(distance) {
    this.mileage += distance;
  }

  get model() {
    return this.modelName;
  }

  get model(value) {
    return this.modelName = value;
  }
}

const car = new Car(2021);
console.log(car);
car.drive(100);
console.log(car);

Javascript allows to add properties to a class by defining a method as a getter with "get" keyword. The getter method should not take any arguments. We cannot change the property when the class only has a getter for the property.

console.log(car.model);
car.model = 'Tesla R100'
console.log(car.model);

Classes can be dynamically created in javascript as shown in the below example.

const classFactory = function(...properties) {
   return class {
      constructor(...values) {
         for(const [index, property] of properties.entries()) {
            this[property] = values[index];
         }
      }
   };
}

const Book = classFactory('title', 'pages');
const book1 = new Book('Lord of the Rings', 9800);
console.log(book1);

Static Methods

The static methods are directly attached to the class and not with any instances of the class. Before ES6, static methods we defined by adding the method directly to the constructor. In ES6, static keyword is used to define the static methods. The static methods are called via the class name, not the instances of the class.

class Car {
	constructor(name) {
		this.name = name;
	}
	static createCar(fuel) {
		let name = fuel == "electric" ? "Tesla RX100" : "Toyota Corolla";
		return new Car(name);
	}

    static info() {
       console.log('This is Car info');
    }
    static get model() {
       return 'Tesla RX100';
    }
}

Car.info();
console.log(Car.model);
let anonymous = Car.createCar("electric");

Similar to instance field/method, static field and methods can be made private by adding the special symbol # as prefix.

class User {
  static #instances = 0;
  name;

  constructor(name) {
    User.#instances++;
    if (User.#instances > 2) {
      throw new Error('Unable to create User instance');
    }
    this.name = name;
  }
}

const user1 = new User('user1');

Static fields can be defined outside the class as below.

class Car {
  constructor() {
     Car.rating++;
  }
}

Car.rating = 10;
new Car();
console.log(Car.rating);

In order to call a static method from a class constructor or an instance method, the class name is used followed by the . and the name of the static method, i.e. className.staticMethodName(). Alternatively we could also use the this.constructor.staticMethodName() notation.

this Keyword

The this keyword refers to the object instance of the class and allows to access the class instance fields and methods. The this keyword can be used in both global and function contexts, but its behavior changes between strict and non-strict mode. The this keyword references the object that is currently being called by the function. In the global context, the this references the global object, which is the window object on the web browser or global object on Node JS.

When a function is invoked in non-strict mode, then this keyword references the global object, which is the window on the web browser and global on Node JS. But in the strict mode, the this keyword in the function is set to undefined.

When a method of an object is invoked using the object dot notation then the this keyword is set to the object that owns the method. But when the method is called directly (e.g. using function variable) without specifying its object then this keyword is set to the global object in non-strict mode and undefined in the strict mode. To fix this issue, the bind() method of the Function.prototype object. The bind() method creates a new function whose the this keyword is set to a specified value. The below bind method returns a new instance of the walk function were the this reference points to person object passed to the bind() function.

const person = {
	name: 'Jarvis',
	walk() {
		console.log(this);
	}
}

person.walk();

const walk = person.walk;
console.log(walk);

const walk = person.walk.bind(person);
walk();

When the new keyword is used to create an instance of a function object, the function is used as a constructor. The function as a constructor creates a new object and sets this to the newly created object. To ensure that the function is always invoked using new keyword constructor invocation, the meta-property named new.target is used which detects whether a function is invoked as a simple invocation or as a constructor.

function Animal(type) {
    if (!new.target) {
        throw new Error("Animal() must be called with new.");
    }
    this.type = type;
}

const herbivore = new Animal("herbivore");
 
console.log(animal.type);

// Throws error as constructor function is invoke without new keyword
const carnivore = Animal("carnivore");

Inheritance

Traditional languages such as Java, CSharp have Class based inheritance were one class inherits from another class which is statically defined in the code, called classical inheritance. Javascript on the other hand uses Prototypal inheritance which is dynamic in nature. In prototypal inheritance, an object inherits properties from another object via the prototype linkage. It allows to change the behavior of the object on the fly without changing its class.

The prototype is an object that is associated with every functions and objects by default in JavaScript, where function's prototype property is accessible and modifiable and object's prototype property (aka attribute) is not visible. Every function includes prototype object by default. The prototype object is special type of enumerable object to which additional properties can be attached to it which will be shared across all the instances of the constructor function.

function Student() {
    this.name = 'John';
    this.gender = 'M';
}

Student.prototype.age = 15;

var studObj1 = new Student();
console.log(studObj1.age); // 15

var studObj2 = new Student();
console.log(studObj2.age); // 15

Student.prototype = { age : 20 };

var studObj3 = new Student();
console.log(studObj3.age); // 20

console.log(Student.prototype); // object
console.log(studObj1.prototype); // undefined
console.log(studObj1.__proto__); // object

console.log(typeof Student.prototype); // object
console.log(typeof studObj1.__proto__); // object

Every object which is created using literal syntax or constructor syntax with the new keyword, includes __proto__ property that points to prototype object of a function that created this object. The Object.getPrototypeOf(obj) method is used instead of __proto__ to access prototype object of an object. Each object's prototype is linked to function's prototype object. If we change function's prototype then only new objects will be linked to changed prototype. All other existing objects will still link to old prototype of function.

function Student() {
    this.name = 'John';
    this.gender = 'M';
}

var studObj = new Student();

Student.prototype.sayHi = function(){
    console.log("Hi");
};

var studObj1 = new Student();
var proto = Object.getPrototypeOf(studObj1);  // returns Student's prototype object
            
console.log(proto.constructor); // returns Student function

The prototype object is used to implement inheritance in Javascript. In the below example, we set the Student.prototype to newly created person object. The new keyword creates an object of Person class and also assigns Person.prototype to new object's prototype object and then finally assigns newly created object to Student.prototype object. Optionally, we can also assign Person.prototype to Student.prototype object.

function Person(firstName, lastName) {
    this.FirstName = firstName;
    this.LastName = lastName;
};

Person.prototype.getFullName = function () {
    return this.FirstName + " " + this.LastName;
}

function Student(firstName, lastName, schoolName, grade) {
    Person.call(this, firstName, lastName);

    this.SchoolName = schoolName || "unknown";
    this.Grade = grade || 0;
}

//Student.prototype = Person.prototype;
Student.prototype = new Person();
Student.prototype.constructor = Student;

var std = new Student("James","Bond", "XYZ", 10);
            
console.log(std.getFullName()); // James Bond
console.log(std instanceof Student); // true
console.log(std instanceof Person); // true

An object has a prototype object, which in turn can have its own prototype object, which could be chained were each prototype refers to each other until the last prototype which does not have any prototype set. When a method is called on such object, it checks for the corresponding method through the prototype chain across all prototypes until it either finds a match or it reaches end of prototype chain with no match.

const use = function(person) {
    try {
         person.work();
    } catch(ex) {
      console.log('Not Found');
    }
}

const john = { name: 'John Corner' };
const employment = { work : function() { console.log('Working...'); } }

Object.setPrototypeOf(john, employment);
use(john);

As mentioned above, prototype can also be set to the class instead of an object. The gets in javascript are deep while sets are shallow.

const Car = function() {
  this.drive = function(distance) {
      this.mileage += distance;
  }
}

Car.prototype.mileage = 0;

const car1 = new Car();
console.log(Object.getPrototypeOf(car1))
console.log(car1.mileage);

const car2 = new Car();
car1.drive(100);
console.log(car1.mileage);
console.log(car2.mileage);

Javascript refined the syntax providing extends keyword, but under the hood it works as above. The constructor is optional for child class, were if a constructor is not provided, Javascript internally generates a default constructor for the child class. Below is the example using modern syntax which inherits all the methods (including constructor) defined in parent class.

class Person {
   constructor(firstName, lastName) {
      this.firstName = firstName;
      this.firstName = firstName;
   }

   toString() {
      return `${this.firstName} ${this.lastName}`;
   }
}

class Employee extends Person {
   constructor(firstName, lastName, jobTitle) {
      super(firstName, lastName);
      this.jobTitle = jobTitle;
   }

   toString() {
      return `${super.toString} ${this.jobTitle}`;
   }
}

const james = new Employee('James', 'Bond', 'Spy');
console.log(james.toString());
console.log(Object.getPrototypeOf(james));

Javascript allows built-in classes like Array, Map etc to also be extendable. The special keyword super is introduced to call the parent constructor from the child class, and to access the parent fields and methods from the child class. The instanceof operator with syntax object instanceof Class, determines if the object is an instance of Class, taking inheritance into account.

Symbol

Javascript does not have the concept of Interfaces like Java. Symbol is a new primitive type in JavaScript intended for limited specialized use.

To define properties for objects in such a way they don’t appear during normal iteration—these properties are not private; they’re just not easily discovered like other properties.
To easily define a global registry or dictionary of objects.
To define some special well-known methods in objects; this feature, which fills the void of interfaces, is arguably one of the most important purposes of Symbol.

const s1 = Symbol.for('hi');
const s2 = Symbol.for('hi');

console.log(s1 === s2);  // Returns True

Symbol can become a method name which gives us a unique method name to avoid method name collision.

class Person {
  [Symbol.for('play')]() {
     console.log('playing...');
  }
}

const john = new Person();
const playMethod = Symbol.for('play');
john[playMethod]();

Iterators

Javascript has two types of iteration protocols, iterable protocol and iterator protocol.

An object is an iterator when it has a next() method which returns an object with the current element and whether any more elements that could be iterated upon.

An object is an iterable when it contains a method called [Symbol.iterator] that takes no argument and returns an object which conforms to the iterator protocol. Javascript has predefined Symbol such as Symbol.iterator, Symbol.match which allows to add these methods to the existing classes.

class Records {

  constructor() {
    this.names = 'Ted', 'Jim', 'Tim'];
  }

  [Symbol.iterator]() {

     let index = 0;
     const self = this;

     return {
        next: function() {
           return { done: index == 4, value: self.names[index++] };
        }
     };
  }
}

Example of a Generator using Symbol.

lass Records {

  constructor() {
    this.names = 'Ted', 'Jim', 'Tim'];
  }

  *[Symbol.iterator]() {
    for(const name of this.names) {
       yield name;
    }
  }

  yield *this.names;
}

Proxies

A Proxy is an object that wraps another object (target) and intercepts the fundamental operations of the target object. The fundamental operations can be the property lookup, assignment, enumeration, and function invocations, etc. The Proxy object takes the target object to wrap and the handler object containing methods to control the behaviors of the target. The methods inside the handler object are called traps.

const user = {
    firstName: 'John',
    lastName: 'Dillenger'
}

const handler = {
    get(target, property) {
        return property === 'fullName' ? `${target.firstName} ${target.lastName}` : target[property];
    }
};

const proxyUser = new Proxy(user, handler);

console.log(proxyUser.fullName);

Modules

Modules allow to structure the project by moving each class in a separate file.

// File: person.js
class Person {}

// File: teacher.js
class Teacher {}

Objects defined in module are private by default and won’t be added automatically to the global scope. The export statement (were export keyword is placed in front) exposes the specified variable, function, or a class from the current file into other modules. The export keyword requires the name of the function or class to be exported, called named exports, hence anonymous function or class cannot be exported. One or more objects can be exported from a given module using names. Variables can be defined first and exported in later statements.

export class Person {}
export class Teacher extends Person {}

The import statement imports the exported variables, functions or class from the original module file. In order to import multiple bindings, they are explicitly listed inside the curly braces. After the first import statement, the specified module is executed and loaded into the memory, and it is reused whenever it is referenced by the subsequent import statement.

import { Person } from './person';

JavaScript allows you to create aliases for variables, functions, or classes when you export and import.

export { add as sum };

import {sum as total} from './math.js';

It is possible to export bindings that have been imported, called as re-exporting.

export {sum} from './math.js';

Everything can be imported as a single object from a module, using the asterisk (*) pattern as below, which is called as namespace import.

import * as cal from './cal.js';

Default export is were a single (main) object is exported from module. A module can have one and only one default export. The default for a module can be a variable, a function, or a class. The default export is easier to import as we only need to specify the name for the function because the module represents the function name. Below example exports the teacher class as default export from the module.

export default class Teacher extends Person {  ... }

While importing the default object from the module, we don't need curly braces, as default object can be imported directly from the module.

import Teacher from './teacher';

Both default and non-default bindings (exports) can be imported together, were we specify a list of bindings after the import keyword were the default binding come first and non-default binding is surrounded by curly braces..

// sort.js
export default function(arr) { .. }
export function heapSort(arr) { .. }

import sort, {heapSort} from './sort.js';
import Teacher, { promote } from './teacher';

Asynchronous functions

Javascript maintains a call stack, were it adds the function which is being invoked and pops it out once the function execution returns. The call stack has the main() function which is the default function when the javascript is executed.

When a function simply accepts another function as an argument, this contained function is known as a callback function. The callback function is not run unless called by its containing function, it is called back. Callback functions can be named or anonymous functions. Multiple functions can be created independently and used as callback functions, creating multi-level functions. When such function tree created becomes too large, the code becomes incomprehensible, difficult to refactor and known as callback hell.

Javascript provides setTimeout(), setInterval() and requestAnimationFrame() as asynchronous functions which takes a callback function and allow them to execute asynchronously after a certain time interval. The setTimeout allows to delay the execution of function, until after the delay time interval (in milliseconds). The setInterval allows us to run a function repeatedly, starting after the interval of time, then repeating continuously at that interval. The callback function is executed asynchronously after the specified delay time period. The asynchronous functions are sent to Web API (instead of callstack), which keeps track of the timer and once the time is expired it adds the function back to the callstack.

function login(email, password, callback) {
   setTimeout(() => {
      callback({ userEmail: email });
   }, 5000);
}

function getUserDetails(email, callback) {
   setTimeout(() => {
      callback({ age: 12, address: 'California' });
   }, 3000);
}

const user = login('test@email.com', 12345, user => {
   console.log(user.userEmail);
   getUserDetails(user.userEmail, details => {
       console.log(details.address);
   });
});

Promises

In order to avoid the nested structure of callback were output of one function is dependent to call the next function, Promise were introduced. A promise is defined as a proxy for a value that will eventually become available. A promise is an object that gives either the result or a failure for an asynchronous operation.

The Promise constructor accepts a function as an argument which is called the executor. The executor accepts two functions named resolve() and reject(). When we call new Promise(executor), the executor is called automatically. Inside the executor, the resolve() function is called manually if the executor is completed successfully, or else the reject() function is invoked in case of an error. Once a promise has been called, it will start in a pending state. The calling function continues its execution until the promise is in pending state. Once the promise goes into resolved or reject state the corresponding callback function is called. The then() method is used to schedule a callback to be executed when the promise is successfully resolved. The catch() method used to schedule a callback to be executed when the promise is rejected, and the finally() method which executes the callback whether the promise is fulfilled or rejected.

const promise = new Promise((resolve, reject) => {
     setTimeout(() => {
        resolve({ user: 'Ed' });
     }, 2000);


// For error case use reject()

//     setTimeout(() => {
//        reject(new Error('User access denied.'));
//     }, 2000);

});

promise.then(user => {
    console.log(user);
})
.catch(err => console.log(err.message));

A promise can be returned to another promise, creating a chain of promises.

function login(email, password) {
   return new Promise((resolve, reject) => {
     setTimeout(() => {
        resolve({ userEmail: email });
     }, 5000);
   });
}

function getUserDetails(email) {
   return new Promise((resolve, reject) => {
     setTimeout(() => {
        resolve({ age: 12, address: 'California' });
     }, 3000);
   });
}

login('test@email.com', 12345)
.then(user => getUserDetails(user.userEmail))
.then(detail => console.log(details.address));

The Promise.all() allows to synchronize different promises (in parallel) by defining a list of promises, and execute until all are resolved. The Promise.all() accepts a list of Promises and returns a Promise that either resolves when every input Promise resolves or rejected when any of the input Promise is rejected.

const service1 = new Promise(resolve => {
     setTimeout(() => {
        resolve({ values: [1, 2, 3, 4] });
     }, 2000);
   });

const service2 = new Promise(resolve => {
     setTimeout(() => {
        resolve({ quote: 'Wholesale' });
     }, 3000);
   });

Promise.all([service1, service2])
.then(result => console.log(result));

The Promise.race() returns the promise instance which is firstly resolved or rejected. The race method returns a promise that fulfills or rejects as soon as there is one promise that fulfills or rejects, with the value or reason from that promise. It is used to run the attached callback only when the first promise is resolved. The Promise.all() returns a promise that resolves to an array of values from the input promises while the Promise.race() returns a promise that resolves to the value from the first settled promise.

Fetch API

The Fetch API is a promise-based API for making asynchronous HTTP requests in the browser similar to XMLHttpRequest (XHR). Unlike XHR, it is a simple and clean API that uses promises to provides a more powerful and flexible feature set to fetch resources from the server.

The fetch() method is used to send a request to a URL which is passed as a required parameter. It returns a promise that passes the response to then() when it is fulfilled. The catch() method intercepts errors if the request fails to complete for any reason. The fetch response has status, statusText properties and a json() method, which returns a promise that will resolve with the content of the body processed and transformed into JSON. In order to post a request and additional options parameter is passed which contains method, body and headers object as attributes.

const user = {
    first_name: 'John',
    last_name: 'Lilly',
    job_title: 'Software Engineer'
};

const options = {
    method: 'POST',
    body: JSON.stringify(user),
    headers: {
        'Content-Type': 'application/json'
    }
}

fetch('https://server.com/api/users', options)
    .then(res => res.json())
    .then(res => console.log(res));
    .catch(error => {
      console.log('Request failed', error)
    })

Async and Await

Async functions are a combination of promises and generators, and basically, they are a higher level abstraction over promises. In other words we can say that async/await are built on top of Promises.

async function displayUser() {

  try {
    const user = await login('test@email.com', 12345);
    const details = await getUserDetails(user.userEmail);
    console.log(detail);

  } catch(err) {
    console.log(err);
  }
}

Prepending the async keyword to any function means that the function will return a promise. An async function always returns a promise. The keyword await before a function makes the function wait for a promise. The await keyword can only be used inside an async function.

const doSomethingAsync = () => {
  return new Promise(resolve => {
    setTimeout(() => resolve('I did something'), 3000)
  })
}

const doSomething = async () => {
  console.log(await doSomethingAsync())
}

Async functions can be chained very easily, and the syntax is much more readable than with plain promises, as shown in the below example.

const getFirstUserData = async () => {
  const response = await fetch('/users.json') // get users list
  const users = await response.json() // parse JSON
  const user = users[0] // pick first user
  const userResponse = await fetch(`/users/${user.name}`) // get user data
  const userData = await userResponse.json() // parse JSON
  return userData
}

getFirstUserData();

The above example was represented as below using Promises which represents more chaining and arrow functions.

const getFirstUserData = () => {
  return fetch('/users.json') // get users list
    .then(response => response.json()) // parse JSON
    .then(users => users[0]) // pick first user
    .then(user => fetch(`/users/${user.name}`)) // get user data
    .then(userResponse => userResponse.json()) // parse JSON
}

getFirstUserData();

A regular function can call the async function as below.

async function wait() {
  await new Promise(resolve => setTimeout(resolve, 1000));
  return 10;
}

function myfunction() {
  // shows 10 after 1 second
  wait().then(result => console.log(result));
}

myfunction();

Node JS

Node JS provides an open-source, cross-platform runtime environment to execute Javascript code outside of the browser context. The Node JS application runs in a single process, without creating a new thread for every request. Node JS provides a set of asynchronous I/O primitives in its standard library that prevents JavaScript code from being blocked.

Contrary to the traditional HttpServers which allocates a single thread for each request and waits until for e.g. the resource is loaded from the database or external service. Node uses non-blocking or asynchronous model were a single thread is used to serve multiple requests. Node uses event queue were it monitors for received event messages and processes the events using the single thread. Hence this makes Node runtime environment ideal for high I/O intensive applications, were it can serve more client requests without requiring additional thread allocation. Node on the other hand should not be used for CPU-intensive applications which involves high amount of data processing. Also single threaded Node applications only utilizes a single CPU core, rendering other CPU cores unused. Although since Node JS v10.5.0, the new worker_threads module enables to use threads to execute JavaScript in parallel.

Node JS internally maintains a limited Thread Pool which has pre-allocated set of threads to service client requests. The UV_THREADPOOL_SIZE environment variable sets the number of threads used in libuv's threadpool to size threads.

process.env.UV_THREADPOOL_SIZE = 6;

When Node JS receives client requests, it places them into the event queue. Node JS internally has a Component called the “Event Loop” which uses an indefinite loop for polling the event queue for received events and processes them. If the client request does not require any blocking I/O operations, then everything is processed, response prepared and sent back to the client. On the other hand, if the client request requires some blocking I/O operations such as calling database, external services etc, then the event loop would check for thread availability in the internal Thread Pool, and assign the client request to a selected thread from the Thread pool. This Thread is responsible for taking that request, processing it, perform Blocking I/O operations, prepare response and send it back to the Event Loop. The Event Loop in turn would send back the corresponding response to the respective Client. The event loop allows Node JS to perform non-blocking I/O operations, even though JavaScript is single-threaded by offloading operations to the system Kernel whenever possible. Event loop model is provided by the Libuv library in Node JS.

Node JS can be installed by directly downloading Node JS binary for Windows and Mac OS. Mac users can also install using HomeBrew with brew install node. Node JS and npm can be installed in Ubuntu (and other linux flavors) using apt package manager as below. Alternatively node version manager can be used to install multiple versions of node and switch between them.

$ sudo apt update

$ sudo apt install nodejs

$ sudo apt install npm

Below command execute the javascript file using Node JS

$ node app.js

Node JS uses Google's V8 JavaScript engine for language implementation, hence new language features are first implemented in V8, and then incorporate in Node JS. To determine which features are supported for the Node JS release, we refer to the kangax.github.io, a dynamically generated feature list for all major JavaScript engines. For Node specific list, node.green is used, which leverages from the same data as kangax. Also we can print out the V8 version included in Node version using below command. The process.versions property returns an object listing the version strings of Node.js and its dependencies

$ node -p process.versions.v8

We can also list all the in progress features available on each Node JS release by grepping through the --v8-options argument.

$ node --v8-options | grep "in progress"

Node Modules

Node.js has a set of built-in modules which can use without any further installation. Node defines global object which provides various object/methods such as console.log(), setTimeout(), clearTimeout(), setInterval() etc. Node JS provides many modules, such as process, http, fs, events etc. which are discussed below. Node JS has built-in stream module providing the foundation upon which all streaming APIs are built.

Require and Export

The require(id) function is used to import node modules, JSON, and local files. Modules (both built-in and third party) can be imported from node_modules. Local modules and JSON files can be imported using a relative path e.g. (../dir/module).

// Importing a JSON file
const jsonData = require('./path/filename.json');

// Importing a module from node_modules or Node.js built-in module
const crypto = require('crypto');

The exports keyword is used to make properties and methods available outside the module file. In the below example we create custom module in user.js and export it to be used outside.

const getName = () => {
  return 'James Bond';
};

exports.getName = getName;

Now the above user.js module can be imported using require function as below.

const user = require('./user');
console.log(`User: ${user.getName()}`);

Process Module

The process core module of Node JS provides the env property which hosts all the environment variables that were set at the moment the process was started. The environment variables can be passed as below while executing a Node JS application.

$ USER_ID=239482 REGION=us-east-1 node server.js

The environment variables can be access using the below code.

process.env.USER_ID // "239482"

If there are multiple environment variables in the node project, we can create an .env file in the root directory of the project, and then use the dotenv package to load them during runtime as shown in the below example.

require('dotenv').config();

process.env.USER_ID // "239482"

The process module also provides the process.nextTick() function. When one iteration is completed of the event loop, it is known as a tick. The process.nextTick() is a callback function which is executed after completing the current iteration/tick of the event loop. It adds to the nextTickQueue which processes all the callbacks after completing current iteration and before starting the next iteration of event loop. The process.nextTick() is used for resource cleanup, handle errors or to run a request before next iteration.

const process = require('process')

process.nextTick(() => {
  //do something
})

In order to execute a piece of code asynchronously, the setImmediate() function of Node JS is used.

setImmediate(() => {
  //run something
})

A function passed to process.nextTick() is going to be executed on the current iteration of the event loop, before the setTimeout(() => {}, 0) and setImmediate.

HTTP Module

Node JS provides http module which has libraries for networking such as for setting up services or providing client to invoke services. The below example creates new HTTP server using the createServer() method.

const http = require('http');

const hostname = '127.0.0.1'
const port = process.env.PORT

const server = http.createServer((req, res) => {
  res.writeHead(200, {'Content-Type': 'text/plain'});
  res.write('Hello World!');
  res.end();
})

server.listen(port, hostname, () => {
  console.log(`Server running at http://${hostname}:${port}/`)
})

The http module also can be used to call http services as shown in below example.

const https = require('https')
const payload = JSON.stringify({ "id": 1, "name": "Tim Scott" });
const options = {
  hostname: 'emprovise.com',
  port: 443,
  path: '/api/user',
  method: 'POST',
  headers: { 'Content-Type': 'application/json',
             'Content-Length': payload.length   },
  timeout: 5000
}

const req = https.request(options, res => {
 
   if(res.statusCode > 299) {
      let error = new Error(res.statusMessage);
      error.code = res.statusCode;
      reject(error);
   }

   res.on('data', d => {
      data += d;
   });

   res.on('end', () => {
      resolve(data);
   });
});
 
req.on('error', error => {
   reject(error);
});

req.on('timeout', () => {
   req.abort();
});

req.write(payload); 
req.end()

The net module allows to create custom socket clients and connect with the server. The net.connect() method allows to directly connect with the server without creating sockets.

let net = require('net');
let client = new net.Socket();

client.connect(8080, '127.0.0.1', function() {
   console.log('Connected');
   client.write('Hello from Client');
});

client.setEncoding('utf8');

client.on('data', function(data) {
   console.log('Received: ' + data);
   client.destroy();
});

client.on('close', function() {
   console.log('Connection closed');
});

setTimeout(function(){
  client.end('Bye bye server');
},5000);

FS Module

The FS module provides a lot of very useful functionality to access and interact with the file system. It provides methods to read and write to files in local system. Both read and write have corresponding synchronous method versions fs.readFileSync() and fs.writeFileSync() respectively.

var fs = require('fs');
  
// Use fs.readFile() method to read the file
fs.readFile('demo.txt', "utf-8", (err, data) => {
   if (err) console.log(err);
   console.log(data.toString());
 })

var data = "New File Contents";

fs.writeFile("temp.txt", data, (err) => {
  if (err) console.log(err);
  console.log("Successfully Written to File.");
});

The FS module also allows to open the file using the open() method which takes various flags for different file modes. The file modes include r (read), r+ (read+write, don't create if not exist), w+ (read+write, create if not exist), a (append at end, don't create if not exist), a+ (append, create if not exist). The file can also be opened using the fs.openSync method, which returns the file descriptor, instead of providing it in a callback. There is the fs.stat() method which enables to get the details about the file.

const fs = require('fs')

fs.open('test.txt', 'r', (err, fd) => {
  //do something
})

Node JS has a path module which provides methods such as dirname() which gets file's parent directory, extname() gives file's extension, path.join() which joins two or more parts of a path and path.resolve() which returns absolute path.

Event Module

Node's event module provide EventEmitter class which allows to create/trigger events. It has two method emit() to trigger the event and on() which is used to add callback function to be executed after triggering the event.

const EventEmitter = require('events')
const eventEmitter = new EventEmitter()

eventEmitter.on('start', number => {
  console.log(`started ${number}`)
})

eventEmitter.emit('start', 23)

The once() method of EventEmitter is used to subscribe, to execute the callback only for the first time when an event is triggered.

eventEmitter.once('start', (time) => {
    console.log('Message Received from publisher');
    console.log(`${time} seconds passed since the program started`);
});

Axios (Third Party Library)

Axios is a Promise based HTTP client for the browser as well as node.js. Using Promises is a great advantage when dealing with code that requires a more complicated chain of events.

var axios = require('axios');

axios.all([
  axios.get('https://api.nasa.gov/planetary/apod?api_key=DEMO_KEY&date=2017-08-03'),
  axios.get('https://api.nasa.gov/planetary/apod?api_key=DEMO_KEY&date=2017-08-02')
]).then(axios.spread((response1, response2) => {
  console.log(response1.data.url);
  console.log(response2.data.url);
})).catch(error => {
  console.log(error);
});

Express Framework

Express is a minimal and flexible Node JS web application framework that provides a robust set of features to develop web and mobile applications. It facilitates the rapid development of Node based Web applications. It allows to set up middlewares to respond to HTTP Requests, enables to define routing table to perform different actions based on HTTP Method and URL and can dynamically render HTML Pages based on passing arguments to templates.

Node Package Manager (npm)

The Node Package Manager (npm) is the default package manager for Node JS and is written entirely in Javascript. The npm manages all the packages and modules for Node JS and consists of command–line client npm. It gets installed into the system with the installation of Node JS. The required packages and modules in the Node project are installed using npm.

The npm init command is used to setup a new or existing npm package. It prompts to enter the project's name, version, description, entry point (main file), test command, git repository, keywords and license.

The --yes option automatically populate all options with the default npm init values.

$ npm init --yes

The npm init command generates or updates the package.json file in the current directory. The package.json contains metadata about the project and its dependencies. It has dependencies which are used for production and devDependencies for development environment. The "engines" field in package.json specifies which versions of npm are capable of properly installing the application. The package.json also supports a scripts property that can be defined to run command-line tools that are installed in the project's local context.

The npm install or npm i command installs all the dependencies (javascript packages) listed in the package.json into the project directory. We can also pass the modules/packages to be installed to the npm install command. All the modules and dependencies are installed into the node_modules directory.

$ npm install

$ npm i

$ npm install <package>

$ npm install eslint babel-eslint

When the package is installed it is saved as a property of the dependencies field, and becomes the default in the latest version of npm.

Below are the signs that come before the semantic versions with major.minor.patch model in package.json

^: The latest minor release. For example, a ^1.0.4 specification might install version 1.3.0 if that's the latest minor version in the 1 major series.
~: latest patch release. In the same way as ^ for minor releases, ~1.0.4 specification might install version 1.0.7 if that's the latest minor version in the 1.0 minor series.

By default, npm install <package> will install the latest version of a package with the ^ version sign. An npm install within the context of an npm project will download packages into the project's node_modules folder according to package.json specifications, upgrading the package version (and in turn regenerating package-lock.json) wherever it can based on ^ and ~ version matching.

All of the exact package versions are documented in a generated package-lock.json file. The package-lock.json describes the exact versions of the dependencies used in an npm JavaScript project. It is usually generated by the npm install command.

The optional --save flag to the npm install command adds the package as a dependency entry into the project's package.json. Similarly the --save-dev flag adds the package as a devDependency to the package.json, which indicates that the packages are used for development purposes. In order to install a package without saving it in package.json, the --no-save argument is used.

$ npm install <package> --save

The npm install supports installation of packages from local directories.

$ npm install ../some-local-package

The npm can install packages in local or global mode. In local mode, it installs the package in a node_modules folder in the parent working directory. In the global mode, the packages are installed in system or user directory for e.g. "/usr/local/lib/node_modules/" directory. By default, npm installs global modules to the system directory, which requires to be authenticated as a privileged user to install global modules. It is recommended to change the default installation location from a system directory to a user directory.

Install package in global mode.

$ npm install <package> --global

$ npm install <package> -g

List all installed packages globally

$ npm list --global

We need to add ".node_modules_global/bin" to the $PATH environment variable in order to run the global packages from the command line.

The npm config lists all the configuration settings using the environment variables, npmrc files, and package.json.

$ npm config set <key> <value> [-g|--global]

$ npm config get <key>

$ npm config delete <key>

$ npm config list [-l] [--json]

$ npm config edit

The npm uninstall allows to remove a package.

$ npm uninstall underscore

In order to install a specific version of package, the @ sign is appended with the version number to the package name as below.

$ npm install underscore@1.9.1

The npm outdated command displays the Current package version installed locally, the Latest available version of the package and the Wanted version of the package that we can upgrade to without breaking our existing code.

$ npm outdated

Package Current Wanted Latest Location

underscore 1.9.1 1.9.2 1.9.2 project

The npm update command updates the outdated modules to new version.

When npm installs a package, it keeps a copy of the installed package cached within the local .npm directory, to avoid downloading it again next time. When the .npm directory gets cluttered it can be cleaned up using npm cache clean command as below.

$ ls ~/.npm

$ npm cache clean --force

npm allows developers to scan the dependencies for known security vulnerabilities using the npm audit command. The npm audit fix command automatically installs any compatible updates to vulnerable dependencies, fixing all the security vulnerabilities. The --force argument allows to upgrading packages with breaking changes.

$ npm audit

$ npm audit fix

The npm shrinkwrap is used to lock the dependency version in a project. It is ran after installing all npm packages and creates new npm-shrinkwrap.json file with information about all packages being used. It repurposes package-lock.json into a publishable npm-shrinkwrap.json or simply creates a new one. The file created and updated by this command will then take precedence over any other existing or future package-lock.json files.

$ npm shrinkwrap

The npm exec command allows to run an arbitrary command from an npm package (either one installed locally, or fetched remotely), in a similar context as running it via npm run. When running via npm exec, a double-hyphen -- flag can be used to suppress npm's parsing of switches and options that should be sent to the executed command. The below npm exec command runs "foo bar --package=@npmcli/foo" command.

$ npm exec -- foo@latest bar --package=@npmcli/foo

The "scripts" property in package.json file supports a number of built-in scripts and their preset life cycle events as well as arbitrary scripts. These all can be executed by running npm run-script <stage> or npm run <stage> as shorthand. Similar to npm exec, npm run command will run an arbitrary command from "scripts" object of package.json. If no command is provided then it will list the available scripts.

"scripts": { "lint": "eslint ./src --fix" }

$ npm run lint

The npm link command creates a global symbolic link for a dependency, which points to another directory or file on the system. In order to make our custom <module_name> available to the local project, we execute the following.

$ cd <module_name>

$ npm link

This will create a symbolic link from the global node_modules directory to the <module_name> directory. Now to use the <module_name> module in any project, we go to the project directory, then link it to the module using npm link.

$ cd <project_name>

$ npm link <module_name>

The npm publish command allows to publish a private package to npm registry i.e. npmjs.com.

$ npm publish

By default, scoped packages are published with private visibility. To publish a scoped package with public visibility, use npm publish --access public. The public package can be view by visiting https://npmjs.com/package/<package-name>.

$ npm publish --access public

The npm test command executes predefined commands specified in the "test" property of the "scripts" object in package.json.

"scripts": {

"test": "mocha"

}

$ npm test

Alternative method to install packages without updating the package.json is npm ci. It installs dependencies directly from package-lock.json and uses package.json only to validate that there are no mismatched versions. If any dependencies are missing or have incompatible versions, it will throw an error. It is meant to be used in automated environments such as test platforms, continuous integration, and deployment

$ npm ci

Node Package Execute (npx)

The npx is a tool for executing packages and comes bundled with npm version 5.2 onwards. npx is typically used for executing one-off commands, for e.g. to spin up a simple HTTP server. It allows to test and run the package, without installing anything globally. Npx executes the specified command either from a local node_modules/.bin, or from a central cache, installing any packages needed in order to run the command.

$ npx http-server

The npx binary is rewritten in npm version 7.0.0, and the standalone npx package was deprecated. The npx now uses the npm exec command under the hood for backward compatibility.

Node Version Manager

Node Version Manager (NVM) is a bash script used to manage multiple Node.js versions. It allows us to install, uninstall node.js, and switch from one version to another. nvm supports both Linux and macOS, and has a second project named nvm-windows for windows users.

$ curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.38.0/install.sh | bash

The above command will clone the nvm repository to ~/.nvm and add the source line to your profile (~/.bash_profile, ~/.zshrc, ~/.profile, or ~/.bashrc). Restart the terminal before using nvm.

Below command allows to view the list of all the available Node JS versions

$ nvm ls-remote

To install/update the most recent Node JS version we use the nvm install command.

$ nvm install node

We can also install a specific version of Node JS as below

$ nvm install v9.3.0

It allows to install a specific Node JS version and reinstall the npm global packages from a specific version.

$ nvm install v12.14.1 --reinstall-packages-from=10.18.1

We can uninstall the previously installed Node JS version as below.

$ nvm uninstall 13.6.0

To view the list of installed Node JS versions, run:

$ nvm ls

To list available remote versions on Windows 10 the command is as below.

$ nvm list

To switch through installed versions, nvm provides the nvm use command. This works similarly to the install command.

$ nvm use 13.6.0

Switch to the latest Node JS version:

$ nvm use node

To switch between different versions of Node JS

$ nvm run node v9.3.0

We can create custom aliases to identify a specific node version.

$ nvm alias awesome-version 13.6.0

$ nvm use awesome-version

We can check the current node version in use with the below nvm command.

$ nvm current

We can run a command directly for an installed version without switching the node variable.

$ nvm run 13.6.0 --version

We can run a command on a sub-shell, targeting a specific version:

$ nvm exec 13.6.0 node --version

To get the path to Node JS executable for a specific installed version, below command is used.

$ nvm which 13.6.0

Babel

Babel is a JavaScript transpiler that converts converts edge JavaScript(ES6) into plain old ES5 JavaScript that can run on any browser. Transpilers, or source-to-source compilers, are tools that read source code written in one programming language, and produce the equivalent code in another language which is in the same level, e.g. from typescript to javascript. As browsers evolve, new APIs and ECMAScript features are added. Since different browsers evolve at different speeds and prioritize different features, Babel helps to support them all and still use the modern features.

The core functionality of Babel resides at the @babel/core module. The babel-cli JavaScript package contains Babel command line tools. Install the babel-cli (babel-core can be optional) in development mode.

$ npm install babel-cli babel-core --save-dev

Install babel-preset-es2015, which is an array of plugins which allow babel to transpile ES6 JavaScript code to ES5, in devlopment mode in corresponding project.

$ npm install –-save-dev babel-cli babel-preset-es2015

Create the .babelrc file in the root directory and properties are plugins & presets. Plugin property is used to transpile specific features e.g. plugins like arrow function, classes, instanceof etc. In order to transpile all the features of ES6, presets property is used. Preset are a simple collection of babel plugins.

// projectname/.babelrc

{

"presets": ["es2015"]

}

Convert the ES6 code into ES5 by executing babel command.

$ babel src -d build

ESLint

Javascript does not deprecate old code syntax, since a lot of legacy code would be impacted which are running on JS Engines of different browsers. The new syntax is added on top of the existing syntax, were both the old and new syntax works. In order to ensure that the project does not use the old syntax and uses corresponding ECMAScript version, tools such as ESLint come in handy. ESLint is a tool for identifying and reporting on patterns found in ECMAScript/JavaScript code, with the goal of making code more consistent and avoiding bugs. Below is the setup process for ESLint. ESLint statically analyzes the code to find problems and can automatically fix many problems. ESLint is written in JavaScript on Node.js and supports React’s JSX format and ES6 features.

$ npm install eslint --save-dev

$ npx eslint --init

$ npx eslint check-this-file.js

After running eslint --init, we get a .eslintrc.{js,yml,json} file in the directory were the rules can be configured.

Yarn

Yarn is a new JavaScript package manager built solve the consistency, security and speed issues in installing packages using npm, most of which are now resolved. Yarn is only a new CLI client that fetches modules from the npm registry. Yarn can execute the tasks to install the package in parallel thus increasing performance, compared to npm which executes the tasks sequentially per package. The npm output is verbose as it recursively lists all installed packages when running npm install <package>. Yarn lists significantly less information with appropriate emojis.

Yarn can be installed using npm, which comes bundled with Node JS. Yarn can also be installed directly using corresponding system specific installation methods e.g. Ubuntu, Mac OS and Windows.

$ npm install --global yarn

$ yarn --version

Yarn writes its dependencies to package.json and stores the dependencies into node_modules directory, similar as npm. Yarn also auto-generates yarn.lock file in the current directory, which is handled entirely by Yarn. As packages/dependencies are added/upgraded/removed using the Yarn CLI, it will automatically update the yarn.lock file. The yarn.lock file includes everything Yarn needs to lock the versions for all packages in the entire dependency tree. It is recommended that yarn.lock files should be checked into source control.

The yarn init command is used to initialize a new project.

$ yarn init

The yarn or yarn install commands are used to install all the dependencies from package.json file.

$ yarn

$ yarn install

The add command is used to add a dependency to the project.

$ yarn add <package-name>

Install a package globally

$ yarn global add <package-name>

The --dev (or alias -D) flag is used to add a package as a development dependency.

$ yarn add --dev <package-name>

Upgrade a a single package or dependency

$ yarn upgrade <package-name>

Upgrade all the dependencies.

$ yarn upgrade

For selectively upgrading the package within the project, interactive upgrade command is used.

$ yarn upgrade-interactive

Remove a dependency.

$ yarn remove <package-name>

Add a global dependency.

$ yarn global add <package-name>

To check for missing packages to be installed, use yarn check command.

$ yarn check

Yarn provides a handy tool that prints the license of any dependency for the project.

$ yarn licenses ls

$ yarn licenses generate-disclaimer

Yarn helps to inspect the package and determine why a specific package was installed.

$ yarn why <package-name>

The yarn generate-lock-entry command generates a yarn.lock file based on the dependencies set in package.json. This command should be used with caution as it changes the yarn.lock file.

$ yarn generate-lock-entry

Yarn uses a global offline cache to store packages once installed, to be used as cache for new installations. Below command determines yarn's cache directory.

$ yarn cache dir

Yarn can work with multiple registry types. By default it uses npm registry (non-standard), but it can also add packages from files, remote tarballs, or remote git repositories. Below command allows to see the current configured npm registry.

$ yarn config get registry

Install local package using file.

$ yarn add file:/<path to local package directory>

Install package using remote tarball URL.

$ yarn add https://<path to compressed tarball>.tgz

Install package using remote git repository URL.

$ yarn add <git remote-url>

Pandas - Python Data Analysis Library

2020-11-03T20:03:00.161-08:00

Pandas is a high-level data manipulation library built on top of the Numpy package, hence a lot of the structure of NumPy is used or replicated in Pandas. It provides more flexibility in working with large datasets and helpful methods to carryout various data operations for data analysis. Pandas allows to load, prepare, manipulate, model, and analyze data, regardless of the origin of data. Pandas provides tools for loading data from various file formats into in-memory data objects. It allows to reshape & pivot the data sets, aggregate data using group by, and slice, index and subset large data sets. We can join and merge data sets with high performance using Pandas. Data in pandas is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn. Pandas has become a popular tool for effective data manipulation and analysis. Python with Pandas is used in a wide range of fields including academic and commercial domains including finance, economics, Statistics, analytics, etc.

Jupyter Notebooks provide a good environment for using pandas to do data exploration and modeling. Jupyter Notebook enables to execute a piece of code in a particular cell as opposed to running the entire file. This saves a lot of time when working with large datasets and complex transformations. Notebooks also provide an easy way to visualize pandas' DataFrames and plots.

Series

The primary two components of pandas are the Series and DataFrame.

Series is a one-dimensional labeled array which can contain any type of data, including mixed types. The column labels are collectively called index. The labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.

Series is created using python array as below. The default numerical index can be overidden by passing custom index array as below.

series1 = pd.Series([1, 2, 3, 4]
series2 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
type(series1)

Series elements can be accessed by using the index operator[] and index number. Multiple elements can be accessed using the slice operation. Elements can also be accessed using the custom index label passed during initialization of series. The .loc and .iloc indexers can also be used instead of the indexing operator to make selections. The df.loc indexer allows to retrieve data by position and can also select subsets of data compared to the indexing operator. The df.iloc indexer is very similar to df.loc but only uses integer locations to make its selections.

# Get all elements from start to third
print(series1[:3])
# Uses index labels to fetch values
print(series2['b'])

# access the element of series using .loc[] function.
print(series1.loc[2:4])

# using .iloc[] function
print(series1.iloc[2:5])

Series also provides functions for binary operations such as addition, subtraction, multiplication etc as below.

series_a = pd.Series([5, 2, 3,7], index=['a', 'b', 'c', 'd'])
series_b = pd.Series([1, 6, 4, 9], index=['a', 'b', 'd', 'e'])

# adding two series using add function
series_a.add(series_b, fill_value=0)

# subtract one series from another using sub function
series_a.sub(series_b, fill_value=0)

Series can be converted to a list or other types with the help of conversion functions such as .astype(), .tolist() etc.

# importing pandas module  
import pandas as pd 
   
# reading csv file from url  
data = pd.read_csv("nba.csv") 
    
# dropping null value columns to avoid errors 
data.dropna(inplace = True) 
   
# storing dtype before converting 
before = data.dtypes 
   
# converting dtypes using astype 
data["Salary"]= data["Salary"].astype(int) 
data["Number"]= data["Number"].astype(str) 
salary_list = data["Salary"].tolist() 
   
# storing dtype after converting 
after = data.dtypes

The Series .to_frame() method is used to convert a Series object into a DataFrame.

import pandas as pd

series = pd.Series([100, 200, 300, 400, 500])

# converting the series into the dataframe
dataframe = series.to_frame()

Pandas Series unique() function extracts a unique data from the dataset. The unique() method does not take any parameter and returns the numpy array of unique values in that particular column. It can only be applied to 1 dimensional array. The unique() method works only on series and not on DataFrames. The pd.unique() method includes NULL or None or NaN value as a unique value.

dataset = {
    'Name': ['Jack', 'Rock', 'Tom', 'Jack', 'Mike'],
    'serial_no': ['01', '02', '03', '04', '05']}

df = pd.DataFrame(dataset)
group = df["Name"].unique()

# unique function also takes an array of tuples
print(pd.unique([('x', 'y'), ('y', 'x'), ('x', 'z'), ('y', 'x')]))

# nan and None are part of the unique values in the result
print(pd.unique([('x', 'y'), ('y', 'x'), ('x', 'z'), np.nan, None, np.nan]))

DataFrame

DataFrame is a 2-dimensional (labeled) data structure with columns of potentially different Python Data types. A Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series. DataFrames are used to store and manipulate tabular data in rows of observations and columns of variables.

The simplest way to create a Dataframe is to use Python dictionary object. Each (key, value) item in data corresponds to a column in the resulting DataFrame. The default index of the DataFrame is numbers starting from 0, but we could also create our own when we initialize the DataFrame as below.

data_dictionary = {
    'apples': [3, 2, 0, 1], 
    'oranges': [0, 3, 7, 2]
}

purchases = pd.DataFrame(data_dictionary)
purchases = pd.DataFrame(data_dictionary, index=['John', 'Tony', 'Michael', 'Chris'])

We will be focusing on DataFrames in the subsequent sections below.

Reading data from files

Pandas allows to load data from various file formats such as csv, excel and json. It can also data from remote URL. While reading CSV data into DataFrame, the index_col parameter indicates the column which will become the index of the DataFrame. If we pass index_col = 0, then the first column of the DataFrame will be converted into the index.

import pandas as pd

df = pd.read_csv('sales.csv')   # returns as data frame object
df = pd.read_csv('sales.csv', index_col=0)   # making zero-th column as index

# load files in tab seperated format
df_tab = pd.read_csv('sales_tab_format.txt', delimiter='\t')

# load files in excel format
df_xlsx = pd.read_excel('sales.xlsx')

# read json file data, which can also take a URL
df_json = pd.read_json("https://datasource.com/master/data/records.json")

Although Pandas does not have any built in function to read multiple files within a directory path into a single Dataframe, this can be achieved using basic python file operations as shown in below example.

import pandas as pd
import os

files = [file for file in os.listdir('/path/directory')]

all_data = pd.DataFrame()

for file in files:
    df = pd.read_csv('/path/directory/'+file)
    all_data = pd.concat([all_data, df])

Reading data from Database

Pandas can also work with a SQL database by first establishing a connection using an appropriate Python library, then pass an sql query to fetch data into pandas. Pandas loads data from sql database using read_sql() function which is a wrapper around read_sql_table and read_sql_query. Below is an example of accessing data from MySQL Database.

from sqlalchemy import create_engine
import pymysql

mysql_connection_str = 'mysql+pymysql://mysql_user:mysql_password@mysql_host/mysql_db'
mysql_connection = create_engine(mysql_connection_str)

df = pd.read_sql('SELECT * FROM table_name', con=mysql_connection)

Another example of loading data from Oracle Database using cx_Oracle driver.

import cx_Oracle  
import pandas as pd  
  
class Connection(cx_Oracle.Connection):  
  
  def cursor(self):  
  cursor = super(Connection, self).cursor()  
  cursor.arraysize = 5000  
  return cursor  
  
ora_connection = Connection("oracle_user", "oracle_password", "ora12c")  
df = pd.read_sql_query("select * from table_name", ora_connection)

Indexing

Indexes in Pandas are similar to an an address which enable to access any data point across the DataFrame or Series. A DataFrame has an index that labels every row with its initial row number similar to Series, and second set of labels are column names. Index in Pandas is an immutable array which enables to access a row or column using the label. The index of the Series or DataFrame can be accessed using the index attribute or Index() function. It allows to change the index column title as well.

df = pd.DataFrame({"Letters": ["a", "b", "c"], "Numbers": [1, 2, 3]})

# returns the DataFrame index as an attribute
df_index = df.index

# update index column name
df_index.name = "Index Title"

series = [11, 21, 21, 19, 11]

# returns index of list object
seriObj = pd.Index(series)

The set_index() function is used to set the List, Series or DataFrame as an index of the Data Frame. It sets the DataFrame index (row labels) using one or more existing columns or arrays (of the correct length). It takes keys parameter with list of column names to set as DataFrame index, drop parameter which if True removes the column used for index and append parameter which appends the column to the existing index column if True.

# set the Rank column as index to the current DataFrame
data.set_index('rank',inplace=True)

# create a MultiIndex using columns 'year' and 'month'
df.set_index(['year', 'month'])

# create the MultiIndex using an Index and a column:
df.set_index([pd.Index([1, 2, 3, 4]), 'year'])

# create a MultiIndex using two Series
s = pd.Series([1, 2, 3, 4])
df.set_index([s, s**2])

# set the Timestamp column as index
df.set_index('Timestamp', inplace=True, drop=True)

# drop the passed columns and append them to the already existing index column
df.set_index(["Month", "Year"], inplace = True, append = True, drop = False)

Index of a DataFrame can be reset to a list of integers ranging from 0 to length of data as the index using the reset_index() method. It reassigns the index i.e. the row label of DataFrame and Series to the sequence numbers starting from 0. It is mostly used to remove the current index or use row name (string) as the index. The reset_index() takes level parameter which can be column name or number, or list of columns to remove the index. The drop parameter adds the replaced index column to the data if the value is false. When the drop parameter is set to True, the original index is deleted. The col_level selects the column level to insert the labels, while col_fill determines the naming of other levels. By default, the reset_index() does not change the original object and returns the new object, but if the argument inplace is set to True, an original object is changed.

dataset = {
    'Name': ['Rohit', 'Mohit', 'Sohit', 'Arun', 'Shubh'],
    'Roll no': ['01', '03', '04', '05', '09'],
    'Marks in maths': ['93', '63', '74', '94', '83'],
    'Marks in science': ['88', '55', '66', '94', '35'],
    'Marks in english': ['93', '74', '84', '92', '87']}

df = pd.DataFrame(dataset)

# Setting index on the name column
df.set_index(["Name"], inplace=True, append=True, drop=True)

# Resetting index to level 1, back to original form
df.reset_index(level=1, inplace=True, col_level=1)

# Setting index on MultiIndex which is name and Roll no
df.set_index(["Name", "Roll no"], inplace=True, append=True, drop=True)

# convert the 1st index to a column in the DataFrame
df.reset_index(level=0, inplace=True)

# Resetting index to level 2, remove the index on Roll no and index only the Name column
df.reset_index(level=2, inplace=True, col_level=1)

# reset old index to new index, saving old index as new column named 'index'
new_df = new_df.reset_index()

# reset to new index without saving existing index column
new_df = new_df.reset_index(drop=True)

# perform operation in place rather than creating new data frame altogether
new_df.reset_index(drop=True, inplace=True)

Printing data within DataFrame

The head(), tail() and slice allows to get first or last n rows with a DataFrame. The head() function outputs the first five rows of the DataFrame by default. On passing a number argument to head() function it could output the corresponding number of rows. Similarly the tail() function also accepts a number argument.

print(df.head(3))    # get the top 3 rows in the data frame

print(df.tail(3))    # get the bottom 3 rows in the data frame

print(df[50:55])     # get the rows by specifying the row numbers

Indexing (Subset Selection) in pandas means selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. The indexing operator [] is used to select rows or columns in the dataframe. Pandas allows to read data columns using column headers (column names). The indexing operator [] also enables to take a subset of the data in number of ways when the DataFrames are too large to work with as below.

data = {'Name':['John', 'Jimmy', 'Jackie', 'Joseph'],
        'Age':[27, 24, 22, 32],
        'Address':['Dublin', 'London', 'Vancover', 'Tokyo'],
        'Qualification':['MS', 'MD', 'MBA', 'Phd']}
 
df = pd.DataFrame(data)

## Read Headers of data columns
print(df.columns)

## Read single column
print(df['Name'])

# Read multiple columns by name at once using the double brackets syntax
print(df[['Name', 'Age', 'Qualification']] )

# Reach top 5 row in 'Name' column
print(df['Name'][0:5])

# select all whose age is 24 and qualification is MD
df[(df.Age == '24') & (df.Qualification == 'MD')]

Pandas provide multiple options to display the data in more customized format such as decimal precision, maximum number of rows displayed, row wrapping etc.

Indexer to select rows and columns

Apart from the indexing operator [], Pandas provide the loc and iloc indexers to perform just about any data selection operation. The loc is label-based were rows and columns are fetched by their (row and column) labels. It can select subsets of rows or columns and even select rows/columns simultaneously. The loc indexer is used for selecting rows by label/index or by a boolean condition within the DataFrame. The loc indexer uses the syntax, df.loc[<row selection>, <column selection>]. The selection using loc method is based on the index values of any rows of the DataFrame. The index is set on a DataFrame using df.set_index() method. Below are the examples of selecting single/multiple rows and columns from the DataFrame. When selecting columns using the loc indexer, columns are referred to by names using lists of strings, or “:” slices. We can also select ranges of index labels which return all rows in the data frame between the specified index entries.

data = {'first_name':['Johnny', 'Jimmy', 'Joe', 'Joseph'],
        'last_name':['Depp', 'Harrison', 'Liberman', 'Nash'],
        'company_name':['Apple', 'Google', 'Tesla', 'Microsoft'],
        'address':['14 Taylor Street', '15 Binney Street', '8 Moor Place', '5396 Forth Street'],
        'city':['London', 'Berlin', 'Los Angeles', 'Dubai'],
        'phone':['0343-6454565', '023-4345345', '012-54645646', '09-23423424']}

df = pd.DataFrame(data)
df.set_index("last_name", inplace=True)

# selecting single row
df.loc['Depp']

# selecting multiple rows
df.loc[[ 'Depp', 'Harrison']]

# select rows and columns using names of the columns
df.loc[['Harrison', 'Liberman'], ['first_name', 'address', 'phone']]

# select rows with index values 'Liberman' and 'Nash', with all columns between 'city' and 'phone'
data.loc[['Liberman', 'Nash'], 'city':'phone']

# Select same rows, with just 'first_name', 'address' and 'city' columns
data.loc['Harrison':'Nash', ['first_name', 'address', 'city']]

The iloc is integer based indexer and allows to specify rows and columns by their integer index. The iloc indexer is used for integer-location based indexing or selection by position. The iloc indexer syntax is df.iloc[<row selection>, <column selection>]. The iloc indexer returns a Pandas Series when one row is selected, and a Pandas DataFrame when multiple rows are selected, or if any column in full is selected.

# Single selection of Rows:

df.iloc[0] # first row of data frame
df.iloc[1] # second row of data frame
df.iloc[-1] # last row of data frame

# Single selection of Columns:

df.iloc[:,0] # first column of data frame
df.iloc[:,1] # second column of data frame
df.iloc[:,-1] # last column of data frame

## Reading a specific location (R,C) using integer location function passing row and column number
df.iloc[2,1]

# Multiple row and column selections using iloc and DataFrame

df.iloc[0:5] # first five rows of dataframe
df.iloc[:, 0:2] # first two columns of data frame with all rows
df.iloc[[0,3,6,24], [0,5,6]] # 1st, 4th, 7th, 25th row + 1st 6th 7th columns.
df.iloc[0:5, 5:8] # first 5 rows and 5th, 6th, 7th columns of data frame

# finding specific row based on text rather than index
df.loc['Type 1' == "Fire"]    # find all rows were column 'Type 1' value is 'Fire'

Logical Boolean indexing

The conditional selections with boolean arrays using data.loc[<selection>] enables to fetch values which match the specified condition. We pass an array or Series of True/False values to the loc indexer to select the rows where the Series has True values. For example, the statement df[‘first_name’] == ‘Jimmy’] produces a Pandas Series with a True/False value for every row in the DataFrame, where there are “True” values for the rows where the first_name is “Jimmy”. Further, the second argument to the loc method takes the column names to fetch, which can be a single string, a list of columns or slice ":" operation. Passing multiple column names to the second argument of loc[] enables to select multiple columns. Selection of single column returns a Series, while selecting a list of columns returns a DataFrame.

# Select rows with first name Jimmy
df.loc[df['first_name'] == 'Jimmy']

# Select rows with first name Jimmy and columns 'company_name' 'city' and 'phone'
df.loc[df['first_name'] == 'Jimmy', ['company_name', 'city', 'phone']]

# Select rows with first name Jimmy and all columns between 'address' and 'phone'
df.loc[df['first_name'] == 'Jimmy', 'address':'phone']

# Select rows with last_name equal to some values, all columns
df.loc[df['first_name'].isin(['Joe', 'Johnny', 'Joseph'])]   
       
# Select rows with first name Johnny AND 'Street' addresses
df.loc[df['address'].str.endswith("Street") & (df['first_name'] == 'Johnny')] 
 
# select rows with id column between 100 and 200, and just return 'address' and 'city' columns
df.loc[(df['id'] > 100) & (df['id'] >= 200), ['address', 'city']] 
 
# A lambda function that yields True/False values can also be used. E.g. Select rows where the company name has 4 words in it.
df.loc[df['company_name'].apply(lambda x: len(x.split(' ')) == 4)] 
 
# Selections can be achieved outside of the main .loc for clarity
idx = df['company_name'].apply(lambda x: len(x.split(' ')) == 4)

# Select only the True values in 'idx' and only the 3 columns specified:
df.loc[idx, ['city', 'first_name', 'company_name']]

Few more examples of filtering data using loc method

df.loc[df['Type 1'] == 'Grass']

# inside pandas data frame '&' is used instead of 'and'.
df.loc[(df['Type 1'] == 'Grass') & (df['Type 2'] == 'Poisen')]

df.loc[(df['Type 1'] == 'Grass') | (df['Type 2'] == 'Poisen')]

# create a new data frame based on filtering results
new_df = df.loc[(df['Type 1'] == 'Grass') & (df['Type 2'] == 'Poisen') & (df['HP'] > 70)]

# filter to get all the names containing the text 'Mega'
df.loc[df['Name'].str.contains('Mega')]

# remove all the names containing the text 'Mega', negation example in loc
df.loc[~df['Name'].str.contains('Mega')]

# The str.contains() function can also accept regular expressions
import re
df.loc[df['Type 1'].str.contains('Fire|Grass', regex=True)]

# ignore case for str.contains()
df.loc[df['Type 1'].str.contains('fire|grass', flags=re.I, regex=True)]

# fetch all the names in 'Name' column starting with 'Pi'
df.loc[df['Name'].str.contains('^pi[a-z]*', flags=re.I, regex=True)]

Setting values in DataFrames using loc

We can update the DataFrame in the same statement as the select and filter using loc indexer. This particular pattern allows to update values in columns depending on different conditions. The setting operation does not make a copy of the data frame, but edits the original data.

# change the first name of all rows with an ID greater than 2000 to "Robert"
data.loc[data['id'] > 2000, "first_name"] = "Robert"

# update all the places were value of 'Type 1' is 'Fire' to new value 'Flamer'
df.loc[df['Type 1'] == 'Fire', 'Type 1'] = 'Flamer'

# update all the values of column 'Legendary' to True, were the value of column 'Type 1' is 'Fire'
df.loc[df['Type 1'] == 'Fire', 'Legendary'] = True

# change multiple column values based on the value condition on another column
df.loc[df['Total'] > 500, ['Generation','Legendary']] = 'Test Value'
df.loc[df['Total'] > 500, ['Generation','Legendary']] = ['Test1 Value', 'Test2 Value']

Iteration

Dataframe contains rows and columns, and is iterated similar like a dictionary. Rows are iterated using the functions, iteritems(), iterrows(), itertuples(). Columns of DataFrame can be iterated by first creating a list of dataframe columns and then by iterating through that list to pull out the dataframe columns. Iterate through each row in the Dataframe, to read row data.

## Read each row
for index, row in df.iterrows():
    print(index, row)

# Reach each row for the specified column name
for index, row in df.iterrows():
    print(index, row['first_name'])

# creating a list of dataframe columns
columns = list(df)

# iterating columns and print values of 3rd row
for i in columns:
    print (df[i][2])
    
# iterate specified column names in the DataFrame
for column in df[['first_name', 'address']]:
    print('Colunm Name : ', column)
    
# iterate over columns using an index
for index in range(df.shape[1]):
   print('Column Number : ', index)

   # Select column by index position using iloc[]
   columnsObj = df.iloc[: , index]

   print('Column Contents : ', columnsObj.values)

The iteritems() function returns an iterator which can be used to iterate all the columns of the dataframe. The iteritems() function returns an iterator to the tuple containing column name and Series pair for each column. Hence it can iterate over the DataFrame columns, returning a tuple with the column name and the content as a Series.

for (columnName, columnData) in df.iteritems():
    # prints column name
    print('Colunm Name : ', columnName)
    # prints all the values in DataFrame
    print('Column Values : ', columnData.values)

DataFrame Properties

The info() function provides the concise summary of the dataset, such as the number of rows and columns, the number of non-null values, type of data in each column, and memory usage. The number of non-null values and the column data type helps to detect data issues before performing data operations. The describe() method enables to get statistics details about the data within the Dataframe. The shape property returns a tuple representing the dimensionality of the Dataframe. The values property returns all the rows in the DataFrame.

# summary information of the dataframe
df.info()

# read basic statistics information of each column in the data set, e.g. count, mean, min and std deviation
df.describe()

# include all option adds all the columns including non-numeric columns with NAN values
df.describe(include='all')

df.shape     # returns (16598,11) i.e. 16598 records i.e. rows and 11 columns
df.shape[0]  # reutrns 16598 row count

df.values    # returns 2 dimensional array, were inner array represent the first row in the data set.

Cleaning Data

The isnull() and notnull() function is used to check for missing values in Pandas. Both the functions help in checking whether a value is NaN or not. They can also be used in Pandas Series in order to find null values in a series.

# using isnull() function  
df.isnull()

Pandas provides various built in functions such as fillna(), replace() and interpolate() which allow to clean the data by replacing all the NAN values with zeros. The fillna() function fills the NA/NaN values with the specified value in Pandas DataFrame. The replace() method on the other hand replaces the NaN values with zeros by using the numpy.nan property. It replaces the specified value dynamically and differs from loc or iloc functions which require to specify a location to update the value. Interpolate() function is generally used to fill NA values in the dataframe but it uses various interpolation technique to fill the missing values rather than hard-coding the value.

dict = {'price': ['100', 'KDL100', 400, 'ADL100'],
        'discount': ['50', '50%', '30%', '20']}

df = pd.DataFrame(dict)

# convert string to numeric values using df.to_numeric() function
df['price'] = pd.to_numeric(df['price'], errors='coerce')

df['price'] = df['price'].fillna(0)

# replacing the NaN values with zeros by using numpy's nan property
df['price'] = df['price'].replace(np.nan, 0)

# alternatively use the apply() function to convert the entire DataFrame values
df = df.apply(pd.to_numeric, errors='coerce')

# function to replace with zeros in entire dataframe
df = df.fillna(0)

# replace the NaN values with zeros in entire DataFrame
df = df.replace(np.nan, 0)

In order to drop empty values from a dataframe, the dropna() function is used which removes rows and columns with Null/None/NA values from the DataFrame. It returns a new DataFrame without changing the original DataFrame. The dropna() function takes arguments axis, how, thrash, subset and inplace. The axis argument value of 0 results in dropping null values from the rows while axis value 1 drops columns with missing values. The how argument determines whether to drop the row/column when any of the values are null (default) or when all the values are missing. The thresh argument determines the threshold for the drop operation. The subset arguments takes specific rows/columns to drop null values.

series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', np.nan, 'Emilia'),
          ('Westworld', pd.NaT, 'Evan Rachel'), ('La Casa De Papel', 4, None)]

dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])

# drop rows with any missing value which is default behavior
removedNone = dfObj.dropna()

# when axis = 1, drop all columns with any missing value
removedNoneColumns = dfObj.dropna(axis=1)

# drop the rows if all all the values are either None, NaN, or NaT.
removedNoneColumns = dfObj.dropna(how='all')

# when thresh = 2, drop only those rows which have a minimum of 2 Null/None/NA values
removedNoneColumns = dfObj.dropna(thresh=2)

# drop the null values only in the subset of defined labels
removeDefinedColumns = dfObj.dropna(subset=['Name', 'Actor'])

Find Duplicates

Pandas DataFrame.duplicated() function is used to find duplicate rows based on all columns or some specific columns. The duplicated() function returns a Boolean Series with a True value for each duplicated row. The pandas.duplicated takes two parameters, subset and keep. The subset parameter if passed single or multiple column, finds the duplicate rows in the corresponding columns. The keep parameter denotes the occurrence which should be marked as duplicate, either 'first' which ignores first occurrence, 'last' ignores last occurrence and 'False' which marks everything as duplicates.

series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'),
          ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Millie'),
         ('La Casa De Papel', 4, 'Sergio')]

dfObj = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])

# finds all duplicate rows
duplicateDFRow = dfObj[dfObj.duplicated()]

# finds duplicate rows, ignoring the last duplicate occurrence
duplicateDFRow = dfObj[dfObj.duplicated(keep='last')]

# find duplicates in the Name column only
duplicateDFRow = dfObj[dfObj.duplicated(['Name'])]

# find duplicates in both Name and Seasons columns
duplicateDFRow = dfObj[dfObj.duplicated(['Name', 'Seasons'])]

Pandas also has a drop_duplicates() method which helps in removing duplicates from the data frame. It also takes subset parameter which takes column names from which the duplicates are dropped. Similar to duplicated() function it takes the keep parameter to determine which duplicate values to keep and inplace to modify the exiting DataFrame.

# remove duplicate values for orderId
df = df[['Order Id', 'Grouped']].drop_duplicates()

Value Count

The value_counts() function returns the Series containing counts of unique values in sorted order. The resulting object is in descending order with the first element as the most frequently-occurring element. It excludes NA values by default. The value_counts() function applies only to Series, which can be a Series object or a selected column from the DataFrame. The Series.index.tolist() or Series.to_frame() functions can be used to extract values from resultant Series of value_counts() function for further usage.

# read csv file by skipping first 4 lines in the file
df = pd.read_csv('data.csv', skiprows=4)

# find counts of the column city in the dataframe, sorted in descending order by default
df.City.value_counts()

# find counts of the column sport in the dataframe without sorting
df.Sport.value_counts(sort=False)

# find counts and relative frequency by dividing all values by the sum of values
df.Sport.value_counts(normalize=True)

# find counts including the NaN values for Sport column
df.Sport.value_counts(dropna=False)

Sorting data in Dataframe

There are two methods of sorting in Pandas, namely by label and by actual value.

The sort_index() method sorts the DataFrame by taking the axis arguments and the order of sorting. Passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0, sort by row. By default, sorting is done on row labels in ascending order, which can be changes by passing boolean value to the ascending parameter.

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2), index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])

# sort by rows in ascending order by default
sorted_df = unsorted_df.sort_index()

# sort by columns
sorted_df = unsorted_df.sort_index(axis=1)

# sort by descending order
sorted_df = unsorted_df.sort_index(ascending=False)

The sort_values() method is used for sorting columns by values. It accepts a list of column names of the DataFrame with which the values are to be sorted. The sort_values() method also provides a provision to choose the algorithm from mergesort (stable), heapsort and quicksort.

# Sort column values
df.sort_values('Name')
df.sort_values('Name', ascending=False)

# Sort multiple columns
df.sort_values(['Type 1', 'HP'])
df.sort_values(['Type 1', 'HP'], ascending=False)    # Multiple column sorting with descending
df.sort_values(['Type 1', 'HP'], ascending=[1,0])    # Multiple column sorting, with different order for each column

It is recommended to use reindex if the row numbers are already used as index and the order to rows is changed by sorting or deleting rows.

Assign New Columns

New columns can be assigned to a DataFrame using Pandas's assign() method. The assign() method returns the new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten. The length of the newly assigned column must match the number of rows in the DataFrame.

dict = {'price': [520, 500]}
df1 = pd.DataFrame(data=dict)

# assign new column called revised price
df2 = df1.assign(revised_price=lambda x: x.price + x.price * 0.05)

# assign new column using the values of other columns
df2 = df1.assign(revised_price=df1['price'] + df1['price'] * 0.05)

# assigning multiple columns
df2 = df1.assign(revised_price=df1['price'] + df1['price'] * 0.05,
                 changed_price=df1['price'] + df1['price'] * 0.10)

# add new column with values specified in the list
df1['items'] = ['Apple Watch', 'Air Pod']

# add new column with same default value
df1['category'] = 'Electronics'

Transpose Rows and Columns

The transpose() method or T attribute in Pandas swaps the rows and columns of the DataFrame. Neither method changes an original object but returns the new transposed object with the rows and columns swapped. The transpose() function transposes index and column, by writing rows as columns and vice-versa.

dt = {
    'Stranger Things': ['Mike', 'Eleven'],
    'Money Heist': ['Professor', 'Tokyo']
}

df1 = pd.DataFrame(data=dt)

# transpose dataframe using T attribute
transposed_df1 = df1.T

# transpose dataframe using use the transpose() method.
transposed_df2 = df1.transpose()

Updating the data in Dataframe

Some of the example of updating the DataFrame by adding column values directly or using the sum() function of iloc method.

# Adding a new total column by taking sum of values from other columns in a given row
df['Total'] = df['HP'] + df['Attack'] + df['Defense'] + df['Sp.Atk'] + df['Sp.Def']

# another method using iloc function
df['Total'] = df.iloc[:, 4:9].sum(axis=1)

# update the location or the index of the column
df = df[ ['Total', 'HP', 'Attack'] ]

cols = list(df.columns.values)

# switch the column from last column to the 4th column from left side
# get first 3 columns then use reverse index to get total column then, concatenate remaining columns 
df = df[cols[0:4] + [cols[-1]] + cols[4:11]]

Pandas provides methods to add new rows, add new columns, delete rows and delete columns as shown below.

# adding a new Ratings Column with corresponding values to the DataFrame
df['Ratings'] = ['A', 'D', 'C', 'B', 'C']

# adding a row using append() function at the end
data = {'Name': 'Mario Brothers', 'Platform': 'Nintendo', 'HP': 22, 'Attack': 45, 'Defense': 67, 'Total': 132}
df = pd.DataFrame(data2, index=[4])

# delete the entire column from the data frame
df = df.drop(columns=['Total'])

# delete specified column from the DataFrame
del df['Total']

# delete the row using an index label from a DataFrame
df = df.drop(2)

The rename() method can be used to rename certain or all columns via a dictionary. Also list comprehension can be leveraged to rename columns as below.

df.rename(columns={
        'Runtime (Minutes)': 'Runtime', 
        'Revenue (Millions)': 'Revenue in millions'
    }, inplace=True)

# use list comprehension for renaming columns to lower case
df.columns = [col.lower() for col in df]

Aggregate using Groupby

The groupby() function takes the name of the column in the data frame which is to be grouped on and perform aggregation. The groupby() function can also either take either a list of multiple column names, a dict or Pandas Series, or a NumPy array or Pandas Index, or an array-like iterable of these.

# find average for all the 'Type 1' column values
df.groupby(['Type 1']).mean()
df.groupby(['Type 1']).sum()

df.groupby(['Type 1']).mean().sort_values('Defense', ascending=False)

df.groupby(['Type 1']).count()

# to accurately count the items
df['count'] = 1
df.groupby(['Type 1', 'Type 2']).count()['count']

Date Time Conversion

Pandas to_datetime() method helps us to convert string Date time into Python Date time object. It comes in handy while working on datasets involving time. The to_datetime() function has the following parameters.

arg: It is the object which is to be converted to a datetime, which can be int, float, string, datetime, list, tuple, 1D array, Series.
errors: It can have 3 values 1st is ‘ignore’ 2nd ‘raise’ 3rd is ‘coerce’. By default, its value is ‘raise. If the value of this parameter is set to ‘ignore’ then invalid parsing will return the input. If ‘raise’, then invalid parsing will raise the exception. If ‘coerce,’ then the invalid parsing will be set to NaT.
dayfirst: It accepts a Boolean value; it places day first if the value of this parameter is true.
yearfirst: It agrees with a Boolean value; it places year first if the value of this parameter is true.
UTC: It is also a Boolean value, it returns time in UTC format if the value is set to true.
format: It is a string input that tells the position of the day, month, and year.
exact: This is also Boolean value if true requires an exact format match otherwise allows the format to match anywhere in the target string.
unit: It is a string with units of arguments ( D,s, ms, us, ns) denote the unit, which is the integer or float number. By default, it is ‘ns’.
infer_datetime_format: This is a boolean value. When it is true and no format of the data is given, then it attempts to infer the format of the datetime strings.
Origin: It is used to define a reference date.
cache: It uses a cache of unique, converted dates to apply the datetime conversions, when cache is True.

The return value depends on the type of input with DatetimeIndex for list input, Series of datetime64 dtype for Series input and Timestamp for Scalar input. When the date does not meet the timestamp limitations, passing errors=’ignore’ will return an original input instead of raising an exception.

import pandas as pd

data = pd.DataFrame({'year': [2015, 2016, 2017, 2018, 2019, 2020],
                     'month': [2, 3, 4, 5, 6, 7],
                     'day': [4, 5, 6, 7, 8, 9]})
x = pd.to_datetime(data)

Reading large data in Chunks

In some cases the size of the file is very large to fit in memory of the local machine. In such case pandas provides ability to read such large data in chunks. The chunksize parameter means the number of rows to be read into a dataframe at any single time in order to fit into the local memory.

# fetch 5 rows at one time
df = pd.read_csv('data.csv', chunksize=5)

for df in pd.read_csv('data.csv', chunksize=5):
    print("CHUNK DATA FRAME")
    print(df)

We can also read the large data in chunks, and use it to group by values based on certain columns in order to get corresponding aggregation results.

# create a new empty data frame with same columns as the original data frame
new_df = pd.DataFrame(columns=df.columns)

# use groupby to get counts from original data frame and concatenate the result to new empty data frame
for df in pd.read_csv('data.csv', chunksize=5):
     results = df.groupby(['Type 1']).count()
     new_df = pd.concat([new_df, results])

Saving data into files

Similar to the ways data is read, pandas provides intuitive commands to save it. For each file format i.e. JSON, Excel and CSV files, the corresponding functions take the desired filename with the appropriate file extension as below.

df.to_csv('modified.csv')

# to ignore/remove the index column from the output csv file
df.to_csv('modified.csv', index=False)

df.to_excel('modified.xlsx', index=False)

df.to_csv('modified.txt', index=False, sep='\t')

df.to_json('E:/datasets/data.json')

Inserting Dataframe into Database

The to_sql() function enables to insert a DataFrame into an SQL database. The to_sql() is replacement for the deprecated write_frame(). The to_sql() function takes the parameters table name, engine name, if_exists, and chunksize. The chunksize writes records in batches of a given size at a time. By default, all rows will be written at once.

The argument if_exists tells pandas how to deal if the table already exists. When if_exists=fail and if table exists, then to_sql() does nothing. When if_exists=replace and if table exists then to_sql() drops the existing table, recreate it, and then inserts data. When if_exists=append and if table exists then to_sql() inserts data into the existing table. If the table does not exists, to_sql() function creates a new table.

from sqlalchemy import create_engine

# create sqlalchemy engine
engine = create_engine("mysql+pymysql://{user}:{pw}@localhost/{db}"
                       .format(user="root",pw="password", db="db_name"))

df.to_sql('table_name', con = engine, if_exists = 'append', chunksize = 1000)

Plotting data

Pandas has a built in plot() function as part of the DataFrame class which is useful to just get a quick visualization of the data in Dataframe. The plot function uses matplotlib for visualizations. It has several key parameters:

kind — accepts ‘bar’,’barh’,’pie’,’scatter’,’kde’ etc which can be found in the docs.

color — accepts an array of hex codes corresponding sequential to each data series / column.

linestyle — accepts ‘solid’, ‘dotted’, ‘dashed’ (applies to line graphs only)

xlim, ylim — specify a tuple (lower limit, upper limit) for which the plot will be drawn

legend— a boolean value to display or hide the legend

labels — a list corresponding to the number of columns in the dataframe, a descriptive name can be provided here for the legend

title — The string title of the plot

df.plot(x='age', y='fare', kind='scatter')

Pandas also has DataFrame hist() method is a wrapper for the matplotlib pyplot API. The hist() method is called on each Series in the DataFrame, resulting in one histogram per column.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({
    'length': [2.5, 3.6, 4.6, 4.8, 5.0],
    'width': [2.7, 3.7, 6.4, 0.22, 4.7]
})

# histogram on length column
df.hist(bins=3, column="length")

# alternatively use the plot function to plot histogram for column length
df[['length']].plot(kind='hist',bins=3,rwidth=0.8)

plt.show()

Pivot Table

Pivot table is a table of statistics that helps summarize the data of a larger table by “pivoting” that data. The pivot table takes simple column-wise data as input, and groups the entries into a two-dimensional table that provides a multidimensional summarization of the data. A pivot table requires a Pandas dataframe and an at least one index parameter. Index is the feature that allows to group the data in dataframe and appear as an index in the resultant table. We can use more than one feature as an index to group the data. Multiple index values can also be passed as a list, creating further groupings. The pivot table function also takes values parameter which is the list of the columns of the dataframe to be aggregated. If the values parameter is blank then all the numerical values of all columns will be aggregated. The index, columns and values parameters takes the column name in the original table as a value. The pivot function creates a new table, whose row and column indices are the unique values of the respective parameters. The cell values of the new table are taken from column given as the values parameter.

import pandas as pd

# pivot table function
pd.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, 
                     margins=False, dropna=True, margins_name='All', observed=False)

# basic pivot table with one index
pd.pivot_table(df,index=["Name"])

# pivot table with multiple index values
pd.pivot_table(df,index=["Name","Rep","Manager"])

pd.pivot_table(df,index=["Manager","Rep"],values=["Price"])

The aggregation function is the aggfunc parameter which can take a list of functions such as numpy mean function, len function to get total count, numpy sum function to get the sum of all values etc. The specified list of aggregation functions are applied to the values specified.

pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=np.sum)

pd.pivot_table(df,index=["Manager","Rep"],values=["Price"],aggfunc=[np.mean,len])

The fill_value parameter is used to specify the value to be used to replace missing values i.e. NaN values within the data.

pd.pivot_table(df,index=["Manager","Rep"],values=["Price"], columns=["Product"],aggfunc=[np.sum],fill_value=0)

The margins=True parameter enables to add row and column for totals. The dropna=True can be used to not include columns where all entries are NAN. The margin_name enables to provide the names for the total rows/columns.

pd.pivot_table(df, index=["Manager","Rep","Product"], values=["Price","Quantity"], 
                                                      aggfunc=[np.sum,np.mean],fill_value=0,margins=True)

Adding columns to a pivot table in Pandas can add another dimension to the tables. The Columns parameter enables to add a key (column) to aggregate by as below. The columns parameter is optional and displays the values horizontally on the top of the resultant table.

p_table = pd.pivot_table(df, index = 'Type', columns = 'Region', values = 'Units', aggfunc = 'sum')

The pivot table can use the standard data frame functions such as query function to filter the data.

p_table.query('Manager == ["Debra Henley"]')
p_table.query('Status == ["pending","won"]')

Conclusion

In conclusion, Pandas has many uses and enables for data by cleaning, transforming, and analyzing it. It helps to calculate statistics and answer questions about the data, like the average, median, max, or min of each column. It enables to correlate data in one column with another column, and determine the distribution of data in column. Data can be cleaned efficiently by removing missing values and filtering rows or columns by some criteria. Pandas helps to visualize the data with help from Matplotlib by Plotting bar chart, line chart, histograms and much more. Pandas can connect with most of the SQL databases and file systems to either fetch or store data.

NumPy - Python Library for Numerical Computing

2020-09-22T22:45:00.001-07:00

NumPy is primarily used to store and process multi-dimensional array. NumPy is preferred instead of Python List because its performance is better while working on large arrays. NumPy uses fixed (data) types and hence there is no type checking when iterating through objects. NumPy also uses less memory bytes to represent the array in memory and utilizes a contiguous memory, which makes it more efficient to access and process large arrays. NumPy allows insertion, deletion, appending and concatenation, similar to the Python List, but also provides a lot more additional functionality. For an example, NumPy allows to multiple each element of two arrays using a*b were a, b are arrays. NumPy array allows SIMD Vector Processing and Effective Cache Utilization. NumPy is used as a replacement for MatLab, plotting with Matplotlib, images storage and machine learning. NumPy also forms the backend core component for Pandas library.

NumPy is installed using "pip install numpy" command.

NumPy Data Types

NumPy has additional data types compared to the regular Python data types, i.e. strings, integer, float, boolean, complex. The data types is referred using one character, like i for integers, u for unsigned integers etc. Below is a list of all data types in NumPy and the characters used to represent them.

i - integer
b - boolean
u - unsigned integer
f - float
c - complex float
m - timedelta
M - datetime
O - object
S - string
U - unicode string
V - fixed chunk of memory for other type ( void )

Arrays

NumPy allows to represent multi-dimensional arrays compared to the built in array module of python which only supports single dimensional arrays. A NumPy array is a grid of values, all of the same type, and is indexed by a tuple of nonnegative integers. The array object in NumPy is called ndarray, it provides a lot of supporting functions that make working with ndarray very easy. Array can be created by passing a list, tuple or any array-like object into the array() method. Methods of creating arrays are array(), linspace(), logspace(), arange(), zeros(), ones(). Arrays can be initialized using nested python lists, and elements can be accessed using square brackets.
Once the array is created, it has many attributes which describe the NumPy array. The shape of an array is the number of elements in each dimension. Shape attribute is represented by the a tuple with each index having the number of corresponding elements, or the size of the array along each dimension. The ndim attribute provides the number of dimensions i.e. the rank of the array.

from numpy import *

# Create a rank 1 array i.e. single dimensional array
arr1 = array{[1, 2, 4, 5, 6]}

# Passing a type while creating an array
arr2 = array{[6, 8, 9, 4, 4], int}

arr3 = array([[9.0,8.0,7.0],[6.0,5.0,4.0]])

print(arr1.shape)            # Prints "(5)", means array has 1 dimension which has 5 elements.
print(arr3.shape)            # Prints "(2,3)", means array has 2 dimensions, and each dimension has 3 elements.

print(type(arr1))            # Prints ">class 'numpy.ndarray'<"

print(arr1.ndim)             # Prints number of dimensions in the array

arr1[0] = 5                  # Change an element of the array

Functions to Create Arrays

Numpy also provides many functions to create arrays such as zeros(), ones(), full(), random() which initialize the new array with zeros, ones, other numbers, random values respectively.

# Create an array/matrix of all zeros
np.zeros((2,3))             # 2-Dimensional matrix with all zeros
np.zeros((2,3,3))           # 3-Dimensional matrix with all zeros
np.zeros((2,3,3,2))         # 4-Dimensional matrix with all zeros

# Create an array/matrix of all ones
np.ones((4,2,2))
np.ones((4,2,2), dtype='int32')

# Create an array/matrix with any other constant value
np.full((2,2), 99)          # 2-Dimensional matrix with all 99 number
np.full((2,2), 99, dtype='float32')   # float numbers

# Any other number matrix with full-like shape method
d = np.array([[1,2,3,4,5,6,7],[8,9,10,11,12,13,14]])

np.full_like(d, 4)      # returns array([[4,4,4,4,4,4,4],[4,4,4,4,4,4,4]]) were all values are 4

# alternatively we can use
np.full(a.shape, 4)

# Initialize a matrix of random decimal numbers
np.random.rand(4,2,3)

np.random.rand_sample(d.shape)

# Initialize a matrix of random integer numbers, with range 0 to 7
np.random.randint(7, size=(3,3))

# Initialize random integer numbers, with range -4 to 8
np.random.randint(-4,8, size=(3,3))

# Identity matrix
np.identity(3)                                         # returns array([[1, 0, 0],
                                                       #                [0, 1, 0],
                                                       #                [0, 0, 1]])

arr1 = np.array([1,2,3])

# repeat the array
r1 = np.repeat(arr1, 3, axis=0)                        # returns [1 1 1 2 2 2 3 3 3]

arr2 = np.array([[1,2,3]])
r2 = np.repeat(arr, 3, axis=0)                         # returns [[1 2 3]
                                                       #          [1 2 3]
                                                       #          [1 2 3]]

The arange() method allows to create an array based on numerical ranges. It creates an instance of ndarray with evenly spaced values and returns the reference to it. It takes the start number which defines the first value of the array, and the stop value which defines the end of the array and which isn't included in the array. arrange() method also takes the step argument which defines spacing between two consecutive values, and the dtype which is the type of elements of the output array.

import numpy as np

arr = np.arange(start=1, stop=10, step=3)
print(arr)                           # array([1, 4, 7])

arr = np.arange(start=1, stop=10)
print(arr)                           # array([1, 2, 3, 4, 5, 6, 7, 8, 9])

# starts array from zero and increment each step by one
array = np.arange(5)
print(arr)                           # array([0, 1, 2, 3, 4])

arr = np.arange(5, 1, -1)            # counting backwards
print(arr)                           # array([5, 4, 3, 2])

Datatypes

Every numpy array is a grid of elements of the same type. Numpy provides a large set of numeric datatypes as discussed above that can be used to construct arrays. Numpy tries to guess a datatype when we create an array. The array() function also provides an optional argument 'dtype' to explicitly specify the datatype of the elements. The NumPy array object has a property called dtype that returns the data type of the array.

import numpy as np

a = np.array([1,2,3])

print(a.dtype)            # Prints "int64" i.e. datatype of the array

c = np.array([1,2,3], dtype='S')        # Create array with elements as string data type
c = np.array([1,2,3], dtype='i4')       # Create array with elements as integer with 4 bytes
c = np.array([1,2,3], dtype='int16')    # Create array with elements as integer with 2 bytes

# Get Size
print(a.itemsize)   # prints 4 for int32 element size
print(c.itemsize)   # prints 2 for int16 element size

# Get total size
a.size * a.itemsize   # 1st method
a.nbytes              # 2nd method

Iterating Arrays

Arrays are iterated using regular for loops regardless of their dimensions. NumPy also provides a special nditer() function which helps from very basic to very advanced iterations. It enables to change the datatype of elements while iterating using op_dtypes argument and pass it the expected datatype. An additional argument flags=['buffered'] is passed to provide extra buffer space as data-type change does not occur in place. nditer() also supports filtering and changing the step size. To enumerate the sequence numbers of the elements of the array while iteration, a special ndenumerate() method can be used.

import numpy as np

arr = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

for x in arr:
  for y in x:
    for z in y:
      print(z)


# iterating using nditer() function
for x in np.nditer(arr):
  print(x)


for idx, x in np.ndenumerate(arr):
  print(idx, x)

Array Math

Basic mathematical functions operate elementwise on arrays, and are available both as operator overloads and as functions in the numpy module. NumPy allows arithmetic operations on each element of the array.

arr1 = array{[1, 2, 4, 5, 6]}
arr2 = array{[6, 8, 9, 4, 4]}

arr1 = arr1 + 5              # Add 5 to all elements of an array
arr1 += 2                    # Add 2 to all elements of an array
arr1 ** 2                    # Multiply 2 to all elements of an array

# Add elements of two arrays in order. Also called as Vector Operations
arr3 = arr1 + arr2           # returns array { 7, 10, 13, 9, 10 }

print(sqrt(arr1))           # Find square root of each element of the array
print(sin(arr1))            # Find sin value of each element of the array

# Find sum of the array
print(sum(arr1))
print(sort(arr1))

Copy / Clone Array

NumPy allows to copy/clone arrays using the view() function for shallow copy, and the copy() function for deep copy of the array. The copy() function creates a new array while the view() function creates just a view of the original array. The copy owns the data and any changes made to the copy will not affect original array, and any changes made to the original array will not affect the copy. The view does not own the data and any changes made to the view will affect the original array, and any changes made to the original array will affect the view.

a = np.array([1,2,3,4])
# Copy an array arr1 to arr2. The address of both the arrays is same, as both arr1 and arr2 are pointing to same array.
arr2 = arr1

# Clone the array into another array. But its a shallow copy, were elements are still having same address
arr2 = arr1.view()

# Clone the array into another array using deep copy
arr2 = arr1.copy()

# since array arr2 is copy of array arr1, changing arr2 will not change any elements in arr1
arr2[0] = 100

The data type of the existing array can be changed only by making a copy of the array using the astype() method. The astype() function creates a copy of the array, and allows to specify the data type using a string like 'f' for float, 'i' for integer etc as a parameter.

import numpy as np

arr = np.array([1.1, 2.1, 3.1])

newarr1 = arr.astype('i')
newarr2 = arr.astype('int32')

Reorganizing Arrays

The shape of an array is the number of elements in each dimension. NumPy allows to reshape an existing array by allowing to add or remove dimensions or change number of elements in each dimension. NumPy allows to flatten the array i.e. convert a multidimensional array into a 1D array using flatten() function. Alternatively reshape(-1) can also be used to flatten the array. Further Numpy's vstack() function is used to stack the sequence of input arrays vertically to make a single array.

from numpy import *

arr1 = array({
              [1,2,3],
              [4,5,6]
            })

arr1 = array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

arr1 = arr1.flatten()            # flatten from multi dimensional to single dimensional array
print(arr1)                      # array([1,2,3,4,5,6])

# reshape single dimensional array to multi dimensional array
newarr = arr.reshape(2, 3, 2)    # The outermost dimension has 2 arrays which contains 3 arrays, each with 2 elements
print(newarr)                    # array([[[ 1  2], [ 3  4], [ 5  6]],  [[ 7  8], [ 9 10], [11 12]]])

before = np.array([[1,2,3,4], [5,6,7,8]])

after = before.reshape((4, 2))                                    # returns [[1 2]
                                                                  #          [3 4]
                                                                  #          [5 6]
                                                                  #          [7 8]]

# Vertically stacking vectors

v1 = np.array([1,2,3,4])
v2 = np.array([5,6,7,8])

np.vstack([v1,v2,v1,v2])                                    # returns array([[1,2,3,4]
                                                            #                [5,6,7,8]
                                                            #                [1,2,3,4]
                                                            #                [5,6,7,8]])

# Horizonral stacking vectors

h1 = np.ones((2,4))
h2 = np.zeros((2,2))

np.vstack([h1,h2])                                    # returns array([[1, 1, 1, 1, 0, 0],
                                                      #                [1, 1, 1, 1, 0, 0]])

Concatenate Arrays

The concatenate() function is used to join the arrays which are joined based on the axis to concatenate along. The arrays are passed to the concatenate() function are as a tuple, which can alternatively be also passed as a Python List. The arrays passed to concatenate() function requires to be of the same data type. Arrays in NumPy have axes which are directions, e.g. axis 0 is the direction running vertically down the rows and axis 1 is the direction running horizontally across the columns. The concatenate() function can operate both vertically and horizontally based on the axis argument specified. If we set axis = 0, the concatenate function will concatenate the NumPy arrays vertically which is also the default behavior if no axis is specified. On the other hand, if we manually set axis = 1, the concatenate function will concatenate the NumPy arrays horizontally.

import numpy as np

arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])

arr = np.concatenate((arr1, arr2), axis=0)
print(arr)                                              # array([[1,2], [3,4], [5,6], [7,8]])

arr = np.concatenate((arr1, arr2), axis=1)
print(arr)                                              # array([[1,2,5,6], [3,4,7,8]])

Stacking

Stacking is similar as concatenation, the only difference is that stacking is done along a new axis. A sequence of arrays to be joined are passed to the stack() method along with the axis. If axis is not explicitly passed it is taken as 0. NumPy provides helper functions such as hstack() to stack along rows, vstack() to stack along columns and dstack() to stack along height i.e. depth.

import numpy as np

arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

arr = np.stack((arr1, arr2), axis=1)
print(arr)

arr = np.hstack((arr1, arr2))
print(arr)

arr = np.vstack((arr1, arr2))
print(arr)

arr = np.dstack((arr1, arr2))
print(arr)

Splitting Array

The array_split() function takes an array and number of split as arguments to break the array into multiple parts.

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6])

newarr = np.array_split(arr, 4)
print(newarr)

arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18]])

newarr = np.array_split(arr, 3)
print(newarr)

newarr = np.array_split(arr, 3, axis=1)
print(newarr)

Searching Arrays

The where() method allows to search an array for a certain value, and return the indexes for the matched elements. Another method called searchsorted() performs a binary search in the array, and returns the index where the specified value would be inserted to maintain the search order. The searchsorted() method starts the search from the left by default and returns the first index where the argument number is no longer larger than the next value. By specifying the argument side='right' it allows to return the right most index instead.

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 4, 4])

x = np.where(arr == 4)
print(x)

x = np.where(arr%2 == 0)
print(x)

arr = np.array([1, 3, 5, 7])

x = np.searchsorted(arr, 3)
print(x)

x = np.searchsorted(arr, [2, 4, 6])
print(x)

NumPy's all() method tests all the array elements along a given axis if it evaluates to True. The any() method tests any array element along a given axis if it evaluates to True. In other words, numpy.any() method returns True if at least one element in an array evaluates to True while numpy.all() method returns True only if all elements in a NumPy array evaluate to True. NumPy also allows conditional operators to check the condition on each element of the array, and allows to 'and'/'or' the boolean arrays to get a cumulative result.

import numpy as np

arry = np.array([1, 2, 73, 4, 5, 89, 54, 34, 102])

# Find any value in the column has a value which is greater than 50
x = np.any(arry > 50, axis=0)
print(x)                                     # True

#Find the columns which has all the values that are grater than 50
x = np.all(arry > 50, axis=0)
print(x)                                     # False

#Find the rows which has all the values that are grater than 50
x = np.all(arry > 50, axis=1)
print(x)                                     # True

# Find values in array greater than 50 and less than 100
x = ((arry > 50) & (arry < 100))
print(x)                                     # [False, False, True, False, False, True, True, False, False]

# negation of above condition
x = (~((arry > 50) & (arry < 100)))
print(x)                                     # [True, True, False, True, True, False, False, True, True]

Sorting Arrays

Arrays can be sorted in numeric or alphabetical order with ascending or descending order.

import numpy as np

arr = np.array([[3, 2, 4], [5, 0, 1]])

print(np.sort(arr))

arr = np.array(['banana', 'cherry', 'apple'])

print(np.sort(arr))

Filtering Arrays

NumPy allows to filter an array using a list of booleans corresponding to indexes in the array. If the value at an index is True that element is contained in the filtered array, otherwise when False it is excluded from the filtered array. The filtered array can be created by hardcoding True/False values, or using a filter variable as a substitute for the filter array.

import numpy as np

arr = np.array([41, 42, 43, 44])

x = [True, False, True, False]

newarr = arr[x]
print(newarr)

filter_arr = arr > 42

newarr = arr[filter_arr]
print(newarr)

NumPy ufuncs

Computation of NumPy arrays is enhanced when used the vectorized operations, generally implemented through NumPy's universal functions (ufuncs). Vectorized operation simply performs an operation on the array, which will then be applied to each element. Such vectorized approach is designed to push the loop of processing each array element into compiled layer of NumPy, which leads to much faster execution. Vectorized operations in NumPy are implemented via ufuncs, whose main purpose is to quickly execute repeated operations on values in NumPy arrays. Computations using vectorization through ufuncs are nearly always more efficient than their counterpart implemented using Python loops, especially as the arrays grow in size.

Ufuncs exist in two flavors, unary ufuncs which operate on a single input, and binary ufuncs which operate on two inputs. ufuncs usually take array under operation along with additional arguments such as 'where' which is a boolean array or condition, 'dtype' which defines the return type of elements and 'out' which is the output array where the return value could be copied. Below are wide range of examples of ufuncs.

# Array Arithmetic Operations

x = np.arange(4)
print("x     =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2)  # floor division
print("-x     = ", -x)
print("x ** 2 = ", x ** 2)
print("x % 2  = ", x % 2)

# Absolute Value

x = np.array([-2, -1, 0, 1, 2])
abs(x)

# Trigonometric functions

theta = np.linspace(0, np.pi, 3)
print("theta      = ", theta)
print("sin(theta) = ", np.sin(theta))
print("cos(theta) = ", np.cos(theta))
print("tan(theta) = ", np.tan(theta))

# Exponents and Logarithms

x = [1, 2, 3]
print("x     =", x)
print("e^x   =", np.exp(x))
print("2^x   =", np.exp2(x))
print("3^x   =", np.power(3, x))
print("ln(x)    =", np.log(x))
print("log2(x)  =", np.log2(x))
print("log10(x) =", np.log10(x))

# Aggregates

x = np.arange(1, 6)
np.add.reduce(x)         # reduce repeatedly applies a given operation to the elements of an array until only a single result remains.
np.multiply.reduce(x)

np.add.accumulate(x)     # stores all the intermediate results of the computation

Boolean Masking and Advance Indexing

Masking in python and data science is when you want manipulated data in a collection based on some criteria. Masking allows to extract, modify, count, or otherwise manipulate values in an array based on some criterion. NumPy enables boolean masking to create a special type of array called Masked Array. A masked array is created by applying scalar (conditional operator) to NumPy array.

# Load Data from File containing (1,2,73,4,5,89...)
filedata = np.genfromtext('data.txt', delimiter=',')

filedata > 50
# returns array([[False, False, True, False, False, True]])

# Index using the condition, i.e. grab the value if its greater than 50
filedata[filedata > 50]
# returns array([73, 89])

# We can pass index as a list to fetch values in NumPy
g = np.array([1,2,3,4,5,6,7,8,9])
g[[1,2,8]]  # returns array([2, 3, 9])

# Pass multiple array index for each dimension
k = np.array([[1,2,3], [4,5,6], [7,8,9]])
k[[1,2],[2,2]]
# returns array([6, 9])

# another example with slicing
k[[1,2], 3:]

Matrix

Matrix module contains the functions which return matrices instead of arrays. It contains functions which represent arrays in matrix format. A matrix is a specialized 2-D array in NumPy that retains its 2-D nature through operations. It has certain special operators, such as * (matrix multiplication) and ** (matrix power). The matrix() function returns a matrix from an array of objects or from the string of data. The asmatrix function interprets the input as matrix. The methods zeros(), ones(), rand(), identity() and eye() returns matrix with zeros, ones, random values, square identity matrix and matrix with ones on diagonal respectively.

arr1 = array({ [1,2,3], [4,5,6] })
m1 = matrix(arr1)                # convert 2D array to a matrix
m2 = matrix('1 3 6 : 4 6 7')     # create matrix
m2.diagonal()                    # returns diagonal elements in the matrix   
m3  = m1 * m2                    # multiply matrices

Slicing

Slicing in python means taking elements from one given index to another given index. Similar to Python lists, numpy arrays can be sliced. Since arrays may be multidimensional, you must specify a slice for each dimension of the array. Slice is represented instead of index like [start:end:step]. The default values for start is 0, end is the length of the array in that dimension and step is 1. Negative slicing is achieved by using the minus operator to refer to an index from the end.

d = np.arrary([[1,2,3,4,5,6,7],[8,9,10,11,12,13,14]])

# Get a specific element [r, c]
d[1, 5]  # returns 13
d[1, -2] # returns 13

# Get a specific row
d[0, :]

# Get a specific column
d[:, 2]

# Getting [startindex:endindex:stepsize]
d[0, 1:6:2]     # get elements between 2nd and 7th with alternate numbers
d[0, 1:-1:2]

# Change value of the element
a[1,5] = 20

# Set value of 3rd column to 5
a[:,2] = 5

a[:,2] = [1,2]

e = np.array([[[1,2],[3,4]],[[5,6],[7,8]]])

# Get specific element (work outside in)
b[0,1,1] = 4

b[:,1,:]    # returns array([[3,4], [7,8]])

# Replace values in 3-dimensional array
b[:,1,:] = [[9,9],[8,8]]

# Returns array([[[1,2],[9,9]],[[5,6],[8,8]]])

Generate Random Number

NumPy offers the random module to work with random numbers. The randint() function when passed the integer will generate a random number from 0 until the integer argument. It also takes a size parameter which specifies the shape of an array, in order for randint() to return a multi-dimensional array of random integers. NumPy also has the choice() method which allows to generate a random value based on an array of values.

from numpy import random

print(random.randint(100))     # random integer between 0 to 100

print(random.rand())     # random float between 0 and 1

x=random.randint(100, size=(5))
print(x)                # prints an array containing 5 random integers from 0 to 100

x = random.randint(100, size=(3, 5))
print(x)                 # prints 2-D array with 3 rows, each row containing 5 random integers from 0 to 100

x = random.choice([3, 5, 7, 9])
print(x)                 # prints one of the values randomly from the passed array

x = random.choice([3, 5, 7, 9], size=(3, 5))
print(x)                 # prints 2-D array with 3 rows, each row containing 5 values from passed array

Linear Algebra

NumPy package contains numpy.linalg module that provides all the functionality required for linear algebra. Below are some of the important functions in this module.

dot: It returns the dot product of two arrays. For 2-D vectors, it is the equivalent to matrix multiplication. For 1-D arrays, it is the inner product of the vectors. For N-dimensional arrays, it is a sum product over the last axis of a and the second-last axis of b.

import numpy.matlib 
import numpy as np 

a = np.array([[1,2],[3,4]]) 
b = np.array([[11,12],[13,14]]) 
np.dot(a,b)                        # returns [[37,40], [85,92]]

vdot: It returns the dot product of the two vectors. If the first argument is complex, then its conjugate is used for calculation. If the argument id is multi-dimensional array, it is flattened.

import numpy as np 
a = np.array([[1,2],[3,4]]) 
b = np.array([[11,12],[13,14]]) 
print np.vdot(a,b)                  # prints 130

inner: It returns the inner product of vectors for 1-D arrays. For higher dimensions, it returns the sum product over the last axes.

import numpy as np 
a = np.array([[1,2], [3,4]]) 
b = np.array([[11, 12], [13, 14]]) 

print np.inner(a,b)        # returns   [[35 41]
                           #            [81 95]]

outer: Returns the outer product of two vectors.

matmul: It returns the matrix product of two 2-D arrays. For arrays with dimensions above 2-D, it is treated as a stack of matrices residing in the last two indexes and is broadcast accordingly.

import numpy.matlib 
import numpy as np 

a = [[1,0],[0,1]] 
b = [[4,1],[2,2]] 
print np.matmul(a,b)        # returns   [[4  1] 
                            #            [2  2]]

determinant: It calculates the determinant from the diagonal elements of a square matrix. For a 2x2 matrix [[a,b], [c,d]], the determinant is computed as ‘ad-bc’. The larger square matrices are considered to be a combination of 2x2 matrices.

import numpy as np

a = np.array([[1,2], [3,4]]) 
print np.linalg.det(a)               # returns -2.0

solve: It gives the solution of linear equations in the matrix form.

inv: It calculates the inverse of a matrix. The inverse of a matrix is such that if it is multiplied by the original matrix, it results in identity matrix.

import numpy as np 
a = np.array([[1,1,1],[0,2,5],[2,5,-1]]) 

ainv = np.linalg.inv(a) 
print(ainv)

Statistics

Numpy supports various statistical calculations using the various functions that are provided in the library like Order statistics, Averages and variances, correlating, Histograms. NumPy has a lot in-built statistical functions such as Mean, Median, Standard Deviation and Variance.

x = [32.32, 56.98, 21.52, 44.32, 55.63, 13.75, 43.47, 43.34]

# Functions to Find Mean, Median, SD and Variance

mean = np.mean(X)
print("Mean", mean)                  # 38.91625

median = np.median(X)
print("Median", median)              # 43.405

sd = np.std(X)
print("Standard Deviation", sd)      # 14.3815654029

variance = np.var(X)
print("Variance", variance)          # 206.829423437

# Functions to Find Min, Max and Sum

stats = np.array([[1,2,3], [4,5,6]])

np.min(stats)                        # returns 1
np.min(stats, axis=0)                # returns [1, 2, 3]  as all the minimum values in top row

np.max(stats, axis=1)                # returns [3, 6]

np.sum(stats)                        # adds all elements and returns 21
np.sum(stats, axis=0)                # adds columns and returns array([5, 7, 9])

Histograms

NumPy has a numpy.histogram() function that is a graphical representation of the frequency distribution of data. The histogram() function mainly works with bins i.e. class intervals and set of data given as input. The numpy.histogram() function takes the input array and bins as two parameters. The successive elements in bin array act as the boundary of each bin. Numpy histogram function will give the computed result as the occurrances of input data which fall in each of the particular range of bins. That determines the range of area of each bar when plotted using matplotlib. Matplotlib enables to convert the numeric representation of histogram into a graph. The plt() function of pyplot submodule takes the array containing the data and bin array as parameters and converts into a histogram.

import numpy as np
from matplotlib import pyplot as plt 

arr = np.array([22,87,5,43,56,73,55,54,11,20,51,5,79,31,27]) 
# returns (array([3, 4, 5, 2, 1]), array([  0,  20,  40,  60,  80, 100]))                                                                   

hist = np.histogram(arr, bins=[0,20,40,60,80,100])

plt.hist(arr, bins=[0,20,40,60,80,100])
plt.title("histogram") 
plt.show()

Python - A Brief Tutorial

2020-09-13T15:05:00.061-07:00

Python is a programming language which was first released in 1991. It has stood the test of time and after nearly 30 years it is still widely used in software industry. The simplicity and concise syntax of python with growing number of libraries and framework still makes it one of the language of choice. Today python is used in web scraping, data science, machine learning, image processing, NLP, data processing and many more areas.

Python language is a specification and has multiple implementations of it e.g. CPython (widely used standard implementation), PyPy, IronPython, Jython. IronPython is python implementation written in C# targeting Microsoft’s .NET framework and uses .Net Virtual Machine for execution. PyPy is python implementation which is faster than Python as it uses Just-in-Time compiler to translate Python code directly into machine-native assembly language. Jython is an implementation of the Python programming language that can run on the Java platform. Jython programs use Java classes instead of Python modules. Jython compiles into Java byte code, which can then be run by Java virtual machine. Jython enables the use of Java class library functions from the Python program.

Python code is first compiled by compiler to byte code which is later interpreted by Python virtual machine to the machine language. The .py source code is first compiled to byte code as .pyc. The byte-code are the instructions similar in spirit to CPU instructions, but instead of being executed by the CPU, they are executed by the Python Virtual Machine (PVM).

Statement, Indentation and Comments

In Python, the end of a statement is marked by a newline character. But we can make a statement extend over multiple lines with the line continuation character (\) as below.

a = 1 + 2 + 3 + \
    4 + 5 + 6 + \
    7 + 8 + 9

Line continuation is implied inside parentheses ( ), brackets [ ], and braces { }. We can also put multiple statements in a single line using semicolons, as follows:

a = 1; b = 2; c = 3

Most of the programming languages like C, C++, and Java use braces { } to define a block of code. Python, however, uses indentation. Indentation refers to the spaces at the beginning of a code line. A code block (body of a function, loop, etc.) starts with indentation and ends with the first unindented line. The amount of indentation is up to the developer, but it must be consistent throughout the block. Generally, four whitespaces are used for indentation and are preferred over tabs. PEP 8 is the documentation which has all the best practices for formatting the code.

In Python, the hash (#) symbol is used to start writing a comment which extends until the newline character. For multi-line comments which extend up to multiple lines, the triple quotes either ''' or """ is used as below.

# this is a python comment

"""
This is Documentation comment for python
"""

Data Types

Python supports Dynamic Typing of the variable were type depends on the value assigned to the variable. When a variable is assigned a value, actually it's a label pointing to the memory location with the value. If the variable is assigned to a new value, the type of the variable changes depending on the type of the value it's pointing to. Python is case sensitive language, hence uppercase variable is different from lowercase variable. Generally lower case is used for variable names, and "_" (underscore) is used to separate multiple words in variable or function name instead of camel case.

x = 5          # type of variable x is integer
x = 'Example'  # type of variable x is now string

In Python every value is an object. A variable is a label pointing to the particular object (with memory location). When a variable is assigned a value 100, the label of the variable is pointing to the object containing value 100. If another variable is assigned same value 100, it will also point to the same object with 100. As more and more variable have the same value, the reference count to the object 100 increases. The reference count to the object decreases when the variables are re-assigned a different value, the variable goes out of scope or when del keyword is used to remove the variable (label) as a reference to the object. Once all the references to the object are removed, it can then be safely removed from the memory. Internally the python object holds the object type, its reference count and its value. If no variables is pointing to any given value or address location, then python makes it ready for garbage collection. Python uses both reference counting as well as generational garbage collection (variation of mark and sweep).

There is no concept of constant value in python. It depends on programmer to not change the value and treat it as a constant.

None: It represents no value assigned to variable. It is similar to null defined in other languages.

Numeric: It has 4 types namely: Int is integer, Float is floating point, Bool (True/False) and Complex. The boolean value of True (int(True) = 1) is 1 and boolean value for False is 0. Complex numbers have a real and imaginary part e.g. complex_number = 6 + 9j. Complex number can also be created using complex(b, k), were b is integer and k is floating point number.

Python has functions to create integer, boolean, float variables from string as below.

int_val = int("45")
float_val = float("6.78")
bool_val = bool("true")

Sequence types include List, Tuple, Set, String and Range which are discussed below. Apart from these python also has a Dictionary similar to a HashMap in Java.

Python provides the built-in type() function, which when passed one argument, returns the type of an object. It is used to get the type of the variable i.e. Integer, Float, String or Boolean.

type(var_name)

Standard Functions

Python has a set of built in functions.

The print() function is a standard function to print variables. It also takes expressions to print the results.

print('*' * 10)       # using expression within print function

By default the print function prints the text in a new line. To print the text in the same current line, we pass second argument end="" as below:

print('*' * 10, end="")

The input() function is used to get the user input in python.

field = input("Enter the value: ")

The eval() function is used to evaluate an expression or execute a function and return the corresponding result.

eval(input("Enter an expression: "))
# When entered the expression (2 + 6 - 1), it prints the result 7

Argument values "argv" is used to get arguments passed while running the python program.

import sys

x = int(sys.argv[1])
y = int(sys.argv[2])
z = x + y
print(z)

The id() function is used to get the address of the value pointed by the variable.

var_name = "Example"
id(var_name)

In python if multiple variables have the same data then all the variables will point to the same single address which contains the same data. In the below example the address pointed by a, b and 100 is the same, i.e. id(a) == id(b) == id(100). Further as we change the value of a to 200, the value of b remains the same i.e. 100, even if we stated (b = a) above.

a = 100
b = a
cond1 = (id(a) == id(b))        # returns true

print(a is b)                   # prints true, as both variable a and b point to same value or object

cond2 = (id(b) == id(100))      # returns true

a = 200
cond3 = (b == 100)              # returns true, even though we declared b=a and changed a to 200

String

Strings are amongst the most popular types in Python. We can create them simply by enclosing characters in quotes. Python treats single quotes the same as double quotes. Creating strings is as simple as assigning a value to a variable. Python support below special operators for string.

Operator	Description	Example
+	Concatenation - Adds values on either side of the operator	a + b will give HelloPython
*	Repetition - Creates new strings, concatenating multiple copies of the same string	a*2 will give -HelloHello
[]	Slice - Gives the character from the given index	a[1] will give e
[ : ]	Range Slice - Gives the characters from the given range	a[1:4] will give ell
in	Membership - Returns true if a character exists in the given string	H in a will give 1
not in	Membership - Returns true if a character does not exist in the given string	M not in a will give 1
r/R	Raw String - Suppresses actual meaning of Escape characters. The syntax for raw strings is exactly the same as for normal strings with the exception of the raw string operator, the letter "r," which precedes the quotation marks. The "r" can be lowercase (r) or uppercase (R) and must be placed immediately preceding the first quote mark.	print r'\n' prints \n and print R'\n'prints \n
%	Format - Performs String formatting	print "My name is %s and weight is %d kg!" % ('Zara', 21)

str = "Python"

new_str = "I Like" + str

print( str[-1] )     # 1st character from the end

print( str[0:3] )    # Returns all chars from 0 till 3, not including the 3rd char

print( str[2:] )     # Returns all chars from 2 till (default) end of the string

print( str[:5] )     # Returns all chars from 0 (default) till 5 (excluding 5th char)

another = str[:]     # Returns all the characters from the string to create a clone of string

another = str[1:-1]  # Returns all chars from index 1 till 1st char from the end (excluding the last char)

# Trying to modify char in string throws error as strings are immutable.
str[0] = 'R'

# example of formatted string with prefix 'f' to use formatted strings
sentiment = 'good'
msg = f'{str} is a {sentiment} language'

# example of a identated string paragraph
para = '''
        used for paragraph. The text is printed with indentation as it is in the source code.
       '''
print(para)

Various String Methods

The len() and print() are general purpose functions in python which work with other types as well. The rest of the below functions such as find(), replace(), upper() etc are string specific functions.

len(str) # returns the no of characters in the string.

str.isupper() # checks if string is in upper case

# Here upper() is a method, while len(), input() are general purpose functions. It creates a new string without modifying the existing string.
upper_str = str.upper()

# Returns index of the first occurrence of the character or string. It is case sensitive.
str.find('y')

paragraph = "I like programming in Python"
str.find('Python')  # returns index of the word python in the string paragraph.

str.replace('Python', 'Java')  # replaces the word with the passed word in the string
str.replace('P', 'J')          # replace a given character with another character

string.split(' ')      # returns a list of items (words)

To check if the string contains a given word or not, we use the 'in' operator which returns a boolean value, below is the example expression.

text = "Its fun to program in Python."
'Python' in text

The title() method returns a string where the first character in every word is upper case. Like a header, or a title. If the word contains a number or a symbol, the first letter after that will be converted to upper case.

Binary and Decimal Numbers

The bin() function converts decimal number to binary format. The '0b' prefix symbolizes that the number is in binary format.

bin(25) = 0b11001

For octal format we use oct() function which returns result with prefix '0o'. Similarly for hexadecimal we use hex() function which returns the results with prefix '0x'.

Below are the bitwise operators in Python:

# compliment of number
~12 = -13

# bitwise and operator
12 & 13 = 12

# bitwise or operator
12 | 13 = 13

# bitwise ex-or operator
12 ^ 13 = 1

# bitwise left shift operator, which shift 2 bits to left
10 << 2 = 14

Arithmetic Operators

Python supports all the standard arithmetic operations (+,-, *). Although it has two types of division operators as below.

# Normal division which returns floating point number
10 / 3 = 3.33333333

# Whole division which return an integer result. 
10 // 3 = 3

Exponent operator

10 ** 3 = 1000  # means 10 to the power 3

Augmented assignment operators

x += 3
x -= 3

Operator Precedence: Below is the order of precedence for operators in Python

Parentithesis
Exponentiation (12 ** 3)
Multiplication or Division
Addition or Subtraction

Python also allows assignment of multiple variables to different values.

a, b = 1, 2

It also allows to swap two variables as below: Python uses ROT_TWO() to swap the the two top most stack items for this operation.

a, b = b, a

Math Functions

Python support basic math operations by default as below.

round(2.9) = 2    # rounds a float value to integer
abs(-2.9) = 2.9   # returns positive representation of the value

It supports mathematical functions after importing the Math module.

import math

math.ceil(2.9) = 3   # Get ceiling of the number
math.floor(2.9) = 2  # Get floor of the number
math.sqrt(25)        # returns square root of the number
math.pow(3,2) = 9    # power function
math.pi              # value of PI

# import module and define module alias
import math as m
m.sqrt(25)

Random integer values can be generated using the random() function which also allows to specify the range for the random values.

import random

random.random()
random.randint(10, 20)

It also allows to select a random item from a list.

members = ["John", "Sam", "Michael", "Arthur", "George"]
random_member = random.choice(members)

If Else Statement

The if and else statements are used for conditional execution.

if a == true:
    print("Boolean value is true")
    print("Second line")
elif b == true:
    print("Another Boolean value is true")
else:
    print("Boolean value is false")

# if condition can also we written as a single line conditional assignment statement
fruit = "Apple"
is_apple = True if fruit == 'Apple' else False

Python support all basic comparison operators including <, >, <=, >=, ==, !=. It also provides following logical operators to combine multiple conditions.

and: if a and b
or: if a or b
not: if a and not b

While Loop

The while statement is used for repeated execution as long as an expression is true.

i = 1
while i < 6:
  print(i)
  i += 1

While statements in Python can optionally also have an else block, which is executed only when the while loop completes successfully without any breaks. As the while loop can be terminated with a break statement, in which case the else part is ignored. Hence, a while loop's else part runs if no break occurs and the condition is false.

while counter < 3:
    print("Inside loop")
    counter = counter + 1
else:
    print("Inside else")

For Loop

For loop allows to iterate over the elements of a sequence (such as a string, tuple or list) or other iterable object.

for item in "Sample String":
   print(item)

for item in ["Toyota", "Honda", "Ferrari"]:
   print(item)

The for loop containing a condition in python has an else block which essentially means if nothing matched and no break occurred then execute the else block outside the for loop. The break condition is important here as without break, it will execute the else block every time.

for num in [12, 34, 23, 67]
   if num % 5 == 0:
       print(num)
       break
else:
   print("Not Found")

The range function creates objects based on the from/to parameters passed to it. Range object can also be converted to a list by passing it to list() function e.g. list(range(10)). Below is the example of range functions.

range(10)           # returns numbers from 0 to 9
range(5, 10)        # returns numbers from 5 to 9
range(5, 10, 2)     # returns numbers from 5 to 9 with steps of +2
range(20, 10, -1)   # returns numbers from 20 to 10 in descending order

# for loop can be used to loop through the values within the range
for item in range(10)
   print(item)

An underscore '_' can also be used as a variable in both for and while loops.

_ = 5
while _ < 10:
    print(_, end = ' ') # default value of 'end' id '\n' in python. we're changing it to space
    _ += 1

for _ in range(5):
    print(_)

Python support keywords, break, continue and pass. Break statement breaks the loop and continue skips the remaining code continuing with the loop. Pass keyword indicates that the block (can be a if/else block, loop block, class/method block) is empty and to skip it.

Arrays

Arrays is similar to a list were all the items are of same type. Arrays in python don't have specific fixed size. Arrays can be created using array(type, list of values) method. The type value is taken from python type code.

from array import *

vals = array('i', [5,9,8,4,2])   # type value 'i' indicates signed int
print(vals)
vals.buffer_info()  # returns a tuple (address, size), with address and size of the array.
vals.typecode       # returns i (signed integer)
vals.reverse()      # reverse the values of the array
vals.index(2)       # returns index of element 2 i.e. 4

# append() method is used to add values to the array.
vals.append(10)

empty_array = array('i', [])   # empty array of integers

newarray = array(vals.typecode, (a for a in vals))

for e in vals:
    print(e)

Lists

List can contain items of different types as opposed to an array.

names = ["John", "Bill", "Tony"]
names[-1]    # returns Tony
names[1:]    # returns a new list from 1st element (0th being the first) to the end of the list
names[1] = "Billy"    # update 2nd element

matrix = [ [1,2,3], [5,6,7], [3,7,8] ]

for row in matrix:
   for item in row:
      print(item)

numbers = [4,6,8,10]
numbers.append(56)     # adds item to the list
numbers.insert(index, item)
numbers.extends([16,37,68])   # add multiple elements to list
numbers.remove(item)
numbers.clear()
numbers.pop()      # remove last item
numbers.pop(index)     # remove indexed item
numbers.index(item)    # returns index of the first occurrence of the item
numbers.count(item)    # number of times the item occurs within the list
min(numbers)   # get minimum value in the list
sum(numbers)   # get sum of all items in the list
50 in numbers  # existence of an item within the list
50 not in numbers # 50 does not present in the list then true

numbers.sort()
numbers.reverse()  # reverses the list

numbers_copy = numbers.copy()  # create a copy of the list

List Comprehensions allows conditional construction of list literals using for and if clauses. They provide a more concise way to create lists.

print [i for i in range(10)]
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

print [i for i in range(20) if i%2 == 0]
# [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

Enumerate Function

The enumerate() function assigns an index to each item in an iterable object that can be used to reference the item later. The enumerate() function takes an iterable type object like a list, tuple, or a string, and an optional start parameter (0 by default) which tells which index to use for the first item of the iterable object. By default, the enumerate() function returns an enumerated object which can be converted to a tuple or a list by using tuple(<enumerate>) and list(<enumerate>), respectively.

cities = ['Amsterdam', 'New York', 'Paris', 'London']

for i, city in enumerate(cities)
    print(i, city)

The enumerate() function can be used instead of the for loop. That’s because enumerate() can iterate over the index of an item, as well as the item itself.

cars = ['kia', 'audi', 'bmw']
print(list(enumerate(cars, start = 1)))

Tuples

Tuples are same a list but unlike lists are immutable.

numbers = (1, 2, 3)

Tuples does not allow to change it items once assigned, as it does not support object assignment (hence they are immutable). It has only two methods (similar to lists) index() and count(). To read a specific item in the tuple use :

print(numbers[0])

When returning tuples from a function, we don't need to specify the brackets, e.g.

return a, b   # Python automatically interprets this as a tuple

Unpacking feature: Used to assign items of lists or tuples to variables by unpacking them.

coordinates = (1, 2, 3)
x = coordinates[0]
y = coordinates[1]
z = coordinates[2]

x, y, z = coordinates

list = (1, 2, 3)
a, b, c = list

If we don't want to use specific values while unpacking, just assign that value to special underscore '_' variable.

# ignoring a value
a, _, b = (1, 2, 3) # a = 1, b = 3

# ignoring multiple values
# *(variable) used to assign multiple value to a variable as list while unpacking
# it's called "Extended Unpacking", only available in Python 3.x
a, *_, b = (7, 6, 5, 4, 3, 2, 1)

Sets

Set does not retain the order of elements as it uses hash. Hence indexes are not supported in set. It does supports add(), remove(), pop() methods to update the set. Also there are no duplicates allowed in a set.

s = { 3, 4, 5, 7, 8, 18 }

Dictionary

Dictionary is of mapping data type. Dictionaries are used to store information as key value pairs. Every key is unique and should be immutable within a given dictionary.

customer = {
  "name": "John",
  "age": 67,
  "is_retired": False,,
  "cars": ["Honda", "Toyota"]
}

# returns value "John" associated with the "name" key in customer dictionary
# If specified key does not exits then it throws an error.
customer["name"]

# get() method also fetches the value of the key from the dictionary
# if the key is not present in dictionary then it returns None without throwing an error
customer.get("name")

# Also allows to specify default value which will be returned when no key exists within the dictionary 
customer.get("name", "default_value")   

customer["name"] = "Jack"   # updates the value of the key "name" in the dictionary
customer["birth_date"] = "10 Jan 1980"   # Adds new key-value to the dictionary
customer.clear()   # empty the dictionary
customer.keys()  # returns all keys in the dictionary
customer.values()  # returns all values in the dictionary

The del keyword is used to delete objects. In Python everything is an object, so the del keyword can also be used to delete variables, lists, or parts of a list etc.

del customer["birth_date"]

Zip Function

The zip() function creates an iterator of tuples which aggregate elements from two or more iterables. It pairs the first item in each passed iterator together, followed by paring the the second item in each passed iterator together and so forth.

keys = ["Texas", "Ohio", "California"]
values = ["Austin", "Columbus", "Sacramento"]
capitals_dictionary = dict(zip(keys, values))   # zip converts two lists in key-value pairs

names = ("Tony", "Jack", "Tom", "Merlyn")
companies = ("Google", "Apple", "Facebook", "Microsoft")
zipped = zip(names, companies)

print(list(zipped))   # return each element from two list as pairs in as new list
[ ('Tony', 'Google'), ('Jack', 'Apple'), ('Tom', 'Facebook'), ('Merlyn','Microsoft') ]

for (a,b) in zipped:
    print(a,b)

The zip function also allows to iterate through multiple lists in parallel and access the corresponding element.

x_coordinates = [34, 56, 67, 89]
y_coordinates = [78, 76, 43, 12]

for x, y in zip(x_coordinates, y_coordinates):
    print(x, y)

Functions

Functions are always defined first and then called in Python. It is recommended to have 2 blank lines after the function definition. The immutable arguments (int, string) passed to the function are passed by value, while mutable values (list, set etc) are passed by reference in python.

      def greet_user(first_name, last_name, location):
            print(f"Hello {first_name} {last_name} ! How is it in {location}")

The actual arguments which are actually passed to the function have below types.

Positional Argument: Positional Parameters are passed in the order of their definition in the function.

      greet_user("John", "Smith", "Canada")

Keyword Argument: Keyword arguments are passed by keyword or name of argument in any order. The positional arguments and keyword arguments can be mixed within the function call. But keyword arguments should always come after positional arguments.

      greet_user(location="London", last_name="Mayer", first_name="John")

      greet_user("John", location="London", last_name="Mayer")

Default Argument: Set default value to the arguments of the functions so that those values can be skipped while calling the function.

     def greet_user(first_name, last_name, location='USA'):
           print(f"Hello {first_name} {last_name} ! How is it in {location}")

      greet_user("John", "Smith")

Variable Length Argument: The number of arguments passed in a function is not fixed for variable length arguments. A '*' is added before the argument name in function definition. The variable length argument is of type tuple in the function.

     def sum(*a):
       r = 0

       for i in a:
           r += i

     sum(4, 7, 9, 2)

Keyword variable length arguments is same as Variable length arguments, but it allows keyword arguments to be included in variable length. It is indicated by adding '**' before the argument name.

     def person(name, **keyword_args):
         print(name)
         
         for i, j in keyword_args.items():
             print(i, j)

     person('Jack', age=56, city='London', contact=441711231233)

If there is no return statement in a function, then by default all the functions in python return None. A function returning None in Python is similar to void return in Java. Function can return multiple values instead of a single value as below.

      def count(l):

         odd, even = 0, 0

         for i in l:
            if i%2 == 0:
               even += 1
            else:
               odd += 1
          
         return odd, even

      lst = [3,5,7,8]
      even, odd = count(lst)

Python allows maximum recursion depth of 1000. The sys.getrecursionlimit() function returns this recursion limit of python. The sys.setrecursionlimit(9999) allows to override the default value and set a custom recursion limit.

First-class functions

In Python, functions are first-class objects. This means that functions can be passed around and used as arguments, just like any other object. Also functions can be returned as values.

def say_hello(name):
    return f"Hello {name}"

def be_awesome(name):
    return f"Yo {name}, together we are the awesomest!"

def greet_bob(greeter_func):
    return greeter_func("Bob")

greet_bob(say_hello)
greet_bob(be_awesome)

Inner Functions

Python allows to define a functions inside other functions. Such functions are called inner functions with below example. The inner functions are not defined until the parent function is called. They are locally scoped and only exist inside the parent() function.

def parent():
    print("Parent function")

    def first_child():
        print("first child function")

    def second_child():
        print("second child function")

    second_child()
    first_child()

Function Annotations

Function annotations is a Python 3 feature that allows to add arbitrary metadata to function arguments and return value. The annotations only provides a nice syntactic support for associating metadata without any semantics (meaning) and is totally optional. In the below example function sum() which takes 3 arguments, a, b and c. The first argument a is not annotated, the second argument b is annotated with the string ‘annotating b’, and the third argument c is annotated with type int. The return value is annotated with the type float. Note the "->" syntax for annotating the return value. The annotations have no impact whatsoever on the execution of the function. Annotations have no standard meaning or semantics and is mainly used for documentation.

def sum(a, b: 'annotating b', c: int) -> float:
    print(a + b + c)

Modules

A module is a python file which contains all the related functions and classes. The module can be imported into another python file in order to execute functions. The "import module-name" statement imports entire module in the python file. It also requires to specify the "module-name.function()" to call the function.

In the below example we have the Calc.py file which has add() function. Then another demo.py file is trying to call the add() function from Calc.py.

import Calc

a = 4
b = 7
c = Calc.add(a, b)

In order to import only selected functions in python module we use "from module-name import function". It allows to call the function() directly without using the prefix 'modulename.'.

from math import sort, pow
pow(3,2)

All the related modules are organized into a package which is a directory containing module files. Package is the container for multiple modules. A special file called "__init__.py" is added to the package directory to make the directory a package. When the python interpreter sees the "__init__.py" file in a directory then it treats the directory as a package. The modules within the package can be imported using "import package-name.module-name" or "from package-name.module-name import function".

Python has many standard built-in modules for many general functionality. The complete list is available in python 3 module index documentation. These modules can be imported directly without specifying the package name. The built in modules are located in python 3.x directory (library root) within the base python directory.

__name__ global variable

In Python Introspection is the ability of an object to know about its own attributes at runtime. For instance, a function knows its own name and documentation. The __name__ is once such a special global variable in python whose value depends on the place it is fetched. The value of __name__ is __main__ in the file were the python code is being executed. On other hand, its value is module name when its printed in another module imported in the main execution file. The value of the __name__ variable changes as per the place its being used. When running the python file as a main code, and using __name__ then it returns __main__. When __name__ is fetched in a file imported as a module then it returns module name. When a module is imported in python, it executes all the statements in the file. To avoid executing main method in the module file when imported as a module, but to allow execution of main method when ran in standalone mode, the __name__ variable is checked for __main__ as below.

def main():
   print("Hello")

if __name__ == "__main__"
    main()

Global Keyword

Global variables defined outside the function are accessible to the function in python. If the name of global variable and local variable in the function is same, then local variable will always take precedence within the function. If a global variable is assigned a value within a function it is interpreted as local variable. Hence to modify a global variable from within the function, a global keyword is used before the global variable name to explicitly specify access to global variable.

       a = 90

       def test():
          global a
          a = 34
          print("value of a is ", a) 

       test()
       print("Outside test value of a is ", a)

Also globals() provides access to all the global variables. To access a particular global variable 'a', we use globals()['a'].

       a = 10

       def test():
          globals()['a'] = 15

Lambda Functions

Functions are objects in Python and they can be passed as parameters into other functions.

function = lambda parameters: body
f = lambda a,b : a+b

numbers	= [2,3,7,9,6,4,5,8]

# filter() example

even_nums = list(filter(lambda n: n%2 == 0, numbers))

# map() example

squares = list(map(lambda n: n*n, numbers))

# reduce() example 

from functools import reduce

sum = reduce(lambda alb: a+b, numbers)

Iterators

The for loop uses iterators behind the scenes for looping items. The iter() is the function which converts a list to an iterator, which is used to iterate the list one value at a time. The iterator has __next__() method which gives the current value and points to the next value. The iterator preserves the state of the last value returned by it.

numbers = [ 2, 15, 8, 6, 3, 4]

iterator = iter(numbers)

# Both below method prints the current value in the iterator and points to the next value
print(iterator.__next__())
print(next(iterator))

Example of iterator in a while loop.

def loop(iterable):
    oIter = iterable.__iter__()
    while True:
        try:
            print oIter.next()
        except StopIteration:
            break

loop([1,2,3])

A custom iterator can be created by implementing the __next__() and __iter__() methods.

class NumGenerator

     def __init__(self, start, limit):
	 self.num = start
         self.limit = limit

     def __iter__():
         return self

     def __next__():

         if self.num <= self.limit:
             val = self.num
             self.num += 1
             return val
         else:
             raise StopIteration


generator = NumGenerator(1, 50)

for i in generator
     print(i)

Generator

Generators are iterators, a kind of iterable we can only iterate over once. Generators do not store all the values in memory, they generate the values on the fly. The yield keyword is used to produce a sequence of values. It is used to iterate over a sequence without storing the entire sequence in memory. If the body of a def contains yield, the function automatically becomes a generator function.

def numgenerator()
    yield 1
    yield 2
    yield 3
    yield 4
    yield 5

gen = numgenerator()
print(gen.__next__())

# Below  for loop prints 2 to 5, as 1 is printed by above print statement
for i in gen
    print(i)

Exceptions

In python, Exception is a generic error which includes all errors/exceptions. The try statement specifies exception handlers and/or cleanup code for a group of statements as below.

try:
     age = int(input('Age: '))
	 income = 20000
	 average = income / age
	 print(f"Age is {age}")
except ZeroDivisionError:
     print('Age cannot be zero')
except ValueError:
     print('Invalid value')
except Exception as e:
   print("Something went wrong...", e)
finally:
   print("execution completed")

The try statement also supports else block after the except block, which is executed when there is no exception.

value = '9X'

try:
    print(int(value))
except:
    print('Conversion failed !')
else:
    print('Conversion successful !')

With Statement

The with statement wraps the execution of a block with methods defined by a context manager. This allows common try…except…finally usage patterns to be encapsulated for convenient reuse. In the below example, the with statement automatically closes the file after the nested block of code, no matter how the block exits. If an exception occurs before the end of the block, it will close the file before the exception is caught by an outer exception handler. If the nested block were to contain a return statement, or a continue or break statement, the with statement would automatically close the file in those cases as well.

with open('output.txt', 'w') as f:
    f.write('Hi there!')

Classes

Everything in Python is an Object. Classes are used to define new type or objects. The class can have methods in its body and they can also have attributes which can be set anywhere in the program.

# In Python we use camel case naming to name the class
class Point:
    def draw(self):
		print("draw")

point1 = Point()
point1.x = 10    # Creates an attribute x in point1 object and assigns the value 10

class Person:

    country = "USA"

	def __init__(self, name, age):
		self.name = name   # self is reference to current object. It adds new attribute 'name' and assigns the parameter value
		self.age = age
		
	def greeting(self):
		print(f"hello, {self.name}")
		
	def compare(self, other):
		if self.age == other.age:
			return true
		else:
			return false		
		
employee = Person("John", 90)

# Two ways to call greeting() method
Person.greeting(employee)

# Here the object on which the method is called, internally passes itself as an agrument to self.
employee.greeting()

# Update the attribute of the object externally
employee.name = "Tim"
employee.greeting()

manager = Person("Michael", 90)

# Compare takes two parameters, who is calling it and whom to compare with.
if manager.compare(employee):
	print("Employee and Manager have same age")

# class variables are accessed same as instance variables.
print(manager.name, manager.age, manager.country)

# update class variable
Person.country = "Canada"

Init and New Methods

In Python, __init__() method is responsible for instantiating the class instance. It acts as the constructor of the class and takes parameters to set attribute values. The __init__() constructor is optional for a class. The __new__ method is similar to the __init__ method, and is called when the class is ready to instantiate itself. The major difference between these two methods is that __new__ handles object creation and __init__ handles object initialization. The __new__() method is defined as a static method in the base class and it needs to pass a class (cls) parameter. The class (cls) parameter represents the class that needs to be instantiated, and this parameter is provided automatically by python parser at instantiation time. The __new__ method is called first when an object is created and __init__ method is later called to initialize the object. If both __init__ method and __new__ method exists in the class, then the __new__ method is executed first and decides whether to use __init__ method or not. The reason being that the new() method can call other class constructors or simply return other objects as instances of this class.

The self parameter represents the current (object) instance of the class. The self keyword allows to access the attributes and methods of the class in the python. It binds the attributes with the given arguments.

class Employee(object):
   
     def __init__(self, name, salary):
         self.name = name
         self.salary = salary
      
     def __new__(cls, name, salary):
         if 0 < salary < 10000:
             return object.__new__(cls)
         else:
             return None
   
     def __str__(self):
         return '{0}({1})'.format(self.__class__.__name__, self.__dict__)
         
     emp = Employee("James", 4500)
     print(emp)

Two types of variables in Python class, an instance variable and a class (static) variable which is common for all the objects. The variables defined within __init__() are instance variables, while the variables defined outside __init__() in the class are called class variables. Python has namespace which is an area where an object/variable is created and stored. Class variables are stored in class namespace, while instance variables are stored in Object/instance namespace.

There are two types of methods in Python:

Instance methods: The methods which take the "self" parameter are called instance methods. Instance methods have two types, accessor methods which fetch values and mutator methods which modify values.

Class methods: Class methods are common to all objects and are used to work with class variables. All class methods have the parameter "cls" in their methods.

class Person:

    country = "USA"

    @classmethod
    def country(cls):
         return cls.country


p1 = Person()
print(p1.country())

print(Person.country())

To call the class method we need to pass cls parameter which can be avoided by adding @classmethod decorator to the class method.

Static methods: The method which neither uses the instance variables nor class variables and provides independent functionality is called a static method. Mainly it is used for utility methods. For static method we need to use the @staticmethod decorator.

class Person:

    def info():
	print("Information about the class Person")

Person.info()

Meta-Classes

In Python the object.__class__ designates the name of the class for the object. From Python 3 an object’s type and its class is referred interchangeably, as type(object) is the same as object.__class__. Since everything is an object in Python, all the classes also have a type, which is the type class itself. The type class is a metaclass, of which all the classes are instances.

The built-in type() function enables to create new class dynamically. It takes the name of the class, tuple of base classes which the class inherits and namespace dictionary containing the definitions for class body. When all these parameters are passed the type() function dynamically defines a class.

# create new class Person dynamically. Here attr_val can also be assigned to an external function name
Person = type('Person', (), { 'attr': 100, 'attr_val': lambda x : x.attr })

p = Person()
print(p.attr_val())            # prints value 100 of attr

# create new class Employee extending Person class dynamically
Employee = type('Employee', (Person,), dict(attr=100))

e = Employee()
print(e.attr)                  # prints 100
print(e.__class__)             # prints class '__main__.Employee'
print(e.__class__.__bases__)   # prints tuple with single element, class '__main__.Person'

A class in Python can be instantiated using the expression e.g. Person() which creates a new instance of class Person. When the interpreter encounters Person(), it first calls the __call__() method of Person’s parent class. Since Person is a standard new-style class, its parent class is the type metaclass, so type’s __call__() method is invoked. This __call__() method in turn invokes the __new__() and __init__() methods. If Person does not define __new__() and __init__(), default methods are inherited from Person’s ancestry. But if Person does define these methods, they override those from the ancestry, which allows for customized behavior when instantiating Person.

def new(cls):
     x = object.__new__(cls)
     x.attr = 500
     return x

# modify instantiation behavior of class Person by initializing an attribute attr to 500
Person.__new__ = new

g = Person()
print(g.attr)               # prints 500

Python does not allow to reassign the __new__() method of the type metaclass. To customize the instantiation of the class, a custom meta class can be created by extending the type meta class and overriding the __new__() method. While defining a new class we specify that its metaclass is a custom metaclass using the metaclass keyword in the class definition, rather than the standard metaclass type. Such custom meta-class serve as a template for creating classes and referred to as class factories.

class Meta(type):
     def __new__(cls, name, bases, dct):
         x = super().__new__(cls, name, bases, dct)
         x.attr = 100
         return x
         
     def __init__(cls, name, bases, dct):
         cls.attr = 100

class Foo(metaclass=Meta):
     pass

print(Foo.attr)

Inner Classes

Python also allows to have inner classes as shown in below example. We can create object of inner class inside the outer class or outside the outer class provided the outer class name is used to call it.

class Person:

	def __init__(self, name, age):
		self.name = name
		self.age = age
                self.address = self.Address()

        def show(self):
             print(self.name, self.age)
             self.address.show()

        class Address:

      	      def __init__(self):
            	  self.street = "Main Street"
              	  self.city = "Boston"
            	  self.country = "USA"

              def show(self):
               	  print(self.street, self.city, self.country)
  

p1 = Person("Jimmy", 2)
p1.show()

# access attributes of inner class
print(p1.address.street)

new_address = Person.Address()

Inheritance

Python allows to inherit all the methods from the parent class. It also allows multiple inheritance. Every class in Python is derived from the object class which is the base type in Python.

class Mammal:
	def __init__(self):
              print("Mammal Init")


	def breathe(self):
		print("breathe oxygen")

class Fish:
	def __init__(self):
              print("Fish Init")

	def swim(self):
		print("swims")


# Python does not like empty class, so a 'pass' line is added to let python know to pass this line. 
class Dog(Mammal):
        pass

class Cat(Mammal)
	def __init__(self):
              super().__init__()
              print("Cat Init")

        def runs(self):
              print("runs 30mph")


class Whale(Mammal,Fish)

	def __init__(self):
              super().__init__()  # By default calls the Mammal's init method
              print("Cat Init")

        def color(self):
              print("color of whale is blue")

Python always executes the __init__() method of the object's class. If it cannot find the __init__() method in the sub class then it will call the init method of the super class. To call explicitly call init() method of the super class from subclass or any other methods in super class, the super() keyword is used. As python supports multiple inheritance, when a sub class inheriting from multiple super classes calls super().__init__() method from its own __init__() method then by default it calls the init method of the first Super class mentioned in the inheritance list. Python has a Method resolution order which starts from left to right of the super classes in multiple inheritance. Hence the first super class mentioned in the multiple inheritance is called for init() or any other method using super().method_name().

Method Resolution Order

Method Resolution Order (MRO) determines the order in which base class methods should be inherited in the case of multiple inheritance. It defines the order in which the base classes are searched when executing a method. First the specified method or attribute is searched within the current class. If not found, the search continues into parent classes in depth-first, left-right fashion, in the order specified while inheriting the classes, without searching the same class twice. MRO ensures that a class always appears before its parents. In case of multiple parents, the order is the same as tuples of base classes. In the below example of diamond inheritance Python follows a depth-first lookup order i.e. Class D -> Class B -> Class C -> Class A, which ends up calling the method from class B.

class A: 
    def hello(self): 
        print(" In class A") 
class B(A): 
    def hello(self): 
        print(" In class B") 
class C(A): 
    def hello(self): 
        print("In class C") 
  
# multiple inheritence
class D(B, C): 
    pass
     
r = D() 
r.hello()

MRO of a class can be viewed as the __mro__ attribute or the mro() method. The former returns a tuple while the latter returns a list.

class X:
    pass

class Y:
    pass

class Z:
    pass

class A(X, Y):
    pass

class B(Y, Z):
    pass

class M(B, A, Z):
    pass

print(M.mro())

Polymorphism

Duck Typing:

class PyCharm

   def execute(self)
        print("Compiling")
        print("Running")

class Sublime

   def execute(self)
        print("Spell Check")
        print("Compiling")
        print("Running")

class Computer

   # Here ide variable takes any type as long as it has the execute() method
   def code(self, ide)
        ide.execute()

comp = Computer()

pycharm = PyCharm()
comp.code(pycharm)

sublime = Sublime()
comp.code(sublime)

Operator Overloading:

class Business

	def __init__(self, expense, sales):
             self.expense = expense
             self.sales = sales

        def __add__(self, other):
             expense = self.expense + other.expense
             sales = self.sales + other.sales
             business_obj = Business(expense, sales)
             return business_obj

        def __str__(self):
             return '{} {}'.format(self.sales, self.expense)

b1 = Business(1300, 2100)
b2 = Business(9800, 7300)

# Python internally converts the + operator expression to method call Business.__add__(b1, b2)
b3 = b1 + b2

Similar as the '+' is converted to predefined __add__() method call, other operators are converted to python methods as mentioned . Also python invokes __str__() method behind the scene when we try to print the value of any object.

# Internally python calls print(b1.__str__())
print(b1)

Method Overloading: Python does not support method overloading were we can have two methods with same name but different (no of) arguments. But python does allow to set default values to method parameters making them optional while invoking the method as below.

class Sample

        def sum(self, a=None, b=None, c=None):

             s = 0

             if a != None and b != None and c != None:
                 s = a + b + c
             elif a != None and b != None:
                 s = a + b
             else:
                 s = a

             return s
	
s1 = Sample()
print(s1.sum(4,7,9))
print(s1.sum(2,7))

Method Overriding: Python supports Method Overriding, and invokes the method in the current class rather than the super class method with same name.

class A
    def show(self)
       print("In Class A")

class B(A)
    def show(self)
       print("In Class B")

b1 = B()
b1.show()

Abstract Class

By default python does not support abstract classes. The ABC (Abstract Base Classes) module is used to implement abstract class in python.

from abc import ABC, abstractmethod

class Computer(ABC):     # The abstract class with abstract methods should inherit from class ABC

    @abstractmethod      # Abstract methods are annotated with abstractmethod annotation
    def process(self):
        pass

class Laptop(Computer):

    def process(self):
        print("Running")


com = Computer()
com.process()      # Gives error as process() is abstract method in Computer class

lap = Laptop()
lap.process()

Decorators

Decorators allow to add extra features to the existing functions without modifying the actual function. In other words the behavior of the existing function is changed using decorator by adding a new function which invokes the original function. It modifies the behavior of a function without permanently modifying it, by wrapping the original function with another function. Python allows to pass a function as a parameter to another function.

def msg_decorator(function):
    def wrapper():
        print("Hello")
        function()
        print("Welcome to Python Tutorial !")
    return wrapper

def say_whee():
    print("Whee !")       

say_whee = msg_decorator(say_whee)
say_whee()

Python allows to use decorators in a simpler way using the @ symbol. In the below modified say_whee() method, we use @msg_decorator which is just an easier way of saying say_whee = msg_decorator(say_whee).

@msg_decorator
def say_whee():
    print("Whee !")

The *args and **kwargs are used in the inner wrapper function to support it to accept an arbitrary number of positional and keyword arguments. Similarly the wrapper function should support return value if the decorated function to return values from decorated functions. Further the name of the original function (using .__name__) after being decorated could be fixed by using the @functools.wraps decorator.

import functools

def decorator(func):
    @functools.wraps(func)
    def wrapper_decorator(*args, **kwargs):
        # Do something before
        value = func(*args, **kwargs)
        # Do something after
        return value
    return wrapper_decorator

Decorating Classes

The methods of a class can be decorated similar as the functions. Some commonly used decorators that are built into Python are @classmethod, @staticmethod, and @property. The @classmethod and @staticmethod decorators are used to define methods inside a class namespace that’s not connected to a particular instance of that class. The @property decorator is used to customize getters and setters for class attributes.

class Circle:
    def __init__(self, radius):
        self._radius = radius

    @property
    def radius(self):
        return self._radius

    # define a setter method
    @radius.setter
    def radius(self, value):
        if value >= 0:
            self._radius = value
        else:
            raise ValueError("Radius must be positive")

    @property
    def area(self):
        return self.pi() * self.radius**2

    def cylinder_volume(self, height):
        return self.area * height

    @classmethod
    def unit_circle(cls):
        return cls(1)

    @staticmethod
    def pi():
        return 3.1415926535

Decorators can be used on the entire class as a simpler alternative to metaclasses. The only difference is that the decorator will receive a class and not a function as an argument. The decorator class takes a function as an argument in its .__init__() method and stores a reference to the function. The class instance is also callable, by implementing the special .__call__() method, so that it can stand in for the decorated function. The .__call__() method will be called instead of the decorated function. The functools.update_wrapper() function is used instead of @functools.wraps.

class CountCalls:
    def __init__(self, func):
        functools.update_wrapper(self, func)
        self.func = func
        self.num_calls = 0

    def __call__(self, *args, **kwargs):
        self.num_calls += 1
        print(f"Call {self.num_calls} of {self.func.__name__!r}")
        return self.func(*args, **kwargs)

@CountCalls
def say_whee():
    print("Whee!")

File Handling

Files can be access in python using open() function. It could be opened in read (r), write (w) or append (a) mode using character format by default.

file = open("file.txt", a)
if file.readable():
     print(file.read())      # read entire file
     print(file.readline())  # read single line
     file.write("something")

file.close()

There are additional modes named read (rb), write (wb) or append (ab) to access/write binary files. Refer to the complete list of file access modes for details. Below example copies image data from one file to another file.

f1 = open('image1.jpg', 'rb')
f2 = open('image2.jpg', 'wb')

     for data in f1:
         f2.write(data)


# The glob() method allows to search for files or directories in the current path.
for file in path.glob('*.py')
	print(file)

Pathlib module provides an object oreinted filesystem paths, i.e. it provides classes to work with files and directories.

path = Path("tempdirectory")
path.exists()
path.mkdir()
path.rmdir()

Multi Threading

from time import sleep
from threading import *

class Greeting(Thread):

       # run() is a method in Thread class which needs to be overridden to implement thread.
       def run(self):
            for i in range(500)
                 print("Hello")
                 sleep(1)        # sleep takes number of seconds for the execution to be suspended


class Message(Thread):

       def run(self):
            for i in range(500)
                 print("Welcome")
                 sleep(1)


g = Greeting()
m = Message()

# The start() method of Thread class internally invokes the run() method
g.start()
m.start()

# Ask main thread to wait until thread m and thread g completes
g.join()
m.join()

print("Thread program completed")

PIP and PyPI

Pip is the standard package environment system to install and manage software package in Python. It allows you to install and manage additional packages that are not part of the Python standard library. Pip has been included with the Python installer since versions 3.4 for Python 3, as package management is important for application development.

Python has a very active community that contributes an even bigger set of packages than the Python standard library, which helps with our development needs. These packages are published to the Python Package Index, also known as PyPI. PyPI hosts an extensive collection of packages that include development frameworks, tools, and libraries. For example PyPI hosts a very popular library to perform HTTP requests called requests. To install any package, the command "pip install <package>" is used. Pip then looks for the package in PyPI, calculates its dependencies, and finally installs the package. The package is installed in python 3 directory under site packages folder. The pip install command always installs the latest published version of a package.

$ pip install requests

The list command shows the packages installed in the environment.

$ pip list

To view the package metadata the show command in pip is used.

$ pip show requests

To Install the specific version of a package we use.

$ pip install "SomeProject==1.4"

To install greater than or equal to one version and less than another:

$ pip install "SomeProject>=1,<2"

To install a version that’s “compatible” with a certain version: 4

$ pip install "SomeProject~=1.4.2"

To upgrade an already installed package to the latest from PyPI.

$ pip install --upgrade SomeProject

The --upgrade option upgrades all specified packages to the newest available version. The --force-reinstall option reinstalls all packages even if they are already up-to-date. The --ignore-installed option ignores whether the package and its deps are already installed, overwriting installed files. The --no-deps option doesn't install package dependencies.

$ pip install --upgrade --force-reinstall --ignore-installed --no-deps <package>

We can create a specification of the dependencies and versions which would be used to develop and test the application. Requirement files allow to specify exactly which packages and versions should be installed. The pip freeze command outputs all the packages installed and their versions to the standard output. This output from freeze command in requirements format can be redirected to generate a requirements file.

$ pip freeze > requirements.txt

The requirement.txt would contain out in below format.

scikit-learn==0.22.1
matplotlib==3.1.3
numpy==1.11.2

To later replicate the environment in another system, we can run pip install specifying the requirements file (or any other text file name) using the -r switch as below.

$ pip install -r requirements.txt

Pip search command allows to search in the PyPI for any libraries using multiple search keywords.

$ pip search requests oauth

Package can be uninstalled using the uninstall command.

$ pip uninstall request

Pip by default gets the latest version of python packages, including the sub-dependencies which could cause issues especially in production if the version is non-backward compatible. Typically this is resolved by executing the 'pip freeze' command which pins all dependencies and freezes everything in development to requirement.txt file. Although now each specified versions of third-party packages including its sub-dependencies needs to be manually updated ensuring inter compatibility, which can become cumbersome to manage.

Virtual Environment

Virtual environment is an indispensable part of Python programming. A virtual environment is an isolated container containing all the software dependencies for a given project. It has its own Python executable and third-party package storage. It is important because by default software (packages) such as Python, NumPy and Django are installed in the same directory. Python stores all the system packages in a child directory of the path from "sys.prefix" and all the third party packages are placed in one of the directories pointed by "site.getsitepackages()". This causes a problem when we want to work on multiple projects on the same computer. If a project uses one version of NumPy while some other project uses a different NumPy version, then the virtual environment provides an isolated environment to separate the two project setups effectively. There are no limits to the number of virtual environments we can have since they’re just directories containing a few scripts. The venv module which is part of standard library in Python 3 enables to create the lightweight virtual environments mentioned before.

The virtualenv tool can be installed with pip using the below command.

$ pip install virtualenv

Create a new virtual environment inside a new directory 'virtual-env'. The below command works only for Python 3.

$ mkdir virtual-env && cd virtual-env

$ python3 -m venv envname

Alternatively we can use the below command to create and start virtual environment named 'envname', using the default python version. It creates the virtual environment within the current directory.

$ virtualenv envname

Create and start virtual environment using the specified python version.

$ virtualenv envname -p python3

$ virtualenv envname -p /usr/local/bin/python3

The above command creates a env directory which contains bin directory with all python binaries, include directory with python packages and finally the lib directory were all third party dependencies are installed in site-packages. The activate scripts in the bin directory allow to to set up the shell to use the environment’s Python executable and its site-packages by default. In order to use this environment's packages/resources in isolation, we need to “activate” it using the below command.

$ source envname/bin/activate

For windows the command is '.\Scripts\activate'. With the above command the prompt is prefixed with the name of the environment (envname) indicating that the 'envname' environment is active. To exit from the environment use the deactivate command.

(envname) $ deactivate

On activation of the environment, the virtual environment’s bin directory is the first directory searched when running an executable on the command line. Thus, the shell uses our virtual environment’s instance of Python instead of the system-wide version.

The virtualenvwrapper tool is a wrapper scripts around the main virtualenv tool, which helps in organizing all the virtual environments in one location. It also provides methods to easily create, delete, and copy environments, and switch between multiple environments.

Pipenv

Pipenv is a packaging tool which simplifies the dependency management in Python-based projects. It brings together Pip, Pipfile and Virtualenv to provide a straightforward and powerful command line tool. Pipenv has virtual environment management built in which makes it a single tool for both package and environment management. The virtual environment for the project is created and managed by Pipenv when packages are installed via Pipenv’s command-line interface. Dependencies are tracked and locked, with development and production dependencies managed separately. Pipenv is installed using below pip command

$ pip install pipenv

The pipenv install command is used to install (all or specific) packages within the project. It also creates two files, Pipfile and Pipfile.lock, and a new virtual environment in the project directory. If no python version is specified, it uses the default version of Python. The Pipfile contains dependency information about the project and is used to track the project dependencies. The Pipfile supercedes the requirements.txt file that is typically used in Python projects. The pipenv install command installs all the packages specified within the Pipfile.

$ pipenv install

Specific package can be installed or removed using the pipenv's install and uninstall commands as below

$ pipenv install requests

$ pipenv uninstall requests

Pipenv enables to keep two environments separate using the --dev flag and install pytest in dev environment.

$ pipenv install pytest --dev

To completely wipe all the installed packages from your virtual environment, we use below command.

$ pipenv uninstall --all

The package name along with its version and its dependencies, can be frozen by updating the Pipfile.lock. This is done using the below lock command.

$ pipenv lock

In order to activate or create the virtual environment associated with the Python project below shell command for pipenv is used.

$ pipenv shell

To invoke shell commands in the virtual environment, without explicitly activating it first, the run command is used.

$ pipenv run <insert command here>

Pipenv's graph command displays a dependency graph to understand the top-level dependencies and their sub-dependencies.

$ pipenv graph

Pipenv supports the automatic loading of environmental variables when a .env file (containing key-value pairs) exists in the top-level directory. In such case, when the pipenv shell command opens the virtual environment, it loads the environmental variables from the file. The default behavior of Pipenv can be changed using some environmental variables for configuration.

Pipenv has many advance features such as specifying a package index, detection of security vulnerabilities, easily handling environment variables, and playing nicely with Windows. Some of the drawbacks of pipenv include, generation of many miscellaneous files in the project root directory, performance issues with dependency resolution and has complex commands/options. Also pipenv does not manage the scaffolding (internal structure) of the project unlike Poetry.

Poetry

Poetry is a modern tool which simplifies dependency management and packaging in Python. It manages project dependencies, creation and activation of virtualenv, build and publishing of packages, ensures package integrity and allows to convert python functions to command line programs. It also provides directory structure for the project including tests.

Install poetry using below pip command

$ pip install poetry

Below are some of the poetry commands to setup project, fetch dependencies and execute project.

It is recommended to configure poetry to create the project's virtual environment in .venv folder inside the project directory, before using poetry to create a project. This is very handy when using IDEs like VS Code and PyCharm as they immediately recognizes the .venv folder and pick up the correct interpreter.

$ poetry config virtualenvs.in-project true

Create a new project by creating a directory structure for the project.

$ poetry new poetry-demo

The above command creates the project's directory structure containing poet and test directories, and the pyproject.toml file. The .toml file defines the project metadata, all project build dependencies, development dependencies to perfom other actions like testing, building, documentation, etc and finally the build system. Poetry also automatically creates the virtual environment for the project, if it detects no virtual environment already associated with the project. The environment info command in poetry displays the path of the current virtual environment with other details.

$ poetry env info

The poetry init command creates a pyproject.toml file interactively by prompting to provide basic information about the package.

$ poetry init

The install command reads the pyproject.toml file from the current project, resolves all the dependencies, and installs them.

$ poetry install

During the installation, poetry automatically generates the poetry.lock file to track the exact version of the dependencies that have been install on the system. If the poetry.lock file is already present, it would install the exact version of packages defined in the lock file, instead of trying to install the latest one from PyPi. It is recommended to track poetry.lock along with pyproject.toml file in source control.

In order to get the latest versions of the dependencies and to update the poetry.lock file, you should use the update command

$ poetry update

The add command adds required packages to your pyproject.toml and installs them. The --dev flag is used to install a development dependency which is not directly related to the project.

$ poetry add pandas

$ poetry add flake8 --dev

The lock command locks (without installing) the dependencies specified in pyproject.toml.

$ poetry lock

The show command lists all of the available packages. The -t or --tree option lists all the dependencies as well as the sub-dependencies in tree format. The --latest option allows to check if there is any latest version available for the package dependency.

$ poetry show

$ poetry show --tree

$ poetry show --latest

The run command executes the given command inside the project's virtual environment. It can also execute one of the scripts defined in pyproject.toml.

$ poetry run pytest

$ poetry run python main.py

Poetry also provides a shell command that spawns a new shell directly inside the virtual environment. It enables to execute commands in virtual environment without using 'poetry run' in front of the command.

$ poetry shell

> pytest

The update command can be used to update all or specific package dependencies.

$ poetry update

$ poetry update pytest

The remove command is used to remove the package from the project. To remove a development package the -D or --dev option must be specified.

$ poetry remove requests

$ poetry remove -D pytest

The build command builds the source and wheels archives.

$ poetry build my_new_project

$ poetry build

The publish command deploys the package built previously to either a public or private repository. Specifying the build option allows to build the package before publishing. By default the publish command publishes the package in pipy public repository. To switch to a private repository -r option is used along with the name of the private repository configured using the config command.

$ poetry publish

$ poetry publish -r privrepo -u username -p password

The config command enables to configure a private repository name with the repository url.

$ poetry config repositories.privrepo https://private.repository

We can also store credentials for the private repository using the config command.

$ poetry config http-basic.privrepo username password

The credentials for pypi can also be configured using the previous command and replacing 'privrepo' by 'pypi' but it is now recommended to use API tokens to authenticate with pypi as below.

$ poetry config pypi-token.pypi my-token

Flink - Stream Processing in Real Time

2020-08-02T11:32:00.016-07:00

A decade ago most of the data processing and analysis within software industry was carried on by batch systems with some lag time. Now as new technologies and platforms evolve, many organizations are gradually shifting towards a stream-based approach to process their data on the fly, since most of the data is being streamed. The concept isn't new, and was explored within Dataflow programming in the 1980s. Today majority of the large-scale data processing systems handle the data, which is produced continuously over time. These continuous streams of data come from variety of sources, for example web logs, application logs, sensors, or as changes to application state in databases (transaction log records) etc. Stream processing continuously incorporates new data to compute a result, with the input data being unbounded. It can work with a lot less hardware than batch processing as data is processed as it comes, spreading the processing over time.

Apache Storm was the first widely used framework for stream processing were processing programs ran continuously on data and produce outcomes in real-time, while the data is generated. Apache Spark became the most popular, matured and widely adopted streaming platform with many features such as structured streaming, custom memory management, watermarks, event time processing support etc. Spark is essentially a batch with Spark streaming as micro-batching and special case of Spark Batch. It is not truly real time processing and has some latency. It is also stateless by design and tuning Spark becomes challenging with many parameters.

Apache Flink is an open source platform for distributed stream and batch data processing. It is essentially a true streaming engine with a special batch processing mode of streaming with bounded data. The core of Apache Flink is a streaming dataflow engine, which supports communication, distribution and fault tolerance for distributed stream data processing. Apache Flink is a hybrid platform for supporting both batch and stream processing. It supports different use cases based on real-time processing, machine learning projects, batch processing, graph analysis and others. Flink does not implement the allocation and management of compute resources in a cluster, process coordination, highly available durable data storage, and failure recovery. Instead, it focuses on its core function, distributed data stream processing and leverages existing cluster infrastructure and services. Flink is well integrated with cluster resource managers, such as Apache Mesos, YARN, and Kubernetes, but can also be configured to run as a stand-alone cluster. Flink does not provide durable, distributed storage. Instead, it takes advantage of distributed filesystems like HDFS or object stores such as S3. Flink depends on Apache ZooKeeper for leader election in highly available setups.

Streaming Basics

Dataflow program describes how data flows between operations. Dataflow programs are commonly represented as directed graphs, where nodes are called operators and represent computations and edges represent data dependencies. Operators are basic functional units in the data flow which consume the data from inputs, perform a computation on them, and produce data to outputs for further processing. A dataflow graph is a directed acyclic graph (DAG) that consists of stateful operators and data streams which represent the data produced by an operator and available for consumption by operators. A data stream is a potentially unbounded sequence of events. Operators without input ports are called data sources and operators without output ports are called data sinks. A dataflow graph must have at least one data source and one data sink. The operators are parallelized into one or more parallel instances called subtasks and streams are split into one or more stream partitions, with one partition per subtask. Each operator might have several parallel tasks running on different physical machines. Ideally any streaming application should have low latency with few milliseconds for processing an event and high throughput i.e. rate of processing events which is measured in events or operations per time unit. The peak throughput is the performance limit when the system is at its maximum load. Streaming operations can be stateless or stateful were they maintain information about the events they have received before. Stateful stream processing applications are more challenging to parallelize and operate in a fault-tolerant manner because state needs to be efficiently partitioned and reliably recovered in the case of failures. Data ingestion is the operation of fetching raw data from external sources and converting it into a format suitable for processing. Data egress on the other hand is the operation of producing output in a form suitable for consumption by external systems. Transformation operations are single-pass operations that process each event independently. These operations consume one event after the other and apply some transformation to the event data, producing a new output stream.

Lambda architecture is a pattern which combine batch and stream processing systems to implement multiple paths of computation. A streaming fast path for timely approximate results, and a batch offline path for late accurate results. These approaches suffer from high latency (imposed by batches), high complexity, as well as arbitrary inaccuracy since the time is not explicitly managed by the application code. Batch programs are special cases of streaming programs, where the stream is finite, and the order and time of records does not matter. Flink has a specialized API for processing static data sets and uses specialized data structures and algorithms for the batch versioned operators such as join or grouping, and uses dedicated scheduling strategies. The result is that Flink presents itself as a full-fledged and efficient batch processor on top of a streaming runtime, including libraries for graph analysis and machine learning.

Flink Components

The core of Flink is the distributed dataflow engine, which executes dataflow programs. A Flink runtime program is a DAG of stateful operators connected with data streams. Flink provides two main core APIs: the DataStream API for both bounded and unbounded streams, and the DataSet API for the bounded data sets. A DataSet is treated internally as a stream of data. They provide common building blocks for data processing, such as various forms of transformations, joins, aggregations, windows, state, etc. Both the DataSet API and DataStream APIs create runtime programs executable by the dataflow engine. Flink bundles domain-specific libraries and APIs that generate DataSet and DataStream API programs, namely, FlinkML for machine learning, Gelly for graph processing and Table API for SQL-like operations. Gelly is a Graph API for Flink which simplifies graph analysis within Fink applications. Gelly allows to transform and modify the graphs using high-level functions, by providing methods to create, transform and modify graphs. FlinkML was the legacy Machine Learning (ML) library for Flink. It is deprecated/removed in Flink 1.9, with the new Flink-ML interface being developed the umbrella of FLIP-39, and is being actively extended under FLINK-12470. The Table API is an extension of relational model were Tables have a schema attached with the API offering comparable operations, such as select, project, join, group-by, aggregate, etc. The Table API programs goes through an optimizer that applies optimization rules before execution. Flink also offers support for SQL query expressions, which is similar to the Table API both in semantics and expressiveness. The SQL abstraction closely interacts with the Table API, and SQL queries can be executed over tables defined in the Table API. FlinkCEP is the Complex Event Processing (CEP) library implemented on top of Flink which allows to detect event patterns in an endless stream of events.

Flink Architecture

The Flink runtime consists of two types of processes: a JobManager and one or more TaskManagers high are responsible for executing the applications and failure recovery. The JobManagers and TaskManagers can be started either directly on the machines as a standalone cluster, or in containers, or managed by resource frameworks like YARN or Mesos. Once started the TaskManagers connect to the JobManager making themselves available to be assigned work.

JobManager

The JobManager is the master process that controls the execution of a single application. Each application is controlled by its own separate JobManager. The JobManager receives an application consisting of JobGraph (logical dataflow graph) and JAR file containing required classes and libraries for execution. The JobManager converts the JobGraph into a physical dataflow graph called the ExecutionGraph, which consists of tasks that can be executed in parallel. The JobManager requests the necessary resources (TaskManager slots) to execute the tasks from the ResourceManager. Once it receives enough TaskManager slots, it distributes the tasks of the ExecutionGraph to the TaskManagers that execute them. The JobManager coordinates the distributed execution of the dataflow. It tracks the state and progress of each task (operator), schedules new tasks, and coordinates checkpoints, savepoints and recovery.

TaskManager

A TaskManager is a JVM (worker) process, and executes one or more subtasks in separate threads. It executes the tasks of a dataflow, and buffer and exchange the data streams. Each TaskManager provides a certain number of task slots (unit of resource scheduling) which limit the number of tasks the TaskManager can execute. In other words, it represents the number of concurrent processing tasks executed by TaskManager. The task slot ensures that subtask will not compete with other subtasks from different jobs for managed memory, but instead has an amount of reserved managed memory. By default, Flink allows subtasks to share task slots even if they are subtasks of different tasks, so long as they are from the same job. Hence one task slot may be able to hold an entire job pipeline thus providing better resource utilization. The TaskManager once started registers its slots to the ResourceManager. When instructed by the ResourceManager, the TaskManager offers one or more of its slots to a JobManager. The JobManager can then assign tasks to the slots to execute them. The TaskManager reports the status of the tasks to the JobManager. During execution, a TaskManager exchanges data with other TaskManagers running tasks of the same application. There must always be at least one TaskManager.

ResourceManager

The ResourceManager is responsible for allocation/deallocation of resources and provisioning them in a Flink cluster. It manages TaskManager slots, which is the unit of resource scheduling in a Flink cluster. When the ResourceManager receives TaskManager slot request from JobManager, it instructs a TaskManager with idle slots to offer them to the JobManager. If the ResourceManager does not have enough slots to fulfill the JobManager’s request, the ResourceManager can talk to a resource provider to provision containers in which TaskManager processes are started. The ResourceManager also takes care of terminating idle TaskManagers to free compute resources. Flink implements multiple ResourceManagers for different environments and resource providers such as YARN, Mesos, Kubernetes and standalone deployments. In a standalone setup, the ResourceManager can only distribute the slots of available TaskManagers and cannot start new TaskManagers on its own.

Dispatcher

The Dispatcher provides a REST interface to submit Flink applications for execution and starts a new JobMaster for each submitted job. It also runs the Flink WebUI to provide information about job executions.The REST interface enables the dispatcher to serve as an HTTP entry point to clusters that are behind a firewall. The dispatcher also runs a web dashboard to provide information about job executions.

JobMaster

JobMaster is one of the components running in the JobManager. It is responsible for supervising the execution of the Tasks of a single job and thus manages the execution of a single JobGraph. JobMaster is responsible for execution of multiple jobs in parallel within a Flink cluster, each having its own JobMaster. There is always at least one JobManager. A high-availability setup might have multiple JobManagers, one of which is always the leader, and the others are standby.

Operator Chaining

Flink chains operator (e.g. two subsequent map transformations) subtasks together into tasks by default for distributed execution. Each task is executed by one thread. Chaining two subsequent transformations means co-locating them within the same thread for better performance. Chaining operators together into tasks reduces the overhead of thread-to-thread handover and buffering, and increases overall throughput while decreasing latency. The StreamExecutionEnvironment.disableOperatorChaining() method is used to disable chaining in the whole job.

Flink Application Execution

Flink Application is a user program which spawns one or multiple Flink jobs, either on local JVM or remote clusters, from its main() method. The ExecutionEnvironment which can be either LocalEnvironment or RemoteEnvironment, provides methods to control the job execution (by setting parallelism) and to interact with the outside resources. The jobs of a Flink Application can either be submitted to a long-running Flink Session Cluster, a dedicated Flink Job Cluster, or a Flink Application Cluster.

Flink Session Cluster is long running pre-existing cluster which can accept multiple job submissions. The cluster continues to run even after all the jobs are finished, and it life is not bounded by any Flink job. All the jobs share same cluster and compete with each other for cluster resources (like network bandwidth). Such pre-existing cluster saves time for acquiring the resources and starting the TaskManagers, but the shared setup means a crash of one TaskManager will fail all the jobs running over it.

Flink Job Cluster the cluster manager such as YARN or Kubernetes is used to spin up a cluster for each submitted job and is only available for that job. When the client requests resources from the cluster manager and submits the job to the Dispatcher, TaskManagers are allocated lazily based on job's resource requirements. Once the job is finished, the Flink Job Cluster is torn down. Since the cluster is restricted to a single job, failure does not affect's other jobs. Since allocation of resources takes time, Flink Job Clusters are generally used for long running and time sensitive requirements.

Flink Application Cluster is a dedicated Flink cluster that only executes jobs from one Flink Application and where the main() method runs on the cluster rather than the client. The job submission is done by packaging the application classes & libraries into a JAR and the cluster entrypoint (ApplicationClusterEntryPoint) is responsible for calling the main() method to extract the JobGraph. Flink Application Cluster continues to run until the lifetime of the Flink Application.

Stateful Stream Processing

The stateful operations are the one were they remember information across multiple events, by storing data across the processing of individual elements/events. Every function and operator in Flink can be stateful. State enables Flink to be fault tolerant using checkpoints and savepoints. Flink redistributes the state across multiple parallel instances, thus rescaling its applications. Keyed state is a type of state maintained in an embedded key/value store. Key state is partitioned and distributed strictly together with the streams that are read by the stateful operators. Hence, access to the key/value state is only possible on keyed streams and is restricted to the values associated with the current event’s key. Aligning the keys of streams and state makes ensures that all state updates are local operations, thus guaranteeing consistency without transaction overhead. This also allows Flink to redistribute the state and adjust the stream partitioning transparently. Keyed State is organized into Key Groups which are an atomic unit by which Flink can redistribute Keyed State.

Checkpoints

Flink implements fault tolerance using a combination of stream replay and checkpointing. A checkpoint marks a specific point in each of the input streams along with the corresponding state, for each of the operators. Flink's fault tolerance mechanism continuously draws consistent snapshots of the distributed data stream and operator state asynchronously. These snapshots act as consistent checkpoints to which the system can fall back. The stream flow is resumed from the checkpoint while maintaining consistency by restoring the state of the operators and replaying the records from the point of the checkpoint. These snapshots are very light-weight and can be drawn frequently without much impact on performance. The state of the streaming applications is usually stored at a configurable distributed file system. In case of failure, Flink stops the distributed streaming dataflow. The system then restarts the operators and resets them to the latest successful checkpoint. The input streams are reset to the point of the state snapshot. Any records that are processed as part of the restarted parallel data-flow are guaranteed to not have affected the previously checkpointed state. Batch programs in Flink does not use any checkpointing and recovers by fully replaying the streams since its inputs are bounded.

By default checkpoints are disabled in Flink and can be enabled by calling enableCheckpointing(interval) on the StreamExecutionEnvironment, where the checkpoint interval is passed in milliseconds. Some of the other checkpoint parameters include guarantee level (exactly-once or at-least-once), checkpoint timeout, minimum time between checkpoints (ensuring some progress between checkpoints), number of concurrent checkpoints (by default is 1), externalized checkpoints (which write their meta-data to a persistent storage to resume on failure), fail/continue task on checkpoint errors, prefer checkpoint for recovery and unaligned checkpoints (reduces checkpointing times under backpressure). The checkpoints are stored in memory, file system or database based on the State backend configuration. By default, state is kept in memory in the TaskManagers and checkpoints are stored in memory in the JobManager. The state backend can be configured via StreamExecutionEnvironment.setStateBackend(..). Flink currently only provides processing guarantees for jobs without iterations. Hence enabling checkpoint on an iterative job causes an exception, which still can be forced though by using env.enableCheckpointing(interval, CheckpointingMode.EXACTLY_ONCE, force = true) flag.

Stream Barriers

Stream Barriers is a core element in Flink’s distributed snapshotting as they are injected into the data stream and flow with the records as part of the data stream. A barrier separates the records in the data stream into the set of records that goes into the current snapshot, and the records that go into the next snapshot. Every barrier carries the ID of the snapshot whose records it pushed in front of it. Barriers are very lightweight. Multiple barriers from different snapshots can be in the stream at the same time. Stream barriers are injected into the parallel data flow at the stream sources. Barriers for snapshot are injected in the source stream at the position up to which the snapshot covers the data and it flows downstream. When an intermediate operator receives a barrier for snapshot from all of its input streams, it emits a barrier for snapshot into all of its outgoing streams. The operator cannot process any further records from the incoming stream after receiving snapshot barrier, until it has receives the same barrier from the other inputs as well. When the sink operator receives the barrier from all of its input streams, it acknowledges that snapshot to the checkpoint coordinator. After all sinks have acknowledged a snapshot, it is considered completed. When the last stream receives the barrier, the operator emits all pending outgoing records, and then emits snapshot's barriers itself. The operator snapshots the state, resumes processing records from all the input streams, and finally writes the state asynchronously to the state backend. Any state present within the operators becomes the part of the snapshots. Operators snapshot their state at the point in time when they have received all snapshot barriers from their input streams, and before emitting the barriers to their output streams. Only the updates to the state from the records before the barriers are applied and the state is stored in a preconfigured reliable distributed storage. In case of a failure Flink selects the latest completed checkpoint, with the system redeploying the entire distributed dataflow and gives each operator the state that was snapshotted as part of checkpoint. Such checkpointing is called aligned checkpointing, were the sources are set to start reading the stream from checkpoint's snapshot position.

Starting with Flink 1.11, the checkpoints can overtake all in-flight data as long as the in-flight data becomes part of the operator state, also known as unaligned checkpointing. In unaligned checkpointing, operators first recover the in-flight data before starting processing any data from upstream operators in unaligned checkpointing, performing same recovery steps as aligned checkpoints. It ensures that barriers are arriving at the sink as fast as possible. The alignment step could add latency to the streaming program which is usually a few milliseconds or in some cases noticeably higher. The alignments can be skipped when causing high latency, in which case the operator continues processing all inputs, even after some checkpoint barriers for checkpoint has arrived. Alignment happens only for operators with multiple predecessors (joins) as well as operators with multiple senders (after a stream repartitioning/shuffle).

Savepoints

All programs that use checkpointing can resume execution from a savepoint. They allow updating both the programs and the Flink cluster without losing any state. Savepoints are manually triggered checkpoints, which take a snapshot of the program and write it out to a state backend, using the checkpointing mechanism. Savepoints are similar to checkpoints except that they are triggered by the user and don’t automatically expire when newer checkpoints are completed.

Timely Stream Processing

Timely stream processing is an extension of stateful stream processing in which time plays crucial role in the computation, for example time series analysis such as aggregation by time periods. There are different notions of time in streaming, namely, Event Time and Processing Time.

Processing Time

Processing time refers to using the system time (clock) of the machine that is executing the respective operation. Processing time requires no coordination between streams and machines. It provides the best performance and the lowest latency. Processing time cannot be determined always in a distributed and asynchronous environments, as its susceptible to the speed at which records arrive and flow between operators within the system, as well as other outages.

Event Time

Event time is the time that each individual event occurred on its producing device and is embedded within the records before they enter Flink. In event time, the progress of time depends on the data, not on any wall clocks. Event time programs specifies how to generate Event Time Watermarks, which indicates the progress in event time. Ideally event time yields consistent and deterministic results, regardless of the arrival order by timestamp of the events. But the latency incurred by waiting for out-of-order events which is a finite time period limits its deterministic nature. When all the data arrives, the event time produces consistent and correct results after working through the out-of-order late events.

Watermarks

Flink stream processor uses the watermarks mechanism to measure the progress in event time and for example, to get notified when to close the window in progress after the event time passes beyond an hour for an hourly window operator. Watermarks are part of the data stream and carry a timestamp to determine that the event time has reached a specified time within that stream. It indicates that there should be no more elements in the stream with a timestamp lower than the watermark timestamp, after the specified watermark point. In other words watermark is a declaration that by specified point in the stream, all events up to a certain timestamp should have arrived. This makes watermarks crucial for processing out-of-order streams were events are not ordered by their timestamps. The operator also advances its internal event time clock to the value of the watermark after reaching the watermark. Watermarks are generated at, or directly after source functions, defining the event time at a particular parallel source. Each parallel subtask of a source function usually generates its watermarks independently.As the watermarks flow through the streaming program, they advance the event time at the operators where they arrive. Whenever an operator advances its event time, it generates a new watermark downstream for its successor operators. It is possible that certain elements will violate the watermark condition, by arriving with timestamp even after the Watermark for corresponding timestamp has already occurred. In real world it becomes difficult to specify the time by which all the elements of a certain event timestamp will have occurred. Delaying watermarks causes delays in evaluation of event time windows.

DataStream API

DataStream API provides APIs to implement transformation on data streams, e.g. filtering, updating state, defining windows, aggregating.

The first step of the Flink program is to create a context for execution using StreamExecutionEnvironment, which can be LocalStreamEnvironment or RemoteStreamEnvironment. It is an entry class that can be used to set parameters, create data sources, and submit tasks. The instance of StreamExecutionEnvironment is obtained by using its static methods, getExecutionEnvironment(), createLocalEnvironment() and createRemoteEnvironment(String host, int port, String... jarFiles). Ideally the getExecutionEnvironment() method is used to provide the corresponding environment depending on the context. When executing the program within an IDE or as a regular Java program, it creates a local environment to execute the program in local machine. On other hand when the program is bundled into a JAR file and invoked using command line, the Flink cluster manager executes the main method with the getExecutionEnvironment() returning an execution environment of a cluster. The StreamExecutionEnvironment contains the ExecutionConfig which allows to set job specific configuration values for the runtime. The throughput and latency can be configured in Flink using environment.setBufferTimeout(timeoutMillis) on the execution environment which sets a maximum wait time for the data element buffers (which avoid network traffic by one-by-one element transfer) to fill up. Setting buffer timeout to -1 removes the timeout, flushing the buffer only after its full.

Next, either the initial data is generated, or data is read from external data source using a socket connection. This creates a DataStream of specific type. DataStream is the core API for stream processing in Flink. The DataStream class is used to represent an immutable collection of data, which can be either finite or unbounded. Since the data is immutable, no new elements can be added and removed from the data collection. DataStream defines many common operations such as filtering, conversion, aggregation, window, and association. Once the output DataStream containing final results is achieved, the it is written to an outside system by creating a sink. The methods such as writeAsText(String path) and print() are used to create a sink. Finally, the execute() method on the StreamExecutionEnvironment is called to trigger the Flink program execution. All Flink programs are executed lazily. The data loading and transformation operator operations such as source creation, aggregation, and printing etc, only build the graphs of internal operator operations, by adding to a dataflow graph. Only after calling the execute() method, the dataflow graphs are submitted to the cluster or the local computer for execution. Depending on the type of the ExecutionEnvironment the execution will be triggered on the local machine or submit the program for execution on a cluster. The execute() method will wait for the job to finish and then return a JobExecutionResult, this contains execution times and accumulator results. Flink also allows to trigger asynchronous job execution by calling executeAysnc() on the StreamExecutionEnvironment, which returns a JobClient to communicate with the submitted job.

The below program counts the words coming from a web socket in 5 second windows. The program gets its input stream from netcat running on the terminal using the command "nc -lk 9999".

import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time

object WindowWordCount {
  def main(args: Array[String]) {

    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val text = env.socketTextStream("localhost", 9999)

    val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
      .map { (_, 1) }
      .keyBy(0)
      .timeWindow(Time.seconds(5))
      .sum(1)

    counts.print()
    env.execute("Window Stream WordCount")
  }
}

The Flink program reads data by attaching a source using StreamExecutionEnvironment.addSource(sourceFunction). Flink provides many pre-implemented source functions and allows to write custom sources by implementing the SourceFunction for non-parallel sources, or implementing the ParallelSourceFunction interface or extending the RichParallelSourceFunction for parallel sources. The StreamExecutionEnvironment has several predefined stream sources such as File based sources (readTextFile, readFile), Socket based source (socketTextStream), Collection-based sources (fromCollection, fromElements, fromParallelCollection, generateSequence) and Custom source (addSource).

Data sinks consume DataStreams and forward them to files, sockets, external systems, or print them. Flink comes with a variety of built-in output formats that are encapsulated behind operations on the DataStreams, for example writeAsText(), writeAsCsv(), writeUsingOutputFormat() for custom file formats, writeToSocket() and addSink() which invokes a custom sink function. It is recommended to use the flink-connector-filesystem for reliable exactly-once delivery of a stream into a file system.

An iterative operation is carried out by embedding the iteration in a single operator or by using a sequence of operators, one for each iteration. Flink supports iterative streaming by implementing a step function and embed it into an IterativeStream. Since there is no maximum number of iterations, the stream should eventually be forwarded to a downstream using either a split transformation or a filter. The below program continuously subtracts 1 from a series of integers until they reach zero.

val someIntegers: DataStream[Long] = env.generateSequence(0, 1000)

val iteratedStream = someIntegers.iterate(
  iteration => {
    val minusOne = iteration.map( v => v - 1)
    val stillGreaterThanZero = minusOne.filter (_ > 0)
    val lessThanZero = minusOne.filter(_ <= 0)
    (stillGreaterThanZero, lessThanZero)
  }
)

Operators in Flink transform one or more DataStreams into a new DataStream. Flink Programs can combine multiple transformations using such operators to create a sophisticated dataflow topologies. Examples of transfromation operators are: map(), flatMap(), filter(), reduce(), split(), select(), iterate(), keyBy() which partitions a stream into disjoint partitions by keys, fold() which combines current value with last folded value, variosu aggregations, window() and windowAll() which group stream events according to some characteristic.

Flink provides special data sources e.g. env.fromElements(), env.fromCollection(collection) which are backed by Java collections to ease testing. Similarly Flink provides a sink to collect DataStream results for testing, for example, "DataStreamUtils.collect(myDataStream.javaStream).asScala".

Windows

Windows split the stream into buckets of finite size, over which computations can be applied. A window is created when the first element belonging to window arrives and removed the time passes its end timestamp. Flink guarantees removal only for time-based windows. The keyBy(...) splits the infinite stream into logical keyed streams were any attribute of the event can be used as a key. Keyed streams allow windowed computation to be performed in parallel by multiple tasks processing independently. All elements referring to the same key will be sent to the same parallel task. In non-keyed streams the original stream is never split into multiple logical streams and all the windowing logic is performed by a single task. The window(...) or windowAll(...) is called for non-keyed streams in Non-Keyed Windows.

Each window has a Trigger and a function to be applied to window data, attached to it. A trigger can also decide to purge a window’s contents any time between its creation and removal. The WindowAssigner defines how the elements are assigned to the windows and is passed to passed to window(...) (for keyed streams) or the windowAll() (for non-keyed streams) call. Flink comes with pre-defined window assigners for the most common use cases, namely tumbling windows, sliding windows, session windows and global windows. The window function defines the computation to be performed on the windows. The window function can be one of ReduceFunction, AggregateFunction, FoldFunction or ProcessWindowFunction. A Trigger determines when a window is ready to be processed by the window function. Each WindowAssigner comes with a default Trigger. A custom trigger can be specified using trigger(...). The trigger interface has five methods namely, onElement(), onEventTime(), onProcessingTime(), onMerge() and clear(), which allows the Trigger to react to different events. The onElement(), onEventTime() and onProcessingTime() methods decide how to act on their invocation event by returning a TriggerResult. TriggerResult enum has values, CONTINUE which does nothing, FIRE which triggers computation, PURGE which clears elements in window, and FIRE_AND_PURGE which both triggers computation and clearing window afterwards. Flink’s windowing model allows specifying an optional Evictor in addition to the WindowAssigner and the Trigger, using the evictor(...) method. The evictor has the ability to remove elements from a window after the trigger fires and before and/or after the window function is applied. The Evictor interface has two methods evictBefore() and evictAfter(), which contains the eviction logic to be applied before or after the window function. The result of a windowed operation is a DataStream and no information about the windowed operations is retained in the result elements. The only relevant information that is set on the result elements is the element timestamp.

DataSet API

Data set is a collection of finite bounded data. Dataset API perform batch operations on the data over a period. It provides different kinds of transformations on the datasets like filtering, mapping, aggregating, joining and grouping. Datasets are created from sources like local files or by reading a file from a particular source and the result data can be written on different sinks like distributed files or command line terminal. DataSet programs are similar to DataStream programs with the difference that the transformations are on data sets. Below is the WordCount example on DataSet.

import org.apache.flink.api.scala._

object WordCount {
  def main(args: Array[String]) {

    val env = ExecutionEnvironment.getExecutionEnvironment
    val text = env.fromElements(
      "Who's there?",
      "I think I hear them. Stand, ho! Who's there?")

    val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
      .map { (_, 1) }
      .groupBy(0)
      .sum(1)

    counts.print()
  }
}

Data transformations transform one or more DataSets into a new DataSet. Some transformations (join, coGroup, groupBy) require that a key be defined on a collection of elements. Other transformations (Reduce, GroupReduce, Aggregate) allow data being grouped on a key before they are applied. The parallelism of a transformation can be defined by setParallelism(int). DataSets are created using the abstractions behind InputFormat, either from files or from Java collections. Flink comes with several built-in formats to create data sets from common file formats using ExecutionEnvironment methods. For example, the methods readTextFile(path), readCsvFile(path), readSequenceFile(key, value, path) read data from specified files, while methods fromCollection(Iterable), fromElements(elements: _*), fromParallelCollection(SplittableIterator) create data set from specified iterator/elements, with generateSequence(from, to) generating sequence from the specified interval. Data sinks consume the DataSets and store or return them to external destination. Data sink operations are described using an OutputFormat. Flink comes with a variety of built-in output formats that are encapsulated behind operations on the DataSet, for example writeAsText(), writeAsCsv(...), print(), write() and output(). Iteration operators in Flink encapsulate a part of the program and execute it repeatedly, feeding back the result of one iteration (the partial solution) into the next iteration. There are two types of iterations in Flink, BulkIteration and DeltaIteration.

Parameters can be passed to flink functions using either the constructor or the withParameters(Configuration) method. The parameters are serialized as part of the function object and shipped to all parallel task instances. Flink also allows to pass custom configuration values to the ExecutionConfig interface of the environment. Since the execution config is accessible in all (rich) user functions, the custom configuration will be available globally in all functions. Objects in the global job parameters are accessible in many places in the system. All user functions implementing a RichFunction interface have access through the runtime context.

Semantic annotations are used to give hints to Flink the behavior of a function. They tell the system which fields of a function's input does the function reads and evaluates and which fields it unmodified forwards from its input to its output. Semantic annotations are a powerful means to speed up execution, because they allow the system to reason about reusing sort orders or partitions across multiple operations. Semantic annotations may eventually save the program from unnecessary data shuffling or unnecessary sorts and significantly improve the performance of a program. Incorrect semantic annotations would cause Flink to make incorrect assumptions about the program, leading to incorrect results. Hence if the behavior of an operator is not clearly predictable, it is better to not provide any annotation. @ForwardedFields (or @ForwardedFieldsFirst) annotation declares input fields which are forwarded unmodified by a function to the same position or to another position in the output. This information is used by the optimizer to infer whether a data property such as sorting or partitioning is preserved by a function. @NonForwardedFields (or @NonForwardedFieldsFirst) annotation declares all the fields which are not preserved on the same position in a function's output. The values of all other fields are considered to be preserved at the same position in the output. Hence, @NonForwardedFields annotation is opposite of @ForwardedFields annotation. @ReadFields (or @ReadFieldsFirst) annotation declares all fields that are accessed and evaluated by a function, i.e., all fields that are used by the function to compute its result. For example, fields which are evaluated in conditional statements or used for computations must be marked as read, while fields which are unmodified forwarded to output without evaluation are considered not read.

Broadcast variables allow to make a data set available to all parallel instances of an operation, in addition to the regular input of the operation. They enable data set to be accessible at the operator as a Collection. The broadcast sets are registered by name via withBroadcastSet(DataSet, String), and accessible via getRuntimeContext().getBroadcastVariable(String) at the target operator. Flink offers a distributed cache which enable to share locally accessible files to parallel instances of user functions. A program registers a file or directory of a local or remote filesystem such as HDFS or S3 under a specific name in its ExecutionEnvironment as a cached file. When the program is executed, Flink automatically copies the file or directory to the local filesystem of all workers. A user function can look up the file or directory under the specified name and access it from the worker’s local filesystem.

Table API and SQL

Flink provides two relational APIs, the Table API and SQL, for unified stream and batch processing. The Table API is a language-integrated query API for Scala and Java that allows the composition of queries from relational operators such as selection, filter, and join in a very intuitive way. Flink’s SQL support is based on Apache Calcite which implements the SQL standard. Queries specified in both interface have the same semantics and specify the same result regardless whether the input is a batch input (DataSet) or a stream input (DataStream). The Table API and the SQL interfaces (still actively being developed) are tightly integrated with each other as well as Flink’s DataStream and DataSet APIs. This enables to switch seamlessly between all APIs and libraries. The central concept of Table API and SQL is that the Table serves as input and output of queries.

Flink provides planners, namely Blink planner (Flink 1.9) and old planner, which are responsible for translating relational operators into an executable, optimized Flink job. Both of the planners come with different optimization rules and runtime classes. Blink planner treats batch jobs as a special case of streaming, were batch jobs are only translated into DataStream programs (not DataSet) similar to the streaming jobs. Blink planner also does not support BatchTableSource and uses bounded StreamTableSource.

The TableEnvironment is a central concept of the Table API and SQL integration. It is responsible for registering a Table in internal catalog and registering catalogs. It also enables to execute SQL queries and converts DataStream or DataSet into a Table. A TableEnvironment is created by calling the BatchTableEnvironment.create() or StreamTableEnvironment.create() with a StreamExecutionEnvironment or an ExecutionEnvironment and an optional TableConfig. The TableConfig can be used to configure the TableEnvironment or to customize the query optimization and translation process. A Table is always bounded to a specific TableEnvironment and its not possible to combine tables (using join or union) of different TableEnvironments in the same query. TableEnvironment maintains a map of catalogs of tables which are created with an identifier. Each identifier consists of catalog name, database name and object name. Tables can be either regular TABLES describing external data such as a file or database table, else virtual VIEWS created from an existing Table object. TableEnvironment allows to set the current catalog and current database, thus making them optional parameters while creating tables using createTemporaryTable (..) and views using createTemporaryView(..).

Table can be temporary for a particular Flick session or permanent visible across multiple Flink sessions and clusters. A temporary table is stored in memory and can have same identifier as an existing permanent table, in which case it shadows the permanent table making it inaccessible. An object of Table API corresponds to a VIEW (virtual table) from relational database systems. It encapsulates a logical query plan. The query defining the Table is not optimized and the subsequent queries referencing the registered table are inlined. The Table API has integrated query API which can be applied on Table class to perform relational operations. These query API methods for example, filter(), groupBy(), select() etc return a new Table object representing the result of applying the relational operation on the input Table. Some relational operations are composed of multiple method calls such as table.groupBy(...).select(), where groupBy(...) specifies a grouping of table, and select(...) the projection on the grouping of table.

Flink’s SQL queries as opposed to query API are specified as regular Strings. Flink query is internally represented as a logical query plan which is optimized and translated into a DataStream program. Table API and SQL queries can be easily mixed because both return Table objects. A Table is emitted by writing it to a TableSink. TableSink is a generic interface to support a wide variety of file formats (e.g. CSV, Apache Parquet, Apache Avro), storage systems (e.g., JDBC, Apache HBase, Apache Cassandra), or messaging systems (e.g., Apache Kafka). A batch Table can only be written to a BatchTableSink, while a streaming Table requires either an AppendStreamTableSink, a RetractStreamTableSink, or an UpsertStreamTableSink. The Table.executeInsert(String tableName) method looks up the TableSink from the catalog by name, validates the table schema with TableSink schema and finally emits the Table to the specified TableSink.

Some of the Table API or SQL query used frequently are as below.

The TableEnvironment.executeSql() method is used for executing a given statement.
The Table.executeInsert() is used for inserting the table content to the given sink path.
The Table.execute() is used for collecting the table content to local client.
A Table is buffered in StatementSet first, when its emitted to a sink through StatementSet.addInsert() or an INSERT statement using StatementSet.addInsertSql(). The StatementSet.execute() finally emits the buffered data to sink.

Below is the example common structure of Table API and SQL program.

// create a TableEnvironment for blink planner streaming
val bsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build()
val tableEnv = StreamTableEnvironment.create(bsEnv, bsSettings)

// create a Table
tableEnv.connect(...).createTemporaryTable("table1")
// register an output Table
tableEnv.connect(...).createTemporaryTable("outputTable")

// create a Table from a Table API query
val tapiResult = tableEnv.from("table1").select(...)
// create a Table from a SQL query
val sqlResult  = tableEnv.sqlQuery("SELECT ... FROM table1 ...")

// emit a Table API result Table to a TableSink, same for SQL result
val tableResult = tapiResult.executeInsert("outputTable")
tableResult...

Table API and SQL queries can be easily integrated with and embedded into DataStream and DataSet programs. Inversely, a Table API or SQL query can also be applied on the result of a DataStream or DataSet program. This can be achieved by converting a DataStream or DataSet into a Table and vice versa. The Scala Table API features implicit conversions for the DataSet, DataStream, and Table classes. The DataStream or DataSet can also be registered in a TableEnvironment as a View. The schema of the resulting temporary view depends on the data type of the registered DataStream or DataSet. DataStream or DataSet can also be converted directly into a Table, allowing the inverse as well were a Table can be converted into a DataStream or DataSet. While converting a Table into a DataStream or DataSet the data type for the resulting DataStream or DataSet (from table rows) needs to be specified. A Table which is the result of a streaming query is updated dynamically, such that as new records arrive on the query’s input streams, the table changes. To convert table to DataStream, such updates on the table by dynamic querying is encoded to be converted into DataStream. The two modes to convert a Table into a DataStream are Append Mode which is used only when dynamic Table is modified by INSERT changes and Retract Mode which is used always.

Apache Flink leverages Apache Calcite to perform sophisticated query optimization. The optimizer makes decisions based on query plan, data source statistics and fine-grain costs (memory, cpu etc) for each operator. The Table API provides a mechanism to explain the logical and optimized query plans to compute a Table using Table.explain() or StatementSet.explain() methods. Table.explain() returns the plan of a Table, while StatementSet.explain() returns the plan of multiple sinks.

Conclusion

Apache Flink is a true stream processing framework, built from the ground up to process streaming data. It processes events one at a time and treats batch processing as a special case. In contrast, Apache Spark treats a data stream as multiple tiny batches making streaming a special case of batch. Stream imperfections like out-of-order events can be easily handled using Flink's event time processing support. Flink provides additional operations that allows implementing cycles within the streaming application and perform several iterations on batch data. Flink provides a vast powerful set of operators to apply functions to a finite group of elements in a stream. Hence complex data semantics can be easily implemented using Flink’s rich programming model.

Flink implements lightweight distributed snapshots to provide low overhead and only-once processing guarantees in stream processing. Flink does not rely entirely on JVM garbage collector and implements a custom memory manager that stores data to process in byte arrays. This allows reducing the load on a garbage collector and increases performance. This enables Flink to process event streams at high throughputs with consistent low latencies. Spark Streaming is trying to catch up with Flink with Structured Streaming release and it seems to be tough fight ahead. There is multiple benchmarking comparing the both after newer releases, with Spark claiming to be faired well, which then is responded by Flink. Apache Flink is relatively new in streaming arena and has smaller community relative to Apache Spark. Also it is primarily used in streaming applications as there is no known adoption of Flink Batch in production. Flink does lack in robustness on node failures when compared to Spark. It also cannot handle skewed data very well (as Spark). With all that said, Apache Flink is the best framework out there for real-time processing and has a fast growing community by each day.

Spark - An In-Memory Cluster Computing Framework

2020-06-29T16:47:00.132-07:00

In the modern information age were data is collected at a staggering scale from countless interconnected devices, the need to be able to stream, process and analyze this data often in real time has become crucial for many companies. The Internet of Things (IoT) were objects and devices are embedded with tiny sensors that communicate with each other and the user, creating an interconnected system generating a massive amounts of data. Such large volume of data needs to be processed in real time to drive intelligent solutions and revolutionary features of modern applications. A massive distributed parallel processing of vast amounts and varieties of data is tough to manage with the current analytics capabilities in the cloud. Apache Spark with its stack components enables to process such decentralized data from fog computing solution. Apache Spark is powerful cluster computing engine and it stands out for its ability to process large volumes of data significantly faster than MapReduce because data is persisted in-memory on Spark’s own processing framework. Spark can be used for both batch processing as well as real-time processing. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Apache Spark also provides a suite of web user interfaces (UIs) which can be used to monitor the status and resource consumption of the Spark cluster. Spark is framework is comprised of Spark Core and four libraries namely, Spark SQL, Spark Streaming, MLlib and GraphX, which are optimized to address different use cases.

Hadoop and Spark

Hadoop was the only major player in Big Data processing until Apache Spark was released in 2014. With Spark providing convenient APIs and increased speeds up to 100 times faster than Hadoop MapReduce, it is dominating the Big Data landscape. Spark is not a replacement for Hadoop entirely, however it provides a promising alternative to Hadoop MapReduce.

Hadoop MapReduce manages scheduling and task allocation processes within the cluster along with workloads which are suited for batch processing. Multiple MapReduce jobs could be strung together to create a data pipeline. In between every stage of that pipeline, the MapReduce code would read data from the disk, and when completed, would write the data back to the disk. This process was inefficient because it had to read all the data from disk at the beginning of each stage of the process. Hadoop MapReduce persists data back to the disk after every map or reduce action. It also kills its processes as soon as a job is complete to reduce memory footprint.

Spark on the other hand does not write the data onto the disk, instead it performs all its activities (transformations) within the memory (RAM) thus increasing the processing performance. Although when data is too large to fit within the memory, Spark can also use the disk to process data, degrading the performance. Even without using in-memory cache, it still outperforms Hadoop MapReduce. For example, Spark set the record for the GraySort benchmark in 2014 by sorting 100 TB of data in 23 minutes compared to 72 minutes using Hadoop MapReduce on a cluster of 2100 nodes in 2013. Spark also overcomes the disk I/O limitations by caching data in memory as much as possible using RDDs to reduce disk I/O. On average, Spark outperforms Hadoop by up to 20 times in iterative algorithms (machine learning), because of Spark's efficient reuse of intermediate results. Spark offers faster approach to process data without passing data through MapReduce processes in Hadoop. Spark is not designed to deal with the data management and cluster administration tasks associated with running data processing and analysis workloads at scale. It leverages Hadoop YARN or Apache Mesos which offer functionalities around distributed cluster management. Spark can also run on top of Hadoop benefiting from Hadoop's cluster manager (YARN) and underlying storage such as HDFS and HBase. Spark without Hadoop, integrates with alternative cluster managers like Mesos and storage platforms like Cassandra and Amazon S3. Spark allows to better manage a wide range of data processing tasks, from batch processing to streaming data and graph analysis.

Spark Architecture

Apache Spark has a well-defined and layered architecture where all the spark components and layers are loosely coupled and integrated with various extensions and libraries. Apache Spark Architecture is based on two main abstractions.

Resilient Distributed Dataset (RDD)

RDD is an immutable, fundamental collection of data elements or datasets that are processed in parallel. RDD is logically partitioned across many servers so that they can be computed on different nodes of the cluster. RDD hides the data partitioning and distribution which allows Spark to design parallel computational framework with a higher-level API using Scala, Python and R. The dataset in RDD (records of data) can be loaded from any datasource, e.g. text/JSON files, a database via JDBC, etc. Each dataset in an RDD can be divided into logical partitions, which are then executed on different worker nodes of a cluster. Spark supports two types of RDD’s – Hadoop Datasets which are created from the files stored on HDFS and the parallelized collections which are based on existing Scala collections. RDD is mostly stored in memory for as much time as possible, although it can also be stored on hard drive if required. Since RDDs are immutable, no changes take place in them once they are created, which allows to maintain consistency over the cluster. RDD can define placement preferences of RDD records in order to compute partitions as close to the records as possible. RDD can be cached either in memory using persist method (or on disk) for faster access, when the RDD is used several times during the processing. RDDs can also be partitioned manually to correctly balance partitions and distribute them across the nodes within a cluster. Generally, smaller partitions allow distributing RDD data more equally among more executors, while it's easy to work with fewer partitions. The number of partitions of a RDD can be controlled by using repartition or coalesce transformations. RDDs live in a single SparkContext that creates a logical boundary. An RDD is a named (by name) and uniquely identified (by id) entity in a SparkContext. RDDs are resilient i.e. fault tolerant and able to recompute missing or damaged partitions caused by node failures with the help of RDD lineage graph.

Spark RDD’s support two different types of operations – Transformations and Actions, which can be performed on RDD as well as on data storage to form another RDDs. Transformations create a new dataset (new RDD) from an existing one, while Actions return a value to the driver program after running a computation on the dataset. Some examples of transformations include map, filter, select, and aggregate (groupBy), while examples of actions are count, show, reduce or writing data out to file systems. When transformation is applied on RDD it creates a DAG (Directed Acyclic Graph) using applied operation, source RDD and function used for transformation. It will build this DAG graph using the references until any Action operation is applied on the last lined up RDD, were the DAG is submitted to DAG scheduler for execution. The result values of action which is the actual dataset are stored to driver program or to the external storage system. Hence spark transformations is lazy, as actual computations is only triggered when an action is invoked. At high level, there are two transformations that can be applied onto the RDDs, namely Narrow transformation and Wide transformation.

Narrow transformation — In Narrow transformation, all the elements that are required to compute the records in single partition live in the single partition of parent RDD. It doesn't require the data to be shuffled across the partitions. Hence the input & output data stays in the same partition, i.e. it is self-sufficient. Only a limited subset of partitions used to calculate the result. Spark groups narrow transformations as a single stage known as pipelining. Common Narrow transformations are map, flatMap, MapPartitions, filter, sample & union.

Wide transformation — In wide transformation, all the elements that are required to compute the records in the single partition may live in many partitions of parent RDD. Wide transformations are also known as shuffle transformations because data shuffle is required to process the data.

Common wide transformation are gropupByKey, ReduceByKey, coalesce, repartition, join, intersection.

RDDs can be created using following methods.

Parallelized Collections: The simplest way to create RDDs is by using existing collection from the driver program and passing it to SparkContext's parallelize() method. The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. The parallelize method also takes a number of partitions, represented by slices parameter below, which cuts the dataset into specified partitions. Entire dataset needs to be on one machine in order to operate parallelize method. Due to this property, this process is rarely used outside of testing and prototyping.

Parallelized collections are created by calling below methods of SparkContext.
```
sparkContext.paralleleize(col, slices)
sparkContext.makeRDD(coll, slices)
sparkContext.range(start, end, step, slices)
```
External Datasets: Spark supports creating RDDs on any storage source supported by Hadoop, including local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat. Text file RDDs can be created using SparkContext's textFile(name, partitions) method. This method takes an URI for the file (either a local path, or URI like hdfs://, s3n://, etc) and reads it as a collection of lines. When using local file system the file is either copied to all workers or shared using network-mounted shared file system. RDD of pairs of a file and its content from a directory can can be created using SparkContext's wholeTextFiles(name, partitions) method. This method reads a directory containing multiple small text files, and returns each of them as (filename, content) pairs. All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, textFile("/my/directory"), textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz"). The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file and does not allow to have fewer partitions than the blocks. The SparkContext’s sequenceFile[K, V] method where K and V are key and value types respectively in the SequenceFiles file. The SparkContext's hadoopRDD method is used for Hadoop InputFormats which takes JobConf, input format class, key class and value class. The RDD.saveAsObjectFile and SparkContext.objectFile supports saving an RDD in a simple format consisting of serialized Java objects.
Existing Spark RDD: A new RDD can also be created from existing RDDs with a different dataset using the transformation process which is carried out frequently as RDDs are immutable. The new RDD created from an existing Spark RDD also carries a pointer to the parent RDD in Spark. All such dependencies between the RDDs are logged in a lineage graph.

Below are the examples of creating new RDDs using above methods.

// creating RDD using parallelize collection
val rdd1 = spark.sparkContext.parallelize(Array("sun","mon","tue","wed","thu","fri"),4)
val result = rdd1.coalesce(3)
result.foreach(println)

// creating RDD using external storage
val dataRDD = spark.read.textFile("path/of/text/file").rdd

// creating RDD using existing RDDs i.e. transformations
val words=spark.sparkContext.parallelize(Seq("sun", "rises", "in", "the", "east", "and", "sets", "in", “the”, "west"))
val wordPair = words.map(w => (w.charAt(0), w))
wordPair.foreach(println)

A RDD Lineage (RDD operator or RDD dependency graph) is a graph of all the parent RDDs of a RDD. It is built as a result of applying transformations to the RDD and creates a logical execution plan. The lineage graph determines what transformations need to be executed after an action has been called. The logical execution plan starts with the earliest RDDs (those with no dependencies on other RDDs or reference cached data) and ends with the RDD that produces the result of the action that has been called to execute. The RDD's toDebugString() method is used to get details of RDD lineage graph.

Intrinsic properties of RDD Object:

A list of parent RDD’s that are the dependencies of the RDD.
A list of partitions that a dataset is divided into.
A compute function for computing each split on partitions.
An optional Partitioner for key-value RDDs, that defines the hashing of keys, and the pairs partitioned.
An optional list of preferred locations to computer each split on, i.e. hosts for a partition where the records live or are the closest to read from.

The goal of RDD is to reuse intermediate in-memory results across multiple data-intensive workloads with no need for copying large amounts of data over the network. Some of the limitations of RDD include, RDD is not optimized to work with structural data, it cannot infer schema of the ingested data and RDDs degrade performance when there is not enough memory to store them.

Directed Acyclic Graph (DAG)

A Directed Acyclic Graph (DAG) is a graph that has no cycles and flows in one direction. DAGs are useful for representing many different types of flows, including data processing flows. The DAG in Spark are used to perform sequence of computations on the data (RDDs). The DAG consists of nodes representing RDD partitions, while the directed edges representing the transitions (transformation) from one data partition state to another. DAG is the scheduling layer of the Apache Spark architecture that implements stage-oriented scheduling. Compared to MapReduce that creates a graph in two stages (Map and Reduce), Spark creates the DAGs which may contain multiple stages forming a tree-like structure. The DAG scheduler divides the operators into stages of tasks. A stage is comprised of tasks which are based on partitions of the input data (block size). The DAG scheduler pipelines operators together. For e.g. Many map operators can be scheduled in a single stage. The final result of the DAG scheduler is a set of stages. The Stages are passed on to the Task Scheduler. The task scheduler launches tasks via cluster manager (Spark Standalone/Yarn/Mesos). The task scheduler is unaware about the dependencies between the stages of tasks. The worker nodes execute the tasks on the Slave.

The Apache Spark framework uses a master–slave architecture that consists of a driver, which runs as a master node, and many executors that run across as worker nodes in the cluster. Executors are agents that are responsible for executing a task. The Driver Program calls the main program of an application and creates SparkContext. A SparkContext consists of all the basic functionalities. The Spark driver is a JVM process that coordinates workers and execution of the task. The driver contains various other components such as DAG Scheduler, Task Scheduler, Backend Scheduler, and Block Manager, which are responsible for translating the user-written code into jobs that are actually executed on the cluster. Spark Driver and SparkContext collectively watch over the job execution within the cluster. Spark Driver works with the Cluster Manager (YARN, Mesos) which allocates resources to manage various jobs. Each job is split into multiple smaller tasks which are further distributed to worker nodes. An action is one of the ways of sending data from Executer to the driver.

Broadcast Variables and Accumulators

It is important to note that Spark breaks the transformation and action computations into tasks to run on separate machines, and each machine runs both its part of the transformation (e.g. map) and a local reduction, returning only its answer to the driver program. Spark breaks up the processing of RDD operations into tasks which are individually executed by an executor. Prior to execution, Spark computes the closure of the task by identifying the variables and methods which must be visible for the executor to perform its computations on the RDD. The closure is then serialized and sent to each executor as copies to be executed individually. Hence regular variables should not be referenced in such transformation/action operations as they are executed separately on different machines.

Sometimes though, a variable needs to be shared across multiple tasks, or between tasks and the driver program. Spark supports two types of shared variables, broadcast variables which can be used to cache a value in memory on all nodes, and accumulators which are variables that are only added to, such as counters and sums.

Broadcast Variables: Broadcast variables is a read-only variable cached on each machine as opposed to variable copies send with tasks. They are used to give every node a copy of a large input dataset in an efficient manner. Spark distributes the broadcast variables using actions that are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. Hence explicitly created broadcast variables are useful only when tasks across multiple stages need the same data. Broadcast variables are created from a variable v by calling SparkContext.broadcast(v) and its value can be accessed by calling the value() method. The broadcast variable is used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. Also the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable. The unpersist() method is used to temporarily release while destroy() method to permanently release all the resources of the broadcast variable copied onto executors.

Accumulators: Accumulators are variables that are added to through associated operations. They are used for aggregating information across the executors and to implement counters or sums. Accumulators provide a mechanism for safely updating a variable when execution is split up across worker nodes in a cluster. Spark supports accumulators of only numeric types but allows to add new types. Accumulators can unnamed or named accumulators, were named accumulator is displayed in Spark's web UI for the stage that modifies that accumulator along with its value. A numeric accumulator can be created by calling SparkContext.longAccumulator() or SparkContext.doubleAccumulator() to accumulate values of type Long or Double, respectively. Tasks running on a cluster can add to the accumulator using the add method. However only the driver program can read the accumulator’s value, using its value method. Spark guarantees that each task’s update to the accumulator inside actions will only be applied once, regardless of any restarted tasks. For RDD transformations, each task’s update may be applied more than once if tasks or job stages are re-executed. Accumulators do not change the lazy evaluation model of Spark. The accumulator value is only updated once the RDD is computed as part of an action. As a result of this, accumulators used inside functions like map() or filter() wont get executed unless some action happen on the RDD.

Shuffle operations

Shuffle operations re-distribute the data across partitions such that they are grouped differently. It involves copying data across executors on different machines thus making the shuffle a complex and costly operation (involves disk and network I/O). Prime examples of shuffle is reduceByKey or aggregateByKey operation. The result of executing a reduce function against all the values associated with a single key is single value combining all values associated with the key. Since not all values for a single key necessarily reside on the same partition, or even the same machine, poses a challenge as they must be co-located to compute the result. During computation, a single task operates on a single partition to organize all the data to execute a single reduce function task. Spark then performs an all-to-all operation were it reads from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key. Internally Spark generates sets of map tasks to organize the data, and a set of reduce tasks to aggregate them. The map tasks reside in memory until they fit (written to disk when size exceeds) and are sorted based on the target partition to write into a single file. The reduce tasks read the relevant sorted blocks to aggregate the result. The newly shuffled data has no ordering of partitions or elements. The methods mapPartitions, repartitionAndSortWithinPartitions and sortBy are used to ordered the resultant shuffled data. The repartition operations like repartition and coalesce cause shuffle.

There are two main shuffle implementations available in Spark, namely Sort Shuffle and Tungsten Sort. Although before Spark 1.2.0, Hash Shuffle was used as default, which created a separate file for each mapper task and for each reducer, resulting in large number of open files in the filesystem which impacted the performance for large operations. Sort Shuffle implementation outputs a single file ordered and indexed by reducer id, which allows to easily fetch the chunk of data related a given reducer id by using the position of related data block in the file and doing a single file seek operation before reading the file. It also fallbacks to having separate files for the reducers when the amount of reducers is smaller than spark.shuffle.sort.bypassMergeThreshold. In Sort Shuffle, the sorted results of map operation are not reused for reduce operation, instead the results of reduce operation are sorted again using TimSort. The intermediate results are written to disk when not enough memory is available. Sort Shuffle implementation creates smaller files for map operation, but sorting of results is slow compared to Hash Shuffle.

Project Tungsten was the largest change to Spark’s execution engine since the project’s inception. It focuses on substantially improving the efficiency of memory and CPU for Spark applications, to push performance closer to the limits of modern hardware. The cache-aware computation allows Spark applications to spend less CPU time waiting to fetch data from main memory and more time doing useful work. It also used code generation to exploit modern compilers and CPUs. As Spark applications pushed the boundary of performance, the overhead of JVM objects and GC had become non-negligible. Java objects had a large inherent memory overhead, with a simple 4 byte string becoming over 48 bytes in total in JVM object model, due to UTF-16 encoding, byte headers and hash code data. Further JVM garbage collection exploits the transient nature of young generation objects having a high rate of allocation/deallocation, which works only when GC can reliably estimate the life cycle of objects. As Spark knows much more information than the JVM garbage collector about the life cycle of memory blocks, it can manage memory more efficiently than the JVM. Hence an explicit memory manager was introduced to convert most Spark operations to operate directly against binary data rather than Java objects. It builds on sun.misc.Unsafe, an advanced functionality provided by the JVM that exposes C-style memory access e.g. explicit allocation, deallocation, pointer arithmetics. Also each method call is compiled by JIT into a single machine instruction.

Tungsten Sort is a shuffle implementation as part of Project Tungsten, which operates directly on serialized binary data (without deserializing it) and uses memory copy functions sun.misc.Unsafe to directly copy serialized data. It also uses special cache-efficient sorter ShuffleExternalSorter that sorts arrays of compressed record pointers and partition ids. By using only 8 bytes of space per record in the sorting array, it works more efficiently with CPU cache. The serialized data spilled from the memory is directly stored on the disk. The extra spill-merging optimizations are automatically applied when the shuffle compression codec supports concatenation of serialized stream output, which is enabled by shuffle.unsafe.fastMergeEnabled parameter. Tungsten sort also uses off-heap storage array using LongArray to save 64 bit long pointers which are used to access all our memory blocks (in heap/ off heap). Tungsten sort works only for non-aggregation operation, as aggregation requires deserialization to aggregate new incoming values.

RDD Persistence and Caching

When an RDD is cached or persisted, each node stores any RDD partitions which is computed within the memory in order to reuse them for other actions on the same or derived dataset. Due to caching and reuse of RDDs, the performance of the future RDD operations increase by over 10 times. An RDD can be marked for persistence or caching by using the persist() or cache() methods on it. Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. An RDD can be removed manually from the cache instead of waiting for it to fall out of the cache, the RDD.unpersist() method is used. Spark’s cache is fault-tolerant as any lost RDD partition is automatically recomputed using the transformations that originally created it. Also each persisted RDD can be stored using a different storage level based on storage location and format. These storage levels are set by passing a StorageLevel object to the persist(). The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY. Below are the set of storage levels.

Storage Level	Description
MEMORY_ONLY	RDDs are stored as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK	RDDs are stored as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
MEMORY_ONLY_SER	RDDs are stored as serialized Java objects. It is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER	Similar to MEMORY_ONLY_SER, but spill partitions that don't fit in memory to disk instead of recomputing them on the fly each time they're needed.
DISK_ONLY	Store the RDD partitions only on disk.
MEMORY_ONLY_2, MEMORY_AND_DISK_2	Same as the levels above, but replicate each partition on two cluster nodes.

Spark Core

Spark Core contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems etc. Spark Core also has API that defines RDDs (Resilient Distributed Datasets), which are Spark’s main programming abstraction. RDDs represent a collection of items distributed across many compute nodes that can be manipulated in parallel. Spark Core provides many APIs for building and manipulating these collections.

Spark SQL

Spark SQL is Spark’s package for processing structured and semi-structured data. It allows querying data via SQL as well as Hive Query Language (HQL) and supports many sources of data, including Hive tables, Parquet, and JSON. Spark SQL sits on top of Spark Core to introduce a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data. Spark SQL provides a natural syntax for querying JSON data along with automatic inference of JSON schemas for both reading and writing data. It understands the nested fields in JSON data and allows users to directly access these fields without any explicit transformations. Spark SQL can also act as a distributed query engine using its JDBC/ODBC or command-line interface. In this mode, end-users or applications can interact with Spark SQL directly to run SQL queries, without the need to write any code.

DataFrames

A DataFrame is a distributed collection of data organized into rows and named columns. It organizes data into rows, where each row consists of a set of columns, and each column has a name and an associated type. It is conceptually equivalent to a table in a relational database, but with richer optimizations under the hood. DataFrame works only on structured and semi-structured data by organizing the data in the named column. Data is stored in row-columnar format, row chunk size is set by spark.sql.inMemoryColumnarStorage.batchSize. Each column in each partition stores min-max values for partition pruning. It allows better comparison ratio than standard RDD. It also delivers faster performance for small subsets of columns. DataFrame serializes the data into off-heap storage (in memory) in binary format and then performs many transformations directly on it. The Tungsten physical execution explicitly manages memory and dynamically generates byte code for expression evaluation. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs. The DataFrame API provides a higher-level abstraction which allows to use a query language to manipulate data. It is available in Scala, Java, Python, and R. In Scala and Java, a DataFrame is represented by a Dataset of Rows i.e. Dataset[Row] in Scala. DataFrames operations are referred to as untyped transformations, in contrast to typed transformations of DataSets.

// create a basic SparkSession using SparkSession.builder()
val spark = SparkSession.builder()
                        .appName("Spark SQL basic example")
                        .config("spark.some.config.option", "some-value")
                        .getOrCreate()
                        
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._

val rdd = sc.parallelize(1 to 10).map(x => (x, x * x))
val dataframe = spark.createDataFrame(rdd).toDF("key", "square")
dataframe.show()

// creates a DataFrame based on the content of a JSON file
val df = spark.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout
df.show()

// Select only the "name" column
df.select("name").show()

// Select everybody, but increment the age by 1
df.select($"name", $"age" + 1).show()

// Select people older than 21
df.filter($"age" > 21).show()

// Count people by age
df.groupBy("age").count().show()

Temporary views in Spark SQL are session-scoped and they disappear once the session terminates. A global temporary view is used for sharing the view among all the sessions and keeping it alive until the Spark application terminates. Global temporary view is tied to a system preserved database global_temp, and a qualified name must be used to refer it, e.g. SELECT * FROM global_temp.view1.

// Register the DataFrame as a SQL temporary view which is only available within the session
df.createOrReplaceTempView("people")
// SQL function on SparkSession enables to run SQL queries programmatically and return DataFrame as a result
val sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()

// Register the DataFrame as a global temporary view which is shared among all sessions and is alive until Spark application terminates
df.createGlobalTempView("people")
// Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.people").show()
// Global temporary view is available across sessions
spark.newSession().sql("SELECT * FROM global_temp.people").show()

DataSets

A Dataset is a strongly-typed, immutable distributed collection of objects that are mapped to a relational schema. Datasets efficiently processes structured and unstructured data. A Dataset can be constructed from JVM objects of row or a collection of row object, which can be manipulated using functional transformations (map, flatMap, filter, etc.) similar to RDD. The Spark Dataset API is an extension to Data frame API and provides both type safety and object-oriented programming interface. It is available in Scala and Java. At the core of the Dataset API is a new concept called an encoder, which is responsible for converting between JVM objects and tabular representation. The tabular representation is stored using Spark’s internal Tungsten binary format, allowing for operations on serialized data and improved memory utilization. Encoders serialize the objects into binary format for efficient processing or transmitting of data over the network. Encoders are code generated dynamically and use a format that allows Spark to perform many operations like filtering, sorting and hashing without deserializing the bytes back into an object. Spark supports automatically generating encoders for a wide variety of types, including primitive types (e.g. String, Integer, Long), Scala case classes, and Java Beans. The highly optimized encoders use runtime code generation to build custom bytecode for serialization, speeding up the serialization process and significantly reducing the size of encoded data. The encoders also serve as a powerful bridge between semi-structured formats (e.g. JSON) and type-safe languages like Java and Scala. Dataset also provides compile-time type safety which enables to check for errors before running production applications. Aggregate operations in datasets runs much faster than the corresponding naive RDD implementation. Dataset API also reduces the memory usage, by creating a more optimal layout in memory when caching Datasets.

// Define spark context and create instance of SQLContext
val conf = new SparkConf().setAppName("SQL Application").setMaster("local[*]")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Required to ensure that toDF() and toDS() methods works as expected
import spark.implicits._

// Read JSON file to initialize as University DataSet
case class University(name: String, numStudents: Long, yearFounded: Long)
val schools = sqlContext.read.json("/schools.json").as[University]
schools.map(s => s"${s.name} is ${2015 – s.yearFounded} years old")

// create a Dataset using SparkSession.createDataset() and the toDS 
val movies = Seq(Movie("Avengers", "Awesome", 2019L),  Movie("Justice League", "Nice", 2018L))
val moviesDS = sqlContext.createDataset(movies)
moviesDS.show()
val moviesDS1 = movies.toDS()
moviesDS1.show()

// Encoders are created for case classes
case class Employee(name: String, age: Long)
val caseClassDS = Seq(Employee("Amy", 32)).toDS
caseClassDS.show()

// Encoders for most common types are automatically provided by importing spark.implicits._
val primitiveDS = Seq(1, 2, 3).toDS()
primitiveDS.map(_ + 1).collect() // Returns: Array(2, 3, 4)

// convert DataFrame to strongly typed Dataset
case class Movie(actor_name:String, movie_title:String, produced_year:Long)
val movies = Seq(("Damon, Matt", "The Bourne Ultimatum", 2007L), ("Damon, Matt", "Good Will Hunting", 1997L))
val moviesDF = movies.toDF.as[Movie]

Spark SQL can load data from JSON data source and execute Spark SQL query. The schema of dataset is automatically inferred and natively available without any user specification. This can be achieved using SQLContext's jsonFile and jsonRDD methods. We can create a SchemaRDD for a given JSON dataset and then can register the SchemaRDD as a table as shown in below example. SchemaRDDs can be created from many types of data sources, such as Apache Hive tables, Parquet files, JDBC, Avro file, or as the result of queries on existing SchemaRDDs. Since JSON is semi-structured and different elements might have different schemas, Spark SQL resolves conflicts on data types of a field. It is not required to know all fields appearing in the JSON dataset. The specified schema can either be with a subset of the fields appearing in the dataset or can have field that does not exist. It is easy to write SQL queries on the JSON dataset and the result of a query is represented by another SchemaRDD.

// Create a SQLContext using an existing SparkContext (sc)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create a SchemaRDD for the JSON dataset. 
val employee = sqlContext.jsonFile("/path/to/employee.json")
// Register the created SchemaRDD as a temporary table.
employee.registerTempTable("employee")
// Visualize the employee schema
employee.printSchema()

// Example content of the employee JSON file is {"name":"Jackson", "address":{"city":"Los Angeles","state":"California"}}
val nameAndAddress = sqlContext.sql("SELECT name, address.city, address.state FROM employee")
nameAndAddress.collect.foreach(println)

Spark SQL supports two different methods for converting existing RDDs into Datasets. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. It is more concise and works well when the schema is already known while writing the Spark application. The case class, which can contain complex types, defines the schema of the table. The names of the arguments to the case class are read using reflection and become the names of the columns. The second method for creating Datasets is through a programmatic interface which allows to construct a schema and then apply it to an existing RDD. This method is verbose but it allows to construct Datasets when the columns and their types are unknown until runtime. First an RDD of Rows is created from the original RDD, then schema is created matching the structure of Rows in the newly created RDD, and finally the schema is applied to the RDD of Rows via createDataFrame method provided by SparkSession.

Spark SQL supports Built-in and User Defined Scalar Functions, that return a single value per row. Similarly Spark SQL also supports Built-in and User Defined Aggregate functions such as count(), countDistinct(), avg(), max(), min(), etc that return a single value on a group of rows.

Catalyst optimizer

Catalyst optimizer is the core of Spark SQL which builds an extensible query optimizer. Catalyst optimizer contains general library for representing trees and applying rules to manipulate them. On top that it has libraries specific to relational query processing (e.g., expressions, logical query plans), and sets of rules which handle different phases of query execution; analysis, logical optimization, physical planning, and code generation to compile parts of queries to Java bytecode. Catalyst also offers several public extension points, including external data sources and user-defined types. Catalyst also supports both rule-based and cost-based optimization. The data type in Catalyst is an immutable tree composed of node objects, which can be either a Literal, an Attribute or an Operation. Trees are manipulated by rules, which apply a pattern matching function recursively on all the nodes of the tree, transforming the nodes that match each pattern to a result. Rules are executed multiple times in order to fully transform a tree. Catalyst groups rules into batches, and executes each batch until it reaches a fixed point were the tree stops changing after applying its rules. The Catalyst’s tree transformation is used four phases, (1) analyzing a logical plan to resolve references, (2) logical plan optimization, (3) physical planning, and (4) code generation to compile parts of the query to Java bytecode. In the physical planning phase, Catalyst generates one or more physical plans and selects a plan using a cost model. All other phases are purely rule-based optimizations such as pipelining projections or filters into one Spark map operation performed in physical planner. Each phase uses different types of tree nodes. Catalyst relies on a special feature of the Scala language, quasi quotes, to make code generation simpler.

Performance Tuning

Spark SQL can cache tables using an in-memory columnar format by calling spark.catalog.cacheTable("tableName") or dataFrame.cache(). Then Spark SQL will scan only required columns and will automatically tune compression to minimize memory usage and GC pressure. The spark.catalog.uncacheTable("tableName") method is called to remove the table from memory. Configuration of in-memory caching can be done using the setConf method on SparkSession or by running SET key=value commands using SQL. Spark SQL supports providing join strategy hints, namely BROADCAST, MERGE, SHUFFLE_HASH and SHUFFLE_REPLICATE_NL, which instructs Spark to use the hinted strategy on each specified relation when joining them with another relation. Spark prioritizes the BROADCAST hint over the MERGE hint over the SHUFFLE_HASH hint over the SHUFFLE_REPLICATE_NL hint, when both sides of the join have hints. Spark SQL also provides coalesce hints to control the number of output files similar to the coalesce, repartition and repartitionByRange in Dataset API. Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default and can be turned on using the umbrella configuration of spark.sql.adaptive.enabled. As of Spark 3.0, AQE has three major features, including coalescing post-shuffle partitions, converting sort-merge join to broadcast join, and skew join optimization.

Structured Streaming

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. It takes care of running streaming computation incrementally and continuously and updating the final result as streaming data continues to arrive. The Spark SQL engine also ensures fault-tolerance through checkpointing and Write-Ahead Logs. The Spark SQL's Dataset/DataFrame API can also be used for stream aggregations, event-time windows, stream-to-batch joins etc. Structured Streaming queries are processed by default using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end low latencies (100 ms) and exactly-once fault-tolerance guarantees. A new low-latency processing mode called Continuous Processing is added since Spark 2.3, which achieves end-to-end latencies as low as 1 millisecond with fault-tolerance guarantees. Structured Streaming treats live data stream as a table which is being continuously appended and runs the standard batch-like query as an incremental query on the unbounded table. For every data item arriving on the stream (every second), a new row being appended to the Input Table which eventually updates the Result Table. The source data item is then discarded after updating the result. Whenever the result table gets updated the changed result rows is updated to an external sink. There are few types of built-in output sinks, e.g. File sink, Kafka sink, Foreach sink, and, Console and Memory sink for debugging. The result rows written to external storage is based on the defined output mode. In the Complete Mode the entire updated Result Table is written to the external storage, while for Append Mode and Update Mode, only new rows are appended or updated respectively in the Result Table. The foreach and foreachBatch operations allow to apply arbitrary operations and writing logic on the output of a streaming query. The Dataset.writeStream() also provides trigger settings for a streaming query, which define the timing of streaming data processing, whether the query is going to be executed as micro-batch query with a fixed batch interval or as a continuous processing query.

val spark = SparkSession.builder
                        .appName("NetworkWordCount")
                        .getOrCreate()
  
import spark.implicits._

// lines is the DataFrame representing an unbounded table containing the streaming text data
val lines = spark.readStream
                 .format("socket")
                 .option("host", "localhost")
                 .option("port", 9999)
                 .load()

// First the .as[String] converts DataFrame to a Dataset of String, then the lines are split into words using flatMap()
val words = lines.as[String].flatMap(_.split(" "))

// Dataset are grouped by unique values to generate running word count Dataframe
val wordCounts = words.groupBy("value").count()

// Start running the query in background with complete output mode, that prints running counts on console
val query = wordCounts.writeStream
                      .outputMode("complete")
                      .format("console")
                      .start()

// using the query object handle, wait for the termination of the query
query.awaitTermination()

Event-time is the time when the data was generated and is embedded in the data itself, which allows to perform window-based aggregations. Since Spark is updating the Result Table, it has full control over updating old aggregates when there is arrival of late data (based on event-time), as well as cleaning up old aggregates to limit the size of intermediate state data. Every streaming source is assumed to have offsets (similar to Kafka offsets, or Kinesis sequence numbers) to track the read position in the stream. The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger (interval). Structured Streaming ensures end-to-end exactly-once semantics under any failure, using replayable sources and idempotent sinks.

Structured Streaming allows all kinds of operations to be applied on streaming DataFrames/Datasets, ranging from untyped, SQL-like operations (e.g. select, where, groupBy), to typed RDD-like operations (e.g. map, filter, flatMap). Structured Streaming supports joining a streaming Dataset/DataFrame with a static Dataset/DataFrame as well as another streaming Dataset/DataFrame. The result of the streaming join is generated incrementally and will be the exactly the same as if it was with a static Dataset/DataFrame. Structured Streaming supports inner join and some type of outer joins between a streaming and a static DataFrame/Dataset. With Spark 2.3, support for stream-stream joins is added which enables to join two streaming Datasets/DataFrames. Since any row received from one input stream can match with any future yet-to-be-received row from the other input stream, the past input is buffered as the streaming state in order to match every future input with past input and accordingly generate joined results. All the late and out-of-order data are handled using watermarks in Stream-stream Joins. Structured Streaming provides a StreamingQuery object which is created when a query is started to monitor (progress and errors) and manage the query.

Continuous Processing

Continuous processing is a new, experimental streaming execution mode introduced in Spark 2.3 that enables low (~1 ms) end-to-end latency with at-least-once fault-tolerance guarantees. On contrary the default micro-batch processing engine provides the latencies at best of ~100ms. The Continuous processing engine launches multiple long-running tasks that continuously read data from sources, process it and continuously write to sinks. The number of tasks required by the query depends on how many partitions the query can read from the sources in parallel which depends on number of cores within the cluster. Since it is an experimental mode, currently there is no automatic retry mechanism for failed tasks and requires tasks to be restarted manually from the checkpoint. Further stopping a continuous processing stream may produce spurious task termination warnings which could safely be ignored.

Spark Streaming

Spark Streaming is a Spark component that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. It leverages Spark Core's fast scheduling capability to perform streaming analytics, by ingesting data in mini-batches and performing RDD transformations on those mini-batches of data.

The traditional stream processing systems are designed with a continuous operator model to process the data. In such systems, there is a set of worker nodes, each running one or more continuous operators. Each continuous operator processes the streaming data one record at a time and forwards the records to other operators in the pipeline. There are “source” operators for receiving data from ingestion systems and “sink” operators which output data to downstream systems. When such system is scaled to handle large volumes of real time data, node failures and unpredictable traffic causes system failures and uneven resource allocation. In case of node failure, the system has to restart the failed continuous operator on another node and replay some part of the data stream to recompute the lost information, thus halting the pipeline. Spark Streaming solves these issues by using a new architecture called discretized streams leveraging the fault tolerance and other features from the Spark engine.

Spark Streaming discretizes the streaming data into tiny, sub-second micro-batches. Such splitting of stream data into micro-batches allows for fine-grained allocation of computations to resources, thus load balancing tasks across the workers. Initially Spark Streaming’s Receivers receive data in parallel from the source and separates them into blocks. The source is polled after every batch interval defined in the application to create a batch for each incoming record. Then the receiver replicates the block of data among the executors (worker nodes), buffering it within their memory. The Receivers are long running process in one of the Executors with its life span same as the driver program. Then the latency-optimized Spark engine runs short tasks (tens of milliseconds) to process the batches and output the results to other systems. Internally the Driver launches tasks on every batch interval to process the blocks of data and the subsequent results are sinked to the destination location. Spark tasks are assigned dynamically to the workers based on the locality of the data and available resources. This enables both better load balancing and faster fault recovery. Spark can handle node failures by relaunching the failed tasks in parallel into other nodes within the cluster, evenly distributing all the recomputations across many nodes. Each batch of streaming data is represented by an RDD, a fault-tolerant distributed dataset in Spark. A series of such RDDs is called a DStream. The common representation of RDD allows batch and streaming workloads to interoperate seamlessly. It allows to introduce new operators dynamically for ad-hoc queries which combine streaming data with static datasets or support interactive queries. Spark interoperability also extends to rich libraries like MLlib (machine learning), SQL, DataFrames, and GraphX. This allows Spark to support advanced analytics like machine learning and SQL queries were workloads are complex and require continuously updates to data models. Since Spark worker/executor is a long-running task, it occupies one of the cores allocated to the Spark Streaming application. Hence the Spark Streaming application needs to allocate enough cores (or threads, if running locally) to process the received data, as well as to run the receiver(s). In other words, the number of cores allocated to the Spark Streaming application must be more than the number of receivers.

DStreams

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from source, or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset. Each RDD in a DStream contains data from a certain interval. Any operation applied on a DStream translates to operations on the underlying RDDs. There are two types of operations performed on DStreams i.e transformations and output operations. Every Spark Streaming application processes the DStream RDDs using Spark transformations which create new RDDs. Any operation applied on a DStream translates to operations on the underlying RDDs, which in turn, applies the transformation to the elements of the RDD. Output operations, like saving the HDFS or calling an external API produces output in batches. Similar to RDDs, DStreams allow to persist the stream’s data in memory using the persist() method.

Input DStreams

Input DStreams are DStreams representing the stream of input data received from streaming sources. Every input DStream (except file stream) is associated with a Receiver object which receives the data from a source and stores it in Spark’s memory for processing. Spark Streaming provides two categories of built-in streaming sources.

Basic sources: Sources directly available in the StreamingContext API. For example: file systems, socket connections, and Akka actors.
Advanced sources: Sources like Kafka, Flume, Kinesis, Twitter, etc. are available through extra utility classes which require adding extra dependencies.

Multiple streams of data can be received in parallel in a streaming application by creating multiple input DStreams. This will create multiple receivers which will simultaneously receive multiple data streams.

Failure Recovery and Checkpoints

When there is a failure in an Executor, the Receiver and the stored memory blocks are lost and the Driver triggers a new receiver and the tasks will be resumed using the replicated memory blocks. On other hand when the Driver Program itself fails, then the corresponding Executors as well as their computations, and all the stored memory blocks will be lost. In such scenario, Spark provides a feature called DStream Checkpointing. Checkpointing enables a periodic storage of DAG of DStreams to fault tolerant storage e.g. HDFS. So when the Driver, Receiver and Executors are restarted, the Active Driver program can make use of this persisted Checkpoint state to resume the processing. Even if we are able to restart the Checkpoint state and start processing from the previous state with the new Active Executor, Driver program, and Receiver - we need to have a mechanism to recover the memory blocks at that state. In order to achieve this, Spark comes with a feature called Write Ahead Log (WAL) - This will synchronously saves memory blocks into fault-tolerant storage.

Spark streaming provides checkpointing of Metadata and actual data. Metadata includes the configuration used to create the streaming application, the set of DStream operations that define the streaming application and jobs are queued but have not completed yet. Metadata checkpoints are used to recover the from failure of the node running the driver of the streaming application. After the RDD transformations, the generated RDDs depend on RDDs of previous batches, which causes the length of the dependency chain to keep increasing with time. To avoid such unbounded increases in recovery time, intermediate RDDs of stateful transformations are periodically checkpointed to reliable storage (e.g. HDFS) to cut off the dependency chains. Checkpoints are generally enabled for stateful transformations when either updateStateByKey or reduceByKeyAndWindow are used in the application or for recovering from driver failure using Metadata checkpoints. Checkpoints can be enabled by setting a directory in a fault-tolerant and reliable file system for e.g., HDFS, S3, etc used to store checkpoint data. It is achieved by using streamingContext.checkpoint(checkpointDirectory). Further to recover the application from driver failures, the application creates a new StreamingContext when started for first time, but it re-creates the StreamingContext from the checkpoint data when it is restarted after failure. Since checkpointing of RDDs by saving into a reliable storage increases the processing time, it is recommended that the default batch interval for checkpoint should be 10 seconds. Accumulators and Broadcast variables cannot be recovered from checkpoint in Spark Streaming

Example of Spark Streaming

Below is the example of Spark Streaming dataset from netcat server as a source and computation of word count on the incoming data. The Spark configuration is defined and Master URL for spark cluster is configured (using args(0)). The setMaster() method of SparkConf allows to configure a Spark, Mesos, Kubernetes or YARN cluster URL, or a special “local[*]” string to run in local mode (detects the number of cores in the local system). The * in “local[*]” corresponds to number of threads configured to run the tasks locally. When using input DStream based on receiver (e.g. sockets, Kafka, etc.), it is recommended to have a thread for processing the received data and additional threads equal to the number of receivers running in the system. Spark internally creates a SparkContext which can be accessed as ssc.sparkContext. Then a new StreamingContext is initialized with batch interval as 1 second. The streaming context's, ssc.socketTextStream(…) creates a DStream from text data received over a TCP socket connection. Besides sockets, the StreamingContext API provides methods for creating DStreams from files and Akka actors as input sources.

The streaming context connects with the receiver and defines micro batches using the time duration specified.

      def main(args: Array[String]) {
        if (args.length < 3) {
          System.err.println("Usage: NetworkWordCount <master> <hostname> <port> <duration> <checkpoint directory>")
          System.exit(1)
        }
    
        StreamingExamples.setStreamingLogLevels()
    
        val ssc = StreamingContext.getOrCreate(args(4), () => createContext(args))
    
        // Start receiving data and processing
        ssc.start()
        // Wait for the processing to be stopped (manually or due to any error)
        ssc.awaitTermination()
      }
    
      def createContext(args: Array[String]) = {
    
        // Create a local StreamingContext with two working thread and batch interval of 1 second.
        // The master requires 2 cores to prevent a starvation scenario. Hence args(0) is local[2].
        val sparkConf = new SparkConf().setMaster(args(0)).setAppName("NetworkWordCount")
        sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "true")
    
        val ssc = new StreamingContext(sparkConf, Seconds(args(3).toInt))
    
        // Create a socket stream(ReceiverInputDStream) on target ip:port
        val lines = ssc.socketTextStream(args(1), args(2).toInt, StorageLevel.MEMORY_AND_DISK_SER)
    
        // Split words by space to form DStream[String]
        val words = lines.flatMap(_.split(" "))
    
        // count the words to form DStream[(String, Int)]
        val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

        // print the word count in the batch    
        wordCounts.print()
    
        ssc.checkpoint(args(4))
    
        ssc
      }

File Streams

DStream can be created for reading data from files on any file system compatible with the HDFS API (e.g. HDFS, S3, NFS, etc.), using streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory). Spark streaming provides monitoring of special directory called dataDirectory and process any files created in the dataDirectory except files in nested directories or files with different formats. Once files are moved into dataDirectory they cannot be changed, as files updated with new data will not be read. The simple text files can be read using streamingContext.textFileStream(dataDirectory). Reading of file streams does not require running a receiver, hence no allocation of CPU cores.

It is very important to state that once a streaming context has been started, no new streaming computations can be set up or added to it. Also once a context has been stopped, it cannot be restarted. Further only a single StreamingContext can be active in a JVM at any given time. The stop() method on StreamingContext stops both the StreamingContext and the internal SparkContext. To stop only the StreamingContext, the optional parameter stopSparkContext of stop() is passed as false. A SparkContext can be re-used to create multiple StreamingContexts, as long as the previous StreamingContext is stopped, without stopping the original SparkContext, before the next StreamingContext is created.

Receiver Reliability

When the Spark system is receiving data from reliable sources e.g. Kafka or Flume, acknowledging that the data received is correct ensures that there is no data loss in case of any failure. To enable this Spark provides two kinds of receivers:

Reliable Receiver: Sends acknowledgment to a reliable source that the data has been received correctly and stored in Spark with Replication.

Unreliable Receiver: It doesn’t send any acknowledgement to a source upon receiving data. It is used for sources that do not support acknowledgement, or don't require additional complexity of acknowledgement.

Transformations on DStreams

Transformation	Description entity
map(func)	return a new DStream by passing each element of the source DStream through a function func
flatMap(func)	each input item can be mapped to 0 or more output items
filter(func)	return only the records of the source DStream on which the predicate func returns true
repartition(numPartitions)	Changes the level of parallelism in this DStream by creating more or fewer partitions
count()	counting the number of elements in each RDD of the source DStream
reduce(func)	Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func
reduceByKey(func, [numTasks])	return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function

Window Operations

Spark Streaming also provides windowed computations, which allow you to apply transformations over a sliding window of data. Every time the window slides over a source DStream, the source RDDs that fall within the window are combined and operated upon to produce the RDDs of the windowed DStream. The below diagram shows that the window operation is applied over the last 3 time units of data, and slides by 2 time units. Any window operation takes parameters window length, the duration of the window and the sliding interval, interval at which the window operation is performed. Both the window length and sliding interval should be multiples of the batch interval of the source DStream.

The below example applies reduceByKey operation on the pairs of DStream of (word, 1) pairs over the last 30 seconds of data, using the operation reduceByKeyAndWindow.

// Reduce last 30 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))

updateStateByKey

The updateStateByKey operation allows to maintain arbitrary state while continuously updating it with new information. This can be achieved by following steps.

Define the state - The state can be an arbitrary data type.

Define the state update function - Specify with a function how to update the state using the previous state and the new values from an input stream.

Spark applies the state update function for all existing keys, regardless of them having new data in a batch or not. If the update function returns None then the key-value pair will be eliminated.

def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    val newCount = ...  // add the new values with the previous running count to get the new count
    Some(newCount)
}
val runningCounts = pairs.updateStateByKey[Int](updateFunction _)

Data Serialization

Spark streaming enables serialization of below two types of data, mostly to persist data.

Input data: By default, the input data received through Receivers is stored in the executors’ memory with StorageLevel.MEMORY_AND_DISK_SER_2. The data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. Such serialization is an overhead as the receiver must deserialize the received input data and re-serialize it using Spark’s serialization format.

Persisted RDDs generated by Streaming Operations: RDDs generated by streaming computations may be persisted in memory. For example, window operations persist data in memory as they would be processed multiple times. However, unlike the Spark Core default of StorageLevel.MEMORY_ONLY, persisted RDDs generated by streaming computations are persisted with StorageLevel.MEMORY_ONLY_SER i.e. serialized by default to minimize GC overheads.

Transform Operation

The transform operation, along with variations like transformWith, allows arbitrary RDD-to-RDD functions to be applied on a DStream. It can be used to apply any RDD operation that is not directly exposed in the DStream API. This enables to for example to do real-time data cleaning by joining the input data stream with precomputed spam information and then filtering based on it.

val spamInfoRDD = ssc.sparkContext.newAPIHadoopRDD(...) // RDD containing spam information

val cleanedDStream = wordCounts.transform { rdd =>
  rdd.join(spamInfoRDD).filter(...) // join data stream with spam information to do data cleaning
  ...
}

Join Operations

Streams can be very easily joined with other streams using different kinds of joins in Spark Streaming. For each batch interval, the RDD generated by stream1 will be joined with the RDD generated by stream2. It also supports the leftOuterJoin, rightOuterJoin, fullOuterJoin operations. Stream can be joined with a dataset using DStream.transform operation. The function provided to transform is evaluated for every batch interval and therefore it will use the current dataset that dataset reference points to.

Spark ML

ML is a distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface). Mahout machine learning library running on MapReduce is slow. MLlib is Apache Spark's scalable machine learning library which substitutes Mahout. It implements a set of commonly used machine learning and statistical algorithms which include correlations and hypothesis testing, classification and regression, clustering, and principal component analysis.

GraphX

GraphX is a distributed graph-processing framework on top of Spark. It provides an API for expressing graph computation that can model the user-defined graphs by using Pregel abstraction API. GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and aggregateMessages) as well as an optimized variant of the Pregel API to support graph computation.

Limitations of Spark

Spark was not designed as a multi-user environment. Spark users are required to know whether the memory they have access to is sufficient for a dataset. Adding more users further complicates this since the users will have to coordinate memory usage to run projects concurrently. Due to this inability to handle this type of concurrency, users will want to consider an alternate engine, such as Apache Hive, for large, batch projects. Spark Streaming works with the ingestion (input) timestamp of the data rather than the event-time. Due to this it puts the data in a batch even if the event was generated earlier and belonged to the earlier batch, which may result in less accurate information as it is equal to the data loss.

Akka - Evolution of Multithreading

2020-05-18T13:22:00.153-07:00

Over the past decade with rise of mobile applications, distributed micro-services and cloud-infrastructure services, the applications need to be highly responsive, provide high throughput and efficiently utilize system resources. Many real-time systems ranging from financial applications to games cannot wait for single-threaded process to complete. Also resource-intensive systems would take a lot of time to finish tasks without parallelization. Multithreading takes advantage of the underlying hardware by running multiple threads making the application more responsive and efficient.

Multithreading achieved through context switching, gives rise to many problems such as thread management, accessing shared resources, race conditions and deadlocks. Some of the multithreading concepts were developed to resolve these problems. Thread pool allowed to decouple task submission and execution, and alleviate from manual creation of threads. It consists of homogeneous worker threads (as pool size) that are assigned to execute tasks and returned back to the pool once task is finished. A synchronization technique of locks is used to limit access to a resource by multiple threads. Mutex on other hand is used to guard shared data, only allowing a single thread to access a resource. Such locking seriously limits concurrency with blocked caller threads not performing any work while the CPU does the heavy-lifting of suspending and restoring them later. The ExecutorService was introduced as part of Concurrency API in Java, which provide higher abstraction on threads. It managed thread creation and maintained thread pool under the hood, and is capable of running multiple asynchronous tasks. The Callable threads were added to allow to return results back to the caller within the Future object. These and many other improvements help to simplify the concurrency code but is not enough to avoid the complex thread synchronization.

Concurrency means that multiple tasks run in overlapping time periods, while Parallelism means the tasks are split up into smaller subtasks to be processed in parallel. Concurrent tasks are stateful, often with complex interactions, while parallelizable tasks are stateless, do not interact, and are composable. A concurrency model specifies how threads in the the system collaborate to complete the tasks they are are given. The important aspect of a concurrency model is whether the threads are designed to share a common state or each thread has its own own state isolated from other threads. When threads share state by using and accessing the shared object, problems like race conditions and deadlocks may occur. On the other hand when threads have separate state, they need to communicate either by exchanging immutable objects among them, or by sending copies of objects (or data) among them. This allows for no two threads to write to the same object thus avoiding the concurrency problems faced in shared state. The Parallel worker model, which is most common, has a delegator distributes the incoming jobs to different workers. Each worker completes the entire job, working in parallel and running in different threads. The parallel worker model works well for isolated jobs but becomes complex when workers need to access shared data. Event driven model (Reactive model) is were workers perform the partial job and delegate the remaining job to another worker. Each worker is running in its own thread, and shares no state with other workers. Jobs may even be forwarded to more than one worker for concurrent processing. The workers react to events occurring in the system, either received from the outside world or emitted by other workers. Since workers know that no other threads modify their data, the workers can be stateful making them faster than stateless workers. Akka is one such event driven concurrency model.

Akka is a toolkit and runtime for building highly concurrent, distributed, and fault tolerant applications on the JVM. Akka creates a layer between the actor and baseline system so that the actor's only job is to process the messages. All the complexity of creating and scheduling threads, receiving and dispatching messages, and handling race conditions and synchronization, is relegated to the framework to handle transparently. Akka model is a a concurrency model which avoids concurrent access to mutable state and uses asynchronous communications mechanisms to provide concurrency. Akka encourages the push model using messages and a single shared queue, rather than the pull model were each worker processes from its own queue.

The following characteristics of Akka allow you to solve difficult concurrency and scalability challenges in an intuitive way:

Event-driven model: Actors perform work in response to messages. Communication between Actors is asynchronous, allowing Actors to send messages and continue their own work without blocking to wait for a reply.
Strong isolation principles: Actors don't have any public API methods which can be invoked. Instead, its public API is defined through messages that the actor handles. This prevents any sharing of state between Actors, with the only way to observe another actor's state is by sending it a message asking for it.
Location transparency: The system constructs Actors from a factory and returns references to the instances. Because location doesn’t matter, Actor instances can start, stop, move, and restart to scale up and down as well as recover from unexpected failures. It enables the actors to know the origin of the messages they receive. The sender of the message may exist in the same JVM or another JVM, thus allowing Akka actors to run in a distributed environment without any special code.
Lightweight: Each instance consumes only a few hundred bytes, which realistically allows millions of concurrent Actors to exist in a single application.

Note: Lightbend announced the release of Akka 2.6 in Nov 2019 with a new Typed Actor API ("Akka Typed"). In the typed API, each actor needs to declare which message type it is able to handle and the type system enforces that only messages of this type can be sent to the actor. The classic Akka APIs are still supported even though it is recommended to use the new Akka Typed API for new projects. Currently only classic Akka APIs are used in this post but as the new typed APIs is adopted it would be updated.

Actors and Actor System

An actor is a container for State, Behavior, a Mailbox, Child Actors and a Supervisor Strategy. An actor object needs to be shielded from the outside in order to benefit from the actor model hence actors are represented to the outside using actor references, which are objects that can be passed around freely and without restriction.

ActorSystem

All actors form part of an hierarchy known as the actor system. Actor system is a logical organization of actors into a hierarchical structure. It provides the infrastructure through which actors interact with one another. It is a heavyweight structure per application, which allocates n number of threads. Hence it is recommended to create one ActorSystem per application. ActorSystem manages the life cycle of the actors within the system and supervises them. The creator actor becomes the parent of the newly created child actor. The list of children is maintained within the actor's context. Akka creates two built-in guardian actors in the system before creating any other actor. The first actor created by actor system is called the root guardian. It is the parent of all actors in the system, and the last one to stop when the actor system is terminated. The second actor created by actor system is called the system guardian with system namespace. Akka or other libraries built on top of Akka may create actors in the system namespace. The user guardian is the top level actor that we can provide to start all other actors in the application. An actor created using system.actorOf() are children of the user guardian actor. Top-level user-created actors are determined by the user guardian actor as to how they will be supervised. The root guardian is the parent and supervisor of both the user guardian and system guardian.

Messaging and Mailbox

Actors communicate exclusively by exchanging messages. Every actor has an exclusive address and a mailbox through which it can receive messages from other actors. Mailbox messages are processed by the actor in consecutive order. There are multiple mailbox implementations with the default implementation being FIFO.

Akka ensures that each instance of an actor runs in its own lightweight thread, shielding it from rest of the system and that messages are processed one at a time. The messages are required to be immutable since actors can potentially access the same messages concurrently, in order to avoid race conditions and unexpected behaviors. Hence actors can be implemented without explicitly worrying about concurrency and synchronized access using locks. When a message is processed it is matched with the current behavior of the actor which is the function which defines the actions to be taken in reaction to the message at that point in time. When a message is sent to an actor which does not exist or is not already running then it goes to a special Dead Letter actor (/deadLetters).

State

An actor contains many instance variables to maintain state while processing multiple messages. Each actor is provided with the following useful information for performing its tasks via the Akka Actor API:

sender: an ActorRef to the sender of the message currently being processed
context: information and methods relating to the context within which the actor is running
supervisionStrategy: defines the strategy to be used for recovering from errors
self: the ActorRef for the actor itself

Behind the scenes Akka will run sets of actors on sets of real threads, where typically many actors share one thread, and subsequent invocations of one actor may end up being processed on different threads. Akka ensures that this implementation detail does not affect the single-threadedness of handling the actor’s state. Since the internal state is vital to an actor's operations, when the actor fails and is restarted by its supervisor, the state will be created from scratch.

Actor Lifecycle

An actor is a stateful resource that has to be explicitly started and stopped. An actor can create, or spawn, an arbitrary number of child actors, which in turn can spawn children of their own, thus forming an actor hierarchy. ActorSystem hosts the hierarchy and there can be only one root actor, an actor at the top of the hierarchy of the ActorSystem. ActorContext has the contextual information for the actor and the current message. It is used for spawning child actors and supervision, watching other actors to receive a Terminated(otherActor) events, logging and request-response interactions using ask with other actors. ActorContext is also used for accessing self ActorRef using context.getSelf().

Every Actor that is created must also explicitly be destroyed even when it's no longer referenced. The lifecycle of a child actor is tied to the parent, a child can stop itself or be stopped by parent at any time but it can never outlive its parent. In other words, whenever an actor is stopped, all of its children are recursively stopped. Actor can be stopped using the stop method of the ActorSystem. Stopping a child actor is done by calling context.stop(childRef) from the parent.

An Actor has the following life-cycle methods:

Actor's constructor: An actor’s constructor is called just like any other Scala class constructor, when an instance of the class is first created.
preStart: It is only called once directly during the initialization of the first instance, i.e. at creation of its ActorRef. In the case of restarts, preStart() is called from postRestart(), therefore if not overridden, preStart() is called on every restart.
postStop: It is sent just after the actor has been stopped. No messages are processed after this point. It can be used to perform any needed cleanup work. Akka guarantees postStop to run after message queuing has been disabled for the actor. All PostStop signals of the children are processed before the PostStop signal of the parent is processed.
preRestart: According to the Akka documentation, when an actor is restarted, the old actor is informed of the process when preRestart is called with the exception that caused the restart, and the message that triggered the exception. The message may be None if the restart was not caused by processing a message.
postRestart: The postRestart method of the new actor is invoked with the exception that caused the restart. In the default implementation, the preStart method is called.

If initialization needs to occur every time an actor is instantiated, then constructor initialization is used. On contrary, if initialization needs to occur only the first instance of the actor which is created, then initialization is added to preStart and postRestart is overridden to not call the preStart method. Below is the example of implementing all the actor lifecycle methods.

import akka.actor.{Actor,ActorSystem, Props}  
  
class RootActor extends Actor{  
  def receive = {  
    case msg => println("Message received: "+msg);  
    10/0;  
  }  
  override def preStart(){  
    super.preStart();  
    println("preStart method is called");  
  }  
  override def postStop(){  
    super.postStop();  
    println("postStop method is called");  
  }  
  override def preRestart(reason:Throwable, message: Option[Any]){  
    super.preRestart(reason, message);  
    println("preRestart method is called");  
    println("Reason: "+reason);  
  }  
  override def postRestart(reason:Throwable){  
    super.postRestart(reason);  
    println("postRestart is called");  
    println("Reason: "+reason);  
  }  
}

Supervision and Monitoring

In an actor system, each actor is the supervisor of its children. When an actor creates children for delegating its sub-tasks, it will automatically supervise them. If an actor fails to handle a message, it suspends itself and all of its children and sends a message, usually in the form of an exception, to its supervisor. Once an actor terminates, i.e. fails in a way which is not handled by a restart, stops itself or is stopped by its supervisor, it will free up its resources, draining all remaining messages from its mailbox into the system's dead letter mailbox. The mailbox is then replaced within the actor reference with a system mailbox, redirecting all new messages into the drain. Actors cannot be orphaned or attached to supervisors from the outside, which might otherwise catch them unawares.

When a parent receives the failure signal from its child, depending on the nature of the failure, the parent decides from following options:

Resume: Parent starts the child actor, keeping its internal state.
Restart: Parent starts the child actor by clearing its internal state.
Stop: Stop the child actor permanently.
Escalate: Escalate the failure by failing itself and propagating the failure to its parent.

Restart of the actor is carried by creating a new instance of the underlying Actor class and replacing the failed instance with the fresh one inside the child's ActorRef. The new actor then resumes processing its mailbox, without processing the message during which the failure occurred.

The supervision strategy is typically defined by the parent actor when it spawns a child actor. The default supervisor strategy is to stop the child in case of failure. There are two types of supervision strategies to supervise any actor, namely one-for-one strategy and one-for-all strategy. The one-for-one strategy applies the supervision directive to the failed child while one-for-all strategy applies it to all its siblings. The one-for-one strategy is the default supervision strategy. Below is an example One-for-One Strategy implementation.

import akka.actor.SupervisorStrategy.{Escalate, Restart, Stop}
import akka.actor.{Actor, OneForOneStrategy}

class Supervisor extends Actor {

 override val supervisorStrategy =
    OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) {
      case _: ArithmeticException      => Resume
      case _: NullPointerException     => Restart
      case _: IllegalArgumentException => Stop
      case _: Exception                => Escalate
    }

  def receive = {
    case p: Props => sender() ! context.actorOf(p)
  }
}

Monitoring is used to tie one actor to another so that it may react to the other actor’s termination, in contrast to supervision which reacts to failure. Since actors creation and restarts are not visible outside their supervisors, the only state change available for monitoring is the transition from alive to dead. Lifecycle monitoring is implemented using a Terminated message to be received by the monitoring actor, where the default behavior is to throw a special DeathPactException if not otherwise handled. The Terminated messages can be listened by invoking ActorContext.watch(targetActorRef). Termination messages are delivered regardless of the order of termination of the actors. Monitor enables to recreate actors or schedule it at later time as a retry mechanism, thus providing an alternative to restarting the actor by the supervisor. Below example show the parent actor monitoring the child actor for termination using watch() method.

import akka.actor.Actor  
import akka.actor.ActorLogging  
import akka.actor.PoisonPill  
import akka.actor.Terminated
import akka.actor.SupervisorStrategy.Escalate

class DeathPactParentActor extends Actor with ActorLogging {

   override val supervisorStrategy = OneForOneStrategy() {
     case _: Exception => {
       log.info("The exception is ducked by the Parent Actor. Escalating to TopLevel Actor")
       Escalate
     }
   }
  
  def receive={
    case "start"=> {
      val child=context.actorOf(Props[DeathPactChildActor])
      context.watch(child) //Watches but doesnt handle terminated message. Throwing DeathPactException here.
      child ! "stop"
    }
    // case Terminated(_) => log.error(s"$actor died"")        // Terminated message not handled
  }
}

class DeathPactChildActor extends Actor with ActorLogging {  
  def receive = {
    case "stop"=> {
      log.info ("Actor going to be terminated")
      self ! PoisonPill
    }
  }
}

Creating Actors

Akka creates Actor instances using a factory spawn methods which return reference to the actor's instance. Props is a configuration class to specify immutable options for the creation of an Actor, which contains information about a creator, routing, deploy etc. Actors are created by passing a Props configuration object into the actorOf factory method which is available on ActorSystem and ActorContext. The call to actorOf returns an instance of ActorRef which is a handle to the actor instance and the only way to interact with it. It is recommended to provide factory methods on the companion object of each Actor which helps keeping the creation of suitable Props close to the actor definition. Actors are automatically started asynchronously when created. It is also recommended to declare one actor within another as it breaks actor encapsulation.

Actors are implemented by extending the Actor base trait and implementing the receive method.

The receive method defines a series of case statements determining which messages the Actor handles using standard Scala pattern matching. The Akka Actor receive message loop is exhaustive, with all the messages which the actor can accept is pattern matched, otherwise an unknown message is sent to akka.actor.UnhandledMessage of the ActorSystem's EventStream. The result of the receive method is a partial function object, which is stored within the actor as its "initial behavior".

import akka.actor.Actor
import akka.event.Logging
 
class StatusActor extends Actor {
  val log = Logging(context.system, this)
  def receive = {
    case "complete" => log.info("Task Completed!")
    case "pending" => log.info("Task Pending!")
    case _ => log.info("Task Status Unknown")
  }
}

Messages

Messages are mostly passed using Scala's Case class being immutable by default and easily able to integrate with pattern matching. Messages are sent to an Actor through one of the following Scala methods.

! means "fire-and-forget", e.g. send a message asynchronously and return immediately. Also known as tell.
? sends a message asynchronously and returns a Future representing a possible reply. Also known as ask.

Since ask has performance implications it is recommended to use tell unless really required. Message ordering is guaranteed on a per-sender basis. Below is example of sending tell message.

import akka.actor.{Props, ActorSystem}
 
object ActorDemo {
  def main(args: Array[String]): Unit = {
    val system = ActorSystem("demo-system")
    val props = Props[StatusActor]
    val statusActor = system.actorOf(props, "statusActor-1")
    statusActor ! "pending"
    statusActor ! "blocked"
    statusActor ! "complete"
    system.terminate()
  }
}

A message can be replied using sender() method which provides the ActorRef to the instance of sender Actor. To replying to a message we just need to use sender() ! ReplyMsg. If there is no sender i.e. a message which is sent without an actor or future context, the default sender is a 'dead-letter' actor reference. Messages can also be forwarded from one actor to another. In such case, address/reference of an Actor is maintained even though the message is going through a mediator. It is helpful when writing actors that work as routers, load-balancers, replicators etc.

import akka.actor.{Actor,ActorSystem, Props};  
class ParentActor extends Actor{  
  def receive = {  
    case message:String => println("Message recieved from "+sender.path.name+" massage: "+message);  
    val child = context.actorOf(Props[ChildActor],"ChildActor");  
    child ! message    // Message forwarded to child actor   
  }  
}  

class ChildActor extends Actor{  
  def receive ={  
    case message:String => println("Message recieved from "+sender.path.name+" massage: "+message);  
    println("Replying to "+sender().path.name);  
    sender()! "I got your message";  // Child Actor replying to parent actor
  }  
}  
  
object ActorExample{  
  def main(args:Array[String]){  
    val actorSystem = ActorSystem("ActorSystem");  
    val actor = actorSystem.actorOf(Props[ParentActor], "ParentActor");  
    actor ! "Hello";  
  }  
}

ActorContext

ActorContext has the contextual information for the actor and the current message. It is used for spawning child actors and supervision, watching other actors to receive a Terminated(otherActor) events, logging and request-response interactions using ask with other actors. ActorContext is also used for accessing self ActorRef using context.getSelf().

ActorRef

ActorRef is a reference or immutable handle to an actor within the ActorSystem. It is shared between actors as it allows other actors to send messages to the referenced actor. Each actor has access to its canonical (local) reference through the ActorContext.self field. ActorRef always represents an incarnation (path and UID) not just a given path, once an actor is stopped and new actor uses the same name as ActorRef, the original ActorRef will not point to the new actor. There are special types of actor references which behave like local actor references, e.g. PromiseActorRef, DeadLetterActorRef, EmptyLocalActorRef and DeadLetterActorRef.

ActorPath

Actors are created in a strictly hierarchical fashion. Hence there exists a unique sequence of actor names given by recursively following the supervision links between child and parent down towards the root of the actor system. ActorPath is a unique path to an actor that shows the creation path up through the actor tree to the root actor. An actor path consists of an anchor, which identifies the actor system, followed by the concatenation of the path elements, from root guardian to the designated actor; the path elements are the names of the traversed actors and are separated by slashes. ActorPath can be created without creating the actor itself, unlike ActorRef which requires a corresponding actor. ActorPath from terminated actor can be reused to a new incarnation of the actor. ActorPath are similar to file path, for example, "akka://my-sys/user/service-a/worker1".

ActorSelection

ActorSelection is another way to represent an actors similar to ActorRef. It is a logical view of a section of an ActorSystem's tree of Actors, allowing for broadcasting of messages to that section. ActorSelection points to the path (or multiple paths using wildcards) and is completely oblivious to which actor's incarnation is currently occupying it.

val selection: ActorSelection = context.actorSelection("/user/a/*/c/*")

val actorRef = system.actorSelection("/user/myActorName/").resolveOne()

Stopping Actor

Actors can be stopped by invoking the stop() method of either ActorContext or ActorSystem class. ActorContext is used to stop child actor and ActorSystem is used to stop the top level Actor. The actual termination of the actor is performed asynchronously. Some of other methods to stop the actor are PoisonPill, terminate() and gracefulStop().

In the below example the stop() method of ActorSystem passing the ActorRef is used to stop the top level Actor.

object ActorExample{  
  def main(args:Array[String]){  
    val actorSystem = ActorSystem("ActorSystem");  
    val actor = actorSystem.actorOf(Props[ActorExample], "RootActor");  
    actor ! "Hello"  
    actorSystem.stop(actor);
  }  
}

A child actor can be stopped by the parent actor using stop() method from its ActorContext and passing childActor's ActorRef as shown in the below example.

class ParentActor extends Actor{  
  def receive = {  
    case message:String => println("Message received " + message);  
    val childactor = context.actorOf(Props[ChildActor], "ChildActor");  
    context.stop(childactor);  
      
    case _ => println("Unknown message");  
  }  
}

The terminate method stops the guardian actor, which in turn recursively stops all its child actors.

Exceptions

Exceptions can occur while an actor is processing a message. In case when an exception occurs while the actor is processing a message taken out of mailbox, the corresponding message is lost. The mailbox and the remaining messages in it remain unaffected. In order to retry the processing of the message, the exception must be caught and handled to retry processing the message. If code within an actor throws an exception, that actor is suspended and the supervision process is started. Depending on the supervisor's decision the actor is resumed, restarted (wiping out its internal state and starting from scratch) or terminated.

Scheduler

Scheduler is a trait and extends to AnyRef. It is used to handle scheduled tasks and provides the facility to schedule messages. We can schedule sending of messages and execution of tasks. It creates new instance for each ActorSystem for scheduling tasks to happen at specific time. It returns a cancellable reference to the scheduled operation which can be cancelled by calling cancel method on the reference object. We can implement Scheduler by importing akka.actor.Scheduler package.

Logging

Akka also comes built-in with a logging utility for actors, and it can access it by simply adding the ActorLogging trait. An actor by implementing the trait akka.actor.ActorLogging can become a logging member. The Logging object provides error, warning, info, or debug methods for logging messages.

The Logging constructor's first parameter is any LoggingBus, specifically system.eventStream, while the second parameter is the source of the logging channel. Logging is performed asynchronously to ensure that logging has minimal performance impact. Log events are processed by an event handler actor that receives the log events in the same order they were emitted. By default log messages are printed to STDOUT, but it can also plug-in to a SLF4J logger or any custom logger. By default messages sent to dead letters are logged at INFO level Akka provides a additional configuration options for very low level debugging and customized logging.

When to use Akka

Akka is ideal for the concurrency models with a shared mutable state accessed by multiple threads which are required to be synchronized. It provides an elegant solution were multiple threads needs to communicate with each other asynchronously in order to provide concurrency. Akka actors use java threads internally and assign threads to only on-demand actors which has work to do and don't have any occupied threads. This prevents the application from running thousands of threads simultaneously adding overhead on most machines especially at peak loads. It enforces encapsulation of behavior and state within individual actors without resorting to locks. Akka actor passing messages between each other avoids blocking and the usage of inter-thread communication Java methods like wait(), notify() and notifyAll(). Akka handles errors more gracefully with the supervisor actor when informed of the child actor's failure, it could carry out retries by creating new child actors. Akka encourages push mindset, were entities react to signals, changing state, and send signals to each other to drive the system forward. Akka provides better performance than using native Java threads, even with Executors for standard implementation and fixed amount of threads.

Cassandra - Scaling BigData

2019-12-23T08:01:00.001-08:00

In the age of cloud and IoT (Internet of Things) were more and more devices are connected to the internet, the amount of data produced each day is truely mind boggling. To put into perspective 90 percent of all the data in the world was generated within past two years. The constantly growing amount of data and high volume of transactions pose challenges for next-generation cloud applications and question the reliance on traditional SQL databases. Achieving scalability and elasticity is a huge challenge for relational databases. Relational databases are designed to run the whole dataset on a single server in order to maintain the integrity of the table mappings and avoid the problems of distributed computing. In such design scaling requires upgrading to bigger, more complex proprietary hardware with more processing power, memory and storage. NoSQL databases to the contrary are designed to scale on distributed systems and can operate across hundreds of servers, petabytes of data, and billions of documents. Further there is no more single point of failure and new nodes could be easily added/removed based on performance needs.

The Apache Cassandra database provides high scalability and availability without compromising performance for mission-critical applications. Cassandra is masterless with all its nodes in the ring being part of a homogenous system. It is highly fault tolerent and supports temporary loss of multiple nodes (depending on cluster size) with negligible impact to the overall performance of the cluster as downed cassandra nodes can be replaced easily. Cassandra supports data replication across multiple data centers providing lower latency and disaster recovery as multiple copies of data are stored in multiple locations. It provides support for custom tuning configuration based on system's data processing (read/write heavy systems or read/write heavy data centers). It can be easily configured to work in a multi data center environment to facilitate fail over and disaster recovery. Cassandra accommodates structured, semi-structured and unstructured data formats and can dynamically accommodate changes to the data formats. It gracefully handles data complexity were for e.g. data write patterns, locations, and frequencies can vary. It has picked up support for Hadoop, text search integration through Solr, CQL, zero-downtime upgrades, virtual nodes (vnodes), and self-tuning caches, just to name a few of the major features.

Origins

In 2004 Amazon was growing rapidly and was starting to hit the upper scaling limits of its Oracle database. It hence started to consider in building its own in-house database which usually is a bad idea. Out of this experiment, the engineers created the Amazon Dynamo database which backed major internal infrastructure including the shopping cart on the Amazon.com website. A group of engineers behind Amazon DynamoDB published a paper in 2007 which described the Design and Implementation of a tunable, highly scalable and highly available Distributed key value store to suit the heterogeneous applications on Amazon’s platform. It provided the characteristics of partitioning, high availability for writes, temporary failure handling, permanent failure recovery, membership management, and failure detection. On other end Google developed BigTable in 2004 and released the paper in 2006 which introduced a rich data model, multi values map and fast sequential access. The BigTable has a wide column store were names and format of the columns varies for each row in the same table. The row and column key along with the timestamp are associated to create a sparse, distributed multi-dimensional sorted map. Tables are split into multiple segments (which could be compressed) based on certain row keys to limit the segment size to few gigabytes.

Cassandra is developed with the blend of DynamoDB paper for distributed system and BigTable paper for data model. It was originally developed at Facebook in 2008 to power their search feature. Cassandra entered the Apache Incubator stage with its implementation paper released in 2009 and graduated as an Apache Top-level Project in February 2010.

Architecture

Cassandra has master-less, ring based distributed architecture were all nodes are connected peer-to-peer, and data is distributed among all the nodes in a cluster. Each node is independent and interconnected with other nodes within the cluster. Every Cassandra cluster is assigned a name which is same for all the nodes participating in a cluster have the same name. Nodes can be added or removed as needed to scale horizontally and simultaneously increasing the read and write throughput. Cassandra has robust support for clusters spanning across multiple data centers.

Cassandra places replicas of data on different nodes, hence there is no single point of failure. A consistent hashing algorithm is used to map Cassandra row keys to physical nodes. The range of values from a consistent hashing algorithm is a fixed circular space which can be visualized as a ring. Consistent hashing also minimizes the key movements when nodes join or leave the cluster. At start up each node is assigned a token range which determines its position in the cluster and the range of data stored by the node. Each node receives a proportionate range of the token ranges to ensure that data is spread evenly across the ring. The below diagram illustrates dividing 0 to 100 token range evenly among a four node cluster in a data center. Each node is assigned a token and is responsible for token values from the previous token (exclusive) to the node's token (inclusive). Each node in a Cassandra cluster is responsible for a certain set of data which is determined by the partitioner. A partitioner is a hash function for computing the resultant token for a particular row key. This token is then used to determine the node which will store the first replica. Currently Cassandra offers a Murmur3Partitioner (default), RandomPartitioner and a ByteOrderedPartitioner.

All the nodes exchange information with each other using Gossip protocol. Gossip protocol is used for peer discovery and metadata propagation. It enables to discover state for all nodes within the cluster by exchanging state information about themselves and other nodes they know about. The gossip process runs every second for every node and exchange state messages with up to three other nodes in the cluster. The state message contains information about the node itself and all other known nodes. Each node independently selects a live peer (if any) in the cluster, which would probabilistically select a seed node or even unavailable node. When the message is sent to the peer node directed to the destination node, the peer node would the route the message to appropriate peers and updates the meta-data information before sending back the response it received from the destination node. Seed nodes are used during start up to help discover all participating nodes. Seeds nodes have no special purpose other than helping bootstrap the cluster using the gossip protocol. When a new node starts up it looks to its seed list to obtain information about the other nodes in the cluster and starts gossiping. After first round of gossip, the new node will now possess cluster membership information about other nodes in the cluster and can then gossip with the rest of them. Nodes do not exchange information with every other node in the cluster in order to reduce network load. They just exchange information with a few nodes and over a period of time state information about every node propagates throughout the cluster. This enables each node to learn about every other node in the cluster even though it is communicating with a small subset of nodes. The gossip protocol also facilitates failure detection. It is a heartbeat listener and marks down the timestamps and keeps backlogs of intervals at which it receives heartbeat updates from each peer. If a node does not get an acknowledgment for a write to a peer, it simply stores it up as a hint. The nodes will stop sending read requests to a peer in DOWN state and probabilistically gossiping is tried again since the node is unavailable.

Components of Cassandra

Node: Node is the place where data is stored. It is the basic component of Cassandra.

Data Center: It is the collection of related nodes. Many nodes are categorized as a data center.

Cluster: The cluster is the collection of one or more data centers.

Commit Log: Every write operation made to the node is written to Commit Log. It is used for crash recovery in Cassandra. It ensures that all the writes are durable and survive permanently even in case of power failure on the node.

Mem-table: A mem-table is a memory-resident write back cache of data partitions that Cassandra looks up by key. A write back cache is where the write operation is only directed to the cache and completion is immediately confirmed. The memtable stores writes in sorted order until reaching a configurable limit, and then is flushed.

SSTable: The Sorted String Table (SSTable) ordered immutable key value map is an efficient way to store large sorted data segments in a disk file. Data is flushed into SSTable from the Mem-table when its contents reach a certain threshold value.

Bloom Filters: They are an extremely fast way to test the existence of a data structure in a set. A bloom filter does not guarantee that the SSTable contains the partition, only that it might contain it, which helps to avoid expensive I/O operations. Bloom filters are stored in files alongside the SSTable data files and also loaded into memory.

Merkle Tree: It is a hash tree which provides an efficient way to find differences in data blocks. The leaves of the Merkle tree contain hashes of individual data blocks and parent nodes contain hashes of their respective children which enables to find the differences between the nodes.

Data Replication

Cassandra replicates data into multiple nodes to ensure fault tolerance and reliability. The replicated data is synchronized across replicas to ensure data consistency which often takes microseconds. Cassandra replicates data based on the chosen replication strategy which determines the nodes where replicas are placed and the replication factor which determines the total number of replicas to be placed on different nodes within the cluster.

A replication factor of one means that there is only one copy of each row in the cluster, in which case if the node containing row goes down the row cannot be retrieved. A replication factor of minimum three ensures that there is no single point of failure, as three copies of each row is present on three different nodes.There is no primary or master replica and all the replicas are equally important. Generally the replication factor should not exceed the total number of nodes in the cluster.

The first replica for the data is determined by the partitioner. The placement of the subsequent replicas is determined by the replication strategy. There are two kinds of replication strategies in Cassandra.

SimpleStrategy is used for single data center and one rack. SimpleStrategy places the first replica on the node determined by the partitioner. Additional replicas are placed on the next nodes in clockwise manner in the ring without considering the topology.

NetworkTopologyStrategy is used when cassandra cluster is deployed across multiple data centers. It allows to define the number of replicas that would be placed in different datacenters, hence making it suitable for multidata center deployments. The nodes in network topology strategy are data centre aware. Network topology strategy places replicas in the same datacenter by walking the ring clockwise until reaching the first node in another rack. Replicas are usually placed on distinct racks since nodes in same racks often fail together. The total number of replicas for each datacenter is determined by the threshold of cross data-center latency impacting local reads and the possible failure scenarios. Number of replicas can vary across different data center based on application requirements. With two replicas in each datacenter, the failure of a single node per replication group still allows local reads at a consistency level of ONE. The higher the number of replicas in each data center, more resilience against failures of multiple node within the data center maintaining strong consistency level of LOCAL_QUORUM. Cassandra uses snitches to discover the overall network topology which is used to route inter-node requests efficiently.

Switching the replication strategy does not cause any data to be moved between nodes automatically. This can be achieved using a utility called NodeTool. NodeTool provides various stats and hints about the health of a particular Cassandra cluster. It also has the ability to perform some mission critical actions, such as removal of a particular node within a ring. The nodetool repair command is used to update the data between the nodes after changing the replication strategy. However, if we are just switching an existing cluster with a single rack and datacenter from SimpleStrategy to NetworkTopologyStrategy then it should not require to move any data. The below command runs the node repair tool using the option `pr – primary range only` which means repair will only repair the keys that are known to the current node where repair is being run, and on other nodes where those keys are replicated. Ensure to run repair on each node, but only with ONE node at a time.

$ nodetool repair -pr examplekeyspace

Write Operation

Clients can interface with any Cassandra nodes using either a thrift protocol or using CQL, since Cassandra is masterless. The node that a client connects to is designated as the coordinator which is responsible for serving the client. The consistency level determines the number of nodes that the coordinator needs to hear from in order to notify the client of a successful mutation. All inter-node requests are sent through a messaging service and in an asynchronous manner. Based on the partition key and the replication strategy used the coordinator forwards the mutation to all applicable nodes. QUORUM is the majority of nodes for which the coordinator should wait (default 10 seconds) for a response to satisfy the consistency level. It is calculated using the formula (n/2 +1) where n is the replication factor. Every node first writes the mutation to the commit log ensuring the durability for the writes. The node then writes the mutation to the (in-memory) memtable. The memtable is flushed to the disk when it reaches maximum allocated size in memory, or the memtable memory allocation elaspses. When the memtable is flushed it writes to an immutable structure called SSTable (Sorted String Table). SSTable are immutable as it not written to again once written after the memtable is flushed. The commit log is used for playback purposes in case data from the memtable is lost due to node failure. Once data in the memtable is flushed to an SSTable on the disk, corresponding data in commit log is purged. Every SSTable creates three files on disk which include a bloom filter, a key index aka partition index and a data file. Over a period of time when the number of SSTables increases it impacts the read requests as multiple SSTables needs to be read. To increase the read performance, the SSTables with related data are combined into single SSTable as part of compaction process. Memtables and SSTables are maintained per table while commit log is shared among tables.

Read Operation

Similar to the write process, the client can connect with any node using either thrift protocol or CQL within the cluster to read data. The chosen coordinator node is responsible for returning the requested data. A row key must be supplied for every read operation. The coordinator uses the row key to determine the first replica. The replication strategy in conjunction with the replication factor is used to determine all other applicable replicas. As with the write path the consistency level determines the number of replica's that must respond to the coordinator node before it successfully returns the results. If the contacted replicas has a different version of the data the coordinator returns the latest version to the client and issues a read repair command to push newer version of data to the nodes with older version of data.

Cassandra combines read results from the active memtable and potentially multiple SSTables. It processes data at several stages on the read path to discover the location of stored data. Below are the stages to fetch data from Cassandra node.

Checks the memtable
If enabled checks the row cache
Checks Bloom filter
If enabled checks the partition key cache
If a partition key is found in the partition key cache then goes directly to compression offset map, or else checks the partition summary
If the partition summary is checked, then the partition index is accessed
Locates the data on disk using the compression offset map
Fetches the data from the SSTable on disk

A partition is typically stored across multiple SSTable files. If the memtable has the desired partition data, then the data is read and then merged with the data from the SSTables. A number of other SSTable structures exist to assist read operations. The row cache improves the performance for very read-intensive operations but is contra-indicated for write-intensive operations. If enabled, row cache stores a subset of the partition data (frequently accessed) stored on disk in the SSTables in memory. The row cache size is configurable, as is the number of rows to store. If a write comes in for the row, the cache for that row is invalidated and is not cached again until the row is read. Similarly, if a partition is updated, the entire partition is evicted from the cache. When the desired partition data is not found in the row cache, then the Bloom filter is checked to discover which SSTables are likely to have the request partition data. Since Bloom filter is a probabilistic function, it find the likelihood that partition data is stored in a SSTable. If the Bloom filter does not rule out an SSTable, Cassandra checks the partition key cache which stores a cache of the partition index. If a partition key is found in the key cache, it can directly go to the compression offset map to find the compressed block on disk that has the data. If a partition key is not found in the key cache, then the partition summary which stores a sampling of the partition index is searched. The partition summary samples every X keys and maps the location of every Xth key's location in the index file. After finding the range of possible partition key values, the partition index is searched. The partition index resides on the disk and stores an index of all partition keys mapped to their offset. The partition index now goes to the compression offset map to find the compressed block on disk that has the data. The compression offset map stores pointers to the exact location on disk that the desired partition data will be found. The desired compressed partition data is fetched from the correct SSTable(s) once the compression offset map identifies the disk location. The query receives the result set.

Consistency Levels

Cassandra is highly Available and Partition-tolerant (AP) database in terms of the CAP Theorem (Consistency, Availability and Partition Tolerance) as it uses eventually consistent replication were nodes with older versions of data are updated in the background. Cassandra operations follow the BASE (Basically Available, Soft state, Eventual consistency) paradigm to maintain reliability even with the loss of consistency. In BASE system guarantees the availability of the data as per CAP Theorem, but the state of the system could change over time without any inputs due to eventual consistency. This approach is the opposite of ACID transactions that provide strong guarantees for data atomicity, consistency and isolation. Cassandra’s tunable consistency allows per-operation tradeoff between consistency and availability through consistency levels. An operation’s consistency level specifies how many of the replicas need to respond to the coordinator (the node that receives the client’s read/write request) in order to consider the operation a success. The below consistency levels are available in Cassandra which determines the number of majority nodes that should respond to the coordinator (default 10 seconds) to satisfy the consistency level:

ONE – Only a single replica must respond. It provides highest chance for the writes to succeed.
TWO – Two replicas must respond.
THREE – Three replicas must respond.
QUORUM – A majority (n/2 + 1) of the replicas must respond, where n is the replication factor.
ALL – All of the replicas must respond.
LOCAL_QUORUM – A majority of the replicas in the local datacenter (whichever datacenter the coordinator is in) must respond. It is recommended when temporary inconsistencies of a few milliseconds is acceptable.
EACH_QUORUM – A majority of the replicas in each datacenter must respond.
LOCAL_ONE – Only a single replica must respond. In a multi-datacenter cluster, this also guarantees that read requests are not sent to replicas in a remote datacenter. It is used to maintain consistent latencies but with higher throughput.
ANY – A single replica may respond, or the coordinator may store a hint. If a hint is stored, the coordinator will later attempt to replay the hint and deliver the mutation to the replicas. This consistency level is only accepted for write operations.
SERIAL – This consistency level is only for used with lightweight transaction. It is recommended only for high consistency and low availability global applications. It runs Paxos protocol across all the data centers as opposed to LOCAL_SERIAL which runs on only local data center. It allows reading the current (and possibly uncommitted) state of data without proposing a new addition or update. If a SERIAL read finds an uncommitted transaction in progress, it will commit it as part of the read.
LOCAL_SERIAL – Same as SERIAL but used to maintain consistency locally (within the single datacenter). Equivalent to LOCAL_QUORUM. It is recommended when data is partitioned by DataCenter, any inconsistency is unacceptable but lower availability is tolerable.

Partitioner

A partitioner is a hash function for computing the hash (token) of the partition key. Each row of data is uniquely identified by a partition key and distributed across the cluster by the value of the token. Hence the partitioner determines the distribution of data across the nodes in the cluster. The partitioner is configured globally across the entire cluster. The Murmur3Partitioner is the default partitioning strategy for any new Cassandra cluster and is suitable for majority of the use cases. Cassandra offers the following partitioners:

Murmur3Partitioner (default): It uniformly distributes data across the cluster based on MurmurHash hash values.
RandomPartitioner: It uniformly distributes data across the cluster based on MD5 hash values.
ByteOrderedPartitioner: It is used for ordered partitioning and orders rows lexically by key bytes. It allows ordered row scan by partition key similar to traditional primary index in RDBMS.

Both the Murmur3Partitioner and RandomPartitioner uses tokens to help assign equal portions of data to each node and evenly distribute data from all the column families (tables) throughout the ring. Using an ordered partitioner is not recommended as it causes hot spots due to Sequential writes and uneven load balancing for multiple tables.

Installation of Apache Cassandra

Windows:

Download latest Apache Cassandra release or archive release and extract the .gz or .tar file.
Find conf/cassandra.yaml file in CASSANDRA_HOME directory, were CASSANDRA_HOME is the path to the unzipped apache-cassandra directory.
Replace all the paths starting with /var/lib/cassandra/ with ../var/lib/cassandra/. This should update the properties data_file_directories, commitlog_directory and saved_caches_directory.
Go to CASSANDRA_HOME\bin directory and execute the command cassandra.bat in windows command prompt to run cassandra. Cassandra by default runs on port 9042.
To enabled Authentication for Cassandra database, open conf/cassandra.yaml configuration file and update the below properties with these values.

authenticator: PasswordAuthenticator

authorizer: CassandraAuthorizer

Cassandra can also run as windows service using Apache commons daemon and following tutorial.
To connect with cassandra service and create new keyspace (database) we use CSQL (Cassandra Query Language) Shell.
The cqlsh shell is compatible with Python 2.7 and requires it to be installed in the system to execute.
In order to setup python and cqlsh shell without admin privileges, download Standalone Python 2.7.9 for windows and extract it to C drive.
Then edit CASSANDRA_HOME\bin\cqlsh.bat file to add below set path command.
setlocal SET PATH=%PATH%;c:\python 2.7.6
Go to CASSANDRA_HOME\bin directory and execute the command cqlsh.bat -u cassandra -p cassandra to connect to cassandra database.
RazorSQL is Cassandra Database Browser which allows to view Keyspaces, Tables and Views.

Ubuntu:

Install OpenJDK 8 into the system which is required to run Apache Cassandra.

$ sudo apt install openjdk-8-jdk

Install the apt-transport-https package which is necessary to access a repository over HTTPS.

$ sudo apt install apt-transport-https

Import the repository's GPG using the following wget command. Curl can also be used.

$ wget -q -O - https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add -

$ curl https://www.apache.org/dist/cassandra/KEYS | sudo apt-key add -

Add the Cassandra repository to system repository file.

$ sudo sh -c 'echo "deb http://www.apache.org/dist/cassandra/debian 311x main" > /etc/apt/sources.list.d/cassandra.list'

Once the repository is added, update the apt package list and install the latest version of Apache Cassandra.

$ sudo apt update
$ sudo apt install cassandra

Cassandra service will automatically start after the installation process is complete. Check the service status using systemctl or use the nodetool utility to check the status of Cassandra cluster.

$ sudo systemctl status cassandra

$ nodetool status

Enable the Cassandra service to start automatically when the system boots. In addition, the service can be manually started using below start cassandra command.

$ sudo systemctl enable cassandra
$ sudo systemctl start cassandra

Cassandra's default configuration is valid for running it on a single node. In order to use Cassandra in a cluster or by several nodes simultaneously, Cassandra configuration file needs to be updated. Open the Cassandra configuration file located at /etc/cassandra/cassandra.yaml in nano editor.

$ sudo nano /etc/cassandra/cassandra.yaml

Update the cluster_name parameter and assigning new name.
cluster_name: [cluster_name]

Update the data storage port using the the storage_port parameter. The port should be available in the firewall.
storage_port :[port]

Check the seed_provider parameter in the seeds section and add the IP addresses of the nodes that make up the cluster separated by a comma.
Seeds: [node_ip],[node_ip]...[node_ip]

In Cassandra, by default authentication and authorization options are disabled. The below configuration enables the authentication and authorization in Cassandra.
authenticator: PasswordAuthenticator
authorizer: CassandraAuthorizer

Save the cassandra.yaml configuration file and run the following nodetool command to clear the system cache.

$ nodetool flush system

Finally reload Cassandra server to make the changes take affect as below.

$ sudo systemctl reload cassandra

To interact with Cassandra through CQL (the Cassandra Query Language), the cqlsh command line utility is used.

Cassandra Data Model

The Cassandra data model is a schema-optional, column-oriented data model. Unlike typical RDBMS, Cassandra requires all the columns required by the application to be modeled up front. It does not require each row to have the same set of columns. Cassandra data model consists of Keyspaces (analogous to databases), column families (analogous to tables in relational model), keys and columns.

A column family is referred as table in CQL (Cassandra Query Language) but actually is a map of sorted map. A row in the map provides access to a set of columns which is represented by a sorted map e.g. Map<RowKey, SortedMap<ColumnKey, ColumnValue>>. A map provides efficient key lookup, and the sorted nature enables efficient scans. In Cassandra, we can use row keys and column keys to do efficient lookups and range scans. A row key also known as the partition key, has a number of columns associated with it i.e. a sorted map. The row key is responsible for determining data distribution across the cluster. A column key can itself hold a value as part of the key name. In other words, we can have a valueless columns in the rows.
The number of column keys in Cassandra is unbounded, thus supporting wide rows. The maximum number of cells (rows x columns) in a single partition is 2 billion.

Cassandra doesn't support RDBMS operations such as JOINS, GROUP BY, OR clause, aggregation etc hence the data in Cassandra should be denormalized and stored in the same manner as it will be completely retrieved. The queries which needs to be supported by Cassandra should be determined beforehand and then corresponding tables be created according to the required queries. Tables should be created in such a way that a minimum number of partitions needs to be read. Cassandra is optimized for high write performance, hence there is a tradeoff between data write and data read operations in Cassandra. Due to this reason writes should be maximized for better read performance and data availability. Cassandra is a distributed database and promotes maximizing data duplication to provide instant availability without a single point of failure. Also disk space is not more expensive than memory, CPU processing and IO operations.

Cassandra spreads data into different nodes based on partition keys which is the is the first part of the primary key. It does this by hashing a part of every table's primary key called the partition key and assigning the hashed values (called tokens) to specific nodes in the cluster. Partition are a group of records with the same partition key. Ideally all partitions would be roughly the same size. Poor choice for partition keys causes the data to be distributed unevenly with either too large partitions or unbounded partitions and cluster hotspots. Data should be preferably retrieved from a single read within a single partition as opposed to collecting data from different nodes from different partitions. Good partition key selection minimizes the number of partitions read while querying data. Proper partitioning and clustering keys also allows the data to be sorted and queried in the desired order. It should be noted that the order in which partitioned rows are returned depends on the order of the hashed token values and not on the key values themselves. Above were some of the rules which must be kept in mind while modelling data in Cassandra.

Security

By default, Cassandra has a user with username as `cassandra` and password as `cassandra` which can be used to create a new user using `cqlsh` tool. Login into `cqlsh` tool using the below command.

$ cqlsh localhost -u cassandra -p cassandra

The `USER` commands namely, CREATE/ALTER/DROP are deprecated after introduction of roles in Cassandra 2.2. Roles are used in cassandra to represent users and group of users. To create a new superuser, we create an admin role with super user enabled as below.

 CREATE ROLE admin WITH SUPERUSER = true AND LOGIN = true AND PASSWORD = 'admin123';

Before creating a role with superuser, Cassandra authentication must be enabled to avoid the Unauthorized error "Only superusers can create a role with superuser status". The cassandra.yaml configuration must be updated by changing the authenticator property from AllowAllAuthenticator to PasswordAuthenticator and the authorizer property from AllowAllAuthorizer to CassandraAuthorizer, with Cassandra service restarted in order to enable authentication in Cassandra.

Disable the default `cassandra` superuser.

 ALTER ROLE cassandra WITH SUPERUSER = false AND LOGIN = false;

Gives the user with the role data_reader permission to execute SELECT statements on any table across all keyspaces

 CREATE ROLE appuser WITH SUPERUSER = false AND LOGIN = true AND PASSWORD = 'test123';
 CREATE KEYSPACE testdb WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
 GRANT ALL ON KEYSPACE testdb TO appuser;
 GRANT SELECT ON KEYSPACE testdb TO appuser;
 ALTER KEYSPACE system_auth WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': 3, 'DC2': 3};

Verify the default data center name

 SELECT data_center FROM system.local;

Alter the keyspace using the data center name as the replication factor and set the number of nodes for replication

 ALTER KEYSPACE ExampleKeyspace WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1': '3'};

Check the keyspace on each node in the cluster you will see that the replication strategy is now NetworkTopologyStrategy.

 SELECT * FROM system_schema.keyspaces;

Keyspace

Cassandra keyspace is a container for all the data for a given application. While defining a keyspace a replication strategy and a replication factor i.e. the number of nodes that the data must be replicated, should be specified. A keyspace is a container for a list of one or more column families while a column family is a container of a collection of rows. Each row contains ordered columns. Column families represent the structure of the data. Each keyspace has at least one and often many column families.

Create Keyspace using simple replication strategy and with replication factor as 1.

 CREATE KEYSPACE companySimpleDetail 
 WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1};

Create Keyspace using network replication strategy and with replication factor for datacenter named 'London_Center'. The data center name must match the name configured in the snitch. The default datacenter name is 'datacenter1' in Cassandra. To display the datacenter name, use nodetool status command. Also disable write commit log for the companyNetworkDetail keyspace which increases performance and also risk of data loss. Commit log should be never disabled for SimpleStrategy environments.

 CREATE KEYSPACE companyNetworkDetail 
 WITH REPLICATION = {'class': 'NetworkTopologyStrategy', 'London_Center': 1 } 
 AND DURABLE_WRITES = false;

 USE companyNetworkDetail;

The existing keyspace can be updated using ALTER KEYSPACE to change the replication factor, replication strategy or durable writes properties. The DROP KEYSPACE command drops the keyspace including all the data, column families, user defined types and indexes from Cassandra. Before dropping the keyspace, Cassandra takes a snapshot of the keyspace. If keyspace does not exist in the Cassandra, it will return an error unless IF EXISTS is used.

Cassandra Query Language

Cassandra Query Language is similar to SQL and allows to change data, look up data, store data, or change the way data is stored. It mainly consists of CQL statements which end in a semicolon (;). Keyspace, column, and table names created using CQL are case-insensitive unless enclosed in double quotation marks. CQL defines many built-in data types for columns. For a complete CQL command reference, see CQL commands.

Create column family Employee with primary key as employeeId, department no and salary. Without a primary key, row family (table) cannot be created in Cassandra. Primary key should have atleast one partition key. If there is no bracket within the primary key columns i.e. ((EmpId), Emp_deptNo, Emp_salary) then the first column is used as partition key in order to identify the node were to write the data. The rest of the columns namely Emp_deptNo and Emp_salary are cluster keys (clustering columns) which are used to determine the on-disk sort order of data which is written within the partition. Hence first Emp_deptNo is written followed by Emp_salary column in the partition.

 CREATE TABLE Employee (
  EmpId int,
  Emp_FirstName text,
  Emp_LastName varchar,
  Emp_salary double, 
  Emp_comm float,
  Emp_deptNo int,
  Emp_DOB date,
  PRIMARY KEY (EmpID, Emp_deptNo, Emp_salary)
 );

 DESCRIBE Employee;

In a variation is the WITH clause which indicates that the data should be clustered in ascending order by first Emp_FirstName column and then by column Emp_DOB.

 CREATE TABLE Employee (
 EmpId int,
 Emp_FirstName text,
 Emp_LastName varchar,
 Emp_salary double, 
 Emp_comm float,
 Emp_deptNo int,
 Emp_DOB date,
 PRIMARY KEY (EmpId, Emp_deptNo, Emp_salary)
) WITH CLUSTERING ORDER BY (Emp_FirstName ASC, Emp_DOB ASC)
 AND bloom_filter_fp_chance = 0.01
 AND caching = {keys };

In another variation we can create Employee column family with composite partition key as employeeId and salary as below.

 CREATE TABLE Employee (
  EmpId int,
  Emp_FirstName text,
  Emp_LastName varchar,
  Emp_salary double, 
  Emp_comm float,
  Emp_deptNo int,
  Emp_DOB date,
  PRIMARY KEY ((EmpId, Emp_salary), Emp_FirstName, Emp_DOB)
 );

Cassandra supports ALTER TABLE to add/drop column, alter column name or type, or change the property of the table. The DROP TABLE drops specified table including all the data from the keyspace, while TRUNCATE TABLE removes all the data from the specified table. In both cases Cassandra takes a snapshot of the data not the schema as a backup.

INSERT INTO statement below allows to write data into Cassandra columns in a row form. The primary key column needs to be specified along with other optional columns.

 INSERT INTO Employee(EmpId, Emp_firstName, Emp_LastName, Emp_salary, Emp_comm, Emp_deptNo, Emp_DOB) 
 VALUES (1001, 'John', 'Wilkinson', 135000, 110, 10, '1983-08-14');

As described earlier Cassandra stores the data in nested sorted map data structure with RowKeys each mapping to multiple Column name and values along with timestamp. The value of the CQL primary key aka partition key is used internally as the row key. The names of the non-primary key CQL fields are used internally as columns names, and the values of the non-primary key CQL fields are then internally stored as the corresponding column values. The columns of each RowKey is sorted by column names. Rows can contain columns with no column name and no column value.

Cassandra allows to upsert were it inserts a row only if a primary key does not already exists otherwise it will update that row. The UPDATE statement is used to update the data in the Cassandra table. Column values are changed in 'Set' clause while data is filtered with 'Where' clause. Cassandra cannot set any field which is part of the primary key.

 UPDATE companyNetworkDetail.Employee SET Emp_salary=145000, Emp_deptNo=15 WHERE EmpId=1001;

The Delete command removes an entire row or some columns from the table Student. When data is deleted, it is not deleted from the table immediately. Instead deleted data is marked with a tombstone and are removed after compaction.

 Delete from companyNetworkDetail.Employee Where EmpId=1001;

The above syntax will delete one or more rows depend upon data filtration in where clause. The below syntax on the other hand will delete some columns from the table.

 Delete Emp_comm from companyNetworkDetail.Employee Where EmpId=1001;

Below are some of SELECT query rules for fetching the data from Cassandra cluster. For the below examples we will be using partition key as EmpId and Emp_salary, while cluster columns (key) as Emp_FirstName and Emp_DOB.

All partition key columns should be used with = operator in the WHERE clause.
```
 SELECT * FROM Employee WHERE EmpId = 1001;    // NOT ALLOWED
```
The above query gives an error saying it cannot execute as it might involve data filtering and thus may have unpredictable performance. Cassandra knows here that it might not be able to execute the query in an efficient way and hence it gives this warning with the error message. The only way Cassandra can execute this query is by retrieving all the rows from the Employee table and then by filtering out the ones which do not have the requested value for the EmpId column. If the Employee table contains say a million rows and majority of them have the requested value for the EmpId column, then the query will still be relatively efficient in which case ALLOW FILTERING should be used as below.
```
 SELECT * FROM Employee WHERE EmpId = 1001 ALLOW FILTERING;
```
Hence ideally all the columns which ae part of the partition key should be used in WHERE clause for better performance.
```
 SELECT * FROM Employee WHERE EmpId = 1001 AND Emp_salary = 145000;
```
Use the cluster columns in same order as it is defined in the table.
```
 SELECT * FROM Employee WHERE EmpId = 1001 AND Emp_salary = 145000 AND Emp_DOB = '1983-08-14';  // NOT ALLOWED
```
The above query fails with the message "PRIMARY KEY column "emp_dob" cannot be restricted as preceding column "emp_firstname" is not restricted". Cassandra stores the data on the disk which is partitioned by EmpId and Emp_salary and then sorted by Emp_FirstName and Emp_DOB. Hence we cannot skip one of the cluster columns in the middle and try to fetch the remaining columns.
```
 SELECT * FROM Employee WHERE EmpId = 1001 AND Emp_salary = 145000 AND Emp_DOB = '1983-08-14' AND Emp_FirstName = 'John';
```
However we can skip the cluster columns which are after the queried cluster column in the WHERE clause. For example below query uses all partition keys and only "Emp_FirstName" cluster key, skipping the last cluster key "Emp_DOB".
```
 SELECT * FROM Employee WHERE EmpId = 1001 AND Emp_salary = 145000 AND Emp_FirstName = 'John';
```
We cannot use = operator on only non primary key column without partition key columns, and if we want to use them then "ALLOW FILTERING" is mandatory.
```
 SELECT * FROM Employee WHERE Emp_comm = 110;    // NOT ALLOWED
 
 SELECT * FROM Employee WHERE Emp_comm = 110 ALLOW FILTERING;
```

Cannot use = operator on only cluster key columns without specifying the partition key.

 SELECT * FROM Employee WHERE Emp_DOB = '1983-08-14' AND Emp_FirstName = 'John';   // NOT ALLOWED
 
 SELECT * FROM Employee WHERE EmpId = 1001 AND Emp_salary = 145000 AND Emp_DOB = '1983-08-14' AND Emp_FirstName = 'John';

IN operator is allowed on all the column of partition key but it slows down the performance.
```
 SELECT * FROM Employee WHERE EmpId IN (1001) AND Emp_salary IN (100000, 200000);
```

>, >=. <= and < operators are not allowed on partition key.

 SELECT * FROM Employee WHERE EmpId <= 1001 AND Emp_salary = 145000;    // NOT ALLOWED

>, >=, <= and < operators can be used on only cluster key columns followed by first columns and partition key column.

 SELECT * FROM Employee WHERE EmpId = 1001 AND Emp_salary = 145000 AND Emp_FirstName = 'John' AND
 Emp_DOB = '1983-08-14' AND Emp_comm > 100;  // NOT ALLOWED
 
 SELECT * FROM Employee WHERE EmpId = 1001 AND Emp_salary = 145000 AND Emp_FirstName = 'John' AND
 Emp_DOB > '1983-08-14' AND Emp_comm > 100;

Here, Cassandra rejects the query that attempts to return ranges without identifying any of the higher level segments.

A table contains a timestamp representing the date and time that a write occurred to a column. Using CQL's WRITETIME function in the SELECT statement allows to fetch the timestamp on which the column was written to the database. The output of the function is microseconds.

 SELECT WRITETIME(Emp_comm) from Employee;

Time To Live (TTL)

Cassandra provides a functionality for Automatic Data Expiration using Time to Live (TTL) values during data insertion. The TTL value is specified in seconds. Once the Data exceeds the TTL period, it expires and is marked with a tombstone. Expired data continues to be available for read requests during the grace period. Normal compaction and repair processes automatically remove the tombstone data. TTL is not supported on counter columns.

The USING TTL syntax during data insertion allows to specify TTL value in seconds as below.

 INSERT INTO companyNetworkDetail.Employee(EmpId, Emp_firstName, Emp_LastName, Emp_salary, Emp_comm, Emp_deptNo, Emp_DOB)
 VALUES (1002, 'Kevin', 'Gordan', 129000, 350, 20, '1971-05-21')
 USING TTL 100;

Cassandra Batch

Cassandra BATCH is used to execute multiple modification statements (insert, update, delete) simultaneously. It applies all DML statements within a single partition before the data is available, ensuring atomicity and isolation. Cassandra Batch helps to reduce client-server traffic and more efficiently update a table with a single row mutation when it is target for a single partition. Either all or none of the batch operations will succeed, ensuring atomicity. Batch isolation occurs only if the batch operation is writing to a single partition. No rollbacks are supported on Cassandra Batch. Below is the syntax for Batch operation.

 BEGIN BATCH
   DML_statement1 ;
   DML_statement2 USING TIMESTAMP [ epoch_microseconds ] ;
   DML_statement3 ;
APPLY BATCH ;

Cassandra Collections

Cassandra allows storing of multiple values in a single variable using Collection data types. Collections should be small to prevent the overhead of querying collection because entire collection needs to be traversed. Hence Cassandra collection only allows to query 64KB of data thus limiting the storage to 64KB. Cassandra support three types for collections as below.

Set: Set is a data type that is used to store a group of elements. The elements of a set will be returned in a sorted order. While inserting data into the elements in a set, enter all the values separated by comma within curly braces { } as shown below.

 CREATE TABLE data2 (name text PRIMARY KEY, phone set<varint>);
 INSERT INTO data2(name, phone)VALUES ('rahman',    {9848022338,9848022339});
 UPDATE data2 SET phone = phone + {9848022330} where name = 'rahman';

List: The list data type is used when the order of elements matters. While inserting data into the elements in a list, enter all the values separated by comma within square braces [ ] as shown below.

 CREATE TABLE data(name text PRIMARY KEY, email list<text>);
 INSERT INTO data(name, email) VALUES ('ramu', ['abc@gmail.com','cba@yahoo.com']);
 UPDATE data SET email = email +['xyz@tutorialspoint.com'] WHERE name = 'johnny';

Map: The map is a collection type that is used to store key value pairs. While inserting data into the elements in a map, enter all the key : value pairs separated by comma within curly braces { } as shown below.

 CREATE TABLE data3 (name text PRIMARY KEY, address map<timestamp, text>);
 INSERT INTO data3 (name, address) VALUES ('robin', {'home' : 'hyderabad' , 'office' : 'Delhi' } );
 UPDATE data3 SET address = address+{'office':'mumbai'} WHERE name = 'robin';

Under the hood, each item within the collection (list, set or map) becomes a column with different patterns. For map, the column name is the combination of the map column name and the key of the item, while the value is the value of the item. For list, the column name is the combination of the list column name and the UUID of the item order in the list, the value is the value of the item. For Set, the column name is the combination of the set column name and the item value, the value is always empty.

 CREATE TABLE example (
    key1 text PRIMARY KEY,
    map1 map<text,text>,
    list1 list<text>,
    set1 set<text>
 );

 INSERT INTO example (key1, map1, list1, set1)
 VALUES ( 'john', {'patricia':'555-4326','doug':'555-1579'},  ['doug','scott'],  {'patricia','scott'} )

Below is the internal storage representation of the row inserted in table example from above.

RowKey: john
=> (column=, value=, timestamp=1374683971220000)
=> (column=map1:doug, value='555-1579', timestamp=1374683971220000)
=> (column=map1:patricia, value='555-4326', timestamp=1374683971220000)
=> (column=list1:26017c10f48711e2801fdf9895e5d0f8, value='doug', timestamp=1374683971220000)
=> (column=list1:26017c12f48711e2801fdf9895e5d0f8, value='scott', timestamp=1374683971220000)
=> (column=set1:'patricia', value=, timestamp=1374683971220000)
=> (column=set1:'scott', value=, timestamp=1374683971220000)

Indexing

An index also called secondary index, allows to access data in Cassandra using non-primary key fields other than the partition key. It enables fast and efficient lookup of data matching a given condition. Cassandra does not allow to conditionally query by a normal column which has no index. Cassandra creates a indexes column values in a hidden column family (table) separate from the table whose column is being indexed. The index data is stored locally and is not replicated to other nodes. Consequentially the data query request by the indexed column needs to be forwarded to all the nodes and responses from all the nodes be waited for until returning back the merged results. Hence the indexed column query response slows down as more and more machines are added to the cluster. Indexes can be used for collections, collection columns, and any other columns except counter columns and static columns. Currently Apache Cassandra 3.1 only supports equality comparison condition for indexed column queries with no support for range or order by select queries. Cassandra's built-in indexes work the best on the table which has many rows that contain the indexed value. The overhead involved in querying and maintaining the index increases as the number of unique values increases in the indexed column. Indexes should be avoided for the high-cardinality columns, columns with counters or frequently updated or deleted columns.

The Create index statement creates an index on the column specified by the user. If the data already exists for the column to be indexed then Cassandra creates indexes on the data during the 'create index' statement execution. After creating an index, Cassandra indexes new data automatically when data is inserted. The index cannot be created on primary key as a primary key is already indexed. Cassandra needs indexes to be created on the columns to apply filtering within the queries.

 Create index IndexName on KeyspaceName.TableName(ColumnName);

The Drop index statement drops the specified index. If index name was not specified during the index creation, then index name is given as TableName_ColumnName_idx. If the index does not exist, the drop index statement returns an error unless IF EXISTS is used that will return no-op. During index creation, the keyspace name should be specified along with the index name, or else the index will be dropped from the current keyspace.

 Drop index IF EXISTS KeyspaceName.IndexName

Materialized views

Cassandra provides Materialized views to handle automated server-side denormalization, removing the need for client side handling of denormalization to query a column without specifying the partition key and without relying on secondary indexes which adds latency for each request. Materialized Views are essentially standard CQL tables that are maintained automatically by the Cassandra server. It ensures eventual consistency between the base and view data and enables for very fast lookups of data in each view using the normal Cassandra read path. Materialized views does not have the same write performance characteristics as the normal table. The views require an additional read-before-write, as well as the data consistency checks on each replica before creating the view updates which adds to the overhead and latency for the writes. Since materialized views creates a CQL Row in the view for each CQL Row in the base, they don't support combining multiple rows from base and placing them into the view. Currently, materialized views only support simple SELECT statements, with no support for WHERE clauses, ORDER BY, and functions.

When a materialized view is created against a table which has data already, a building process will be kicked off to populate the materialized view. In such case there will be a period during which queries against the materialized view may not return all results. On completion of the build process, the system.built_materializedviews table on each node will be updated with the view's name. When a base view is altered by adding new columns, deleting/altering existing columns then the materialized view is updated as well. If the base table is dropped, any associated views will also be dropped. The materialized view queries all of the deleted values in the base table and generate tombstones for each of the materialized view rows, because the values to be tombstoned in the view are not included in the base table's tombstone. Hence the performance of materialized views suffers when there are large number of partition tombstones.

The CREATE MATERIALIZED VIEW CQL command allows to create a materialized view on the specified cc_transactions base table as below. The PRIMARY KEY clause defines the partition key and clustering columns for the Materialized View's backing table.

 CREATE TABLE cc_transactions (
    userid text,
    year int,
    month int,
    day int,
    id int,
    amount int,
    card text,
    status text,
    PRIMARY KEY ((userid, year), month, day, id)
 );

 CREATE MATERIALIZED VIEW transactions_by_day AS
    SELECT year, month, day, userid, id, amount, card, status
    FROM mvdemo.cc_transactions
    WHERE userid IS NOT NULL AND year IS NOT NULL AND month IS NOT NULL AND day IS NOT NULL AND id IS NOT NULL AND card IS NOT NULL
    PRIMARY KEY ((year, month, day), userid, id);

 SELECT * FROM transactions_by_day where year = 2017 and month = 2 and day = 6;

Some of the limitations on defining Materialized Views are as below:

A primary key of a Materialized View must contain all the columns from the primary key of the base table. As each CQL Row in the view is mapped with corresponding CQL Row in the base, all the columns of the original primary key (partition key and clustering columns) must be represented in the materialized view.
A primary key of a Materialized View can contain at most one other additional column which is not part of the original primary key. This restriction is added to guarantee that no data (or deletions) are lost and the Materialized Views are consistent with the base table.

Lightweight Transactions

Cassandra does not support ACID transactions with rollback or locking mechanisms, but instead offers atomic, isolated, and durable transactions with eventual/tunable consistency. Sometimes insert or update operations require to be atomic were a consensus must be reached between all the replicas which require a read-before-write. Such read-before-write is provided by CQL using Lightweight Transactions (LWT), also known as compare and set which uses an IF clause on inserts and updates. Compare And Set (CAS) operations require a single key to be read first before updating it with new value with the goal of ensuring the update would lead to a unique value. For lightweight transactions, Apache Cassandra upgrades its consistency management protocol to Paxos algorithm automatically.

In Paxos, any node can act as a leader or proposer which picks a proposal number and sends it to the participating replicas (determined by the replication factor). Many nodes can attempt to act as leaders simultaneously. If the proposal number is the highest the replica has seen, the replica promises to not accept any earlier proposal (with a smaller number). If the majority promises to accept the proposal, the leader may proceed with its proposal. However if a majority of replicas included an earlier proposal with their promise, then that is the value the leader must propose. Conceptually, if a leader interrupts an earlier leader, it must first finish that leader's proposal before proceeding with its own, thus giving us our desired linearizable behavior. After a proposal has been accepted, it will be returned to future leaders in the promise, and the new leader will have to re-propose it again. Cassandra adds the commit/acknowledge phase to move the accepted value into Cassandra storage, and the propose/accept phase to read the current value of the row to match with the expected value for compare-and-set operation. The overall cost involves four round trips to provide linearizability which is very high and hence should be used only a very small minority of operations. Lightweight transactions are restricted to a single partition. The SERIAL ConsistencyLevel allows to read the current (possibly uncommitted) Paxos state without having to propose a new update. If a SERIAL read finds an uncommitted update in progress, it will commit it as part of the read.

Lightweight transactions can be used for both INSERT and UPDATE statements, using the IF clause as below.

 INSERT INTO USERS (login, email, name, login_count)
 VALUES ('jbellis', 'jbellis@datastax.com', 'Jonathan Ellis', 1)
 IF NOT EXISTS

We could use IF EXISTS or IF NOT EXISTS or any other IF <CONDITION> as below:

 UPDATE users SET reset_token = null, password = 'newpassword' WHERE login = 'jbellis'
 IF reset_token = 'some-generated-reset-token';

Cassandra Counters

Cassandra supports counter columns which implement a distributed count. A counter is a special column used to store an integer that is changed in increments. The counter column value is a 64-bit signed integer. The Counter column cannot exist with non counter columns in Cassandra, hence every other column in the table must be either primary key or clustering key. Also counter column should never be used as part of the primary key or the partition key. Below is the example of the counter table.

 CREATE TABLE WebLogs (
    page_id uuid,
    page_name Text,
    insertion_time timestamp,
    page_count counter,
    PRIMARY KEY ((page_id, page_name), insertion_time)
);

The normal INSERT statements are not allowed to insert data into the counter tables. The UPDATE statements are used instead as below.

 INSERT INTO WebLogs (page_id , page_name , insertion_time , page_count ) VALUES (uuid(),'test.com',dateof(now()),0);   // NOT ALLOWED

 UPDATE WebLogs SET page_count = page_count + 1 WHERE page_id = uuid() AND page_name ='test.com' AND insertion_time =dateof(now());

 SELECT * from WebLogs;

We cannot set any value to a counter as it only supports two operations, increment and decrement. The counter column can be incremented or decremented using UPDATE statement as below.

 UPDATE WebLogs SET page_count = page_count + 1 
 WHERE page_id =8372cee6-1d04-41f7-a70d-98fdd9036448 AND page_name ='test.com' AND insertion_time ='2020-01-05 05:19:31+0000';

Cassandra rejects USING TIMESTAMP or USING TTL when updating a counter column. Also the counter column cannot be indexed or deleted.

Limitations of Cassandra

Cassandra supports fast targeted reads by primary key and has very sub-optimal support for alternative paths. Hence it does not work well with the tables which has lots of secondary indexes and has multiple access paths. Secondary indexes should not be used as an alternative access path into a table.
Cassandra being a distributed system does not support global unique sequence of numbers. Hence it will not work for the applications relying on identifying rows with sequential values.
Cassandra does not support ACID principles.
Cassandra does not support JOINS, GROUP BY, OR clause, aggregation etc. So data should be stored in a way that it could be retrieved directly.
The data model should be de-normalized for fast access. Cassandra can handle a very large number of columns in a table.
Cassandra does not support row level locking since locking is a complex problem for distributed systems and usually leads to slow operations.
Cassandra is very good at writes but okay with reads. Updates and deletes are implemented as special cases of writes and that has consequences that are not immediately obvious.
CQL has no transaction support with begin/commit transaction syntax.
Cassandra can only support (McFadin 2014) two billion columns per partition, hence restricting to insert all of the data into same partition. Data should be distributed equally using partition key value.

Ideal Cassandra Use Cases

Writes exceed reads by a large margin.
Data is rarely updated and when updates are made they are idempotent (mostly no changes).
Read Access is by a known primary key.
Data can be partitioned via a key that allows the database to be spread evenly across multiple nodes.
There is no need for joins or aggregates.

Kubernetes: Container Orchestration at work

2018-12-31T14:51:00.000-08:00

In the previous post we discussed the benefits of Containerization over Virtualization using docker containerization platform and docker swarm orchestration of containers. Containers package the application and isolate it from the host making them more reliable and scalable. Although after scaling up to say 1000 containers or 500 services, the container deployment, management, load balancing and peer to peer communication becomes a daunting task. Container orchestration automates the deployment, management, scaling, networking, and availability of the containers, and hence becomes a necessity while operating at such a scale. It automates the arrangement, coordination, and management of software containers. Kubernetes is currently the best available platform for container orchestration.

Kubernetes is an open-source platform for automating deployments, scaling, and operations of application containers across clusters of hosts, providing container-centric infrastructure. It enables managing containerized workloads and services, that facilitates both declarative configuration and automation. Kubernetes uses declarative description of desired state in form of configuration to manage the scheduling, (re)provisioning of resources and for running the containers. It is very portable, configurable, modular, and provides features like auto-scaling, auto-placement, auto-restart, auto-replication, load balancing and auto-healing of containers. Kubernetes can group number of containers into one logical unit for managing and deploying an application or service. Kubernetes is ideally container agnostic but it mostly runs docker containers. It was developed by Google and later donated it to Cloud Native Computing Foundation which currently maintains it. Kubernetes has a large growing online community and many KubeCon conferences are held around the world.

Some of the key features of Kubernetes are as below:

Automatic Binpacking: It packages the application and automatically places containers based on system requirements and available resources.
Service Discovery and Load balancing: It provides ability to discover services and distribute traffic across the worker nodes.
Storage Orchestration: It automatically mounts external storage system or volumes.
Self Healing: Whenever a container fails it automatically creates a new container in its place. When a node fails, kubernetes will create and run all the containers from the failed node into different nodes.
Secret and Configuration Management: It deploys or updates secret and application configuration without having to restart or rebuild the entire image on the running container.
Batch Execution: It manages batch and CI work loads which replaces failed containers.
Horizontal Scaling: It provides simple CLI commands in order to scale applications up or down based on network load.
Automatic Rollbacks and Rollouts: It progressively rollouts updates/changes for the application or its configuration ensuring that individual instances are updated one after the other. When things go wrong kubernetes rollbacks the corresponding change, backing out to the previous state of running containers.

Comparison with Docker Swarm

Docker swarm is easy to setup and requires only few commands to configure and run a docker swarm cluster. While kubernetes setup is complex and requires more commands to be executed to setup a cluster, the effective cluster is more customizable, stable and reliable. Scaling up using docker swarm is faster compared to Kubernetes, as docker swarm is a native orchestration for running docker containers. Docker swarm also has inbuilt automated load balancing compared to Kubernetes which requires manual service configuration. In kubernetes data volumes can only be shared with containers within the same Pod, while docker swarm allows volumes to be shared across any other docker container. Kubernetes provides inbuilt tools for logging and monitoring while docker swarm relies on external 3rd party tools. Kubernetes also provides GUI dashboard for deployment of applications along with command line interface. Kubernetes does process scheduling to maintain the services while updating services similar to docker swarm but also provides rollback functionality in case of failure during the update.

Kubernetes Concepts

CLUSTER: A cluster is a set of nodes with at least one master node and several worker nodes

NODE: Node is a worker machine (or virtual machine) in the cluster.

PODS: Pods are basic scheduling unit, which consists of one or more containers guaranteed to be co-located on the host machine and able to share resources. These co-located group of containers within a pod share an IP address, port space, namespaces, cgroups and storage volumes. The containers within a pod are always scheduled together sharing the same context and lifecycle (start or stop). Each pod is assigned a unique IP address within the cluster, allowing the application to use ports without conflict. The desired state of the containers within a pod is described through a YAML or JSON object called a PodSpec. These objects are passed to the kubelet through the API server. Pods are mortal, i.e. they are not resurrected after their death. Also similar to docker containers they are ephemeral, i.e. when a pod dies another pod comes back up in another host. The logical collection of containers within a Pod interact with each other for execution of a service. A pod corresponds to a single instance of the service.

SERVICE: Service is an abstraction which defines a logical set of Pods and a policy by which to access them. Its a REST object, similar to a Pod and requires a service definition to be POSTed to the API server in order to create a new instance. It provides access to dynamic pods using labels and load balances traffic across the nodes. The Label Selector determines the set of Pods targeted by a Service. Each Service is assigned a unique IP address also known as clusterIP which is tied to the lifespan of the Service, until its death. Service provides a stable endpoint for the pods to reference. Since Pods are created or destroyed dynamically, their IP addresses cannot be relied upon to be stable over time. Hence the external clients relies on the service abstraction to provide reliable access by decoupling from the actual Pods. All communication to the service is automatically load balanced using the member pods of the service. Kubernetes offers native applications a simple Endpoints API that is updated whenever the set of Pods in a Service changes. For non-native applications, Kubernetes offers a virtual-IP-based bridge to Services which redirects to the backend Pods. A Service instance is a REST object similar to a Pod. There are 4 types of service as below:

ClusterIP: Exposes the Service on an internal IP in the cluster. This type of service is only reachable from within the cluster. This is the default Type.
NodePort: Exposes the Service on each Node’s IP at a static port, It uses the same port of each selected Node in the cluster using NAT. It is accessible from outside the cluster using <NodeIP>:<NodePort>. Superset of ClusterIP.
LoadBalancer: It creates an external load balancer in the cloud and assigns a fixed, external IP to the Service. It is a superset of NodePort service type.
ExternalName: Exposes the Service using an arbitrary name by returning a CNAME record with the name. It does not use any proxy.

NAMESPACE: Namespaces are a way to divide cluster resources between multiple users, were users spread across multiple teams. Names of resources need to be unique within a namespace, but not across namespaces. They provide logical separation between the teams and their environments acting as virtual clusters.

LABELS: Labels are key-value pairs which are used to identify objects such as pods, deployments (services) with specific attributes. They can be used to organize and to select subsets of objects. Each label key is unique for a given object. They can be added to an object at creation time and can be added or modified at the run time. Labels allow to distinguish resources within the same namespace.

LABEL SELECTORS: Labels are not unique with multiple objects having same label within the same namespace. The label selector is the core grouping primitive in Kubernetes which allows users to identify/select a set of objects. It can be made of multiple requirements which are comma-separated acting as AND operator. Kubernetes API currently supports two type of selectors.

Equality-based Selectors: They allow filtering by key and value. Matching objects should satisfy all the specified labels and allow three kinds of operators =,==,!=.
Set-based Selectors: They allow filtering of keys according to a set of values. It supports three kinds of operators; in, notin and exists (only the key identifier).

REPLICA SET: ReplicaSet manages the lifecycle of pods and ensures specified number are running. It typically creates and destroys Pods dynamically, especially while scaling in or out. The ReplicaSet is next generation replication controller and also supports set-based selector compared to replication controller only supports equality-based selector. It is similar to services in docker swarm.

DEPLOYMENT: Deployments controls the number of pod instances running using Replica Sets and upgrade them in a controlled manner. They have the capability to update the replica set and to roll back to the previous version. The deployment controller allows to update or pause/cancel the ongoing deployment before completion, or roll back the deployment entirely in the midway. The Deployment Controller drains and terminates a given number of replicas, creates replicas from the new deployment definition, and continues the process until all replicas in the deployment are updated. A YAML file represents and is used to define a Deployment. Deployment can be done either using Recreate i.e. killing all the existing pods and then bringing up new ones, or Rolling Update by gradually bringing down the old pods and bringing up the new ones. By default deployments are performed as a Rolling update. Deployment facilitates scaling (by updating no of replicas), rolling updates, rollbacks, version updates (image updates), Pod health checks and healing.

VOLUME: A volume is a directory which is accessible by containers within a pod and provides persistent storage for the pods. The lifespan of a volume in Kubernetes is same as that of the Pod enclosing it. A volume outlives any containers that run within the Pod, and data is preserved across container restarts. A pod can use any number of volumes simultaneously. The Pod specifies the volumes to be used using the .spec.volumes field and mounts those into its Containers. Every container in the Pod specifies independently the path on which to mount each volume. Few of the volume types include hostPath volume which mounts a file or directory in the node’s filesystem into the Pod, emptyDir volume is created when node is assigned to the pod existing as long as the Pod is running on the node and, secret volume which is used to pass sensitive information to the pods. The gcePersistentDisk volume mounts Google Compute Engine (GCE) Persistent Disk on the pod with pre-populated data and has its contents preserved even after the pod is removed from the node. Kubernetes Persistent Volume is used to retain data even after the pod dies compared to regular volumes. Kubernetes persistent volumes are administrator provisioned volumes which are created with a particular filesystem, size, and identifying characteristics such as volume IDs and names. Creating a kubernetes persistent volume involves first provisioning a network storage, then requesting for persistent volume claim for storage volume and finally using claimed persistent volume. The claim for persistent volume is referenced in the spec for a pod in order for the containers in the pod to use the volume.

MASTER: The API Server, etcd, Controller manager and Scheduler processes together make the central control plane of the cluster which runs on the Master node. It provides a unified view of the cluster. Master cannot run any containers in kubernetes.

WORKER: It is Docker host running kubelet (node agent) and proxy services. It runs pods and containers.

Kubernetes Architecture

Master Components

Master components provide the cluster’s control plane and are responsible for global activities about the cluster such as scheduling, detecting and responding to cluster events.

API Server: The API server exposes Kubernetes API and is the entry points for all the REST commands used to control the cluster. It processes the REST requests, validates them and executes the bound business logic. It relies on etcd for storage of result state. Kubectl (Kubernetes CLI) makes request to Kubernetes API server.

etcd Storage: etcd is a simple, distributed, consistent key-value store for all cluster data. It’s mainly used for shared configuration and service discovery. It provides a REST API for CRUD operations as well as an interface to register watchers on specific nodes, which enables a reliable way to notify the rest of the cluster about configuration changes. Example of data stored by Kubernetes in etcd are the jobs being scheduled, created and deployed, pod/service details and state, namespaces and replication information, etc.

Scheduler: Scheduler is responsible for deployment of configured pods and services onto the nodes. It selects a node for execution of newly created pods which have no assigned nodes. It has the information regarding resources available on the members of the cluster i.e. nodes, as well as the ones required for the configured service to run and hence is able to decide where to deploy a specific service. Factors taken into account for scheduling decisions include individual and collective resource requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference and deadlines.

Controller Manager: Controller manager is a daemon which runs different embedded controllers inside the master node. Each controller is logically a separate process, even though all controllers run as part of a single process. It uses API server to watch the shared state of the cluster and makes corrective changes to the current state to the desired one. Below are some of the controllers executed by the controller manager:

Node Controller: Responsible for noticing and responding when nodes go down.
Replication Controller: It maintains the correct number of pods are always running based on the configured replication factor and recreates any failed pods or removes extra-scheduled pods.
Endpoints Controller: It populates the Endpoints object. For headless services (no load balancer) with selectors, it creates Endpoints records in the API and modifies the DNS configuration to return addresses that point directly to the pods backing the service.
Token Controller: Create default accounts and API access tokens for new namespaces. It manages creation and deletion of ServiceAccount and its associated ServiceAccountToken Secrets that allows API access.

Node Components

Node components run on every node, maintain running pods and provide them the Kubernetes runtime environment.

Kubelet: Each worker node runs an agent process called a kubelet which is responsible for managing the state of the node i.e. starting, stopping, and maintaining application containers based on instructions from the API server. It gets the configuration of a pod from the API server and ensures that the described containers are up and running. It is the worker service which communicates with the API server to get information about services and write the details about newly created ones. All containers run on the host node using kubelet. It monitors the health of the respective worker node and reports it to the API server.

Kube Proxy: It is a network proxy and a load balancer for a service on a single worker node. It handles the requests from the internet by enabling the network routing for TCP and UDP packets. kube-proxy enables the Kubernetes service abstraction by maintaining network rules on the host and performing connection forwarding.

Kubectl: It is default command line tool to communicate with the API service and send commands to the master node. It also allows to enable the kube dashboard which allows to deploy and run containers using GUI.

Monitoring Kubernetes

Kubernetes has built in monitoring tools to check the health of individual nodes using cAdvisor and Heapster. The cAdvisor is an open source container resource usage collector. It operates at node level in kubernetes. It auto-discovers all containers in the given node and collects CPU, memory, filesystem, and network usage statistics. cAdvisor also provides the overall machine usage by analyzing the ‘root’ container on the machine. cAdvisor provides the basic resource utilization for the node but is unable to determine the resource utilization for individual applications running within the containers. The Heapster is another tool which aggregates monitoring data from cAdvisors across all nodes in the Kubernetes cluster. It runs as a pod in the cluster similar to any application. The Heapster pod discovers all the nodes in the same cluster and then pulls metrics by querying usage information from the kubelet of each node (which in turn queries cAdvisor), aggregates them by pod and label, and reports metrics to a monitoring service or storage backend. Heapster allows to collect data easily from the kubernetes cluster using cAdvisor but does not provide in built storage. It can integrate with InfluxDB or Google Cloud for storage and use UI tools such as Grafana for data visualization.

Kubernetes Networking Internals

A kubernetes pod consists of one or more containers that are collocated on the same host and are configured to share a network were all containers in the pod can reach each other using localhost. A typical docker container has a virtual network interface veth0 which is attached to bridge docker0 using a pair of linked virtual ethernet devices. The bridge docker0 is in turn attached to a physical network interface eth0 which is in the root network namespace. Both the docker0 and veth0 are on the same container network namespace. When a second container starts, docker shares the existing interface veth0, instead of getting a new virtual network interface. Now both containers are accessible from veth0 IP address (172.17.0.2) and both can hit ports opened by the other on localhost. Kubernetes implements this by creating a special container (started with pause command) for each pod whose main purpose is to provide a virtual network interface for all other containers to communicate with each other and outside world. The pause command suspends current process until a signal is received so that these containers do nothing at all except sleep until kubernetes sends them SIGTERM. The local routing rules set up during bridge creation allows any packet arriving at eth0 with destination address of veth0 (172.17.0.2) to be forwarded to the bridge which will then send it on to veth0. Although this works well for single host, but causes issues when multiple hosts in the cluster have their own private address space assigned to their bridge without any idea about address space assigned to other hosts thus causing potential conflicts.

Pods in kubernetes are able to communicate with other pods whether they are running on the same local host or different hosts or nodes. A kubernetes cluster consists or one or more nodes typically connected with router gateway on a cloud platforms such as GCP or AWS. Kubernetes assigns an overall address space for the bridges on each node and then assigns the bridges addresses within that space, based on the node the bridge is built on. It also adds routing rules to the gateway router telling it how the packets destined for each bridge should be routed, i.e. which node's eth0 the bridge can be reached through. This combination of virtual network interfaces, bridges and routing rules is called an overlay network.

Pods in kubernetes are ephemeral, hence there is no guarantee that IP address of a pod won't change when the pod is recreated. General solution to such problem is to run the traffic through reverse-proxy/load balancer which is highly durable, failure resistant and maintains the list of healthy servers to forward the requests to. Kubernetes service enables load balancing across a set of server pods and allows client pods to operate independently and durably. Service causes a proxy to be configured to forward requests to a set of pods usually determined by the selector which matches labels assigned to the pods. Kubernetes provides an internal cluster DNS which resolves the service name to corresponding service IP. The service is assigned IP address in the service network which is different from pod network. The service network address range similar to pod network is not exposed via kubectl and require to use provider specific commands to retrieve cluster properties. Both the service and pod networks are virtual networks.

Every ClusterIP service is assigned an IP address on the service network which is reachable from any pod within the cluster. The service network does not have any routes, connected bridges and interfaces on the hosts of nodes making up the cluster. Typically IP networks are configured with routes such that when an interface cannot deliver a packet to its destination because no device with the specified address exists locally it forwards the packet on to its upstream gateway. When the virtual ethernet interface sees packets addressed to service IP address, it forwards packets to the bridge cbr0 as it cannot find any devices with service IP on its pod network. The bridges being dumb passes the traffic to the host/node ethernet interface. In theory if the host ethernet interface also cannot find any devices with service IP address it forwards packet to this interface's gateway, the top level router. But the kube-proxy redirects the packets mid-flight to the addresses server pod.

Proxy usually run in user space were packets are marshaled into user space and back to kernel space on every trip through the proxy which can be expensive. Since both pods and nodes are ephemeral entities in the cluster, kubernetes uses a virtual network for service addressing to provide a stable and non-conflicting network address space. The virtual service network have no actual devices i.e. no ports to listen on or interfaces to open an connection, Kubernetes uses netfilter feature of linux kernel and a user space interface called iptables to route the packets. Netfilter is a rules based packet processing engine which runs in kernel space and is able to look at every packet at various points in its life cycle. It matches the packets against the rules and takes a specific action such as redirecting the packet to another destination when the corresponding rule matches. The kube-proxy opens a port and inserts the correct netfilter rules for the service in response to the notifications from the master api server for the changes in the cluster which includes changes to services and endpoints. The kube-proxy has the ability to run in iptables mode in which it mostly ceases to be a proxy for inter-cluster connections and instead delegates the work of detecting packets bound to service IPs in kernel space and redirecting them to pods, to the netfilter. Kube-proxy's main job is to keep the netfilter rules in sync using iptables based on updates received from master api server. Kub-proxy is very reliable and runs on systemd unit by default were it restarts on failure whereas in Google container engine it runs on pod controlled by a deamonset. Health checks against the endpoints are performed by the kubelet running on every node. The kubelet notifies the kube-proxy via api server when unhealthy endpoints are found and the kube-proxy then removes the endpoint from netfilter rules until the endpoint becomes healthy again. This works well for the requests that originate inside the cluster from on pod to another, but for requests from outside the cluster the netfilter rules obfuscate the origin IP.

Connections and requests operate at OSI layer 4 (tcp) or layer 7 (http, rpc, etc). Netfilter routing rules operate on IP packets at layer 3. All routers, including netfilter make routing decisions based solely on information contained in the packet; generally where it is from and where it is going. Each packet that arrives at a node's eth0 interface and is destined for a cluster IP address of a service, is processed by netfilter which matches the rules established for the service, and forwards the packet to the IP address of a healthy pod. The cluster IP of a service is only reachable from a node's ethernet interface. Although the netfilter rules for the service are not scoped to a particular origin network, i.e. any packet from anywhere that arrives on the node's ethernet interface with a destination of service's cluster IP is going to match and get routed to a pod. Hence clients can essentially call the cluster IP, the packets follow a route down to a node, and get forwarded to a pod.

The problem with this approach is that nodes are ephemeral to some extent similar to pods, for e.g. nodes can be migrated to a new VM or clusters can be scaled up and down. Since the routers are operating on layer 3 packets, they are unable to determine healthy services from unhealthy ones. They expect the next hop in the route to be stable and available. If the node becomes unreachable the route will break and stay broken for a significant time in most cases. Also if the route were durable, all external traffic passing through a single node is not optimal. Kubernetes ingress uses load balancers for distributing client traffic across the nodes within the cluster to solve this problem. Instead of using the static address of the nodes, the address of ethernet interfaces connected to nodes is used by the gateway router to route the packets sent from the load balancer. With this approach when the client tries to connect to the service using a particular port e.g. 80, it fails as there is no process listening on service IP address on the specified port. The node's ethernet interface cannot be connected with the specified port and the netfilter rules which intercepts request and redirects to a pod don't match the destination address which is cluster IP address on the service network. The service network that netfilter is set up to forward packets for is not easily routable from the gateway to the nodes, and the pod network that is easily routable is not the one netfilter is forwarding for.

NodePort services creates a bridge between the pod and service network. NodePort service is similar to clusterIP service with an additional capability to reach the IP address of the node as well as the assigned cluster IP on the services network. When kubernetes creates a NodePort service, kube-proxy allocates a port in the range 30000–32767 and opens this port on the eth0 interface of every node. Connections to this port are forwarded to the service's cluster IP. Since NodePorts exposes the service to clients on a non-standard port, a LoadBalancer is usually configured in front of the cluster which exposes the usual port, masking the NodePort from end users. NodePorts are the fundamental mechanism by which all external traffic gets into a kubernetes cluster.

LoadBalancer service type has all the capabilities of a NodePort service plus the ability to build out a complete ingress path, only when running in an environment like GCP or AWS that supports API-driven configuration of networking resources. An external IP is allocated for LoadBalancer service type thus extending a single service to support external clients. The load balancer has few limitations, namely it cannot be configured to terminate https traffic. It also cannot do virtual hosts or path-based routing, hence cannot use a single load balancer to proxy to multiple services. To overcome these limitations a new Ingress resource for configuring load balancers was added in version 1.2.

Ingress is a separate resource that configures a load balancer with much more flexibly. The Ingress API supports TLS termination, virtual hosts, and path-based routing. It can easily set up a load balancer to handle multiple backend services. The ingress controller is responsible for satisfying the configured requests by driving resources in the environment to the necessary state. When services of type NodePort are created using Ingress, the Ingress controller manages the traffic to the nodes. There are ingress controller implementations for GCE load balancers, AWS elastic load balancers, and for popular proxies such as nginx and haproxy. Mixing Ingress resources with LoadBalancer services can cause subtle issues in some environments.

Kubernetes Operator

Kubernetes manages the complete life cycle of stateless application in fully automated way without any extra knowledge to create/update/delete resources. Kubernetes uses the core control loop mechanism, were it observes the state, checks for the differences between current & desired state and takes action to update to desired state. For stateful applications were storage resources such as database is present, the Kubernetes control loop process doesn't work. The stateful application can have database replicas for example which have different state and identity, so these replicas need to be synchronized and sequentially updated for database consistency. This process varies for different databases from MySQL to Postgres. Hence Kubernetes Operators as used by stateful application to automate these complex stateful operations to update the resources.

Kubernetes Operators use the core control loop mechanism to watch for changes in the application state and make updates. It uses Kubernetes Custom Resource Definition (CRD) which extend existing K8s API and uses app specific knowledge to automate the lifecycle of the application it operates. Each resource in Kubernetes such as Pods, Services Deployments etc is an endpoint on Kubernetes API. This endpoint stores the collection of Kubernetes objects. The Kubernetes API is used to create, update, delete and get these Kubernetes objects. Kubernetes CRD is a way of extending the Kubernetes API to develop custom resources and install in any Kubernetes cluster. Once the resource is installed in Kubernetes cluster, we can create its object using Kubernetes CLI (kubectl). CRD only allows to store and fetch the data, but when combined with custom controller we can have a custom declarative API, which allows the keep the object's current state in sync with object's desired state.

In the below example we apply a custom PDF Document resource kubernetes template.

$ kubectl apply -f pdf-crd.yaml

$ kubectl get pdfdocument

$ kubectl get pdf

$ kubectl api-resources | grep pdf

$ kubectl proxy --port=8080

$ curl localhost:8080/apis/k8s.startkubernetes.com/v1/namespaces/default/pdfdocuments

$ kubectl get crd

Then we setup a Golang workspace to create a custom controller which will manage the above PdfDocument custom resource.

$ go mod init k8s.startkubernetes.com/v1

$ kubebuilder init

$ kubebuilder init --domain dev.emprovise --repo=github.com/pranav-patil/sample-oprator

$ kubebuilder create api --group k8s.startkubernetes.com --version v1 --kind PdfDocument

Custom operator are created for each applications and can be found in OperatorHub.io. Operator SDK allows developers to easily create kubernetes native applications.

The controller manages the (internal or external) resources, and contains the logic to get from current state to the desired state. It manages the resources using the Kubernetes loop mechanism: Observe, Check and Adjust the state. State is the state of the resource called as CRD. The controller uses the state that is stored in Kubernetes control plane (Etcd) to ensure that the resource is at the requested state. State is described as CRD with declarative description in form of YAML. Customer Kubernetes Operator is an application which defines an API for its state, stores the state in Etcd (Control plane) and uses the Kubernetes API to manage, update and delete the desired resource. The resource can be Kubernetes native resource e.g. a Pod or any other external resource outside Kubernetes e.g. Printer.

Admission Webhooks

In Kubernetes any events e.g. create/delete pods, scale deploy etc are request through the API Server. Admission Webhooks allows us to intercept these requests at different stages. There are two types of Admission web hooks, Mutating web hooks and Validation web hooks. The mutating web hooks intercepts the requests (object/YAML) before it hits the API server and allows to inject changes to the request object.The validation web hooks allows to accept or reject the request to the API service, example for policy enforcement.

Helm Package Manager

In order to manage and deploy the complex kubernetes applications there are some third party tools available such as Helm. Helm is a package manager and templating engine for Kubernetes. It allows to easily package, configure, and deploy applications and services onto Kubernetes clusters. It allows to easily create multiple interdependent kubernetes resources such as pods, services, deployments, and replicasets by generating YAML manifest files using its packaging format. It is a convenient way to package YAML files and distribute them in public and private repositories. Such bundle of YAML files is called Helm charts. After Helm installation it sets up a Helm tool.

Helm allows to define a common blueprint for example similar micro services and replace the dynamic values using placeholders. Values are defined using a YAML file or using --set flag in command prompt.

apiVersion: v1
kind: Pod
metadata:
  name: {{ .Values.name }}
spec:
  containers:
  - name: {{ .Values.container.name }}
    image: {{ .Values.container.image }}
    port: {{ .Values.container.port }}

values.yaml

name: my-app
container:
  name: my-app-container
  image: my-app-image
  port: 9001

Helm chart consists of Chart.yaml (contains meta information about the chart, name, version, dependencies), values.yaml contains all the values configured in the template files. Charts directory has chart dependencies while templates contains template files.

$ helm install <chart-name>

$ helm upgrade <chart-name>

$ helm install --values=my-values.yaml <chart-name>

$ helm install --set version=2.0.0 <chart-name>

Search Helm charts from Helm hub and Helm charts GitHub Project.

$ helm search <keyword>

Helm 2 utilizes the Tiller, a server-side component installed inside the Kubernetes cluster, to manage Helm chart installations. Tiller runs on the Kubernetes cluster and performs configuration and deployment of software releases on the cluster using helm commands. The helm command line tool accepts commands which are listened by a tiller server component. Tiller stores the local copy of all the help installations which can be used for release management using release names.

Helm 3 removes Tiller entirely and uses the Helm CLI to interact directly with the Kubernetes API, which simplifies deployments and addresses security concerns were Tiller had was so much powerful and complex. This change also means that Helm 3 relies more on the existing security configuration of your Kubernetes cluster. Helm 3 release names are now scoped to their namespace, allowing for the same release name to be used in different namespaces.

Installation on Ubuntu

Install apt-transport-https

$ sudo apt-get update && sudo apt-get install -y apt-transport-https

Add docker signing key and repository URL

$ curl -s https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"

Install docker on every node

$ sudo apt update && sudo apt install -qy docker-ce

Start and enable the Docker service

$ sudo systemctl start docker
$ sudo systemctl enable docker

Install Kubernetes involves installing kubeadm which bootstraps a Kubernetes cluster, kubelet which configures containers to run on a host and kubectl which deploys and manages apps on Kubernetes.

Add the Kubernetes in signing key

$ sudo curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add

Create the file /etc/apt/sources.list.d/kubernetes.list and add kubernetes repository URL as below.

$ sudo touch /etc/apt/sources.list.d/kubernetes.list
$ sudo vi /etc/apt/sources.list.d/kubernetes.list

Add "deb http://apt.kubernetes.io/ kubernetes-xenial main" to vi and save with "!wq".

$ sudo apt-get update
$ sudo apt-get install -y kubelet kubeadm kubectl kubernetes-cni

Cgroup drivers: When systemd is chosen as the init system for a Linux distribution, the init process generates and consumes a root control group (cgroup) and acts as a cgroup manager. Systemd has a tight integration with cgroups and will allocate cgroups per process. Using cgroupfs which also can configure container/kubelets, alongside systemd means that there will then be two different cgroup managers. Control groups are used to constrain resources that are allocated to processes. A single cgroup manager will simplify the view of what resources are being allocated and will by default have a more consistent view of the available and in-use resources. When we have two managers we end up with two views of those resources causing unstability under resource pressure. Hence cgroup driver is configured to systemd which is recommended driver for Docker cgroup driver.

Setup daemon for docker.

$ cat > /etc/docker/daemon.json <<EOF
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2"
}
EOF

$ sudo mkdir -p /etc/systemd/system/docker.service.d

Restart docker.

$ sudo systemctl daemon-reload
$ sudo systemctl restart docker

Kubernetes Master/Worker Node Setup

Initialize Master Node. The --pod-network-cidr add-on option allows to specify the Container Network Interface (CNI) also called as Pod Network. There are various third party pod network interfaces available which can be selected using --pod-network-cidr option. For example to start a Calico CNI we specify 192.168.0.0/16 and to start a Flannel CNI we use 10.244.0.0/16. It is recommended that the master host have at least 2 core CPUs and 4GB of RAM. If set, the control plane will automatically allocate CIDRs (Classless Inter-Domain Routing or Subnet) for every node. A pod network add-on must be installed so that the pods can communicate with each other.

Kubeadm uses the network interface associated with the default gateway to advertise master node's IP address which it would be listening on. The --apiserver-advertise-address option allows to select a different network interface on master node machine. Specify '0.0.0.0' to use the address of the default network interface.

$ sudo kubeadm init --pod-network-cidr=192.168.1.0/16 --apiserver-advertise-address=<master-ip-address>

The master node can be initialized using default options, were pod is isolated.

$ sudo kubeadm init

Issue following commands as regular user before joining a node

$ mkdir -p $HOME/.kube
$ sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
$ sudo chown $(id -u):$(id -g) $HOME/.kube/config

Kubeadm sets up a secure cluster by default and enforces use of RBAC.

The kubectl apply command is part of Declarative Management were changes applied to a live object directly are retained, even if they are not merged back into the configuration files. The kubectl automatically detects the create, update, and delete operations for every object. The below command installs a pod network add-on. Only one pod network can be installed per cluster.

$ kubectl apply -f <add-on.yaml>

The below command creates a Pod based on Calico (Calico Pod Network) using specified release 3.6 calico.yaml file.

$ kubectl apply -f https://docs.projectcalico.org/v3.6/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml

We can also download the calico.yaml file and then pass it to kubectl apply command.

$ wget "https://docs.projectcalico.org/v3.6/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml" --no-check-certificate

$ kubectl apply -f calico.yaml

The kubeadm join command is ran on the worker nodes to allow them to join the cluster using the <worker-token> returned by the kubeadm init command on the master node. It is recommended that each worker host have at least 1 core CPUs and 4GB of RAM.

$ sudo kubeadm join <master-ip-address>:6443 --token <worker-token> --discovery-token-ca-cert-hash sha256:<worker-token-hash>

To generate the worker token again, run below command with --print-join-command option on the master node. In case kubeadm join command fails with "couldn't validate the identity of the API Server", then use below command to regenerate the token for the join command.

$ sudo kubeadm token create --print-join-command

Kubernetes Dashboard Setup

Setup the dashboard on master node before any worker nodes join the master to avoid issues.

All the domains accessing Kubernetes Dashboard (1.7.x) over HTTP it will not be able to sign in. Nothing will happen after clicking Sign in button on login page.

Use the below command to create the kubernetes dashboard.

$ kubectl create -f https://raw.githubusercontent.com/kubernetes/dashboard/v1.10.1/src/deploy/recommended/kubernetes-dashboard.yaml

To start the dashboard server on default port 8001 with blocking process use the below command.

$ kubectl proxy

To access the dashboard from outside the cluster from any host, and custom port and address use the below command.

$ kubectl proxy --address="<master-node-address>" -p 8080 --accept-hosts='^*$' &

To create a service account for your dashboard, using "default" namespace

$ kubectl create serviceaccount dashboard -n default

To add cluster binding rules for your roles on dashboard

$ kubectl create clusterrolebinding dashboard-admin -n default \
--clusterrole=cluster-admin \
--serviceaccount=default:dashboard

To get the secret key to be pasted into the dashboard token pwd. Copy the out-coming secret key.

$ kubectl get secret $(kubectl get serviceaccount dashboard -o jsonpath="{.secrets[0].name}") -o jsonpath="{.data.token}" | base64 --decode

Go to http://<master-node-address>:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy/#!/login , which displays Kubernetes Dashboard. Select option Token and paste the secure key obtain from previous command to access the Kubernetes dashboard.

Kubernetes Commands

Create a deployment

$ kubectl create deployment <deployment-name> --image=<image-name>

Verify the deployment

$ kubectl get deployments

Get more details about the deployment

$ kubectl describe deployment <deployment-name>

Create a deployment with specified <deployment-name> and an associated ReplicaSet object. The --replicas option specifies the number of pods which would run the specified image.

$ kubectl run <deployment-name> --replicas=5 --labels=<label-name> --image=<image-name>:<image-version> --port=8080

Get the information about the ReplicaSets

$ kubectl get replicasets
$ kubectl describe replicasets

Create the service on the nodes

$ kubectl create service nodeport <deployment-name> --tcp=80:80

Check which deployment is running on which node

$ kubectl get svc

Delete the deployment

$ kubectl delete deployment <deployment-name>

Get status of all the nodes

$ kubectl get nodes

Get status of all the Podes

$ kubectl get pods --all-namespaces -o wide

Get the status of all the pods with specified namespace using the -n parameter.

$ kubectl get pods -n <namespace> -o wide

Get the status of the pods with specified label using -l parameter and namespace.

$ kubectl get pods -n <namespace> -l <label-key>=<label-value> -o wide

Get detail status of the Pods

$ kubectl get -o wide pods --all-namespaces

Get the list of current namespaces in a cluster

$ kubectl get namespaces

Delete the Pod with the specified name

$ kubectl delete pod <pod-name>

Delete the Pods and Services with the specified Pod and Service names respectively.

$ kubectl delete pod,service <pod-name> <service-name>

Delete the pods and services with the specified label name, including uninitialized ones

$ kubectl delete pods,services -l name=<label-name> --include-uninitialized

The kubectl create command is part of Imperative Object Management were the Kubernetes API is used to create, replace or delete objects by specifying the operation directly. It creates a resource by filename of or stdin. Below example creates deployment using Nginx YAML configuration file

$ kubectl create -f nginx.yaml

Deletes resources by file name, stdin, resource and names.

$ kubectl delete -f pod.yml

Create a service which exposes the specified deployment at port 8080 and using --type to specify service type as LoadBalancer instead of the default type ClusterIP.

$ kubectl expose deployment <deployment-name> --type=LoadBalancer --port=8080 --name=<service-name>

Get the information of the specified service.

$ kubectl get services <service-name>

Get the detailed information of the specified service

$ kubectl describe services <service-name>

Update image for the existing container or pod template.

$ kubectl set image deployments/<deployment-name> <container-name>=<image-name>

The scale command scales the specified deployment up or down with the specified number of replicas.

$ kubectl scale deployments/<deployment-name> --replicas=3

Drain particular node or remove it from service. It would safely evict all of pods from the specified node in case to perform maintenance on the node.

$ kubectl drain <node-name> --delete-local-data --force --ignore-daemonsets

Delete a node with specified name

$ kubectl delete node <node-name>

Delete the headless service using the kubectl delete command. Multiple service names can be passed to kubectl delete command.

$ kubectl delete service <service-name>
$ kubectl delete service <service1-name> <service2-name> <service3-name>

Delete pods and services with label name=myLabel

$ kubectl delete pods,services -l name=myLabel

Delete the deployments passing multiple deployment names.

$ kubectl delete deployments <deployment1-name> <deployment2-name>

Reverts any changes made by kubeadm init or kubeadm join commands. The --ignore-preflight-errors option allows to ignore errors from all the checks.

$ kubectl reset --ignore-preflight-errors stringSlice

The below kubectl delete command deletes serviceaccount and clusterrole by namespace and name.

$ kubectl delete serviceaccount -n kube-system admin-user
$ kubectl delete clusterrole cluster-admin

To view logs of a particular container inside a pod running multiple container. The -n (namespace) option allows to filter out the pod by namespace.

$ kubectl logs -f <pod-name> -c <container-name>

$ kubectl -n kube-system logs <pod-name> -c <container-name>

Note: If you face an certificate error "Unable to connect to the server: x509: certificate signed by unknown authority", append --insecure-skip-tls-verify=true argument to kubectl commands

Docker: Platform for Microservices

2018-12-27T23:39:00.003-08:00

With the advent of micro services architectural style were applications are a collection of loosely coupled services which can be independently deployed, upgraded and scaled, many organizations are switching to micro-services design in order to achieve greater scalability and availability. In order to run individual services on different instances to scale efficiently, a self contained unit such as virtual machines or docker containers could be used.

Virtualization is a technique of importing a guest operating system on a host operating system. It allows multiple operating systems to run on a single machine, which allows easy recovery on failure. A virtual machine is comprised of some level of hardware, kernel virtualization on which runs a guest operating system and a guest kernel that can talk to this virtual hardware. Virtual machine emulates a real machine and runs on top of either hosted hypervisor or a bare-metal hypervisor which in turn runs on host machine. Hosted virtualization hypervisor runs on the operating system of the host machine hence is almost hardware independent while bare metal hypervisor runs directly on the host machine’s hardware providing better performance. The hypervisor drives virtualization by allowing the physical host machine to operate multiple virtual machines as guests to help maximize the effective use of computing resources such as memory, network bandwidth and CPU cycles. It also allows sharing of resources amongst multiple virtual machines. Either way, the hypervisor approach is considered heavy weight as it requires virtualizing multiple parts if not all of the hardware and kernel. The virtual machine packages up the virtual hardware, a kernel (i.e. OS) and user space for each new instance thus requiring lot of hardware resources. Running multiple VMs on the same host machine degrades the system performance, as each virtual OS runs its own kernel and libraries/dependencies taking considerable chunk of host system resources. Virtual machines are also slower to boot up which becomes critical for real time processing production applications. Once any virtual machine is allocated memory, it cannot be taken back later even though it only uses fraction of its allocated memory. Virtualization thus involves in adding extra hardware to achieve desirable performance, and is a tedious and costly affair to maintain.

Containerization is a lightweight alternative to full machine virtualization that involves encapsulating an application in a container with its own operating environment. Containers run on the same host operating system and on host kernel requiring significantly less resources making booting up the container much faster than a virtual machine.

Docker is a Containerization platform which packages the application and all its dependencies together in the form of Containers so as to ensure that the application works seamlessly in any environment, be it Development, Testing or Production. Docker Containers similar to VMs have a common goal to isolate an application and its dependencies into a self-contained unit that can run anywhere. Each container runs independently of the other containers with process level isolation. Docker containers requires very less space, starts up faster and can be easily integrated with many Dev-Ops tools for automation compared to virtual machines. The Docker container gets allocated the exact amount of memory to run each container thus avoiding any unused memory allocated to any container. Unlike virtual machines which require hardware virtualization for machine level isolation, docker containers operate on isolation within the same operation system. The overhead difference between VM and containers becomes really apparent as the number of isolated spaces increase. Further since docker containers runs on the host system kernel it makes them very lightweight and faster to execute.

Docker container is an isolated application platform which contains everything needed to run the application. They are built from one base docker image & dependencies are installed on top of the image as "image layers". A Docker image is equivalent to an executable which run specific services in a particular environment. Hence in other words, a Docker container is a live running instance of a Docker image. Docker registry is a storage component for docker images. The registry can be user's local repository or a public repository like DockerHub in order to collaborate to build an application.
Docker engine is the heart of docker system, and it creates and runs Docker containers. It works as a client server application with server being a Docker Daemon process which is communicated by Docker CLI using rest APIs and socket I/O to create/run docker containers. Docker daemon builds an image based on inputs or pulls an image from docker registry after receiving corresponding docker build command or docker pull command from docker CLI. When docker run command is received from docker CLI, docker daemon creates a running instance of docker image by creating and running docker container. For Windows and Mac OS X, there is an additional Docker Toolbox which acts as an installer to quickly setup docker environment which includes Docker client, Kitematic, Docker machine and Virtual box.
Docker provides various restart policies to allow the containers to start automatically when they exit, or when Docker restarts. It is always preferred to restart the container if it stops mostly in case of failures.

Docker Machine

Docker Machine is a tool which allows to create (and manage) virtual hosts with installed docker engine on either local machine using VirtualBox or on any cloud providers such as DigitalOcean, AWS and Azure. The docker-machine commands allows to start, inspect, stop, and restart a managed host, upgrade the Docker client and daemon, and configure a Docker client to talk to corresponding host. Docker Machine enables to provision multiple remote Docker hosts on various flavors of Linux and allows to run docker on older Windows or Mac operating systems.

Docker Networking

By default docker creates three networks automatically on install: bridge, none, and host.

BRIDGE: All Docker installations represent the docker0 network with bridge; since docker connects to bridge driver by default. Docker also automatically creates a subnet and gateway for the bridge network, and docker run automatically adds containers to it. Containers running on the same network can communicate with one another via IP addresses. Docker does not support automatic service discovery on bridge network. To connect containers with the network use the "--network" option of docker run command.

NONE: The None network offers a container-specific network stack that lacks a network interface.
The container for none network only has a local loopback interface (i.e., no external network interface).

HOST: Host enables a container to attach to your host’s network (meaning the configuration inside the container matches the configuration outside the container).

Containers can communicate within networks but not across networks. A container with attachments to multiple networks can connect with all of the containers on all of those networks. The docker network create command allows to create custom isolated networks. Any other container created on such network can immediately connect to any other container on this network. The network isolates containers from other (including external) networks. However, we can expose and publish container ports on the network, allowing portions of our bridge access to an outside network.

Overlay network provides native multi-host networking and requires a valid key-value store service, such as Consul, Etcd, or ZooKeeper. A key-value store service should be installed and configured before creating the network. Multiple docker hosts within overlay network must communicate with the key-value store service. Hosts can be provisioned by docker machine. Once we connect, every container on the network has access to all the other containers on the network, regardless of the Docker host serving the container.

Docker Compose

When the docker application includes more than one container, building, running, and connecting the containers from separate Dockerfiles is cumbersome and time-consuming. Docker compose solves this by allows to define a multi-container application using a single YAML file and spin up the application using a single command. It allows to build images, scale containers, link containers in a network and define volumes for data storage. Docker compose is a wrapper around the docker CLI in order to gain time. A docker-compose.yml file is organized into four sections:

version: It specifies the docker compose file syntax version
services: A service is the name for the docker container in production. This section defines the containers that will be started as a part of the Docker Compose instance.
networks: This section is used to configure networking for the application. It enables to change the settings of the default network, connect to an external network, or define app-specific networks.
volumes: It enables to mount a linked path on host machine which is used by the container for persistent storage.

Installing Docker on Ubuntu

$ sudo apt update

$ sudo apt install apt-transport-https ca-certificates curl software-properties-common

$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

$ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu bionic stable"

$ sudo apt update

$ sudo apt install docker-ce

$ sudo apt install docker-compose

Docker Commands

Log in to a Docker registry

$ docker login -u <docker-username> -p <docker-password>

Pull an image or a repository from a registry

$ docker pull elkozmon/zoonavigator-api:0.2.3

Docker command to clean up any resources; images, containers, volumes, and networks which are dangling and not associated with a container.

$ docker system prune

To remove any stopped containers and all unused images (not just dangling images), add the -a flag to the command.

$ docker system prune -a

Remove all stopped containers, unused volumes and unused images. The -force option does not ask for confirmation during removal.

$ docker system prune --all --force --volumes

Remove all dangling images were no container is associated to them, skipping confirmation for removal.

$ docker image prune -f

Remove all unused images, not just dangling ones

$ docker image prune -a

Remove all stopped containers.

$ docker container prune

Remove all unused local volumes

$ docker volume prune

To delete all dangling volumes, use the below command

$ docker volume rm `docker volume ls -q -f dangling=true`

Below docker ps command with the -a flag gives the details of the container including its name, container id and ports on which they are running.

$ docker ps -a

The docker ps -a command can also be used to locate the containers and filter them using -f flag by their status: created, restarting, running, paused, or exited.

$ docker ps -a -f status=exited

Build a docker image using the docker build command. The -t option allows to tagging of the image.

$ docker build .

$ docker build -t username/repository-name .

Remove the container by container name or id using rm command.

$ docker rm <container-id> or <container-name>

Removes (and un-tags) one or more images. The -f option removes an image from running container.

$ docker rmi

To stop all the docker containers

$ docker stop $(docker ps -a -q)

Then to remove all the stopped containers, pass the docker container ids from docker ps to docker rm command

$ docker rm $(docker ps -a -q)

Create a volume with specified volume driver using --driver (-d) option. The --options (-o) allows to set driver specific options.

$ docker volume create -d local-persist -o mountpoint=/mnt/ --name=<volume-name>

Display detailed information of the specified volume

$ docker volume inspect <volume-name>

The docker volume ls command is used to locate the volume name or names to be deleted. Remove one or more volumes using the docker volume rm command as below.

$ docker volume ls
$ docker volume rm <volume_name> <volume_name>

Using the --filter (-f) option, list volumes by filtering only those which are dangling.

$ docker volume ls -f dangling=true

Get the assigned address for specified docker container

$ docker inspect <container-name>

To get the process id i.e. PID of the specified docker container we use the below command.

$ docker inspect -f '{{.State.Pid}}' <container-name>

Find IP addresses of the container specified by container name. The argument ultimately passed to the docker inspect command is the container id.

$ docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' <container-name>

The display the health check of a docker container.

$ docker inspect --format='{{json .State.Health}}' <container-name>

{
  "Status": "healthy",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2017-07-21T06:10:51.809087707Z",
      "End": "2017-07-21T06:10:51.868940223Z",
      "ExitCode": 0,
      "Output": "Hello world"
    }
  ]
}

Execute the specified command in a running docker container. The command is not restarted if the container gets restarted.

$ docker exec <container-name> ps
$ docker exec -it <container-name> /bin/bash

The exec command comes in handy for all kinds of debuging purposes, e.g. to ensure UDP ports are being listened we use the below netstat command..

$ docker exec -it <container-name> netstat -au

Run a one-time command in a new container. The -e option allows to set an environment variable while running the command. The -t option allocates a pseudo-TTY, while -i keeps the STDIN open even if not attached. The docker run command first creates a writeable container layer over the specified image, and then starts it using the specified command.

$ docker run -it -e "ENV=dev" <docker-image-name>

By default a container’s file system persists even after the container exits. The --rm flag allows to avoid persisting container file systems for short term processes and to automatically clean up the container and remove the file system when the container exits. The --rm parameter is ignored in detached mode with -d (--detach) parameter. By default all containers are connected to a bridge interface to make any outgoing connections, but a custom network can be provided by --network option.

$ docker run -it --rm --network net postgres-service psql -h postgres-service -U appuser

Get a list of all container IDs, only displaying numeric container ids.

$ docker container ls -aq

Display list of all images along with their repository, tags, and size. On passing the name or tag allows to list images by name and tag.

$ docker images
$ docker images java

Create a network with the specified driver using --driver (-d) option.

$ docker network create -d bridge <network-name>

Use the docker run command with the --net flag to specify the network to which you wish to connect your specified container.

$ docker run <container-name> --net=<network-name>

Get he list of Docker networks

$ docker network ls

Network inspect allows to get further details on networks.

$ docker network inspect <network-name>

To get the details of the default bridge network in JSON format we use below command.

$ docker network inspect bridge

Create a docker machine with a --driver flag indicating the provider on which the machine should be created on e.g. VirtualBox, DigitalOcean, AWS, etc.

$ docker-machine create --driver virtualbox <docker-machine-name>

The docker logs command shows information logged by a running container. The -t option allows to follow the log and -t option displays the timestamp.

$ docker logs -t -f <container-name>

Builds, (re)creates, starts, and attaches to containers for a service. Unless they are already running, docker-compose up also starts any linked services. When the command exits it stops all containers.

$ docker-compose up

Start all docker containers. With the -d (--detach) option specified, docker-compose will start all the services in the foreground and leaves them running. When –no-recreate option, if the container already exits, this will not recreate it.

$ docker-compose up -d

Build and start only the specified docker container.

$ docker-compose up <container-name>

The --scale flag allows to scale the number of instances of the specified service.

$ docker-compose up --scale <service-name>=3

The --no-deps argument for docker-compose up command doesn't start linked services.

$ docker-compose up -d --no-deps --build <container-name>

The --file or -f option allows to specify alternate compose file from the default "docker-compose.yml". Multiple configuration files can be supplied using -f option. The compose combines and builds the configuration in the order compose files were supplied. Subsequent files override and add to their predecessors.

$ docker-compose -f docker-compose.yml -f docker-compose.dev.yml up

The --build option allows to build images before starting containers.

$ docker-compose -f docker-compose-dev.yml up --build

The docker-compose with --force-recreate option allows to stop and recreate containers from fresh images every time even if their configuration and image haven't changed.

$ docker-compose up -d <container-name> --build --force-recreate

The --no-recreate does not create containers if they already exists.

$ docker-compose up -d <container-name> --build --no-recreate

The --remove-orphans removes containers for services not defined in the compose file.

$ docker-compose up -d <container-name> --build --remove-orphans

Starts existing containers for a service.

$ docker-compose start

Stop all the docker containers

$ docker-compose stop

Below down command, stops containers and removes containers, networks, volumes, and images created by up.

$ docker-compose down

The Run command runs a one-time command against a service. It starts a container, runs the command and discards the container.

$ docker-compose run <container-name> bash

The exec command allows to run arbitrary commands in the services. It is similar to run, but allows to attach to a running container and run commands inside it e.g. for debugging. By default it allocates a TTY to get an interactive prompt.

$ docker-compose exec <container-name> sh

Removes the stopped service containers without asking for any confirmation.

$ docker-compose rm -f

Stop the containers, if required, and then remove all the stopped service containers.

$ docker-compose rm -s

Remove the volumes that are attached to the containers

$ docker-compose rm -v

Rebuild the docker containers and tags it. It helps to rebuild whenever there is a change in the Dockerfile or the contents of its build directory.

$ docker-compose build

Get status of docker containers

$ docker-compose ps

The docker-compose log command displays output logs from all running services. The -f option means to follow the log output and the -t option gives the timestamps.

$ docker-compose logs -f -t

Validate and view the docker compose file.

$ docker-compose config

It also allows to test the resultant docker compose file where variables e.g. $SERVICE_PASSWORD need to populated by passing from the command line before the docker command as below. It is important to note that docker detects if environment variables are changed in dependent container compared to the existing running container and recreates the container again.

$ SERVICE_PASSWORD=secret docker-compose config

Docker Swarm

Swarmkit is a separate project which implements Docker’s orchestration layer and cluster management and is embedded in the Docker Engine. Docker swarm is a technique to create and maintain a cluster of Docker Engines. Many docker engines connected to each other form a network , which is called docker swarm cluster.

A docker manager initializes the swarm in a docker swarm cluster, and along with many other nodes executes the services. A node is an instance of the Docker engine participating in the swarm. Though docker manager can execute the services, its primary role is to maintain and manage docker nodes running the services. Docker manager also performs cluster and orchestration management by electing a single leader to conduct orchestration tasks. The manager node uses the submitted service definition to dispatch units of work called tasks to worker nodes. A task is an instance of running container which is part of a swarm cluster managed by docker manager. The docker manager assigns tasks to worker nodes according to the number of replicas set in the service scale. Once a task is assigned to a node, it cannot move to another node. The worker nodes receive and execute the corresponding tasks dispatched from manager node. The docker manager maintains the desired state of each worker node using their current state of assigned tasks reported by the agent running on each worker nodes. The docker Manager has two kinds of tokens, a manager token and a worker token. The worker nodes use the worker token to join the swarm as a worker node, while another node can join as a docker manager by getting the token from docker manager creating multi-manager docker cluster. The multi manager cluster has a single primary docker manager while multiple secondary docker managers. While a request to deploy the application (start a service) can be made to either the primary or secondary manager, any request to secondary manager is automatically routed to primary manager which is responsible for scheduling/starting container on the host. All the docker managers in a multi manager cluster form a Raft consensus group. Raft consensus algorithm enables to design fault-tolerant distributed system were multiple servers agree on the same information. It allows an election of a leader and for each subsequent request to the leader which is appended to its log, logs of every follower is replicated with the same information. It is highly recommended to have odd number of docker managers (typically 1, 3 or 5) to avoid split brain issue were more than one candidate gets equal majority aka tie. The Worker nodes communicate with to each other using gossip network.

There are two modes in which services are executed in docker swarm, namely replicated or global. The replicated mode allows to have multiple instances (task) of the service be executed in same docker host, depending on its load and capacity. Also it allows to have no instance of the service running on already loaded docker node. The global mode however ensures that one instance of the service is running on every node of the docker cluster. It ensures that unless all the nodes fail the service would still up and running across other nodes. It is used for critical services which required to be up all the time e.g. Eureka service.

A service is an higher abstraction which helps to run an image in a swarm cluster while swarm manages individual instances aka tasks. It is a an docker image which docker swarm manages and executes. When a desired service state is declared by creating or updating a service, the orchestrator realizes the desired state by scheduling tasks. Each task is a slot that the scheduler fills by spawning a container. The container is the instantiation of the task. Service creation requires to specify which docker image to use and which commands to execute inside running containers.

The services requested would be divided and executed across multiple docker nodes as tasks to achieve load balancing. Multiple tasks belonging to single or different service can be executed within a docker node. At any point of time when a node goes down in docker swarm cluster, the docker manager starts the tasks for the services running on stopped docker node on another nodes to balance the load and thus providing high availability of services. Auto load balancing ensures that during any node down time the docker manager will execute the corresponding down services on other nodes and also scale the services on multiple nodes during high load time. The docker manager uses an internal DNS server for load balancing, which connects all the nodes in the docker cluster. Decentralized access allows to the service deployed in any node to be accessed from other nodes in the same cluster. Docker swarm also allows seamless rolling updates for each service with delay between individual nodes. Docker Swarm manages individual containers on the node for us.

A stack is a group of interrelated services that share dependencies, and can be orchestrated and scaled together. A single stack is capable of defining and coordinating the functionality of an entire application. The stack abstraction goes beyond the individual services and deals with the entirety of application services, which are closely interlinked or interdependent. Stacks allow for multiple services, which are containers distributed across a swarm, to be deployed and grouped logically. The services running in a Stack can be configured to run several replicas, which are clones of the underlying container. The stack is configured using a docker-compose file and it takes one command to deploy the stack across an entire swarm of Docker nodes. Stacks are very similar to docker-compose except they define services while docker-compose defines containers. Docker stack simplifies deployment and maintenance of multiple inter-communicating microservices and is ideal for running stateless processing applications.

Docker Swarm Commands

Initialize Docker Swarm using below swarm init command. The specified <ip-addr> would be the docker manager node's ip address which ideally should be same machine.

$ docker swarm init --advertise-addr <ip-addr>

The swarm init command's listen-addr option allows the current node to listen for inbound swarm manager traffic on the specified IP address.

$ docker swarm init --listen-addr <ip-addr>:2377

The swarm join command allows the node with specified IP address to join the swarm cluster as a node and/or manager.

$ docker swarm join --token <token> <ip-addr>:2377
$ docker swarm join --token <worker-token> <manager>:2377

Create a multi master docker cluster by joining the swarm as docker manager and then giving the manager token. Below we have 2 docker managers in the docker cluster.

$ docker swarm join --manager --token <manager_token> --listen-addr <master2-addr>:2377 <master1-addr>:2377

The below join-token allows docker swarm to manage join tokens. It is usually used to add the current node as a manager or a worker to the current swarm cluster (often by generating swarm join --token command).

$ docker swarm join-token (worker|manager)

Leave the current swarm cluster. When command is ran on a worker, the worker node leaves the swarm. The --force option on a docker manager removes it from the swarm cluster.

$ docker swarm leave --force

All below service commands can only run on docker manager.

Below command lists all the services running inside docker swarm.
$ docker service ls

Below command lists tasks running on one or more nodes for specified nodes.
$ docker service ps <name>

Create new services published on specified node port.

$ docker service create <name> -p host-port:container-port <image-name>
$ docker service create <name> --publish host-port:container-port --replicas 2 <image-name>
$ docker service create --name <name> alpine ping <host-name>

The replicas option allows to specify the number of tasks (instances) which the new created service will be executed.

$ docker service create --replicas 3 --name <name> <image-name>

With mode set as global for create service command, it downloads the specified image and starts the corresponding service on each single node of the cluster.

$ docker service create --mode=global --name=<name> <image-name>

Remove service running in docker swarm.

$ docker service rm <name>

Scale one or more replicated services

$ docker service scale <name>=5

Display details regarding the specified service

$ docker service inspect <name> --pretty

Update the service by increasing the number of replicated services.

$ docker service update --replicas 10 <name>

The service logs command shows information logged by all containers participating in a service or task.

$ docker service logs

List all the nodes present in the swarm cluster

$ docker node ls

Lists all the services (tasks) running on current node (by default).

$ docker node ps

Removes one or more nodes specified by id from the swarm cluster

$ docker node rm <id>

Stop allocating services to Manager-1 node

$ docker node update --availability drain <Manager-1>

Start allocating services to Manager-1 node

$ docker node update --availability active <Manager-1>

Deploy a new stack or update an existing stack. The --compose-file (-c) option allows to provide path to the docker compose file.

$ docker stack deploy <stack-name>

$ docker stack deploy -c docker-compose.yml <stack-name>

List all the stacks

$ docker stack ls

List all the services in the specified stack

$ docker stack services <stack-name>

List the tasks in the specified stack

$ docker stack ps <stack-name>

Machine Learning with TensorFlow

2017-12-31T16:26:00.004-08:00

Over past decade, there has been an exponential rise in the data we collect as the cost of hardware decreases and processing power has surged. Data whether it may be in form of a collection of pictures, social media messages, scientific readings, stock exchange, GPS tracking data, activity monitoring, news contains seemingly valuable information which could tells us about the public trends, entity relationships such as cause and effect, pattern repetitions and various other insights. This helps us in deep understanding about the actors and the environment in which the data is generated. Remarkably this data can also be used by computers to train themselves by determining the patterns, generate models which later can be used to query about the data, but further also to make future predictions. During the 2017 Google I/O Conference, were few improvements were announced, smart computers using Machine learning and other AI techniques were envisioned as next phase in computer technology.
Deep learning is one of the broader methods of machine learning which learns the high level features from the data itself. In deep learning, multiple artificial neurons stacked up as layers perform individual functions and serve as an input to the next layer.

TensorFlow

TensorFlow is software library for dataflow programming released by Google in 2015 which is used implementing deep learning models. TensorFlow first defines an abstract model which defines the computations, called the Computational Graph. The computational graph then runs within a session to make the model a reality. A Computational Graph is a series of TensorFlow operations arranged into a graph of nodes. When Computational Graph is defined all the operations are created without holding any values or running any calculations. Below is an example of an computational graph.

import tensorflow as tf

node1 = tf.constant(3.0, tf.float32)
node2 = tf.constant(4.0)
node3 = node1 * node2
print(node1, node2, node3)

The TensorFlow session's allows to execute computational graph or a part of the graph, and producing actual results as shown below. The session encapsulates the control and state of the TensorFlow runtime.

session = tf.Session()
print(session.run([node1, node2, node3]))
session.close()

Data is represented in form of tensors in TensoreFlow. A tensor is a multi dimensional arrays or a lists, for example, an array is a 1-dimensional tensor, a matrix is 2-dimensional tensor, a three dimensional matrix is a 3-dimensional tensor. Tensors are described by a unit of dimensions called the rank.

Rank	Math entity	Python example
0	Scalar (magnitude only)	s = 483
1	Vector (magnitude and direction)	v = [1.1, 2.2, 3.3]
2	Matrix (table of numbers)	m = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
3	3-Tensor (cube of numbers)	t = [[[2], [4], [6]], [[8], [10], [12]], [[14], [16], [18]]]
n	n-Tensor	....

TensorBoard is a suite of web applications for visualizing and understanding TensoreFlow graphs. To create a TensorFlow graph FileWriter is used to output the graph to a directory.

session = tf.Session()
File = tf.summary.FileWriter('log_simple_graph', session.graph)
session.close()

TensorBoard runs as a local web app, on default port 6006 on executing the command tensorboard --logdir="path_to_the_graph".

DataTypes in Tensorflow

Constant nodes takes no inputs and outputs the value it stores internally.
Placeholder is a parameter of the graph that can accept external inputs. It is a promise to provide a value later. Below is an example.

import tensorflow as tf

a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)

adder_node = a + b
session = tf.Session()

print(session.run(adder_node,{a: [1,3], b: [2,4]}))
session.close()

Variable allows to add a trainable parameters to a graph. They are used to hold and update parameters while training a model. Variable must be initialized before using them unlike constants and placeholders as below.

import tensorflow as tf

W = tf.Variable([.3], tf.float32)
b = tf.Variable([-.3], tf.float32)
x = tf.placeholder(tf.float32)

linear_model = W * x + b

init = tf.global_variables_initialzer()

session = tf.Session()
session.run(init)
print(session.run(linear_model, {x:[1,2,3,4]}))
session.close()

Single Sign On with SAML 2.0

2016-12-19T15:09:00.000-08:00

Security Assertion Markup Language or SAML is the secure XML based communication standard for communicating identities, exchanging authentication and authorization data between parties. SAML is a specification which defines messages and their format, message encoding methods, message exchange protocols, and other recommendations. SAML addresses the primary use case of internet single sign on (SSO) which authenticates the user using a single login into the system and allows access to other affiliated systems without additional authentication. SAML thus eliminates multiple authentication credentials in multiple locations and reduces the number of the users being authenticated. It separates the security framework from platform architecture and specific vendor implementation. SAML involves three entities, the user, identity provider and server provider. The Identity Provider maintains a directory of users and an authentication mechanism to authenticate them. The Service Provider is the target application that a user tries to use.

SAML consists of six components as follows: assertions, protocols, bindings, profiles, metadata, authentication context. The components mainly enable to transfer secure information like identity, authentication, and authorization information between trusted entities.

SAML assertions contain identifying information made by a SAML authority. In SAML, there are three assertions: authentication, attribute, and authorization. Authentication assertion validates that the specified subject is authenticated by a particular means at a particular time and is made by a SAML authority called an identity provider. Attribute assertion contains specific information about the specified subject. And authorization assertion identifies what the specified subject is authorized to do.
SAML protocols define how SAML asks for and receives assertions and the structure and contents of SAML protocols are defined by the SAML-defined protocol XML schema.
SAML bindings define how SAML request-response message exchanges are mapped to communication protocols like Simple Object Access Protocol (SOAP). SAML works with multiple protocols including Hypertext Transfer Protocol (HTTP), Simple Mail Transfer Protocol (SMTP), File Transfer Protocol (FTP) and so on.
SAML profiles define constraints and/or extensions to satisfy the specific use case of SAML. For example, the Web SSO Profile details how SAML authentication assertions are exchanged between entities and what the constraints of SAML protocols and binding are. An attribute profile on the other hand establishes specific rules for interpretation of attributes in SAML attribute assertions. For instance, X.500/LDAP profile details how to carry X.500/LDAP attributes within SAML attribute assertions.
SAML metadata defines a way to express and share configuration information between SAML entities. For instance, an entity's supported SAML bindings, operational roles (IDP, SP, etc), identifier information, supporting identity attributes, and key information for encryption and signing can be expressed using SAML metadata XML documents. SAML Metadata is defined by its own XML schema. In a number of situations, a service provider may need to have detailed information regarding the type and strength of authentication that a user employed when they authenticated at an identity provider.
SAML authentication context is used in (or referred to from) an assertion's authentication statement to carry this information. A service provider can also include an authentication context in a request to an identity provider to request that the user be authenticated using a specific set of authentication requirements, such as a multi-factor authentication.

SAML Assertion
An assertion is a package of information that supplies zero or more statements made by a SAML authority usually about a subject such as a user. SAML assertions are issued from the Identity Provider(also called Asserting Party) to the Service Provider (also called as Relying Party). When the user has authenticated with the Identity Provider a SAML Assertion is sent to the Service Provider with the Identity Provider's information about that user. It is represented by the <Subject> element. SAML specification defines three different kinds of assertion statements that can be created by a SAML authority. All SAML-defined statements are associated with a subject. The three kinds of statement defined in the specification are:

Authentication: The subject was authenticated by a particular means at a particular time.
Attribute: The subject is associated with the supplied attributes with values mostly using LDAP.
Authorization Decision: A request to allow the subject to access the specified resource has been granted or denied using the given evidence.

SAML authentication request protocol enables third-party authentication of a subject. It is useful in the cases to limit the scope within which an identifier is used to a small set of system entities.

Name Identifiers
Name Identifiers are identifiers for subjects and the issuers of assertions and protocol messages. They help to establish a means by which parties may be associated with identifiers that are meaningful to each of the parties. They help to limit the scope within which an identifier is used to a small set of system entities. Two or more system entities may use the same name identifier value when referring to different identities. SAML provides name qualifiers to disambiguate a name identifier by effectively placing it in a federated namespace related to the name qualifiers. The <BaseID> element is an extension point that allows applications to add new kinds of identifiers. The NameIDType complex type is used when an element serves to represent an entity by a string-valued name. Its more restricted form of identifier than the <BaseID> element and is the type underlying both the <NameID> and <Issuer> elements. The <NameID> element is of type NameIDType, and is used in various SAML assertion constructs such as the <Subject> and <SubjectConfirmation> elements, and in various protocol messages. The <EncryptedID> element is of type EncryptedElementType, and carries the content of an unencrypted identifier element in encrypted fashion. The <Issuer> element, with complex type NameIDType, provides information (name etc) about the issuer of a SAML assertion or protocol message.

Assertions has following elements:

The <AssertionIDRef> element makes a reference to a SAML assertion by its unique identifier. The specific authority who issued the assertion or from whom the assertion can be obtained is not specified as part of the reference.
The <AssertionURIRef> element makes a reference to a SAML assertion by URI reference.
The <Assertion> element is of the AssertionType complex type and specifies the basic information common to all assertions, such as version issual time, identifier, issuer, signature, subject, authentication etc.
The SAML assertion MAY be signed adding the <ds:Signature> element, which provides both authentication of the issuer and integrity protection.
The <EncryptedAssertion> element represents an assertion in encrypted fashion.

The Subjects section defines the SAML constructs used to describe the subject of an assertion. The optional <Subject> element specifies the principal that is the subject of all of the (zero or more) statements in the assertion. It identifies the subject using <BaseID>, <NameID>, or <EncryptedID> and confirms it using <SubjectConfirmation> element. A <Subject> element can contain both an identifier and zero or more subject confirmations which a relying party (service provider) can verify when processing an assertion. A <Subject> element SHOULD NOT identify more than one principal.

The <SubjectConfirmation> element provides the means for a relying party (service provider) to verify the correspondence of the subject of the assertion with the party with whom the relying party is communicating. It has the method which identifies a protocol or mechanism to be used to confirm the subject.
The <SubjectConfirmationData> element has specifies additional data that allows the subject to be confirmed or constrains the circumstances under which the act of subject confirmation can take place. The KeyInfoConfirmationDataType complex type constrains a <SubjectConfirmationData> element to contain one or more <ds:KeyInfo> elements that identify cryptographic keys that are used in some way to authenticate an attesting entity.

The <Conditions> element place constraints on the acceptable use of SAML, such as Validity, Audience Restriction, Usage and Proxy Restrictions.
The <Advice> element contains any additional information that the SAML authority wishes to provide to a relying party (service provider).
The <Statement> element is an extension point that allows other assertion-based applications to reuse the SAML assertion framework.
The <AuthnStatement> element describes a statement by the SAML authority asserting that the assertion subject was authenticated by a particular means at a particular time.
The <SubjectLocality> element specifies the DNS domain name and IP address for the system from which the assertion subject was authenticated.
The <AuthnContext> element specifies the context of an authentication event and can contain authentication context class reference, an authentication context declaration or declaration reference,
or both.
The <AttributeStatement> element describes a statement by the SAML authority asserting that the assertion subject is associated with the specified attributes. Assertions containing <AttributeStatement> elements MUST contain a <Subject> element. It has <Attribute>, <AttributeValue>, <EncryptedAttribute> and <AuthzDecisionStatement> elements.

SAML protocol messages can be generated and exchanged using a variety of protocols. The protocols defined by SAML achieve the following actions:

Returning one or more requested assertions. This can occur in response to either a direct request for specific assertions or a query for assertions that meet particular criteria.
Performing authentication on request and returning the corresponding assertion.
Registering a name identifier or terminating a name registration on request.
Retrieving a protocol message that has been requested by means of an artifact.
Performing a near-simultaneous logout of a collection of related sessions (“single logout”) on request.
Providing a name identifier mapping on request.

RequestAbstractType
All SAML requests are of types that are derived from the abstract RequestAbstractType complex type. It has following attributes:

ID: An unique identifier required for the request.
Version: The version of this request.
IssueInstant: The time instant of issue of the request encoded in UTC.
Destination: An optional URI reference indicating the address to which this request has been sent in order to prevent malicious forwarding of requests to unintended recipients.
Consent: An optional indicator which indicates that the consent has been obtained from a principal in the sending of this request.
Issuer: An optional <saml:Issuer> identifies the entity that generated the request message.
Signature: An optional <ds:Signature> that authenticates the requester and provides message integrity.
Extensions: The extension point contains optional protocol message extension elements that are agreed on between the communicating parties.

StatusResponseType
All SAML responses are of types that are derived from the StatusResponseType complex type. It has following attributes:

ID: An unique identifier required for the response.
InResponseTo: A reference to the identifier of the request to which the response corresponds.
Version: The version of this response.
IssueInstant: The time instant of issue of the response encoded in UTC.
Destination: An optional URI reference indicating the address to which this response has been sent in order to prevent malicious forwarding of responses to unintended recipients.
Consent: An optional indicator which indicates that the consent has been obtained from a principal in the sending of this response.
Issuer: An optional <saml:Issuer> identifies the entity that generated the response message.
Signature: An optional <ds:Signature> that authenticates the responder and provides message integrity.
Extensions: The extension point contains optional protocol message extension elements that are agreed on between the communicating parties.
Status: The <Status> is the required code representing the status of the corresponding request. It contains <StatusCode> representing the status of the activity, <StatusMessage> and <StatusDetail> having additional information concerning the status of the request. The status code can be successful with code as "urn:oasis:names:tc:SAML:2.0:status:Success" or varying failure codes as in the specification.

The existing assertions can be requested by uniquely identified reference using <AssertionIDRequest> or queried for assertions by subject or statement type using the <SubjectQuery>, <AuthnQuery>, <RequestedAuthnContext>, <AttributeQuery> and <AuthzDecisionQuery> elements.

The <Response> message element which is an extension of StatusResponseType is used when a response consists of a list of zero or more assertions that satisfy the request. It has additional <saml:Assertion> or <saml:EncryptedAssertion> which specifies an assertion or encrypted assertion by value. In response to a SAML-defined query message, every assertion returned by a SAML authority must contain a <saml:Subject> element that matches the <saml:Subject> element found in the query. The identifier element (<BaseID>, <NameID>, or <EncryptedID>) or at least one <saml:SubjectConfirmation> element must match between the <saml:Subject> elements of the query and its response.

Authentication Request Protocol
When a principal wants to obtain assertions containing authentication statements to establish a security context at one or more relying parties, it uses the authentication request protocol to send an <AuthnRequest> message element to a SAML authority. The returned <Response> message which contains one or more assertions must have at least one assertion which contains at least one authentication statement. Initially the Requester creates the authentication request. The Presenter then sends the <AuthnRequest> message providing the properties required for resulting assertion to the identity provider and either authenticates itself to the identity provider or relies on an existing security context to establish its identity. The process of authentication of the presenter may take place before, during, or after the initial delivery of the <AuthnRequest> message. The <AuthnRequest> message is mostly signed or authenticated by the protocol binding used to deliver the message. An Identity Provider provides identifiers for users looking to interact with a system, and issues an assertion along with an authentication statement. The request presenter ideally is the attesting entity which satisfies the subject confirmation requirements within the <SubjectConfirmation> elements of the resulting assertion. The responder replies to an <AuthnRequest> with a <Response> message either containing one or more assertions meeting the specifications defined by the request or a <Status> describing the error occurred. The presenter can be directed to another identity provider by the responder while issuing its own <AuthnRequest> message, so that the resulting assertion can be used to authenticate the presenter to the original responder. The returned assertion(s) contains a <saml:Subject> element which represents the presenter. The identifier type and format are determined by the identity provider. At least one statement in at least one assertion is a <saml:AuthnStatement> which describes the authentication performed by the responder or authentication service associated with it. The resulting assertion(s) also contains a <saml:AudienceRestriction> element referencing the requester as an acceptable relying party (service provider). The Relying Party consumes the assertion(s) to establish a security context and to authenticate or authorize the requested subject in order to provide a service. Identity provider may skip the creation of a new <AuthnRequest> for the authenticating identity provider, when authenticating the same presenter for a second requester.

The AuthnRequest element extends from RequestAbstractType and adds the below additional attributes.

<saml:Subject>: The optional subject attribute specifies the requested subject of the resulting assertion. It may include one or more <saml:SubjectConfirmation> elements to indicate how and/or by whom the resulting assertions can be confirmed. When subject attribute is absent the presenter of the message is presumed to be the requested subject. When no <saml:SubjectConfirmation> elements are included, then the presenter is presumed to be the only attesting entity required.
<NameIDPolicy>: It specifies the constraints on the name identifier (e.g. name idenfier format for the URI reference) to be used to represent the requested subject. If omitted any type of identifier supported by the identity provider for the requested subject can be used.
<saml:Conditions>: Specifies the SAML conditions the requester expects to limit the validity and/or use of the resulting assertion(s).
<RequestedAuthnContext>: Specifies the requirements the requester places on the authentication context that applies to the responding provider's authentication of the presenter.
<Scoping>: It specifies a set of identity providers trusted by the requester to authenticate the presenter, as well as limitations and context related to proxying of the <AuthnRequest> message to subsequent identity providers by the responder.
ForceAuthn: When "true" the identity provider must authenticate the presenter directly rather than rely on a previous security context. Default value is false.
IsPassive: When "true" the identity provider and the user agent itself must not visibly take control of the user interface from the requester and interact with the presenter. Default value is false.
AssertionConsumerServiceIndex: Indirectly identifies the location to which the <Response> message should be returned to the requester. It applies only to profiles in which the requester is different from the presenter. When omitted the identity provider returns the <Response> message to the default location associated with the requester for the profile of use. It is mutually exclusive with the AssertionConsumerServiceURL and ProtocolBinding attributes.
AssertionConsumerServiceURL: Specifies by value the location to which the <Response> message should be returned to the requester. The responder ensures the value specified is associated with the requester usually by signing the enclosing <AuthnRequest> message is another.
ProtocolBinding: A URI reference that identifies a SAML protocol binding to be used when returning the <Response> message.
AttributeConsumingServiceIndex: Indirectly identifies information associated with the requester describing the SAML attributes the requester desires or requires to be supplied by the identity provider in the <Response> message.
ProviderName: Human-readable name of the requester used by the presenter's user agent or the identity provider.

Artifact Resolution Protocol
The Artifact Resolution Protocol provides the mechanism to transport SAML protocol messages in a SAML binding by reference instead of by value. The requests and responses can be obtained by reference using the protocol. A message sender sends a small piece of data called an artifact using the binding instead of binding a message to a transport protocol. Its mainly used when the bindings is unable to carry the message due to size constraints or usage of secure channel without signature. The <ArtifactResolve> message is used to request that a SAML protocol message be returned in an <ArtifactResponse> message by specifying an artifact that represents the SAML protocol message.The <ArtifactResolve> message is either signed or protected by the protocol binding used to deliver the message. ArtifactResolve element extends RequestAbstractType and adds <Artifact> value that the requester received and now wishes to translate into the protocol message it represents. If the responder recognizes the artifact as valid, then it responds with the associated protocol message in an <ArtifactResponse> message element else the response has no embedded message.

Protocol Bindings
Mappings of SAML request-response message exchanges onto standard messaging or communication protocols are called SAML protocol bindings. It is a mapping of SAML messages to a representation that can be transmitted by an HTTP client over the network interface. All bindings must use HTTP with Secure Sockets Layer (SSL) or Transport Layer Security (TLS). SAML also offers mechanisms for parties to authenticate to one another, but in addition SAML may use other authentication mechanisms to provide security for SAML itself especially when the message passes through an intermediary channels.

RelayState
Some bindings define a "RelayState" mechanism for preserving and conveying state information. The RelayState parameter is used to restore the original application URL so that the user can return to the application with a SAML assertion. Exposing the application URL in SAML messages can be a security risk. For service provider-initiated SSO, the service provider saves the URL and places the name of the cookie in the relay state. For identity provider-initiated SSO this option is not available. Instead we have the identity provider place an alias for the application in the relay state and map the alias to the application on the service provider. RelayState is a parameter used by some SAML protocol implementations to identify the specific resource at the resource provider in an Identity Provider initiated single sign on scenario.

SAML SOAP Binding

SOAP is a lightweight protocol which uses XML technologies to define an extensible messaging framework providing a message construct that can be exchanged over a variety of underlying protocols. A SOAP message is fundamentally a one-way transmission between SOAP nodes from a SOAP sender to a SOAP receiver, possibly routed through one or more SOAP intermediaries. SOAP defines an XML message envelope that includes header and message body sections, allowing data and control information to be transmitted. SAML request-response protocol elements are enclosed within the SOAP message body. SAML messages can be transported using SOAP without re-encoding from the standard SAML schema to one based on the optional SOAP encoding system. A single SOAP message does not have more than one SAML request or response element or any additional XML elements in the SOAP body. SAML SOAP request may have arbitrary headers but does not require any headers to process SAML messages. The SAML message are not cached by the HTTP proxies using the Cache-Control and Pragma headers. The SOAP processing error is returned with a <SOAP-ENV:fault> element, while the SAML error is returned with the <samlp:Status> element within the SOAP body.

Example SOAP Request

 POST /SamlService HTTP/1.1
 Host: www.example.com
 Content-Type: text/xml
 Content-Length: nnn
 SOAPAction: http://www.oasis-open.org/committees/security
 
    
       
           ... 
          
            ...

Example SOAP Response

 HTTP/1.1 200 OK
 Content-Type: text/xml
 Content-Length: nnnn
 
   
     
       https://www.example.com/SAML
        ... 
       
       
       
       
         
           ...
         
         
           ...

HTTP Redirect Binding

SAML protocol messages are transmitted within URL parameters using the HTTP Redirect binding. The XML messages on URL are encoded using specialized URL encodings and transmitted using the HTTP GET method. While the complex message content are sent using HTTP POST or Artifact bindings. Binding endpoints indicate the encodings which they support using the metadata. A URL encoding places the message entirely within the URL query string, and reserves the rest of the URL for the endpoint of the message recipient. A SAMLEncoding query string parameter named is used to identify the encoding mechanism used. When the SAMLEncoding parameter is omitted, then the default value is urn:oasis:names:tc:SAML:2.0:bindings:URL-Encoding:DEFLATE, i.e. DEFLATE encoding which is supported by all endpoints.
Before applying the DEFLATE compression mechanism to the entire XML content of the original SAML protocol message, any signature on the SAML protocol message, including the <ds:Signature> XML element is removed. The compressed data is later base64-encoded with linefeeds and whitespaces removed. The base-64 encoded data is then URL-encoded, and added to the URL as SAMLRequest or SAMLResponse query string parameter based on whether the message is SAML request or response. RelayState is included with a SAML protocol message transmitted using HTTP redirect binding. The RelayState data is URL-encoded and placed in an additional RelayState query string parameter. The value of relaystate does not exceed 80 bytes in length and its validity is verified using a checksum with a pseudo-random value. The responder sends the RelayState parameter in the SAML protocol response. If the original SAML protocol message was signed with an XML digital signature then the URL-encoded form of the message is signed. An additional query string parameter SigAlg is included identifying the signature algorithm used to sign the URL-encoded SAML protocol message. Signature is constructed by concatenating RelayState if present, along with SigAlg and SAMLRequest (or SAMLResponse) query parameters ordered as SAMLRequest=value&RelayState=value&SigAlg=value. The resulting octet string is fed into the signature algorithm. The signature is then encoded using the base64 encoding with any whitespace removed, and included as a query string parameter named Signature. The supported signature algorithms are DSAwithSHA1 and RSAwithSHA1, with their URI representations supported with the encoding mechanism. The order of the query string parameters on the resulting URL while verifying signatures varies depending upon implementation. The URL encoding is not canonical and there are multiple encodings for a given value, hence the relying party performs the verification step using the original URL-encoded values it received on the query string. Sample SAML URL with signature is https://idp.com?SAMLResponse=xxxx&SigAlg=xxxx&Signature=xxxx
When the message is signed, the Destination XML attribute in the root SAML element of the message contains the URL to which the sender has instructed the user agent to deliver the message. The recipient verifies that the value matches with the location at which the message has been received.

Below are the HTTP request response message exchanges using the HTTP redirect binding.

When the user agent first makes an arbitrary HTTP request to a system entity, the system entity decides to initiate a SAML protocol exchange inorder to process the request.
The system entity acting as a SAML requester responds to the HTTP request from the user agent by returning a SAML request. The SAML request is returned encoded into the HTTP response's Location header with HTTP status 303 or 302. The user agent delivers the SAML request by issuing an HTTP GET request to the SAML responder.
The SAML responder may respond to the SAML request by immediately returning a SAML response or might return arbitrary content to facilitate subsequent interaction with the user agent necessary to fulfill the request.
The responder returns a SAML response to the user agent in a similar way as SAML requester responds to the HTTP request. The SAML response is then returned to the SAML requester.
Upon receiving the SAML response, the SAML requester returns an arbitrary HTTP response to the user agent.

If the signature and assertion are valid, the service provider establishes a session for the user and redirects the browser to the target resource.

HTTP POST Binding
SAML protocol messages can be transmitted within the base64-encoded content of an HTML form control using HTTP POST binding. It is used when the communicating parties do not share a direct path of communication and, SAML requester or responder need to communicate using an HTTP user agent as an intermediary. Also when the responder requires interactions with the user agent to fulfill the request, HTTP POST binding is used. XML Messages with this binding are encoded into an HTML form control and are transmitted using the HTTP POST method. A SAML protocol message is form-encoded by applying the base-64 encoding rules to the XML representation of the message and placing the result in a hidden form control within a form. Based on the message, SAML request or SAML response, the form control is named SAMLRequest or SAMLResponse respectively. RelayState is optional and included as RelayState hidden form control has maximum length of 80 bytes. The action attribute of the form is the recipient's HTTP endpoint to which the SAML message is delivered, with the method attribute being "POST". All the form control values are transformed to be included in an HTML document.
The intermediary user agent prevents to rely on the transport layer for end-end authentication. Hence SAML enables Form-encoded messages to be signed before applying base64 encoding. When the message is signed, the Destination XML attribute in the root SAML element contains the URL to which the sender instructed the user agent to deliver the message. The recipient verifies that the value matches the location at which the message has been received. The individual "RelayState" and SAML message values can be integrity protected, but not the combination.

Below are the HTTP request response message exchanges using the HTTP POST binding.

When the user agent makes an arbitrary HTTP request to a system entity, the system entity initiates a SAML protocol exchange.
The system entity acting as a SAML requester responds to an HTTP request from the user agent by returning a SAML request. The user agent delivers the SAML request by issuing an HTTP POST request to the SAML responder.
The SAML responder responds to the SAML request by immediately returning a SAML response or returns arbitrary content to facilitate subsequent interaction with the user agent necessary to fulfill the request.
The responder finally returns a SAML response to the user agent to be returned to the SAML requester.
Upon receiving the SAML response, the SAML requester returns an arbitrary HTTP response to the user agent.

HTTP Artifact Binding
In the HTTP Artifact binding, the SAML request or the SAML response, or both are transmitted by reference using a small stand-in called an artifact. A separate, synchronous binding, such as the SAML SOAP binding, is used to exchange the artifact for the actual protocol message using the artifact resolution protocol defined in the SAML assertions and protocols specification. The artifact binding can be composed with HTTP Redirect binding to transmit request and response messages in a single protocol exchange using two different bindings. Since the artifact binding is resolved using another synchronous binding, a direct communication path must exist between the SAML message sender and recipient in the reverse direction of the artifact's transmission. The URL parameter encoding or the HTML form control are used for artifact message encoding. RelayState with value upto 80 bytes can be included with a SAML artifact transmitted using artifact binding. The general format of an artifact includes a mandatory two-byte artifact type code (TypeCode) and a two-byte index value identifying a specific endpoint of the artifact resolution service of the issuer (EndpointIndex). Each issuer is assigned an identifying URI, also known as the issuer's entity. The artifact type code contains a 20 byte SourceID which is used by the artifact receiver to determine artifact issuer identity and the set of possible resolution endpoints maintained by destination site.

Below are the HTTP request response message exchanges using the HTTP Artifact binding.

When the user agent makes an arbitrary HTTP request to a system entity, the system entity decides to initiate a SAML protocol exchange.
The system entity acting as a SAML requester responds to an HTTP request from the user agent by returning an artifact representing a SAML request. If URL-encoded, the artifact is returned encoded into the HTTP response's Location header, and the HTTP status is either 303 or 302. If form-encoded, then the artifact is returned in an XHTML document containing the form and content. The user agent delivers the artifact by issuing either an HTTP GET or POST request to the SAML responder.
The SAML responder determines the SAML requester by examining the artifact and issues a <samlp:ArtifactResolve> request containing the artifact to the SAML requester using a direct SAML binding, thus temporarily reversing roles.
The SAML requester returns a <samlp:ArtifactResponse> containing the original SAML request message it wishes the SAML responder to process.
The SAML responder responds to the SAML request by either immediately returning a SAML artifact or returning an arbitrary content to facilitate subsequent interaction with the user agent necessary to fulfill the request.
The responder finally returns a SAML artifact to the user agent to be returned to the SAML requester. The SAML requester determines the SAML responder by examining the artifact, and issues a <samlp:ArtifactResolve> request containing the artifact to the SAML responder using a direct SAML binding.
The SAML responder returns a <samlp:ArtifactResponse> containing the SAML response message it wishes the requester to process.
Upon receiving the SAML response, the SAML requester returns an arbitrary HTTP response to the user agent.

SAML URI Binding

The SAML URI Binding supports the encapsulation of a <samlp:AssertionIDRequest> message with a single <saml:AssertionIDRef> into the resolution of a URI. URI resolution can occur over multiple underlying transports mostly HTTP with SSL 3.0. A SAML URI reference identifies a specific SAML assertion. The result of resolving the URI is a message containing the assertion, or a transport-specific error. The specific format of the message depends on the underlying transport protocol. If the transport protocol permits the returned content to be described, then the assertion may be encoded in a custom format. When the same URI reference is resolved in the future, then either the same SAML assertion, or an error, is returned. The SAML reference should consistently reference the same SAML assertion.

Functional Programming in Scala

2016-01-14T20:09:00.060-08:00

Functional programming (FP) is based on a simple premise that we only use pure functions with no side effects in programs which have far-reaching implications. A pure function is one that lacks side effects. A side effect is something which function does other than returning the result such as Modifying a variable or field in object, or writing to an external file. In functional programming, functions are first class citizens were a function is created within a function, functions are passed as arguments between functions or returned to another functions.

A function f with input type A and output type B (written in Scala as a single type: A => B, pronounced “A to B” or “A arrow B”) is a computation that relates every value a of type A to exactly one value b of type B such that b is determined solely by the value of a. Any changing state of an internal or external process is irrelevant to computing the result f(a). Hence when a function has no observable effect on the execution of the program other than to compute a result given its inputs, then we say that it has no side effects. For example, a function intToString having type Int => String will take every integer to a corresponding string. A pure function is modular and composable as it separates the logic of the computation itself from “what to do with the result” and “how to obtain the input”. Such computation logic is reusable with no side effects.

Referential transparency and purity: An expression e is referentially transparent (RT) if, for all programs p, all occurrences of e in p can be replaced by the result of evaluating e without affecting the meaning of p. A function f is pure if the expression f(x) is referentially transparent for all referentially transparent x. In other words, an expression to be referentially transparent—in any program, when the expression can be replaced by its result without changing the meaning of the program. Referential transparency forces the invariant that everything a function does is represented by the value that it returns, according to the result type of the function. When expressions are referentially transparent the computation proceeds using substitution model were at each step we replace a term with an equivalent one.

Data Types and Variables

Scala is a pure object-oriented language in the sense that everything is an object, including numbers or functions with no such thing as primitive type. Each object may have zero or more members, with either a member being declared as a method (using def keyword), or it can be another object declared with val or object.
Scala uses the syntax keyword var to declare a variable, while uses the keyword val to declare a constant. The value of the constant declared using val cannot be changed hence called immutable variable. The type of a variable is specified after the variable name with colon in between, and before equals sign (e.g. var sum:Int = 0). Variable may or may not have initial value during declaration. Scala compiler can determine the type of the variable based on the initial value assigned to the variable, which is called as variable type inference. In Scala statements are separated by newlines or by semicolons. Newlines delimit statements in a block. It should be noted that the ++ operator on numeric variables, e.g. x++ is not allowed in Scala.

Scala Type Hierarchy

There are no primitive types in Scala (unlike java). All data types in Scala are objects that have methods to operate on their data. All of Scala’s types exist as part of a type hierarchy, with every class defined automatically belongs to this hierarchy. Any is the superclass of all classes, also called the top class. It defines certain universal methods such as equals, hashCode, and toString. Any has two direct subclasses AnyVal and AnyRef.

AnyVal represents predefined value classes corresponding to the primitive types in Java. There are nine predefined value types and they are non-null able: Double, Float, Long, Int, Short, Byte, Char, Unit, and Boolean. Scala has both numeric (e.g., Int and Double) and non-numeric types (e.g., String) that can be used to define values and variables. Boolean variables can only be true or false. Char literals are written with single-quotes.

AnyRef represents reference classes. All non-value types are defined as reference types. User defined classes define reference types by default as they are always (indirectly) subclass of scala.AnyRef (Similar to java.lang.Object in Java).

The Empty values in Scala are represented by Null, null, Nil, Nothing, None, and Unit

Nothing is a trait which is sub-type of all value types and is also called the bottom type. There is no value that has type Nothing. Nothing is the return type for methods which never return normally such as a thrown exception, program exit, or an infinite loop. Scala compiler treats throw expressions as Nothing type as throw doesn't return an concrete value, but it should be any type.

Null is the type (trait) of the null literal. It is a subtype of every type except those of value classes. Hence the reference types can be assigned null but the value types cannot be assigned null. Null is provided mostly for interoperability with other JVM languages.

Unit: The Unit is Scala is analogous to the void in Java, which is utilized as a return type of a functions that is used with a function when the stated function does not returns anything.

Nil: Nil is Considered as a List which has zero elements in it. The type of Nil is List[Nothing] as Nothing has no instances, the List is confirmed to be desolated.

        println(null);
        //println(none) // gives error : not found : value none 
        println(Nil)

None: It is one of sub-classes of Scala's Option class - Some and None. It is used to avoid null pointer exception by assigning null to the reference types. None signifies no result from the method.

        //printing empty list
        println(None.toList) 
        //checking whether None is empty or not
        println(None.isEmpty)
        //printing value of None as string
        println(None.toString)

Scala Strings

Scala does not have its own String class and uses Java's java.lang. String with its methods for String operations. Every Java class is available in Scala. Since String class is immutable, StringBuilder should be used while making frequent String modifications.

A multi-line string literal is a sequence of characters enclosed in triple quotes """ ... """. The sequence of characters is arbitrary, except that it may contain three or more consecutive quote characters only at the very end.

String Interpolation allows to embed variable references directly in processed string literals. The string interpolator (s) when prepended to any string literal allows the usage of variables directly in the string. The s string interpolator can also take arbitrary expressions. The f interpolator when prepended to any string literal allows the creation of simple formatted strings, similar to printf in other languages. When using the f interpolator, all variable references should be followed by a printf-style format string, like %d. The raw interpolator is similar to the s interpolator except that it performs no escaping of literals within the string.

val name = "mark"
val age = 20
val amount = 345.67

println(name + " is " + age + " years old")  // string concatenation using + method
println("(%d -- %f -- %s)".format(age, amount, name))
println(s"$name is $age years old")  // S String Interpolation
println(f"$name%s is $age years old")  // F String Interpolation for type safe variable
println(raw"Hello \n World")  // Raw Interpolation, prints all strings literally with special string as it is.

For Loop

The for loop in Scala uses ranges for iteration and does not need a variable declaration e.g. "var i" in the loop. The format of the for loop is "for(i <- range)", were the arrow, <- is called the generator.

for(i <- 1 to 5) {}

for(i <- 1.to(5)) {}    // for loop using explicit to() function call

for(i <- 1.until(6)) {}    // until is similar but excludes last value in the range

for(i <- 1 to 5; j <- 1 to 3) {}  // multiple nested ranges

for(i <- 1 to 5; if i < 6) {}   // for loop using filter or guard condition

val list = List(3,4,6,8,9,32)
val result = for{ i <- list; if i < 6} yield {   // for loop as expression
  i * i
}

Yield Keyword
The yield keyword returns a result after completing of loop iterations. The for loop uses a buffer internally to store iterated result and when finishing all iterations it yields the ultimate result from that buffer. It doesn’t work like imperative loop. The type of the collection that is returned is the same type that is iterated over, hence a Map yields a Map, a List yields a List, and so on.

val xs = List(1, 2, 3, 4)
val x = for (x <- xs) yield x * 2                  // List(2, 4, 6, 8)
val x = for (i <- 1 to 20 if i % 2 == 0) yield(i)  // List(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)

Match Expressions

In contrast with "exact matching" in Java's switch statements, pattern matching allows matching a pattern instead of an exact value. The match expressions consist of value to match, match keyword, multiple case clauses with code to execute when the pattern matches and default clause when no other pattern has matched. If the target matches the pattern in a case, the result of that case becomes the result of the entire match expression. If multiple patterns match the target, Scala chooses the first matching case.

age match {
  case "20" => age
  case "30" | "40" | "50" => age
  case _ => "default"
}

The default clause consists of the underscore character (_) and is the last of the case clauses. The variable pattern '_' is generally used to match any expression, e.g. List(1,2,3) match { case Cons(h,_) => h } results in 1 as List(1,2,3) is equals to Cons(1, Cons(2, Cons(3, Nil))). A pattern matches the target if there exists an assignment of variables in the pattern to subexpressions of the target that make it structurally equivalent to the target. The resulting expression for a matching case will then have access to these variable assignments in its local scope.

def sum(ints: List[Int]): Int = ints match {
   case Nil => 0
   case Cons(x,xs) => x + sum(xs)
}

Pattern guards are boolean expressions which are used to make cases more specific, by adding if expression after the pattern. The pattern can match not only Integers and Strings but any object type as shown in the example below.

def getClassAsString(x: Any): String = x match {
    case s: String => s + " is a String"
    case i: Int => "Int"
    case f: Float => "Float"
    case l: List[_] => "List"
    case p: Person if !p.name.isEmpty => "Person with non empty name"
    case _ => "Unknown"
}

Pattern matching to handle the exceptions thrown in try-catch blocks as below.

def catchBlocksPatternMatching(exception: Exception): String = {
  try {
    throw exception
  } catch {
    case ex: IllegalArgumentException => "It's an IllegalArgumentException"
    case ex: RuntimeException => "It's a RuntimeException"
    case _ => "It's an unknown kind of exception"
  }
}

Classes and Constructors

The class keyword introduces a class and contains the body within the curly braces.
Scala has a single primary constructor and many auxiliary constructors. Entire body of the Scala class is the primary constructor except the instance members defined. The argument list of primary constructor comes after the class name, were all the argument fields become class attributes. By default all attributes are public and immutable (val), and can be accessed directly. The var attributes can be overridden while the val attributes being immutable, cannot be modified. Getter methods and Setter methods are created automatically for var attributes, while val attributes have only getter methods created in the class. Class attribute variables can be declared public or private. The primary constructor can only call the base constructor or the super class constructor.

Auxiliary Constructor: A class can have many Auxiliary constructors but should have different signatures than one another. The Auxiliary Constructor require to call the primary constructor directly or indirectly. In other words, an auxiliary constructor must call either previously defined auxiliary constructors or primary constructors in the first line of its body. The auxiliary constructor is used for constructor overloading and defined as a method using this name.

   class Car(val year: Int, var miles: Int) {  // primary constructor

      println("car created")   // This println is part of primary constructor

      def this() {
         this(year, 0)   // All auxiliary constructors are required to go through primary constructor
      }

      def drive(dist: Int) {
         miles += dist
      }
   }

The new keyword is used to create an object instance by calling the class's constructors.

   val car = new Car(2010, 0)
   println(car.year)

In Scala, a getter and setter will be implicitly defined for all non-private var in an object. The getter name is same as the variable name, while "_=" is added with variable name for setter name.

   class Test {
     private var a = 0
     def age = a
     def age_=(n:Int) = {
        require(n>0)
        a = n
     }
   }

   val t = new Test
   t.age = 5

A pair of values can be returned by the method indicated with type enclosed within braces. A pair can be created by putting the values in parentheses separated by a comma.

def buyCoffee(cc: CreditCard): (Coffee, Charge) = { .. }

Scala Functions

Function in Scala is defined using by the def keyword which is followed by the name, the parameter list in parentheses and return type. The body of the function itself comes after a single equals sign. In Scala by default a parameter to a function is immutable. Scala allows to define the functions named as operators, e.g. +(), ++(), *() etc. Scala functions can be stored in a variable.

def functionName ([list of parameters]) : [return type] = { }

The last statement within the block is automatically returned, without specifying the "return" keyword. Every method has to return some value as long as it doesn’t crash or hang. The value returned from a method is simply whatever value results from evaluating the right-hand side. Scala compiler can infer the return types of methods based on the last statement but is considered bad style. A function, which does not return anything (called procedures), returns Unit which is equivalent to void in Java. The literal syntax for unit is (), i.e. a pair of empty parentheses. Scala looks at the main method with a specific signature, which takes an Array of Strings as its argument, and its return type is Unit, in order to begin the execution of the program. The App trait also can be used to quickly turn objects into executable programs as the object inheriting from App also inherits the main method.

Function parameters can have default values. The argument for such a parameter can optionally be omitted from a function call, in which case the corresponding argument will be filled in with the default.

   def main(args: Array[String]) {
        println( "Value with no parameters : " + addInt() );
        println( "Value with one parameter : " + addInt(5) );
}

   def addInt( a:Int=5, b:Int=7 ) : Int = {
      var sum:Int = 0
      sum = a + b
   }

In a normal function call, the arguments in the call are matched one by one in the order of the parameters of the called function. Named arguments allows to pass arguments to a function in a different order were each argument is preceded by a parameter name and an equals sign as below.

   def main(args: Array[String]) {
        printInt(b=5, a=7);
   }

   def printInt( a:Int, b:Int ) = {
      println("Value of a : " + a );
      println("Value of b : " + b );
   }

Any function name can be used infix, omitting the dot and parentheses when calling it with a single argument, e.g. instead of "Math.abs(42)" we can say "Math abs 42" and get the same result. Scala allows the last parameter to a function to be repeated indicated by '*' following the type, e.g. "String*" which actually is Array[String].

   def main(args: Array[String]) {
        printStrings("Hello", "Scala", "Python");
   }

   def printStrings( args:String* ) = {
      var i : Int = 0; 
      for( arg <- args ) {
         println("Arg value[" + i + "] = " + arg );
         i = i + 1;
      }
   }

In Scala strangely the curly brackets {} can be used in place of parentheses or round brackets (), especially for enclosing the parameters to method calls or the body of the for loop. Generally, functions accepting a single argument may be called with braces instead of parentheses in Scala, hence "Try { age.toInt }" is equivalent to "Try(age.toInt)".

   flatMapExample {32}
   val result = portal.flatMap {(a) => {a.toUpperCase}}

   for {
     n <- 1 to 100
     c <- letters
   } {
     print(n, c)
   }

Scala allows to define functions inside a function which are called local functions and are only visible inside the enclosing method.

Class Methods
A method is a function which is a member of an object (class). Private methods cannot be called from the code outside the owning object. All of an object’s non-private members can be brought into scope by using the underscore syntax, e.g. import "MyModule._". Method overriding during inheritance requires "override" keyword. An abstract class can be defined in Scala, which prevents from creating its instance. Methods are implicitly declared abstract if the equals sign and method body is missing from the method declaration. Scala allows to declare abstract fields similar to abstract methods which need to be inherited by subclasses.

Apply Method

Scala allows the objects that have a method with special name "apply" can be called as if they were themselves methods.

object Car {
 def apply(year: Int) = new Car(year, 0)
} 
 
val car = Car.apply(2013)
val car = Car(2013)      // same as above as apply can be dropped

Scala Extractors

Scala Extractor is defined as an object which has a method named unapply as one of its part. This method extracts an object and returns back the attributes. This method is also used in Pattern matching and Partial functions. Extractors also explains apply method, which takes the arguments and constructs an object so, it’s helpful in constructing values. The unapply method reverses the construction procedure of the apply method.

The return type of the unapply method can be selected like stated below:

If it is a checking procedure then return a Boolean Type.
If the procedure is returning only one sub-value of type T, then return an Option[T].
If the procedure is returning various sub-values of type T1, T2, …, Tn then return an optional tuple i.e, Option[(T1, T2, …, Tn)].
If the procedure returns an unpredictable number of values, then the extractors can be defined with an unapplySeq that returns an Option[Seq[T]].

object CustomerID {

  def apply(name: String) = s"$name--${Random.nextLong}"

  def unapply(customerID: String): Option[String] = {
    val stringArray: Array[String] = customerID.split("--")
    if (stringArray.tail.nonEmpty) Some(stringArray.head) else None
  }
}

val customer1ID = CustomerID("Sukyoung")  // Sukyoung--23098234908
customer1ID match {
  case CustomerID(name) ==> println(name)  // prints Sukyoung
  case _ ==> println("Could not extract a CustomerID")
}

Singleton and Companion Object

The object keyword creates a new singleton type, which is like a class that only has a single named instance similar to anonymous class in java. An object is scala's equivalent to java's static.

When we have the class and object with the same name then the object is called companion object. The class holds details of the instance while the companion object can access the details of the class instance including its private members. Everything located inside a companion object is not a part of the class’s runtime objects but is available from a static context. The companion object should reside in same source file as the class.

   object Car {        // Singleton in Scala, were only one instance of this Car class
      def countOfInstances() = {  // similar to static method in Java
      }
   }

   class Foo { }
   object Foo {         // Foo is a Companion object of Class Foo
       def apply() = new Foo
   }

   val foo1 = new Foo  // Creates new instance of Foo by calling actual constructor of Foo class
   val foo2 = Foo()    // Creates instance of Foo by calling apply method with Foo Companion object

A companion object in addition to the data type and its data constructors is an object with the same name as the data type where we put various convenience functions for creating or working with values of the data type. For example a function to fill the List data type with n copies of element a. When functions are in the companion object they are called as fn(obj, arg1), while when inside the body of the trait they are called as obj.fn(arg1) or obj fn arg1.

Value Classes

Value classes allows Scala compiler to use the inline value directly and to avoid allocating runtime objects. Value classes are similar to wrapper classes in Java using Autoboxing. A value class can only extend universal traits and cannot be extended itself. A universal trait is a trait that extends Any, only has defs as members, and does no initialization. Value class must have only a primary constructor with exactly one public, val parameter whose type is not a user-defined value class. It should not have specialized type parameters or nested or local classes, traits, or objects. It should not define equals or hashCode methods and should be a top-level class or a member of a statically accessible object. Value classes are immutable and cannot be extended by another class. There are nine predefined value types : Double, Float, Long, Int, Short, Byte, Char, Unit, and Boolean. Value classes can be combined with implicit classes for allocation-free extension methods.

class Wrapper(val underlying: Int) extends AnyVal {
  def foo: Wrapper = new Wrapper(underlying * 19)
}

implicit class RichInt(val self: Int) extends AnyVal {
  def toHexString: String = java.lang.Integer.toHexString(self)
}

Call By Name Parameters
In Scala, parameters to the functions are passed by value, by default. Alternatively, Scala also provides call-by-name parameters, which passes an expression to be evaluated within the called function. A call-by-name mechanism passes a code block to the callee (a nullary function which encapsulates the computation of the corresponding parameter) and each time the callee accesses the parameter, the code block is executed and the value is calculated. The call by name parameter syntax is by simply prepending the => symbol to the variable type. The call by-name parameters are evaluated every time when they are used as opposed to call by-value parameters which are evaluated only once. Call by-name parameters won't be evaluated at all if they aren't used in the function body. They are similar to replacing the by-name parameters with the passed expressions.

val callByName = (n: => Int) {         // example of call by name parameter in function
   println("Method call by name")
   println(n)
}

val add = (a: Int, b: Int) => {
   println("Add")
   a + b
}

callByName(add(5, 6))  // passing function to call-by-name parameter function

def performOperation1(op: => Unit) {   // another example of call by name parameter
   op
}

def performOperation2(op: () => Unit) {
  op()
}

performOperation1{ println("Done") } 
performOperation2(() => println("Hello!"))

def calculate(input: => Int) = input * 37   // call by name parameter in function

Case Class
A Case Class is similar to a regular class, except it has a feature for modeling immutable data. It also serves useful in pattern matching, such a class has a default apply() method which handles object construction. The case class also has all constructor parameters as vals, which means they are immutable by default. A companion object to the Case class is created and, apply and unapply methods are added. Hence we can create objects of the Case Class without using “new” keyword. Scala compiler also automatically adds default implementations of toString, hashCode and equals and copy methods. The copy method is used to create a copy of same instance with modifying few attributes or without modifying it. By default, Case class and Case Objects are Serializable.
Case object is also an object which is defined with “case” modifier. It also get same benefits to avoid boilerplate code, added toString and hashCode methods and is Serializable. A Case Class can extend another Class or Abstract Class or a Trait, but Case Class can NOT extend another Case class. A Case Class can override the variables and methods defined in the Trait like other classes.

case class Person(name:String)

object Person{
   def unapply(p:Person):Option[String] = Some(p.name)
   def apply(name:String):Person = new Person(name)
}

case class Person(name:String, age:Int)
val person1 = Person("Posa",30)

val person2 = Person("Posa",30)
val result = (person1==person2)  // == operator is used to compare objects

A deep copy is a copy to another object where any changes we make to it don’t reflect in the original object. A clone() method is used to create a deep copy of an object. A shallow copy, on the other hand is one where changes to the copy do reflect in the original. Scala uses the copy() method to carry out a shallow copy. Since case classes are immutable, a deep copy using clone() or shallow copy using copy() are used to make changes without changing the original.

Case classes are especially useful for pattern matching. In the below example the determineType() takes a parameter as Animal trait and matches on the type of Animal. It matches for the Dog and Cat case classes and the Woodpecker case object which are different subtypes of Animal trait. In first case Dog(name, _) the field name is used in the return value but the color field is ignored with _. If the Dog class is matched, its name is extracted and used in the print statement on the right side of the expression. When matching a Cat, we want to ignore the name, so used the syntax "_:Cat" to match any Cat instance. The anotherExample() show default syntax of matching by class type i.e. "c:Cat". Because Woodpecker is defined as a case object and has no name, it is matched by class name.

trait Animal
case class Dog(name: String, color: String) extends Animal
case class Cat(name: String) extends Animal
case object Woodpecker extends Animal

object CaseClassTest {

    def determineType(x: Animal): String = x match {
        case Dog(moniker, _) => s"Got a Dog, name = $name"
        case _:Cat => "Got a Cat (ignoring the name)"
        case Woodpecker => "That was a Woodpecker"
        case _ => "That was something else"
    }

    def anotherExample(x: Animal) = x match {
        case d: Dog => println(d.name)
        case c: Cat => println(c.name)
    }

    println(determineType(new Dog("Rocky")))
    println(determineType(new Cat("Rusty the Cat")))
    println(determineType(Woodpecker))
}

Inner Functions

In Scala, functions that are local to the body of another function are called an inner function, or local definition. They are used to write loops functionally, without mutating a loop variable, by using a recursive function.

def factorial(n: Int): Int = {
   def go(n: Int, acc: Int): Int =
      if (n <= 0) acc
      else go(n-1, n*acc)
   go(n, 1)
}

Anonymous Functions

Anonymous functions in Scala also called Function literals, have the arguments to the function declared to the left of the => arrow, while to the right of the arrow is the body of the function were the parameters can be used. The anonymous function (x,y) => x + y can be written as _ + _ in situations where the types of x and y could be inferred by Scala. Each underscore in an anonymous function expression like _ + _ introduces a new (unnamed) function parameter and references it. Anonymous functions can have either multiple parameters or no parameter at all.

    var multiply = (x: Int, y: Int) => x*y
    println(multiply(3, 4))

    var userDir = () => { System.getProperty("user.dir") }
    println( userDir )

A function literal or anonymous function under the hood is an object with an apply method. Hence (a, b) => a < b syntactically looks as below, were calling lessThan(10, 20) actually calls the apply method:

val lessThan = new Function2[Int, Int, Boolean] {
   def apply(a: Int, b: Int) = a < b
}

Higher Order Functions

In Scala, functions are values and can be assigned to variables, stored in data structures, and passed as arguments to functions. A function that accepts other functions as arguments. This is called a higher-order function (HOF). Higher Order functions enables to pass or return functions within a function.

def math(x: Double, y: Double, fn: (Double, Double) => Double) : Double = fn(x, y)
val result = math(50, 20, (x,y)=>x+y)
val result = math(50, 20, (x,y)=>x min y)

// HOF with more than two parameters, applied to binary function passed as argument
def math(x: Double, y: Double, z: Double, fn: (Double, Double) => Double) : Double = fn(fn(x, y),z)
val result = math(50, 20, 57, (x,y)=>x + y)

// Using Wildcard notations
val result = math(50, 20, 57, _ + _)
val result = math(50, 20, 57, _ max _)

Partially Applied Functions

Scala allows to apply functions partially to avoid passing redundant values when a method is invoked multiple times with the same value for a parameter. The constant parameter value can be eliminated by partially applying the argument to the method by binding a value to the constant parameter and leave the other parameters unbound by putting an underscore at their place. The resulting partially applied function is stored in a variable.

   def main(args: Array[String]) {
     val date = new Date
     val logWithDateBound = log(date, _ : String)
     logWithDateBound("message1" )
   }

   def log(date: Date, message: String)  = {
     println(date + "----" + message)
   }

Functions can fully applied functions were all arguments are applied, or partially applied functions were partial arguments are applied, below are more examples.

val add = (x: Int, y: Int, z: Int) => x + y + z  // fully applied function
add(10, 20, 30)

val add = (x: Int, y: Int, z: Int) => x + y + z  // partially applied function, were one argument is applied partially
val f = add(10, 20, _ : Int)
f(30)

val fun = add(10, _ : Int, _ : Int)  // partially applied function, were two arguments are applied partially
fun(100, 200)

Closures

A Closure is a function where the return value of the function depends on the value of one or more variables that have been declared outside the function. The changes made inside the closure are passed back as value. Impure closure is when datatype of the free (outside dependent) variable is var, whereas when the free variable datatype is val, then its a pure closure.

// number is called free variable in closure
var number = 10
val add = (x : Int) => x + number

def main(args: Array[String]) {
  println(add(10))
}

Currying
Function currying is a technique of transforming a function that takes multiple arguments into a function that takes a single argument. Currying transforms a function that takes multiple parameters into a chain of functions, each taking a single parameter. Curried functions are defined with multiple parameter lists.

   def strcat(s1: String)(s2: String) = s1 + s2
   // Alternate Syntax
   def strcat(s1: String) = (s2: String) => s1 + s2
   strcat("foo")("bar")

   def add(x: Int, y: Int) = x + y
   def add2(x: Int) = (y: Int) => x + y
   // Scala provides simpler syntax for currying
   def add3(x: Int)(y: Int) = x + y

   def main(args: Array[String]) {
     println(add(20, 10))
     println(add2(20)(10))
     
     val sum10 = add2(10)
     println(sum10(20))
     
   //  val sum30 = add3(30)  // Gives compilation error, as opposed to add2()
     val sum30 = add3(30)_
     println(sum30(300))
   }

In another example calling dropWhile is 'dropWhile(xs)(f)' were dropWhile(xs) is returning a function, which then calls with the argument f as below. Hence more generally, when a function definition contains multiple argument groups, type information flows from left to right across these argument groups.

def dropWhile[A](as: List[A])(f: A => Boolean): List[A] =
   as match {
   case Cons(h,t) if f(h) => dropWhile(t)(f)
   case _ => as
}

val xs: List[Int] = List(1,2,3,4,5)
val ex1 = dropWhile(xs)(x => x < 4)

Arrays

Array is a special kind of collection in scala. it is a fixed size data structure that stores elements of the same data type. The index of the first element of an array is zero and the last element is the total number of elements minus one. Scala arrays supports generics with an Array[T], where T is a type parameter or abstract type. Arrays are compatible with Scala sequences, were an Array[T] can be passed where a Seq[T] is required. They also support all sequence operations.

var arrayname = new Array[datatype](size)  // Array declaration syntax
val array1 : Array[Int] = new Array[Int](4)
val array2 = new Array[Int](5)
val array3 = Array(1,2,3,4,5,6)
array1(0) = 20

// Print array
for(x <- array1)
 println(x)

import Array._
concat(array1, array2) // concatenate Array

Lists

A list is a collection which represents a linked list and holds a sequenced, linear list of items. In Scala, Lists are immutable and each element in the list is of the same type.
In Scala, List has a Cons operator :: , which is short for construct the new List object. It is useful to add new elements at the beginning of the List. We cannot use the Cons operator to add a new element at the end of the List. Also Cons operator can only add elements to existing list or Nil list. Nil is a type list and represents an empty list.

val list1 : List[Int] = List(2,3,4,5,6,7)
val list2 : List[String] = List("One", "Two", "Three")
println(list1(0))    // fetches 0th element from the list, internally uses List.apply() method to fetch element.
list1(0) // get value of list at index 0
list1(0) = 9  // gives an compilation error as lists are immutable in Scala

val newlist = 0 :: list1  // cons is used to prepend/append elements to list 
val listA = 1 :: 5 :: 9 :: Nil    // represents List(1,5,9)
val listB = 1 :: 5 :: (9 :: Nil)  // represents List(1,5,9)

A List has various methods such as add, prepend, max, min, etc to perform various operations on the list. The head() and tail() methods are used to get the first and the last element of the list respectively. The reverse() method is used to reverse the list. The List.foreach() takes a function an applies to each element of the list. The List.fill(n)(x) creates a List with n copies of x. The takeWhile() method takes the elements from a list while the specified predicate is true. The dropWhile() method on the other hand drops the elements from the list while the specified predicate is true and returns the remaining list.

val xs: List[Int] = List(1,2,3,4,5)

val ex1 = dropWhile(xs, (x: Int) => x < 4)
// ex1 == List(1,2,3)

val ex2 = dropWhile(xs, (x: Int) => x > 3)
// ex2 == List(4,5)

val ex3 = List.fill(5)(2)  // List of 2s with 5 elements, result being List(2,2,2,2,2)

xs.foreach( println )  // Using foreach() method to print the list
var sum: Int = 0
xs.foreach(sum += _)   // Using foreach() method to calculate the sum of list

Below are few methods defined in List of standard library.

def take(n: Int): List[A] — Returns a list consisting of the first n elements of this.
def takeWhile(f: A => Boolean): List[A] — Returns a list consisting of the longest valid prefix of this whose elements all pass the predicate f.
def forall(f: A => Boolean): Boolean — Returns true if and only if all elements of this pass the predicate f.
def exists(f: A => Boolean): Boolean — Returns true if any element of this passes the predicate f.
scanLeft and scanRight — Like foldLeft and foldRight, but they return the List of partial results rather than just the final accumulated value.

The unzip method splits a list of pairs into a pair of lists. E.g. List[(Coffee, Charge)] is split by destructuring the pair to declare two values (coffees and charges) on one line.

The reduce method reduces the entire list of values into a single value, using the combine method of the value class to combine values two at a time.

    val (coffees, charges) = purchases.unzip(coffees, charges.reduce((c1,c2) => c1.combine(c2)))

Sets

Set is a collection were all the elements are unique, which is defined by the == method of the type. If a duplicate item is added to the set, the set quietly discards the add item request. Set can be both mutable and immutable. By default set Scala are immutable. In order to use mutable Set, the scala.collection.mutable.Set class should be imported explicitly. A Set has various methods to add, remove clear, size, etc. to enhance the usage of the set.

val set1 : Set[Int] = Set(2,3,4,5,6,7, 7)      // default Immutable set
val set2 : scala.collection.mutable.Set[Int] = scala.collection.mutable.Set(2,3,4,5,6,7, 7)  // Mutable set
val set3 = scala.collection.mutable.Set(2,3,4,5,6,7, 7)
val ispresent = set1(8)    // check if 8 is present in the set

println(set1 + 10)  // set in scala is not ordered and we cannot index sets, index cannot be used for set
println(set1 ++ set2)   // combines sets and shows unique values of 2 sets
println(set1 & set2)   // shows common values in 2 sets
println(set1.intersect(set2))   // shows common values in 2 sets
println(set1.min)

Maps

Map is a collection of key-value pairs. Keys are always unique while values may not be unique. Key-value pairs can have any data type, but must be consistent data type throughout. Similar to Sets, Maps in Scala are classified into mutable and immutable maps. By default Scala uses immutable Map. In order to use mutable Map, the scala.collection.mutable.Map class should be imported explicitly.

val map1 : Map[Int, String] = Map(801 -> "Max", 802 -> "Tom", 804 -> "June")
map1(802) // get value of map for the specified key
map1.keys
map1.values
map1.isEmpty
map1.contains(801)   // check if key contains in the map

map1.keys.foreach{ key =>
 println(key + " : " + map1(key))
}

println(map1 ++ map2)   // combines maps

Tuple

Tuple is a collection of elements in Scala. Tuples are heterogeneous data structures, i.e. they can store elements of different data types. A tuple is Immutable, unlike an array in scala which is mutable. Tuples cannot contain more than 22 elements, i.e. upto Tuple22. Scala has getter functions from "_1" to "_22" to fetch the corresponding tuple element. A tuple of two elements can be created using using 1 -> "Tom".

val tupleA = (1, 2, "hello", true)   // tuples are of fixed size and are immutable
val tupleB = new Tuple4(1, 2, "hello", true)  // Tuple4 means the new Tuple contains 4 elements
val tupleC = (1, "hello", (2,3)) 

println(tupleA._4)    // _1, _2, _3, _4 are created for Tuple4 tuple

tupleA.productIterator.foreach{
 i => println(i)
}

Seq Class

Seq is the interface in Scala’s collections library implemented by sequence-like data structures such as lists, queues, and vectors. The special _* type annotation allows us to pass a Seq to a variadic method. Variadic functions are just providing a little syntactic sugar for creating and passing a Seq of elements explicitly. Seq instance is immutable.

val x = Seq(1, 1.0, 1F)                // Seq[Double] = List(1.0, 1.0, 1.0)
val x: Seq[Number] = Seq(1, 1.0, 1F)   // Seq[Number] = List(1, 1.0, 1.0)
case class Person(name: String)
val people = Seq(
    Person("Emily"),
    Person("Hannah"),
    Person("Mercedes")
)
(1 to 5).toSeq                   // List(1, 2, 3, 4, 5)
Seq.range(1, 6, 2)               // List(1, 3, 5)
Seq.fill(3)("foo")               // List(foo, foo, foo)

Map and Filter Functions

The map() function is a higher order function available for every collection in Scala. It takes a function as a parameter, and applies that function to every element of the source collection. The map function returns a new collection of the same type as the source collection.

val listA = List(1, 2, 3)
val mapB = Map(1 -> "One", 2 -> "Two", 3 -> "Three")
println(listA.map(x => x * 2))   // double every element in the list
println(listA.map(_ * 2))
println(listA.map{ e => e * 2})

println(listA.map(x => "h1" + x))
println(mapB.mapValues(x => "hi" + x))

println("hello".map(_.toUpper))    // Map also can be used on string, to return HELLO

The flatten() method is utilized to disintegrate the elements of a Scala collection in order to construct a single collection with the elements of similar type.
The flatMap() method is identical to the map() method, but the only difference that in flatMap the inner grouping of an item is removed and the sequence is generated. It can be defined as a blend of map method and flatten method. The output obtained by running the map method followed by the flatten method is same as obtained by the flatMap().

val listOfList = List(List(1,2,3), List(4,5,6))
println(listOfList.flatten)                  // returns list with all elements from nested lists as part of single list

println(listOfList.flatMap(x => List(x, x+1)))

Filter is a predicate function which returns a boolean value by evaluating the expression.

println(lst.filter(x => x%2 == 0))

The map, flatmap, and filter collection functions return the Option type. The map function can be used to transform the result inside an Option, if it exists or else if None it aborts the remaining operation. The flatMap function is similar to map method, except that the function provided to transform the result can itself fail. The filter function is used to filter out the relevant values and mostly used within the chain of operations.

Option Class

Exceptions thrown in the functions make the method return value not referentially transparent. They break referential transparency, introducing context dependence, and should be used only for error handling and not for control flow. Exceptions are also not type-safe and the compiler does not know nor can enforce handling unknown exceptions which won’t be detected until runtime. Hence instead of throwing an exception, Scala provides the Option data type as an explicit return type when the function may not have a return value. Option has two cases: it can be defined, in which case it will be a Some, or it can be undefined, in which case it will be None.

The Option class is used to represent a carrier of single or no element for a stated type. The Option class acts as a container which can give two values, Some or None. When a method returns a value which can even be null when Option is utilized i.e, the method defined returns an instance of an Option, instead of returning a single object or a null. The instance of an Option returned can be an instance of either Some class or None class which are subclasses of Option class. The Option[T] class serves as a container for zero or one element of a given type T. If the element exists, it is an instance of Some[T]. If the element does not exist, it is an instance of None. Some of the popular methods to unwrap optional values in case class Some() are to use pattern matching using case, getOrElse() method and foreach can be used to extract optional values since the Option[T] class is a collection of zero or one elements of type T. The IsDefined Option method returns true if the Option does not have a None value and false otherwise. The getOrElse method is used to access the value of the Option or return empty or error value for error handling. A common idiom is to do getOrElse(throw new Exception("FAIL")) to convert the None case of an Option back to an exception. The orElse() method is similar to getOrElse(), except that we return another Option if the first is undefined.

        //printing empty list
        val sampleList = List(1, 2, 3)
        sampleList.find(_ > 0)   // This returns None
        sampleList.find(_ > 2)   // This returns Some(2)

        //To extract value from Some instance, the get() method is used on Option
        val result1 = sampleList.find(_ > 2).get  // Get method will return 2 for Some(2), for None result it throws an exception
        val result2 = sampleList.find(_ > 2).getOrElse(0)  // GetOrElse method returns else value when result is None instead of exception

        val option1 : Option[Int] = Some(5)
        println(option1.isEmpty))    // Option class allows to check if it has any value using isEmpty method
        println(option1.get))        // Return value 5

The Either data type is an extension to Option which allows to track the reason for failure. Either has only two cases were each case carries one value. The Right constructor is reserved for the success case and Left is used for failure.

def safeDiv(x: Int, y: Int): Either[Exception, Int] =
  try Right(x / y)
  catch { case e: Exception => Left(e) }

Reduce / Fold / Scan

The fold, reduce and scan are a family of higher-order functions which use a given combining operation, to recombine the results of recursively processing its constituent parts, building up a return value. The reduce/fold/scan functions apply a binary operator to each element of a collection. The result of each step is passed on to the next step (as input to one of the binary operator's second argument. The xLeft function variation is used to go forward through the collection, while the xRight function variation is used to go backwards through the collection.

Reduce (ReduceLeft/ReduceRight) takes an associative binary operator function as a parameter, applying to each element of collection to return a single cumulative result.

val list1 = List(1, 2, 4, 6, 7, 9, 13, 16, 20)
val list2 = List("A", "B", "C", "F", "G")

println(list1.reduceLeft(_ + _))    // 78
println(list2.reduceLeft(_ + _))    // ABCFG

Fold (FoldLeft/FoldRight) functions are similar as reduce, but an initial value can be passed into foldLeft or foldRight.

println(list1.foldLeft(100)(_ + _))    // 178 which is 100 + 78 which is total of all elements of the list
println(list2.foldLeft("Z")(_ + _))    // ZABCFG

val result = list1.foldLeft(0){(c,e) => c + e}

Scan (ScanLeft/ScanRight) functions are similar to Fold functions, except Scan functions provides a map of intermediate result values. It cumulates a collection of intermediate cumulative results using a start value.

println(list1.scanLeft(100)(_ + _))    // 100, 101, 103, 107, 113, 120, 129, 142, 158, 178
println(list2.scanLeft("Z")(_ + _))    // Z, ZA, ZAB, ZABC, ZABCF, ZABCFG

Strictness and laziness

Scala provides two methods for evaluation of expressions/functions, Lazy and Strict. Lazy mode delays the evaluation of the expression until its value is needed or used. The strict mode however evaluates the expression or function arguments immediately without delay. Scala has strict evaluation of expression by default, but allows lazy evaluation by explicitly using the lazy key word.

   def square(i: Int): Int = i*i

   lazy val l = square(15)/square(11)
   println(l)

Traits
Scala does not support multiple inheritance and provide traits to achieve the expected implementation. A trait is an abstract interface that may optionally contain implementations of some methods. Traits may contain abstract or non-abstract methods but should have at least one non-abstract method. When sealed is added in front trait it means that all implementations of the trait must be declared in the same file. Traits that are declared with no methods, functions, types or properties are called marker trait, e.g. scala.Immutable is a marker trait which indicate the semantics of immutability. Trait can be added at the class level as well as at the instance level as shown in the below example.

trait Friend {
 val name: String
 def listen() = println("I am " + name + " listening")
}
class Animal(val name: String)
class Dog(override val name: String) extends Animal(name) with Friend
class Cat(override val name: String) extends Animal(name)

def main(args: Array[String]) {
  val mycat = new Cat("mycat") with Friend
  mycat.listen()
  seekHelp(mycat)
}

Traits can be used as Decorator Pattern, selectively layering of functions without using multiple inheritance as below.

abstract class Writer {
   def write(msg: String)
}

class StringWriter extends Writer {
   val target = new StringWriter
   
   def write(msg: String) = target.append(msg)
   override def toString() = target.toString()
}

trait UpperCaseFilter extends Writer {
   abstract overrider def write(msg: String) = {
      super.write(msg.toUpperCase())    // Modify the input string and pass it to the next available class in trait hierarchy
                                        // i.e. StringWriter.write() method
   }
}

def write(writer: Writer) = {
   writer.write("This is Great")
}

def main(args: Array[String]) {
  write(new StringWriter)
  write(new StringWriter with UpperCaseFilter)
}

Traits and classes can be marked sealed which means all subtypes must be declared in the same file. This assures that all subtypes are known.

Implicits

Scala provides implicit parameters and conversions which allows to change or extend the standard libraries. Implicit allows to omit calling methods or referencing variables directly but instead rely on the compiler to make the connections. The compiler will call implicit methods or reference variables if the code doesn’t compile but would, if implicit function/variable is used. Implicit definitions are those that the compiler is allowed to insert into a program in order to fix any of its type errors. An implicit conversion is only inserted if there is no other possible conversion to insert and the implicit conversion is within the scope. The Scala compiler will only use one implicit conversion at a given time and will not change the code if it already works. There are three types of Implicit definitions:

Implicit parameters (aka implicit values) will be automatically passed values that have been marked as implicit. It means that if no value is supplied when called, the compiler will look for an implicit value and pass it in for you. The compiler can call the function with implicit parameter of type val, a var or even another def.

def multiply(implicit by: Int) = value * by

implicit val multiplier = 2

multiply

Implicit can be used only once in a parameter list and all parameters following it will be implicit.

def example1(implicit x: Int)                       // x is implicit
def example2(implicit x: Int, y: Int)               // x and y are implicit
def example3(x: Int, implicit y: Int)               // wont compile 
def example4(x: Int)(implicit y: Int)               // only y is implicit
def example5(implicit x: Int)(y: Int)               // wont compile
def example6(implicit x: Int)(implicit y: Int)      // wont compile

Implicit functions are defs that will be called automatically if the code wouldn’t otherwise compile. They’re typically used to create implicit conversion functions; single argument functions to automatically convert from one type to another. The references to implicit functions get applied to implicit arguments in the same way as references to implicit methods. To avoid implicit ambiguity, nested occurrences of an implicit take precedence over outer ones
```
implicit def intToStr(num: Int): String = s"The value is $num"

42.toUpperCase() // evaluates to "THE VALUE IS 42"

def functionTakingString(str: String) = str

// note that we're passing int
functionTakingString(42) // evaluates to "The value is 42"
```

Implicit classes extend behavior of existing classes you don’t otherwise control.

implicit class StringImprovements(s: String) {
 def increment = s.map(c => (c + 1).toChar)
}
  
val result = "HAL".increment

Tail Recursion

Scala detects self-recursion and compiles it to the same sort of bytecode as would be emitted for a while loop,as long as the recursive call is in tail position. A call is said to be in tail position if the caller does nothing other than return the value of the recursive call. If all recursive calls made by a function are in tail position, Scala automatically compiles the recursion to iterative loops that don’t cona function literalsume call stack frames for each iteration. we can tell the Scala compiler about tail call elimination using the tailrec annotation.

def findFirst[A](as: Array[A], p: A => Boolean): Int = {
  @annotation.tailrec
  def loop(n: Int): Int =
     if (n >= as.length) -1
     else if (p(as(n))) n
     else loop(n + 1)
        loop(0)
     }

Variance

Variance defines Inheritance relationships of Parameterized Types. Type parameters in Scala are written in square brackets, e.g. [A]. For List[T], the typed lists List[Int], List[AnyVal], etc. are known as "Parameterized Types" while T is called Type Parameter. Variance makes Scala collections more Type-Safe. Scala supports three types of variance, namely Covariant, Invariant and Contravariant.

Covariant: If "S" is subtype of "T" then List[S] is is a subtype of List[T]. To represent Covariance relationship between two Parameterized Types, Scala uses prefixing type parameter with "+" symbol. For example, List[+T], Set[+T] and Ordered[+T], where T is a Type Parameter and "+" symbol defines Scala Covariance.

Contravariant: If "S" is subtype of "T" then List[T] is is a subtype of List[S]. To represent Contravariant relationship between two Parameterized Types, Scala uses prefixing type parameter with "-" symbol, for List[-T].

Invariant: If "S" is subtype of "T" then List[S] and List[T] don’t have Inheritance Relationship or Sub-Typing. Such relationship between two Parameterized Types is known as "Invariant or Non-Variant". In Scala, by default Generic Types have Non-Variant (Invariant) relationship, were parameterized types are defined without using "+" or "-" symbols.

Scala Variance Type	Syntax	Description
Covariant	[+T]	If S is subtype of T, then List[S] is also subtype of List[T]
Contravariant	[-T]	If S is subtype of T, then List[T] is also subtype of List[S]
Invariant	[T]	If S is subtype of T, then List[S] and List[T] are unrelated

Type Bounds

Type Bounds are restrictions on Type Parameters (taken by generic classes) or Abstract Type members (taken by traits or abstract classes). By using Type Bounds limits can be defined to a Type Variable. Scala supports Upper Bounds, Lower Bounds and View Bounds for Type Variables.

Upper Bounds: The syntax for Upper Bound in Scala is [T <: S]. Here T is a Type Parameter and S is a type. It indicates that the Type Parameter T must be either same as S or Sub-Type of S.

  class Animal 
  class Dog extends Animal 
  class PitBull extends Dog 

  object ScalaUpperBounds {

    def display [T <: Dog](d : T) { 
       println(d) 
    } 

    def main(args: Array[String]) {
       display(new PitBull) 
       display(new Dog) 
    }
  }

In the above example an upper bound from Type Parameter T to Type Dog[T] is defined. Hence T here can be either Dog or subtype of Dog type.

Lower Bounds: The syntax for Lower Bound in Scala is [T >: S]. This indicates that the Type Parameter T must be either same as Type S or Sub-Type of S.

  class Animal 
  class Dog extends Animal 
  class PitBull extends Dog
  class Labrador extends Dog

  object ScalaLowerBounds {

    def display [T >: PitBull](d : T) { 
       println(d) 
    } 

    def main(args: Array[String]) {
       display(new PitBull) 
       display(new Dog)
       display(new Animal)
    }
  }

In the above example an lower bound from Type Parameter T to Type PitBull[T] is defined. Hence T must be either PitBull or superType of PitBull Type.

View Bounds: The View Bound allows to use the existing Implicit Conversions automatically. The syntax for View Bound in Scala is [T <% S]. View bound enables the use of some type A as if it were some type B. In the below example, A should have an implicit conversion to B available, so that one can call B methods on an object of type A. View bounds are deprecated.

def f[A <% B](a: A) = a.bMethod

Underscore Special Character

The underscore is a special character in Scala and in this case, allows for a space in the method name which essentially makes the name “age =”. This allows the method to be used in the same way as directly accessing the public property. In Scala, parentheses are usually optional. The line could just as easily been written as

person.age =(99)
// Or
person.age_=(99)
// Or
person.age = 99

Scala is a functional language. So we can treat function as a normal variable. If you try to assign a function to a new variable, the function will be invoked and the result will be assigned to the variable. This confusion occurs due to the optional braces for method invocation. We should use _ after the function name to assign it to another variable.

class Test {
  def fun = {
    // some code
  }
  val funLike = fun _
}

Standard Library Functions
Scala has Function1, Function2, Function3 and other interfaces known as traits provided by the standard Scala library which takes number of arguments indicated by the name. Scala’s standard library provides compose as a method on Function1, were two functions f and g can be composed by calling "f compose g". Also f andThen g is the same as g compose f. A functional data structure is (not surprisingly) operated on using only pure functions. functional data structures are by definition immutable.

Scala Streams
The Stream is a lazy lists where elements are evaluated only when they are needed. Streams have the same performance characteristics as lists. Similar to list's Cons operator :: , Stream has the Cons operator using the #:: operator method. Stream uses Stream.empty at the end of the expression to begin with empty stream.

val stream1: Stream[Int] = 1 #:: 2 #:: 3 #:: Stream.empty  // using #:: operator
val stream2: Stream[Int] = cons(1, cons(2, cons(3, Stream.empty) ) )       // using Stream.cons method

val stream3: Stream[Int] = Stream.from(1)   // create infinite stream

val emptyStream: Stream[Int] = Stream.empty[Int]  // initialize empty stream

println(s"Elements of stream1 = $stream1")         // Stream(1, ?)
stream2.take(3).print       // prints, 1, 2, 3, empty
stream2.take(10).print       // prints, 1, 2, 3, empty

Only first element of the stream is printed when stream is tried to be printed using println. The stream's take() method evaluates only the first specified number of elements from the stream, which can be used to perform operations.

Futures and Promises

A Future is an object holding a value which may become available at some point. It other words it is a placeholder object for a value that may not yet exist. The value is usually the result of some other computation, which determines the state of the feature. Depending on success or failure of the computation, the future is either completed with a value or completes with an exception thrown by the computation. Once a Future object (Future[T]) is given a value or an exception, it becomes immutable and cannot be overwritten. The Future.apply method starts (or schedules) an asynchronous computation and returns a future object holding the result of that computation. The result becomes available once the future completes.

import scala.concurrent._
import ExecutionContext.Implicits.global

val session = socialNetwork.createSessionFor("user", credentials)
val f: Future[List[Friend]] = Future {
  session.getFriends()
}

val title = Future {
  "hello" * 12 + "WORLD !!"
}

Futures are generally asynchronous and do not block the underlying execution threads. But futures also provide blocking of execution thread for certain cases. It provides blocking by either invoking arbitrary code that blocks the thread from within the future, or blocking from outside another future, waiting until that future gets completed.

val blockedForThisName = Future {
  blocking {
    "This is Blocked"
  }
}

While the Future is a read-only container, a promise is a writable, single-assignment container that is used to complete a future. Futures and Promises as two different sides of a pipe. On the promise side, data is pushed in, while on the future side, data can be pulled out. A promise can be used to either successfully complete a future with a value using the success method, or to complete a future with an exception, by failing the promise using the failure method.

val getNameFuture = Future { "Tom" }
val getNamePromise = Promise[String]()

getNamePromise completeWith getNameFuture

getNamePromise.future.onComplete {
  case Success(name) => println(s"Got the name: $name")
  case Failure(e) => e.printStackTrace()
}

By default, futures and promises are non-blocking, making use of callbacks instead of typical blocking operations. Future and Promises revolve around ExecutionContexts, responsible for executing computations.

ExecutionContext

An ExecutionContext is similar to an Executor were it executes computations in a new thread, in a pooled thread or in the current thread (which is discouraged). Scala provides an inbuilt scala.concurrent.ExecutionContext implementation with a global static pool. Also Executor can be converted to an ExecutionContext using the ExecutionContext.fromExecutor method which wraps a Java Executor into an ExecutionContext. Execution context executes the tasks submitted to them similar to thread pools. They are essential for the Future.apply method because they handle how and when the asynchronous computation is executed. We can either define our own execution contexts and use them with Future, or use the default execution context by importing ExecutionContext.Implicits.global. Below is an example were the execution of fatMatrix.inverse() is delegated to an ExecutionContext, and the result is provided to inverseFuture.

val inverseFuture: Future[Matrix] = Future {
  fatMatrix.inverse() // non-blocking long lasting computation
}(executionContext)

Global Execution Context

ExecutionContext.global is an ExecutionContext backed by a ForkJoinPool which manages a limited number of threads. Maximum number of threads is referred as parallelism level. The number of concurrently blocking computations can exceed the parallelism level only if each blocking call is wrapped inside a blocking call, otherwise the thread pool in global execution context is starved. By default the ExecutionContext.global sets the parallelism level of its underlying fork-join pool to the number of available processors using Runtime.availableProcessors. It can be overridden by minThreads, numThreads and maxThreads properties of scala.concurrent.context. The Global ExecutionContext can be imported from ExecutionContext.Implicits.global. Since ForkJoinPool is not designed for long lasting blocking operations, such long lasting blocking operations are wrapped using a dedicated ExecutionContext as below.

import scala.concurrent._
import ExecutionContext.Implicits.global

val session = socialNetwork.createSessionFor("user", credentials)
val f: Future[List[Friend]] = Future {
  session.getFriends()
}

Callbacks

When the client requires the value of the computation carried out by future, it would block its own computation and wait until the future is completed. The Future API provides such blocking call, it is recommended to do it is in a completely non-blocking way, by registering a callback on the future. The callback is then called asynchronously once the future is completed. The onComplete method which takes a callback function of type Try[T] => U, is the commonly used method to register the callback. The Try[T] is a monad similar to Option[T], which can either hold a value or some throwable object. Try[T] is a Success[T] when it holds a value and otherwise Failure[T], which holds an exception. The onComplete method allows the client to handle the result of both failed and successful future computations. To handle only successful results the foreach callback is used. The onComplete and foreach methods both have result type Unit, which means invocations of these methods cannot be chained.

The callback methods are executed either by the thread that completes the future or the thread which created the callback. The order of execution of callbacks is not predefined, as they can be called sequentially one after the other or concurrently at the same time. Although the ExecutionContext implementation mostly results in a well-defined order. The onComplete callback ensures that the corresponding method is invoked after the future is eventually completed, while foreach callback only invokes the method if the future is completed successfully. If callback is registered on the future which is already completed, it results in the callback being executed eventually. If one callback throws an exception, other callbacks are executed regardless. If some callbacks are never completed for e.g. due to an infinite loop, the other callbacks may not be executed at all, in which case blocking construct should be used. Once executed, the callbacks are removed from the future object, thus being eligible for GC.

import scala.util.{Success, Failure}

val f: Future[List[String]] = Future {
  session.getRecentPosts
}

f onComplete {
  case Success(posts) => for (post <- posts) println(post)
  case Failure(t) => println("An error has occurred: " + t.getMessage)
}

f foreach { posts =>
  for (post <- posts) println(post)
}

Combinators
The forEach and onComplete methods often result in overly idented and bulk code for real life scenarios were often there is multiple nesting of futures. To avoid this futures provide combinators which allow a more straightforward composition. Map is one of the basic combinator which, given a future and a mapping function for the value of the future, produces a new future that is completed with the mapped value once the original future is successfully completed. Hence futures can be mapped in the same way as collections. If the original future is completed successfully then the returned future is completed with a mapped value from the original future. If the mapping function throws an exception the future is completed with that exception. If the original future fails with an exception then the returned future also contains the same exception. Below is the example of map combinator.

val rateQuote = Future {
  connection.getCurrentValue(USD)
}

val purchase = rateQuote map { quote =>
  if (isProfitable(quote)) connection.buy(amount, quote)
  else throw new Exception("not profitable")
}

purchase foreach { amount =>
  println("Purchased " + amount + " USD")
}

Scala allows usage of futures in for-comprehensions, e.g. for (enumerators) yield e were an enumerator is either a generator which introduces new variables, or its a filter. Scala futures have the flatMap and withFilter combinators. The flatMap method takes a function that maps the value to a new future g, and then returns a future which is completed once g is completed. In other words, the flatMap operation maps its own value into some other future. Once this different future is completed, the resulting future is completed with its value. The filter combinator creates a new future which contains the value of the original future only if it satisfies some predicate. Otherwise, the new future is failed with a NoSuchElementException. The recover, recoverWith and fallbackTo combinators in general, creates a new future which holds the (same) result as the (original) future if it completed successfully.

The Performance Problem

2015-12-31T17:18:00.001-08:00

Nobody likes to wait. Nobody is willing to spend time on something if its slow. Performance is the most important aspect to be considered in any software development process after easy of use. Google Search, Youtube, Facebook or WhatsApp won't be popular if they were slow, no matter their content or user interface. Memory management is core in achieving the best performance. Despite the fact that memory and flash storage is getting cheaper by the passing day, badly implemented solutions can still run the system out of memory, thus degrading performance and ultimately crashing the system. The common misconception that performance should be focused in later stage of development or solution hardening process is a recipe for failure. Performance should be accounted for during the initial design and development which includes choosing the architecture, defining database structure, designing data flow and algorithms.

Measure and Identify Problems
Identify potential delays by sketching out the flow of data through the entire system. Chart where it enters, where it goes, and where it ends up. Mark the sections that you can test directly and note the sections which are out of your control. This flowchart will not offer guaranteed answers but it will be a road map for exploring. Creating a mock data that mirrors real world data instead of running successfully in random numbers helps to identify the areas impacted high load.

Relational Database
Every applications today is more complex and performs many more functions than ever before. Database is critical to such functionality and performance of any application.

Excessive Querying: The major problem with database arises when there are excessive database queries executed. When the applications access the database far too often it results in longer response times and unacceptable overhead on the database. Hibernate and other JPA implementations to some extent provide fine-grained tuning of database access by providing eager and lazy fetching options. Eager fetching reduces the number of calls that are made to the database, but those calls are more complex and slower to execute and they load more data into memory. Lazy fetching increases the number of calls that are made to the database, but each individual call is simple and fast and it reduces the memory requirement to only those objects your application actually uses. The expected behavior of the application would help to decide how to configure the persistence engine. The correlation between number of database calls versus number of executed business transactions helps to troubleshoot the excessive database query (N+1) problem.

Caching: Database calls are very expensive from the performance standpoint, hence caching is the preferred approach to optimize the performance of their applications as it’s much faster to read data from an in-memory cache than to make a database call across a network. Caching help to minimize the database load when the application load increases. Caches are stateful and it is important to retrieve a specific object requested. There are various levels such as level 2 cache which sits between the persistence layer and the database, a stand-alone distributed cache that holds arbitrary business objects. Hibernate supports level 2 cache which checks the cache for existence of the object before making a database call and updates the cache. One of the major limitation of caching is that caches are of fixed size. Hence when the cache gets full, the most least recently used objects preferably gets removed, which can result in a "miss" if the removed object is requested. The hit-miss ratio can be optimized by determining the objects to cache and configuring the cache size in order to take advantage of the performance benefits of caching without exhausting all the memory on the machine. Distributed caching provide multiple caches on different servers, with all the changes being propogated to all the members in the cache being updated. Consistency is an overhead for distributed caching and should be balanced with the business requirements. Cache also need to be updated frequently by expiring the objects in order to avoid reading the stale values.

Connection Pool: Database connections are relatively expensive to create, hence rather than creating connections on the fly, they should be created beforehand and used whenever needed to access the database. The database connections pool which contains multiple database connections, enable to execute concurrent queries againist the database and limits the load to the database. When the number of connections in the pool is less then business transactions will be forced to wait for a connection to become available before continuing to process. On the other hand when there are too many connections then they send a high number of requests to the database causing high load and making all the business transactions to suffer from slow database performance. Hence the database pool size should be tuned carefully.

Normalization
Normalization is the process of organizing the columns (attributes) and tables (relations) of a relational database to minimize data redundancy and eliminate inconsistent dependency. Normalized databases fair very well under conditions where the applications are write-intensive and the write-load is more than the read-load. Normalized tables have smaller foot-print and are small enough to get fit into the buffer. Updates are faster as there are no duplicates and inserts are faster as data is inserted at a single place. Selects are faster for single tables were the size is smaller to fit in the buffer and heavy duty group by or distinct queries can be avoided as no duplicates. Although fully normalized tables means more joins between tables are required to fetch the data. As a result the read operations suffer because the indexing strategies do not go well with table joins. When the table is denormalized the select queries are faster as all the data is present in the same table thus avoiding any joins. Also indexes can be used efficiently when querying on single table compared to join queries. When the read queries are more common than updates and the table is relatively stable with infrequent data changes then normalization does not help in such case. Normalization is used to save storage space, but as the price of storage hardware is becoming cheaper it does not offer significant savings. The best approach is to mix normalized and denormalized approaches together. Hence normalize the tables were number of update/insert operations are higher than Select queries and store the all columns which are read together very frequently into single table. When mixing normalization and denormalization, focus on denormalizing tables that are read intensive, while tables that are write intensive keep them normalized.

Indexing and SQL Queries
Indexing is an effective way to tune your SQL database. An index is a data structure that improves the speed of data retrieval operations on a database table by providing rapid random lookups and efficient access of ordered records. On the other hand indexes decreases the performance of DML queries (Insert, Delete, Update) as all indexes need to be modified after these operations. When creating indexes, estimate the number of unique values the column(s) will have for a particular field. If a column can potentially return thousands of rows with same value , which are then searched sequentially, it seldom help in speeding up the queries. Composite index contains more than one field and should be created if it is expected to run queries that will have multiple fields in the WHERE clause and all fields combined will give significantly less rows than the first field alone. Clustered index determines the physical order of data in a table and are particularly efficient on columns that are often searched for range of values.

Below are few of the performance guidelines for SQL queries:

A correlated subquery (nested query) is one which uses values from the parent query. Such query tends to run row-by-row, once for each row returned by the outer query, and thus decreases SQL query performance. It should be refactored as a join for better performance.
Select the specific columns which are required and "Select *" should be avoided, as it reduces the load on the resources to fetch the details of the columns.
Avoid using Temporary tables as it increases the query complexity. In case of a stored procedure with some data manipulation which cannot be handled by a single query, then temporary tables can be used as intermediaries in order to generate a final result.
Avoid Foreign keys constraints which ensure data integrity at the cost of performance. If performance is the primary goal then data integrity rules should be pushed to the application layer.
Many databases return the query execution plan for SELECT statements which is created by the optimizer. Such plan is very useful in fine tuning SQL queries. e.g. Query Execution Plan or Explain.
Avoid using functions or methods in the where clause.
Deleting or updating large amounts of data from huge tables when ran as a single transaction might require to kill or roll back the entire transaction. It takes a long time to complete and also block other transactions for their duration, essentially bottle-necking the system. Deleting or updating in smaller batches enhances concurrency as small batches are committed to disk and has a small number of rows to roll back on failure. On other hand single delete/update operation increases the number of transactions and decreases the efficiency.

Design
Performance should be considered at every step in designing the system. Below are some few design considerations to made while developing application services.

Avoid making multiple calls to the database especially inside the loop. Instead try to fetch all the required records beforehand and loop through them to find a match.
Decision to make the application services stateless instead of stateful does come with a performance price. In an effort to make the services stateless, the common data (e.g. user roles) required by multiple services need to be fetched every time adding overhead to the database. Hence such design decision for all the services to be stateless should be carefully considered.
Database calls are expensive compared to Network latency. Hence it is always preferred to reuse the data fetched from the database by either caching it or by sending it to the client in order to be sent back again for future processing. Although it purely depends on the size of the data and the complexity of the queries fetching the data from the database.
When the size of the data is too large to fetch or process in a single call then pagination should be applied. If the services are stateless and fetching the data involves multiple joins or queries, then pagination fails as the service still needs to fetch all the rows and determine the chunk which the client has requested for. In such cases, the database calls should be split into two. First fetch all the rows containing only the meta-data (especially unique ids) by which a chunk can be determined. Then another call is made to use the meta-data (unique ids) in order to fetch the entire data for the requested chunk.
When fetching a large chunk of records from a database/service takes a substantial toll on the performance, multiple threads can be spawned, each calling the database/service in small chunks and then merging the all the results into a final result.
When the number of records and size of data in the database increases exponentially then no matter the amount of indexing and tuning, the performance of the system will deter. In such cases high scalability database architectures such as clustered databases and distributed database processing frameworks such as Hadoop should be considered.

Algorithms
Algorithms are core building blocks inside any application. The complexity of the algorithm affects the performance but not the other way around. The Big O notation is widely used in Computer Science to describe the performance or complexity of an algorithm, by specifically describing the worst-case scenario. It measures the efficiency of an algorithm based on the time it takes for the algorithm to run as a function of the input size. It is an expression of how the execution time of a program scales with the input data. In Big O notation O(N), N could be the actual input, or the size of the input. When calculating the Big O complexity, constants are eliminated as we're looking at what happens as n gets arbitrarily large. Also the less significant terms are dropped such as O(n + n^2) becomes O(n^2), because the less significant terms quickly become less significant as n gets bigger. The statements, If-else cases, loops and nested loops in the code are analyzed to determine the Big O value for the function.
Amortized time is often used when stating algorithm complexity and it looks at an algorithm from the viewpoint of total running time rather than individual operations. When an operation occurs over million times then the total time taken is considered as opposed to worst-case or the best-case of that operation. If the operation is slow on few cases it is ignored while computing Big O, as long as such cases are rare enough for the slowness to be diluted away within a large number of executions. Hence amortised time essentially means the average time taken per operation, when operation occurs many times. It can be a constant, linear or logarithmic.

Below are the few best practices in designing better algorithms.

Avoid execution or computation of an expression whose result is invariant inside the loop, by moving it outside the loop.
Prefer iteration over recursion in order to avoid consuming a lot of stack frames if if it can be implemented using few local variables.
Don’t call expensive methods in an algorithms "leaf nodes", but cache the call instead, or avoid it if the method contract allows it.
Prefer to use Hashmap with O(1) complexity for element retrieval instead of List with O(n) where elements are read continuously from the data storage. Similarly use the appropriate data structure (Collection type) based on the kind of operations used frequently.

Java Coding Practices
Proper coding practices help to avoid the performance problems faced when the system is put through high load either during performance testing or in production.

Avoid using finalizers when possible in order to avoid delay in garbage collection.
Explicitly close resources (streams) without relying on the object finalizers.
Use StringBuffer /StringBuilder to form string rather than creating multiple (immutable) strings.
Use primitive types which are faster than their boxed primitive couterparts (wrapper classes), and avoid unintentional autoboxing.
When data is constant declare it as static and final, since final thus signaling the compiler that the value of this variable or the object referred to by this variable will never change could potentially allow for performance optimizations.
Create an object once during initialization and reuse this throughout the run.
Prefer local variables instead of instance variables which have faster read access for better performance.
The common misconception is that object creation is expensive and should be avoided. On the contrary, the creation and reclamation of small objects whose constructors do little explicit work is cheap, especially on modern JVM implementations, even though its non-trivial and has a measurable cost. Although creation of extremely heavyweight objects such as database connection is expensive and such objects should be reused by maintaining an object pool. Further creating lightweight short lived objects in Java especially in a loop is cheap (apart from the hidden cost that the GC will run more often).
Cache the values in a variable instead of reading repetitively in the loop. For example caching the array length in a local variable is faster than reading the value for every single loop iteration.
Pre-sizing collections such as ArrayLists when the estimated collection size is known during creation, improves performance by avoiding frequent size reallocation.
Prefer to make the variable fields in the class as public for direct access by outside classes rather than implementing getter and setter accessor methods unless the class fields needed to be encapsulated.
Declare the methods as static if it doesn’t need to access other instance methods or fields.
Prefer manual iteration instead of builtin for each loops for ArrayList which uses Iterator object for iteration. Although for arrays for each loops performance better than manual iterations.
Regular expressions should be avoided in the nested loops. When regular expressions is used in computation-intensive code sections, then the Pattern reference should be cached instead of compiling it everytime.
Prefer bitwise operations as compared to the arithmetic operations such as multiplication, division and modulus when there is extensive computations (e.g. cryptography). For example i * 8 can be replace by i << 3, i / 16 replaced by i >> 4 and i % 4 replaced by i & 3.

Memory
Garbage collection occurs as either a minor or major mark-sweep collection. When eden section is full a minor mark-sweep garbage collection is performed and all the surviving objects are tenured or copied to the Tenured Space. During the major garbage collection, the garbage collector performs a mark-sweep collection across the entire heap, and then performs a compaction. It freezes all the running threads in the JVM, and results in the entire young generation to be free with all the live objects being compacted into the old generation space, shrinking its size. The longer it takes to complete or the more often it executes, the application performance is impacted. The amount of time taken by major garbage collection depends on the size of the heap with 2-3 GB of heap takes 3-5 seconds while 30 GB of heap takes 30 seconds. The java command option –verbosegc logs the full garbage collection entries for monitoring. The Concurrent Mark Sweep (CMS) garbage collection strategy allows an additional thread which is constantly marking and sweeping objects, which can reduce the pause times for major garbage collections. In order to mitigate the major garbage collections, the heap should be sized in such a way that short-lived objects are given enough time to die. The size of young generation space should be a little less than half the size of the heap and the survivor ratios should be anywhere between 1/6th and 1/8th the size of the young generation space.

Memory Leaks
A memory leak occurs when memory acquired by a application for execution is never freed-up and the application inadvertently maintains a object references which it never intended to use again. The garbage collector finds the unused objects by traversing from root node through all nodes that are no longer being accessed or referenced and removes them freeing memory resources for the JVM. If the object is unintentionally referenced by other objects then it is excluded from garbage collection, as the garbage collector assumes that someone intended to use it at some point in the future. When this tends to occur in code that is frequently executed it causes the JVM to deplete its memory impacting performance and eventually exhaust its memory by throwing dreaded OutOfMemory error. Below are some of the best practices and common cases in order to avoid memory leaks.

Each variable should be declared with the narrowest possible scope, thus eliminating the variable when it falls out of scope.
When maintaining own memory using set of references, then the object references should be nulled out explicitly when they fell out of scope.
The key used for HashSet/HashMap should have proper equals() and hashCode() method implementations or else adding multiple elements in an infinite loop would cause the elements to expand causing leaks.
Non-static inner/anonymous classes which has an implicit reference to its surrounding class should be used carefully. When such inner class object is passed to a method which stores the references in cache/external object, the local object (having references to enclosing class) is not garbage collected even when its out of scope. Static inner classes should be used instead.
Check if the unused entries in the collections (especially static collections) are removed to avoid ever increasing objects. Prefer to use WeakHashMap when entries are added to the collection without any clean up. Entries in the WeakHashMap will be removed automatically when the map key object is no longer referenced elsewhere. Avoid using primitives (wrappers) and Strings as WeakHashMap keys as those objects are usually short lived and do not share the same lifespan as the actual target tracking objects.
Check when the event listener (callbacks) is registered but not unregistered after the class is not being used any longer.
When using connection pools, and when calling close() on the connection object, the connection returns back to the connection pool for reuse. It doesn't actually close the connection. Thus, the associated Statement and ResultSet objects remain in the memory. Hence, JDBC Statement and ResultSet objects must be explicitly closed in a finally block.
Usage of static classes should be minimized as they stay in memory for the lifetime of the application.
Avoid referencing objects from long-lasting (singleton) objects. If such usage cannot be avoided, use a weak reference, a type of object reference that does not prevent the object from being garbage collected.
Use of HttpSessions should be minimized and used only for state that cannot realistically be kept on the request object. Remove objects from HttpSession if they are no longer used.
ThreadLocal, a member field in the Thread class, and is useful to bind a state to a thread. Thread-local variables will not be removed by the garbage collector as long as the thread itself is alive. As threads are often pooled and thus kept alive virtually forever, the object might never be removed by the garbage collector.
A DOM Node object always belongs to a DOM Document. Even when removed from the document the node object retains a reference to its owning document. As long as the child object reference exists, neither the document nor any of the nodes it refers to will be removed.

Concurrency
Concurrency (aka multithreading) refers to executing several computations simultaneously and allows the application to accomplish more work in less time. In java concurrency is achieved by synchronization. Synchronization requires the thread, which is ready to execute a block of code, to acquire the object's lock leading to many performance issues.

Mutable shared objects are shared or accessible by multiple threads, but can also be changed by multiple threads. Ideally any objects that are shared between threads will be immutable, as immutable shared objects don't pose challenges to multithreaded code.

Deadlocks occur when two or more threads need multiple shared resources to complete their task and they access those resources in a different order or a different manner. When two or more threads each possess the lock for a resource, the other thread needs to complete its task and neither thread is willing to give up the lock that it has already obtained. Also in synchronized block, a thread must first obtain the lock for the code block before executing that code and, no other thread will be permitted to enter the code block while it has the lock. In such case the JVM will eventually exhaust all or most of its threads and the application will become slow although the CPU utilization appears underutilized. Thread dump helps to determine the root cause of deadlocks. Deadlocks can be avoided by making the application resources as immutable.

Gridlock: Thread synchronization is a powerful tool for protecting shared resources, but if major portions of the code are synchronized then it might be inadvertently single-threading the application. If the application has excessive synchronization or synchronization through a core functionality required by large number of business transactions, then the response times becomes slow, with very low CPU utilization as each of these threads reaches the synchronized code to go in waiting state. The impact of synchronized blocks in the code should be analyzed and redesigned to avoid synchronization.

Thread pool contains ready for execution threads which process the requests in the execution queue of the server. Creating and disposing of multiple threads is a common performance issue. If the application uses many threads for quicker response, it can be faster to create a pool of threads so they can be reused without being destroyed. This practice is best when the amount of computation done by each thread is small and the amount of time creating and destroying it can be larger than the amount of computation done inside of it. The size of thread pool directly impacts the performance of the application. If the size of thread pool is too small then the requests wait, while when the thread pool size is too large then many concurrently executing threads consumes the server's resources. Also too many threads causes more time to be spend on time context switching between threads causing threads to be starved. Hence the thread pool should be carefully tuned based on metrics of Thread pool utilization and CPU utilization.

Git: A Quick Guide

2014-10-12T18:52:00.011-07:00

Git is one of the popular source code management systems in the open source community. Developed by Linus Torvalds for linux kernel development, it has been adopted widely and distributed as a free software under GNU license. There is plenty of documentation and tutorials available for Github online as its being used by a large development community.

Git supports distributed workflows, offers safeguards against content corruption and allows to access repository history when offline. Git uses directed acyclic graph (DAG) to store content using different types of objects and also to store its history. The DAG allows git to efficiently determine common ancestors between two nodes especially during merges. Each git commit contains metadata about its ancestors were there can be zero or many (theoretically unlimited) parent commits. Git enables full branching capability using directed acyclic graphs to store content. The history of a file is linked all the way up its directory structure to the root directory, which is then linked to a commit node having one or more parents. Git merges the content of the two nodes in the DAG while merging two branches. Since git is a distributed system there is no absolute upstream or downstream repository in github. Branches in git are lightweight and similar to bookmarks in mercurial.

Below are the basic set of steps with commands to setup, commit and push using git.

Initializes a Git repository
> git init

Setting up user name and email address in one of the first steps to setup git. This is important since every Git commit uses this information. Using the --global option saves these settings and is used for every commit.
> git config --global user.name "John Doe"
> git config --global user.email johndoe@example.com

Provides git status command to see what is the current state of the project
> git status

Adds the specified file to the staging area in order to start tracking changes made to the corresponding file
> git add newfile.txt

Adds all the changes i.e. all the newly created text files using git add command with a wildcard.
> git add '*.txt'

Unstage the files in the staging area using the git reset command
> git reset folder/file.txt

Adds the staged changes into the repository by running the commit command with a message describing the changes.
> git commit -m "Add cute octocat story"

The --all or -a option of git commit tells gits to automatically stage the modified or deleted files except the newly added untracked files.
> git commit -a -m "comments"

Git's log is a journal that remembers all the changes committed so far by order.
> git log

Git log can also be used to search for a particular commit based on the commit message.
> git log --grep="search text"

Git log also enables to search commits by author.
> git log --author="John"

Git log provides --after or --before flags for filtering commits by date as below
> git log --after="2016-8-1"
> git log --after="yesterday"
> git log --after="2016-8-1" --before="2017-8-4"

Git log allows to display all the commits involving the specified file.
> git log C:\code-folder\...\Code.java

Git log also uses the -- parameter to determine that subsequent arguments are file paths and not branch names. It uses the passed multiple file names to return all the commits that are affected by either of the passed file paths.
> git log -- Controller.java Service.java

Git log also provides -S<string> and -G<regex> flags to search the content of the commits by string or regular expressions respectively.
> git log -S"Hello, World!"

The --oneline flag condenses each commit to a single line. The --decorate option is used to display all of the references (e.g., branches, tags, etc) that point to each commit. The --graph option draws an ASCII graph representing the branch structure of the commit history. The --oneline, --decorate and --graph commands enables to see which commit belongs to which branch as below.
> git log --graph --decorate --oneline

Display all commits (regardless of the branch checked out) by using the –all option as below.
> git log --oneline --all

Displays a certain number of commits e.g. 10 in a single line
> git log --pretty --oneline -10

Display all the commits filtering all the merge commits using the --no-merges flag.
> git log --no-merges

To check the commit difference between local and the remote repository, we first fetch the latest changes and then compare the master with "origin/master" to get the variance between the two.
> git fetch
> git log master..origin/master

Navigation while checking the git commits using the log command may be tricky sometimes. Below are list of short cuts used to navigate in git command state:

Next line: return
Next page: space bar
Previous page: w
Quit viewing the diff: q
Help: h

Git maintains a reflog in background, which is a log of where your HEAD and branch references have been for the last few months. The reflog is displayed as below.
> git reflog

Whenever the branch tip is updated, Git stores such information in its temporary history. Older commits can be specified using this data, by using @{n} reference from the reflog output. For example the below command displays the fifth prior value of the HEAD for the current repository.

> git show HEAD@{5}

The git-diff-tree is a lower level command which compares the content changes using two tree objects. It can be used to list all the files modified within a given commit as below.
> git diff-tree --no-commit-id --name-only -r <sha1-commit-id>

Alternatively the git show command can be used to display all the files modified within the commit. Here the --no-commit-id suppresses the commit ID output and the --pretty argument specifies an empty format string to avoid the cruft at the beginning. The --name-only argument shows only the file names that were affected and the -r argument is to recurse into sub-trees.
> git show --pretty="" --name-only <sha1-commit-id>

The git list files command shows all the files in the index and the working tree.
> git-ls-files

The git list files command also allows to list files in a specified branch
> git ls-tree -r master --name-only

The -r option allows to recurse into subdirectories and print each file currently under version control.
Also we can also specify HEAD instead of master to get the list for any other branch we might be in.

Adds a remote repository using the remote add command with a remote repository name and repository URL as below.
> git remote add origin https://github.com/try-git/try_git.git

In order to update the existing repository URL already added the remote set-url can be used as below, It takes two arguments, namely the existing name and the new URL.
> git remote set-url origin https://github.com/new/new.git

Displays the information about the repository particularly the fetch and the push urls.
> git remote show origin

Cloning a git repository can be done using the git clone command, specifying the remote repository url and the local repository path.
> git clone git@github.com:whatever folder-name

The -b branch option along with single-branch option allows to clone only a single branch.

> git clone https://github.com/pranav-patil/spring-microservices.git -b release --single-branch

The push command, pushes all the committed changes to the remote repository. The name of the remote repo as "origin" and the default local branch name as "master" needs to be specified. The -u tells Git to remember the repo and branch parameters, enabling to simply run git push without any parameters.
> git push -u origin master

When the repository parameter is missing, the current branch configuration is checked to determine the repository to be pushed else it is defaulted to origin.
> git push

The below push command explicitly specifies the remote repository and the remote branch.
> git push origin master

When git push refuses to update a remote ref that is not an ancestor of the local ref, or when the current value of remote ref does not match the expected value, then the --force option is used to disable these checks, and can cause the remote repository to lose commits. It is used to synchronize the remote repository with the local repository discarding the remote changes.
> git push origin HEAD --force
> git push origin master --force

The --force-with-lease option avoids unintentional overwrites as it updates the remote references only if it has the same value as the remote-tracking branch we have locally, else failing with stale info message.

The --set-upstream or -u option sets the association between local branch and the remote branch, so that we can use "git pull" without specifying the remote branch name everytime. It can be used with branch or push command as below.
> git push --set-upstream origin <branchname>
> git branch --set-upstream <branchname> origin/<branchname>

The git fetch command simply gets all the new changes from the specified remote repository, by default from the origin repository.
> git fetch upstream

The --all option for git fetch and also pull enables to fetch the current branch from all remote repositories.
> git fetch --all

The git pull command fetches the changes from the origin repository from the master branch and merges them with the current branch.
> git pull origin master

Git's pull command first fetches the latest changes from the remote repository similar to executing the "git fetch" command. It then follows with "git merge FETCH_HEAD" command to merge the retrieve branch heads with the current branch. To avoid merges, and perform rebase which rewinds the local changes on top of the latest changes, as below:
> git pull --rebase

Below command gets the changes from the upstream repository from master branch in the current branch and rebase the local changes on top of it.
> git pull -r upstream master

In order to make the rebase option as default for all the pull commands, we set this in the configuration for new branch as below:
> git config branch.autosetuprebase always

Enabling pull rebase by default for existing branches can be done as below:
> git config branch.YOUR_BRANCH_NAME.rebase true

In order to carry out merge instead of rebase explicitly, we use
> git pull --no-rebase

Git uses environment variables http_proxy and https_proxy for accessing remote repositories through proxy configuration if required. The no_proxy environment vairable is defined for accessing any internal repository or excluding proxy settings for any repository.

In order to check the difference between the most recent commit, referred to using the HEAD pointer and the last commit to which local repo is set to, the diff command is used
> git diff HEAD

To view the changes within staged files a --staged option is added to diff command :
> git diff --staged

View all the differences or difference for single file with the previous commit hash
> git diff CHANGE_HASH
> git diff CHANGE_HASH -- repo/src/java/Sample.java

Compares content and mode of blobs between the cache and repository
> git diff cache

To list all the files to be pushed in the origin repository on master branch, the below diff command with cache is used. The --stat option displays the ratio of added and removed lines.
> git diff --stat --cached origin/master

The --numstat option is similar to --stat and shows the number of added as well as deleted lines in decimal notation along with the pathname without abbreviation. Hence the below command displays the full file paths of the files that are changed.
> git diff --numstat origin/master

Listing all the files created or modified for the specified commit hash:

> git diff-tree --no-commit-id --name-only -r d9ce760
> git show --pretty="format:" --name-only d9ce760

To compare the same file between two different commits (not contiguous) on the same branch, we specify the start and end commit with git diff.
> git diff HEAD^^ HEAD pom.xml
> git diff HEAD^^..HEAD pom.xml

To compare a specific file from two different branches, we use the below diff command.

> git diff mybranch master -- myfile.java

> git diff mybranch..master -- myfile.java

Change the files back to the last commit the git checkout command is used:
> git checkout -- <target>
> git checkout -- octocat.txt

Checkout the local repository to a particular changeset
> git checkout <git-changeset>

Checkout the previous commit. ^n selects the nth parent of the commit. On the windows command prompt, we need two ^'s because ^ is the escape character.
> git checkout {commit}^^1

The HEAD is a default variable which is a reference to the current (most recent) commit in git. Many git commands, such as git log and git show use HEAD as the commit to report on. The ~ character (“tilde”) character is used to refer to the parent of the commit. The contents of the git HEAD variable is stored in a text file in the .git/HEAD. HEAD^ (which is short for HEAD^1) means in git that the first parent of the tip of the current branch. Git commits can have more than one parent. HEAD^ is short for HEAD^1, and one can also address HEAD^2 and so on as appropriate. We can get to parents of any commit, not just HEAD. Also we can move back through generations: for example, master~2 means the grandparent of the tip of the master branch, favoring the first parent in cases of ambiguity. These specifiers can be chained arbitrarily , e.g., topic~3^2.

Difference between Git ^ (caret) and ~ (tilde)

ref~ is shorthand for ref~1 and means the commit's first parent.

ref~2 means the commit's first parent's first parent, i.e. the commit's grandparent.
ref~3 means the commit's first parent's first parent's first parent. And so on.

ref^ is shorthand for ref^1 and means the commit's first parent. Hence ref~ and ref^ are equivalent.
But ref^2 means the commit's second parent as commits can have two parents when they are a merge.

Below is the summary of all different commit references with their meaning:

HEAD~2 : first parent of the first parent, or grandparent
HEAD^2 : second parent of HEAD, if HEAD was a merge, otherwise illegal
HEAD@{2} : refers to the 3rd listing in the overview of git reflog
HEAD~~ : 2 commits older than HEAD
HEAD^^^ : first parent of the first parent of the first parent

View all the local branches in the local repository.

> git branch

A new branch can be created using branch command or using checkout command with option -b
> git branch branchname
> git checkout -b branchname

Delete the specified git branch using the -d option with the git branch command:
> git branch -d branchname

To search for all the branches containing the particular commit, the contains option can be used.
> git branch --contains <commit>

The deleted branch can be recovered using the git hash <sha1-commit-id> for the branch.
> git branch branchName <sha1-commit-id>

Switch the branch in current repository
> git checkout branchname

Git allows to tag specific commits in the history. The tag command is used to list all the available tags.
> git tag

A tag is created by passing the parameter to the tag command and its deleted with the -d option as below.
> git tag release01
> git tag -d release01

A branch can be tagged as below by adding the branch name parameter with the tag command. This can later be used to checkout the branch instead of specifying the branch name.
> git tag archive/<branchname> <branchname>

It also allows to search all the available tags starting with "v1.8.5" series as below.
> git tag -l "v1.8.5*"

To display the commit details and file changes for the corresponding tag, the git show command is used.
> git show v.1.8.5

The git rm command removes files from the working tree and from the index or staging area. It not only removes the actual files from disk, but will also stage the removal of the files.
> git rm '*.txt'

The git rm command with the cached option only removes the files from the index and keeps the files in the working copy.
> git rm --cached notes.txt

Merges the changes from the specified branch name to the current checked out branch (master in this case). The current branch should be checked out to the specified current branch name in order to merge with the corresponding branch name.
> git checkout current_branch_name
> git merge branchname

Fast-forward merging updates the HEAD (along with the index) to point at the named commit, without creating an extra merge commit. It is recommended for short-lived branches. Non-fast-forward merging enables to have plain history with straight branch, without complex branching making the history easier to understand and easier to revert a group of commits.

By default fast-forward merging is enabled which can be suppressed by specifying -no-ff option as below.
> git merge --no-ff

In any non fast-forward merge, the merged version is committed which reconciles the changes to be merged from all branches and the HEAD, index, and working tree are updated to it. In case the reconciliation of the changes fails, the HEAD points to the same place while the MERGE_HEAD points to the other branch head and the paths which merged cleanly are updated both in the index file and in the working tree. In conflicting scenario, the index file records up to three versions, common ancestor, HEAD changes and MERGE_HEAD changes. The working tree files contain the result of the "merge" i.e. 3-way merge results with conflict markers. The merge --abort command enables to recover from the complex conflicts and start over the merging process as below.

> git merge –abort

Removes all of the working changes along with the changes in the staging area without effecting any local commits. By default git reset command resets to the last commit in the current branch or to the specified changeset.

> git reset --hard
> git reset --hard HEAD~1
> git reset --hard <sha1-commit-id>

Git reset along with master branch is used to remove all the working changes and local commits on the master branch. The below command discards all local commits and working changes.
> git reset --hard origin/master

Resets the current branch pointer to the specific commit (e.g. 6th commit from the latest HEAD) removing all files from working tree and staging area.

> git reset --hard HEAD@{5}

The soft option for git reset neither discards the working tree changes nor the index or staged changes. It does move the local commit back into working changes though.
> git reset --soft

Removes the last commit keeping all the file changes as working changes.
> git reset --soft HEAD~
> git reset --soft HEAD~1

The mixed option for git reset only discards the index or staged changes without affecting the working tree changes.
> git reset --mixed

To remove the recently committed changes git reset command is used on the HEAD as below:
> git reset --hard HEAD^

To remove changes for a particular file from the last commit, or to amend the last commit to un-add a file, the below command is used. HEAD^ allows the file to access its content in the previous commit before the last one.
> git reset HEAD^ path/to/file/to/revert

Further, using git force push can remove commits from remote repository, mostly helpful to remove any sensitive data. Read for more details.
> git push origin +master

After reset the commit removed goes to a “dangling” state and still resides in git’s datastore. Its waiting for the next garbage collection to clean it up. Hence the commit can still be restored unless git gc is ran which cleans all the dangling commits. The git fsck command is part of Maintenance and Data Recovery Utility for git. The fsch command with the --lost-found option can be used to lookup all the dangling commits.
> git fsck --lost-found

The git reflog command can also be used to check for dangling commits. The git merge command passing the SHA1 of the commit to recover can recover the specified commit with HEAD pointing to it.
> git merge <sha1-commit-id>

Git allows to create a commit with the reverse commit to cancel it out, which allows to reset the changes which have been already published without rewriting any branch history.

Create three separate revert commits:
> git revert <sha1-commit-id-1> <sha1-commit-id-2> <sha1-commit-id-3>

Reverts the last two commits wihtin the ranges:
> git revert HEAD~2..HEAD

Reverts a merge commit
> git revert -m 1 <merge-sha1-commit-id>

Git allows to delete the local untracked files from the current branch using the clean command. The -n or --dry-run option enables to preview the files which will be deleted.

Deletes the local un-tracked directories from the current branch with the -d option.
> git clean -f -d

Deletes the local un-tracked or ignored files from the current branch with the option -X option.
> git clean -f -X

Deletes the local ignored and non-ignored files both from the current branch with the option -x option.
> git clean -f -x

Takes a mailbox of commits formatted as an email messages (output from git format-patch) and applies them to the current branch.
> git am

Finds the head commit of the branch before the rebase process was started:
> git reflog

Adding a tag for specific commit e.g. "7108c3c" is done using the tag command.
> git tag -a v1.2 7108c3c -m "Message"

Allows to view or browse the git documentation for the specified command online
> git help push

Stashing enables to save all the modified and tracked files on the stack on unfinished changes in order to get to the clean branch state, in order to pull new changes from remote repository. The files in the stack can later be applied to the working directory to get back to the previous dirty state of working directory.

Pushes a new stash on the stack with all the tracked working changes.
> git stash

To view all the stashes stored on the stack, we use
> git stash list

Lists one line for each stash
> git stash list -p

Retrieve all the working changes from the most recent stash pushed on the stack and apply to working copy
> git stash apply

Shows the contents of one particular stash in the stack
> git stash show stash@{0}

Removes all the stashed states from the stack

> git stash clear

Applies the previous stash in the stack by naming it using its index stack
> git stash apply stash@{0}

Stashing only un-staged changes in Git
> git stash save --keep-index

Rebasing re-writes the project history by creating brand new commits for each commit in the original branch. Its major benefit is a much cleaner project history, eliminating the unnecessary merge commits by creating a perfectly linear project history. The golden rule of git rebase is to never use it on public branches. For example, never rebase master onto your feature branch as it will move all of the commits in master onto the tip of feature.

The below rebase command moves the entire feature branch to begin on the tip of the master branch, effectively incorporating all of the new commits in master.
> git checkout feature
> git rebase master

The local commits are applied on top of the upstream/master as below.

> git rebase upstream/master

The below command rebases the topic branch onto the base branch by replaying topic branch onto the base branch.
> git rebase basebranch topicbranch

Interactive rebasing enables to alter the commits as they are are moved to the new branch and is used to clean up a messy history before merging a feature branch into master. Interactive rebase also allows to modify from the list of previous commit history which have not yet been pushed to the remote repository. From the list of commits being rebased, the text is changed from pick to edit next to the hash of the commit to be modified. Git then prompts to change the commit. This is mainly used to change the commit message, change the commit author, quash two commits into a single commit and in many other cases.

Consider changing the author of the recent commit, we invoke the git interactive rebase using -i option:
> git rebase -i origin/master

Git will rebase, stopping at each commit we marked as pick. Now for each commit marked we execute:
> git commit --amend --author="Philip J Fry <someone@example.com>"
> git rebase –continue

> git commit --amend --reset-author

The rebase command has the interactive mode "-i" which enables to squash multiple smaller commits into a single larger commit. Only non-pushed local commits should be squashed within the interactive mode, to prevent conflicts and rewriting of history. Below rebase command enables to wrap last 4 commits together into a single commit:
> git rebase -i HEAD~4

Then we mark the commits to be merged as squash, and save the changes using ":wq".
Git then allows to modify the new commit’s message based on the rest of the commits involved in the process. So we edit the commit messages to a single commit message and exit the command line using ":wq" saving the changes to finalize the rebase squashing to a single commit.

Similarly, the last commit can be updated to add a new message as below:
> git commit --amend -m "New commit message"

The amend option for git commit also by default adds all the staged file changes to the recent local commit.
> git commit --amend

The last commit on the local repository can be removed by executing git rebase and deleting the second line within the editor window that pops up.
> git rebase -i HEAD~2

Sometimes a feature branch is created from another feature branch, rather than creating from the master branch. The --onto option of rebase command enables to fix this by allowing to transplant a given branch based on one branch to another as below.
> git rebase --onto master oldparent newbranch

In Git there are two places the configurations can be stored. In a global ~/.gitconfig and in a local per-repo .gitconfig which is inside the .git directory. Each local project repository can be configured to set the commit user name and email as below:

> cd git-codebase
> git config user.name "John Adams"
> git config user.email adamsjohn@emprovise.com

In order to set the name and email address in global configuration we can run the above command with the --global option which saves the values in the global configuration file, ~/.gitconfig. Every Git commit uses then uses this user name and email information in global configuration (unless overridden by local project), and it’s immutably baked into the commits.

We can also override the user name and email address by passing custom values to the commit command.

> git -c "user.name=John Dalton" -c "user.email=johndalton@gmail.com" commit -m "Some message" --author="John Dalton<johndalton@gmail.com>"

The git config command show the current global configuration.

> git config --list

Cherry Picking in Git enables to cherry pick a specific commit from another branch. This could be useful if we already have a fix on the master branch that we need to move to the production branch. This is done as below:

1) Checkout the commit we need to work from
> git checkout <sha1-commit-id>

2) Cherry Pick the desired commit. It applies the previous commit to the top of current branch.
> git cherry-pick <sha1-commit-id>

Git provides ability to create patches which could be essentially applied to another branch or repository.

Creates a new file patch.diff with all changes from the current branch against master.
> git format-patch master --stdout > C:\patch.diff

Normally, git creates a separate patch file for each commit, but with --stdout option it prints all commits to the standard output creating a single patch file. In order to have seperate patch files for each commit, the --stdout option can be ignored.

Creates a patch file for each commit, for all commits since the referenced commit (not including it)
> git format-patch HEAD~~

Applies the patch on current branch using a git am command ignoring any white space differences:
> git am --ignore-space-change --ignore-whitespace C:\patch.diff

In case of conflicts, the git am command may fail, and can be aborted using "--abort" or skipped using "--skip" option. By default git apply fails the whole patch and does not touch the working tree when some of the hunks do not apply. The reject option in git-apply makes it apply the parts of the patch that are applicable, leaving the rejected hunks in corresponding *.rej files. The rejected files can then be compared with the with the conflicting files, edited and finally the fixed files added to the index.
> git apply PATCH --reject
// compare rejected files with conflicting files and update
> git add FIXED_FILES
> git am --resolved

A git repository stores commits in a transitive (real) closure from the reachable items through to every commit. Hence it’s not possible to remove a commit and therefore the trees and blobs that make up that repository. So when a branch is deleted the pointer to the commit is deleted but the actual commit still persists. Further the git reflog stores the a list of the previous branch pointers even when the branch is deleted. The git gc command can be used to repack the repository into a more efficient structure and export non-referenced commits as loose objects only if there’s no existing pointer referencing to the commit. The gc command cleanups the unnecessary files and optimizes the repository in order to maintain a good disk space utilization and good operating performance:
> git gc

The git fsck command enables to check whether all objects are present as expected. Also
git fsck --unreachable shows the commits which are no longer reachable due to deleted branches or removed tags. Internally the gc command invokes git prune which evicts the objects which are no longer referenced to any commits (it does not remove references). It deletes data that has been accumulated in Git without being referenced by anything.
> git fsck --unreachable
> git prune

The git remote prune command is used to delete/remove any stale remote-tracking branches which are still locally available, but have already been removed from the remote repository referenced by name (origin). It enables to remove remote references under a particular remote unlike git fetch --prune command.
> git remote prune origin

Similar to git remote prune, git fetch --prune only deletes remote-tracking branches locally still available.
> git fetch origin --prune

The commands git remote update --prune, git remote prune, git fetch --prune, all operate on remote branches (refs/remotes/..).

To git ignore watching/tracking a particular dir/file the update-index command is used as below:
> git update-index --assume-unchanged <file>

To undo ignoring the files and to track them using version --no-assume-unchanged option is used.
> git update-index --no-skip-worktree <file>

Verify the key being used for the github host using the ssh command.
> ssh -v github.com

The http.sslVerify configuration for SSL when set to false, disables the verification of SSL certificate while fetching and pushing files over Https. The global flag will set this configuration to all the git repositories.

> git config http.sslVerify false
> git config --global http.sslVerify false

The ssl backend can be configured to use with git, which can potentially be openssl or schannel. SChannel is the built-in Windows networking layer which will use the Windows certificate storage mechanism when configured.

> git config --global http.sslbackend schannel

Forking in git is nothing more than a clone on the GitHub server side without the possibility to directly push back to original repository, hence allowing to make changes without impacting the original repository. Also the fork queue feature allows to manage the merge request to the original repository. The original repository which is forked is referred as upstream while the fork or clone of original repository is referred as origin. The fork can be synced with the original project by adding the original project as a remote, fetching regularly from the original project or rebasing the current changes on top of the branch with the updated changes. To clone the forked repository locally, clone the repository from the git user's copy as below:

> git clone https://github.com/forkuser/repo.git

When a git repository is cloned, it has a default remote called origin that points to the fork on Git, and not towards the original repository from which it was forked from. In order to keep track of the original repository add another remote named upstream as below:
> git remote add upstream https://github.com/user/repo.git

In case to clean up the remote fork and restart from the upstream we use the below set of commands.
> git fetch upstream
> git checkout master
> git reset --hard upstream/master
> git push origin master --force

Git gilter-branch command enables to rewrite branches by applying custom filters on each revision, modifying each tree and each commit based on the filters specified. It only rewrites the positive refs mentioned in the command line.

> git filter-branch -f --env-filter '
if [ "$GIT_COMMITTER_NAME" = "oldname" ];
then
GIT_COMMITTER_NAME="newname";
GIT_COMMITTER_EMAIL="newaddr";
GIT_AUTHOR_NAME="newname";
GIT_AUTHOR_EMAIL="newaddr";
fi

if [ "$GIT_AUTHOR_NAME" = "oldname" ];
then
GIT_COMMITTER_NAME="newname";
GIT_COMMITTER_EMAIL="newaddr";
GIT_AUTHOR_NAME="newname";
GIT_AUTHOR_EMAIL="newaddr";
fi
' -- --all

The gilter-branch command allows to update all past commits changing the author and committer without losing history.

> git filter-branch -f --env-filter "
GIT_AUTHOR_NAME='Jack Dorcey'
GIT_AUTHOR_EMAIL='jack.dorcey@emprovise.com'
GIT_COMMITTER_NAME='Jack Dorcey'
GIT_COMMITTER_EMAIL='jack.dorcey@emprovise.com'
" HEAD

The gilter-branch command also enables to prune specific/all directories, keeping the history of only specific sub-directories. The prune-empty option removes empty commits, if there are exactly one or zero non-pruned parents, generated by filters from filter-branch command. The --subdirectory-filter option rewrites the repository such that the /module_name had been its project root, and discard all other history as below.
> git filter-branch --prune-empty -f --subdirectory-filter module_name

After using git filter-branch, git still retains a backup copy of the history of the repo in refs/original in case the changes needed to be reverted. If assured that everything went smoothly, the backed up ref can be removed as below:

> git update-ref -d refs/original/refs/heads/master

Git allows to configure aliases for frequently used commands which is configured in the global git config file using below command (e.g. setting alias co for checkout).
> git config --global alias.co checkout

There are multiple merge tools available which can be integrated to git, by configuring in the global git configuration file. Below is a sample git configuration for Diffmerge tool for windows.

[diff]
tool = diffmerge

[difftool "diffmerge"]
cmd = 'C:/diffmerge/sgdm.exe' \"$LOCAL\" \"$REMOTE\"

[merge]
tool = diffmerge

[mergetool "diffmerge"]
trustExitCode = true
cmd = 'C:/diffmerge/sgdm.exe' --merge --result=\"$MERGED\" \"$LOCAL\" \"$BASE\" \"$REMOTE\"

MongoDB : An Overview

2014-02-19T21:18:00.003-08:00

Over the past few years there have been many new NoSql databases in the market. Mongo DB is one such prominent and popular NoSql database. MongoDB is a non-relational json based document store. The json documents mainly consists of arrays i.e. lists and dictionaries known as key-value pairs. MongoDB supports dynamic schema unlike relational databases were schema needs to be define beforehand. An extra field can be added or ignore in any document within a collection. MongoDB provides a sufficient depth of functionality while maintaining better scalability and performance compared to traditional relational databases. MongoDB does not support joins to achieve better scalability and transactions since documents are stored in a hierarchical structure were operations are atomic. The lack of transactions in mongodb can be overcome by restructuring the code to work with single documents, or implement locking in software using critical sections, semaphores etc. MongoDB though does supports atomic operations on a single document. MongoDB encourages the schema design to be application driven by studying the application data access patterns and by considering the data used together for either readonly or write operations. Hence most of the data in mongodb is encouraged to be prejoined or embedded. The factors to consider before embedding the document in mongodb are the frequency of data access, size of the data (16MB limit) and requirement for atomicity of data. Embedding is used for one to one relationships and one to many relationships as long as embedding is done from the many to one. Embedding also helps to avoid round trips to the database and improve read performance by storing continuous chunks of data on the disc. MongoDB supports rich documents including arrays, key-value pairs, nested documents which enables it to prejoin/embed data for faster access. There are no constraints such as foreign keys in mongodb which become less relevant when most of the documents are embedded. Also true linking which refers the _id value of one document in another document as a single field value or array of values helps to achieve many to one or many to many relationships in schema design for mongodb. Linking and embedding works well in mongodb since it has multi-key indexes i.e. which index all the values of the arrays in all the documents. MongoDB enables storage of large file above 16 MB using GridFS. GridFS breaks the large file into a chunks of 16MB documents and stores them in the chunks collection along with a document in a files collection which describes the documents added to the chunks collection. MongoDB provides drivers for the majority of languages and also provides variety of tools for database management.

Following are the steps for installation of MongoDB:

1) Create directory C:\$HOME\mongodb\data\db

2) Start the mongod instance and check the log for errors:
mongod --auth --logpath C:\Temp\mongodb\mongodb.log --logappend --bind_ip localhost --dbpath C:\$HOME\mongodb\data\db

3) In case needed to run mongod instance as a windows service, use the "--install" option and execute the below command:
mongod --auth --logpath C:\$HOME\mongodb\mongodb.log --logappend --bind_ip localhost --dbpath C:\$HOME\mongodb\data\db --install

4) In order to change the parameters in future, use the --reinstall option instead of --install. To remove the service use the --remove option.

5) To start the mongo service use the command: net start mongodb, similarly to stop the service, use the command: net stop mongodb.

6) To stop the mongod instance using the mongo shell use the following commands:
use admin
db.shutdownServer()

The mongo shell is a command line tool to connect with the mongod instance of MongoDB. It is a interactive javascript interpreter and allows all the javascript programming constructs such as below.

> for (i = 0; i < 3; i++) print("hello !!");

The mongo shell has various short keys such as "ctrl + b" used to traverse backward of the line and "ctrl + e" to jump to the end of the line. The commands help and help keys give the detail of the commands and short cut keys respectively. Using the Tab key it automatically completes the mongo commands\queries. The mongo shell also enabled the declaration, initialization and use of the variables such as below:

x = 1

y = "abc"

z = { a : 1 } { "a" : 1 }

z.a // Here the property a of variable z is 1

z["a"] // name of the property as a string "a"

The dot notation does not permit a variable property (or methods or instance variables within an object) lookup, a is treated as a literal even though z is treated as a variable.

By contrast when a variable 'w' is assigned to a string 'a', then using the Square bracker syntax we can lookup the property inside the object 'z' as below:

w="a"

z[w] // this gives 1

Square bracket treats object as a data or dictionary.

The numbers are represented as floating point with NumberInt() as 32 bit value and NumberLong() as 64 bit value, the strings are UTF strings. The new Date() javascript constructor uses ISODate() constructor.

MongoDB uses BSON which contains all the javascript variable types, such as below:

doc = { "name" : "smith", "age": 30, "profesion" : "hacker" }

A db variable is used to communicate with database and used as a handle to the database.

Documents in database live inside of collections which are sets of documents within a particular database. The collections are properties of database.

Mongodb also provides tools to backup and restore data from the database.

The mongoexport command is used to backup a collection in a JSON or CSV format. The file option allows to specify the file format to which the collection data would be exported.

mongoexport --username "mongo-user" --password "mongo-password" --host mongo-host.com --db database-name --collection collection-name -o C:/file.json

The mongoimport is used to import content from a JSON, CSV, or TSV export created by mongoexport.
mongoimport -d students -c grades < grades.js

mongoimport --host mongo-host.com --db database-name --collection collection-name --file C:/file.json

The mongorestore is used to write data from a binary database dump created by mongodump to a MongoDB instance. It can create a new database or add data to an existing database
mongorestore --collection people --db accounts dump/accounts/people.bson

CRUD Operations on MongoDB

The insert method is used to insert a document in a collection,

db.people.insert(doc)

here "people" is the name of the collection interpreted by current database and the "insert" method on collections, takes an arg an javascript object "doc" which is a JSON document.

To retrieve the documents from the collection in mongodb the find() method is used.

db.people.find()

The above find method gets all the documents in the specified collection.

The _id field is a unique and primary field, which must be immutable. The type ObjectId, construction takes into account time and identity of the current machine, process id, and a global counter making it globally unique.

To retrieve a single document at random from the collection, the findOne() method is used.

db.people.findOne()

It also takes arguments similar to the where clause of sql language.

db.people.findOne({"name" : "Jones"})

The above query sends a BSON document to the server as the query.

The _id field by default is present in the findOne results. In order to give only the name field but no _id field in the results, the additional parameters are passed to the findOne method as another document.

db.people.findOne({"name" : "Jones"} , { "name" : true , "_id" : false })

By default the documents are returned as batches such as 20 documents and we can check the remaining documents by typing "it" for more.

Numeric values can also be used as arguments to the fields to retrieve the documents such as below which gets all the scores with student number 19 and with type as essay.

db.scrores.find( { student: 19, type : "essay" }, { "score" : true, "_id" : false } );

The query operators $gt (greater than), $lt (less than), $gte (greater then equal to), and $lte (less than equal to) are used to filter the documents to be retrieved.

db.scrores.find( { score : { $gt : 95, $lte : 98 }, "type" : "essay" } )

db.scrores.find( { score : { $gte : 95, $lt : 98 }, "type" : "essay" } )

Lexicographic search can be made using the UTF strings as below:

db.scrores.find( { name : { $lt : "D", $gt : "B" } } );

None of the above queries are locale aware.

MongoDB is a schema less database i.e. different documents in same collection might have different value types for the same field such as below example mycollection were name field can be an integer or string:

db.mycollection.insert({ "name" : "smith", "age": 30, "profession" : "hacker" });

db.mycollection.insert({ "name" : 34, "age": 34, "profession" : "teacher" });

All comparison operations are strongly typed and dynamically typed too. Hence in the above example, the less than operator ($lt: "D") will not show document with the name 34. It will not cross any datatype boundaries. Also all comparisons are case sensitive with ASCII characters.

The below example will retrieve all the documents were the profession field is absent:

db.people.find( { profession : { $exists : false } } );

In order to fetch those documents were the name field is a string, the $type is specified as 2 which is the type value for string from the BSON specification.

db.people.find( { name : { $type : 2 } } );

MongoDB also supports regular expression as parameters for findOne method using the libpcre library. The below find method for example retrieves all documents with names containing "a".

db.people.find( { name : { $regex : "a" } } );

The following queries fetches those documents with their names ending with "e" and names starting with capital letter "A" respectively:

db.people.find( { name : { $regex : "e$" } } );

db.people.find( { name : { $regex : "^A" } } );

The $or operator is a prefix operator and comes before the sub-queries it connects together. It takes operands as an array whose elements are themselves queries which can be given separately. The below query for example fetches the documents which matches any of the queries in the array.
db.people.find( { $or : [ { name: { $regex: "e$" } }, { age : { $exits : true } } ] } );

Note: When input query has wrong parenthesis then javascript returns "..." as output.

The $and operator returns all the documents which matches all the queries in the array.
db.people.find( { $and : [ { name: { $gt: "C" } }, { age : { $regex : "a" } } ] } );

Multiple constraints can be added on the same field to achieve similar results as the above query.
db.people.find( { name: { $gt: "C" , $regex : "a" } } );

All the query operations in mongodb are polymorphic. Consider a document with an array "favorites" which contains various elements such as "cream", "coke", "beer". The below query fetches all the documents containing the favorites as "beer".
db.accounts.find( { favorites: "beer" } );

In the above query no recursion occurs, if the field has nested content in it then none of the nested contents will be matched. Only the top level contents of array will be looked up for a match.

The below example matches all the favorites containing "beer" and "cheese" using the $all operator.
db.accounts.find( { favorites: { $all : [ "beer", "cheese" ] } } );

Order does not matter while using the $all operator. The operand should be subset of the values in the documents.

The $in operator takes a string and returns the documents were corresponding fields have the value
either "beer" or "cheese".

db.accounts.find( { favorites: $in : [ "beer", "cheese" ] } );

In order to find the embedded document, query the exact value of the field document with the order preserved. If order reversed then mongodb will not be able to find the document as byte by byte comparison is done.

Consider a document with email field which is another document with fields such as "work" and "personal".

db.users.find( { email : { work : "abc@gmail.com", personal : "xyz@gmail.com" } } );

Now in order to find a document using only the work email id, we use the dot notation as below:

db.users.find( { "email.work" : "abc@gmail.com" } );

Dot notation reaches inside of nested documents looking for embedded information without knowledge of other content.

db.catalog.find({ "price" : { $gt : 10000 }, "reviews.rating" : { $gte : 5} });

A Cursor is a pointer to the result set of a query. Clients can iterate through a cursor to retrieve results.
When a cursor is constructed and returned in the shell, the shell is configured to iterate all the elements from the cursor and printing out those elements. We can get the cursor by:

> cur = db.people.find(); null;
null
> cur.hasNext()
true
> cur.next()
returns next document.

The above hasNext() method returns true if there is a document to visit on the current cursor, while the next() method returns the next available document from the cursor. These cursor methods can be used to fetch all the documents using the while loop as below:
while( cur.hasNext()) printjson(cur.next());

The mongo shell batches the documents as a batch of 20 documents. A limit() method can be used to limit the number of documents fetched from the next() method of the cursor. Although as long as the hasNext() or next() methods are not invoked of the cursor, limits can be imposed on the cursor. The below example instructs the server to return only 5 documents, and we add the null to avoid printing of the values returned by the limit method from the cursor.
cur.limit(5); null;

The sort() method sorts the documents returned by the cursor according to the arguments specified. The below query sorts the documents lexically by the name field in the reverse order.
cur.sort( { name : -1 } );

The sort() and limit() returns the modified cursor
cur.sort( { name : -1 } ).limit(5); null;

The skip method is used to skip the specified number of documents. The below example skips the first 2 documents after fetching 5 documents. Here the sort then limit and the skip are processed on the database server sequentially.

cur.sort( { name : -1 } ).limit(5).skip(2); null;

We cannot apply sort, limit methods after started retrieving the documents from the database using the next or hasNext methods because sort and limit needs to be processed in the database.

Similarly the following query retrieves the exam documents, sorted by score in descending order, skipping the first 50 and showing only the next 20 documents.
db.scores.find( { type : "exam" } ).sort( { score : -1 } ).skip(50).limit(20)

The count method is used to count the number of documents retrieved.
db.scores.count( { type : "exam" } );
db.scores.count( { type : "essay" , score : { $gt : 90 } } );

The update method takes multiple arguments, namely the query, fields to update, upsert and multi. The update method discards everything that exists except the _id field. It can be used for merging the application document with the database document.
db.people.update( { name : "Smith" } , { name : "Thompson" , salary : 50000 } );

In order to modify a specific field using the update method we use the $set operator passing the field name and the new value. For example the below query sets the age for Alice to 30, if the field exists, if the field does not exists it creates the age field with the value 30.

db.people.update( { name : "Alice" } , { $set : { age : 30 } } );

The $inc operator is used to increment a field if already exists, or creates a new field with the specified value.
db.people.update( { name : "Alice" } , { $inc : { age : 1 } } );

The below query updates the posts collection by incrementing the likes by 1 if it exists or adding a new likes field with the value 1. It selects the post with the specified permalink and the third comment from the array of comments in the post.
db.posts.update( { permalink : "23446836" } , { $inc : { "comments.2.likes" : 1 } } );

The $unset operator is used to remove a field from the document. The value for the $unset operator is ignored and processed regardless of the specified value.
db.people.update( { name : "Jones" }, { $unset : { profession : 1 } } );

The dot notation is used along with the index of the array to update using the $set operator in the update method. The below example updates the 2nd element of the array a in the arrays collection.
db.arrays.insert( { _id : 0 , a : [ 1, 2, 3, 4 ] } );
db.arrays.update( { _id : 0 } , { $set : { "a.2" : 5 } } );

The $push operator is used to add elements to left hand side of array. Below example adds element 6 to the left of the array a.
db.arrays.update( { _id : 0 } , { $push : { "a" : 6 } } );

The $pop operator is used to remove the rightmost element from the array as shown below.
db.arrays.update( { _id : 0 } , { $pop : { "a" : 1 } } );

In order to remove the leftmost element from the array, set the value of array a to -1 passed to the $pop operator.
db.arrays.update( { _id : 0 } , { $pop : { "a" : -1 } } );

To add multiple elements to the array $pushAll operator is used.
db.arrays.update( { _id : 0 } , { $pushAll : { "a" : [ 7, 8 , 9 ] } } );

The $pull operator is used to remove the element from the array regardless of its location in the array.
db.arrays.update( { _id : 0 } , { $pull : { a : 5 } } );

Similarly $pullAll is used to remove a list of elements regardless of their location in the array.
db.arrays.update( { _id : 0 } , { $pullAll : { a : [ 2, 4 , 8 ] } } );

To treat an array as a set, we use the addToSet method to avoid duplicates in the array. It adds element only if it does not exist similar as push operator.
db.arrays.update( { _id : 0 } , { $addToSet : { a : 5 } } );

On the other hand there is not special function to remove from set, as deletion does not require duplicate checks and the $pop operator work fine. The $each operator is used with $push or $addToSet to add multiple values to an array field. The $splice operator which must be used only after the $each operator, limits the number of elements during the $push or $addToSet operation.

The Upserts update operator is used in order to update an existing document or else create a new document.
db.people.update( { name : "George" } , { $set : { age : 40 } } , { upsert : true } );

When using upsert argument as true, if no concrete value is specified for a field then it leaves the values blank in case it is creating a new document. The below query will add only name "William" in the document.
db.people.update( { age : { $gt : 40 } } , { $set : { name : "William" } } , { upsert : true } );

The update operation can affect more than one document at a time. The empty document can be passed a query to the update method which acts as a selector and matches every document inside the collection. For example following query will give every document a new field:
db.people.update( {}, { $set : { title : "Dr" } } );

The update operation only affects a single document whichever is the first one it finds and is unpredictable.
In case we need to update multiple documents which match the query, we add an multi option as true. The following update method affects all documents setting title field to "Dr".
db.people.update( {}, { $set : { title : "Dr" } }, {multi: true } );

In mongodb the write operations to multiple documents are not isolated transactions. Individual document manipulation is guaranteed to be atomic regardless of parallel readers/writers.
db.scores.update( { score : { $lt : 70 } }, { $inc : { score : 20 } }, {multi: true } );

The remove method is used to delete the documents from the database. It takes query document as a parameter similar to update query.
db.people.remove( { name : "Alice" } );
db.people.remove( { name : { $gt : "M" } } );

The below example removes all the documents in the collection.
db.people.remove();

To remove all the documents in the collection in one pass, drop method can be used on the collection:
db.people.drop();

Removing all documents in the collection requires one by one update of internal state for each document in collection. This keeps the indexes of the documents still intact. Dropping a collection on the other hand requires freeing some much larger database structure in the memory and also deletes all the indexes of the collection. Multiple documents removal is not atomic.

Mongodb provides a getLastError command (similar to count command) to check whether the last operation succeeded or failed by running the below command.
db.runCommand( { getLastError : 1 } );

If the last operation is successful the "err" value is null. The operations like update tells outcome of update using the "updatedExisting" field from the getLastError results. When used upsert flag it shows "upserted" values. Also the value of "n" in the getLastError results tells the number of documents updated, or removed.

Indexing

Indexing is an important aspect to improve the query performance of the database which is true in case of mongodb. Indexes are used by find, findOne, remove and update methods in mongodb.
The ensureIndex method is used to add index to particular field in the collection. It takes an integer as a parameter, with 1 as ascending order and -1 as descending order.
db.students.ensureIndex({ student_id:1});
db.students.ensureIndex({ student_id:1, class:-1});

The below command creates an index for a nested phones column in the addresses document.
db.students.ensureIndex({ 'addresses.phones':1});

In order to find all the indexes in the current database we use the system.indexes collection.
db.system.indexes.find();

The below command allows to view the details of the indexes in the collection.
db.students.getIndexes()

The dropIndex command allows to drop an index from a collection.
db.students.dropIndex({ 'student_id':1 });

When an index is created on a column which is an array then mongodb creates indexes for each entry of the array also known as Multikey Indexes. If indexes are created on multiple columns in the collection, then mongodb does not allow any entry in the collection were both the columns have values as an array. If we try to insert both arrays for the indexed columns, it will throw an error "cannot index parallel arrays".

Unique index is were each key can only appear once in the index. The index created before can have same values for a column.

Below command creates unique index on the collection "students":
db.students.ensureIndex({ 'thing': 1}, {unique:true});

To remove the duplicates entries in the collection in order to setup a unique index, we can use the dropDups option.
db.things.ensureIndex({'thing':1}, {unique:true, dropDups:true});

NOTE: This will delete any duplicate values in the collection on which the unique index is created. There is no guarantee as to which duplicate entries will be deleted by MongoDB using the dropDups option.

In case the key has multiple nulls values in the collection then mongodb can't create an unique index since the key is not unique.

Sparse Indexes: Index only those documents which have the index key value not null.
db.products.ensureIndex({size:1}, {unique:true, sparse:true})
When a sparse index is created, only those documents which has the key value not null will be sorted ignoring the null valued documents in the sorted result. Hence in the below example mongodb sorts all the documents by size ignoring the null values:
db.products.find().sort({'size':1})

Mongodb does not allow to set null value explicitly to a key in the document. Indexes can be created in mongodb in foreground or in background. Foreground Indexes are by Default, they are fast, by performing blocks write operations (per database lock). Background Indexes on the other hand are slow, does not block write operations, can only build one background index at a time per database. A background index creation still blocks the mongo shell that we are using to create the index.

MongoDB also provides commands such as explain to inspect queries with their index usage. Explain command gives the details of what indexes were used and how they were used.
db.foo.find({c:1}).explain()

From the output of the explain command, we have
cursor:BasicCursor : Specifies that no index were used for processing the query.
cursor:BtreeCursor a_1_b_1_c_1 : It means a compound index was used with the name "a_1_b_1_c_1".
isMultiKey : It specifies whether the index is a multi key or not, i.e. are there any values inside the index which are arrays.
n: It specifies the number of documents returned by the query.
nscannedObjects: It means the number of documents scanned to answer the query.
nscanned: It means the number of index entries or documents that were looked at.
nscannedAllPlans: It means the number of index entries or documents scanned for all the query plans.
nscannedObjectsAllPlans: It means the number of documents scanned for all the query plans.
scanAndOrder: It indicates whether the query can use the order of documents in the index for returning sorted results. If true then the query cannot use the order of document in the index else vice versa.
indexOnly: It is true when the query is covered by the index indicated in the cursor field i.e. mongodb can both match the query conditions and return the results using only the index
millis: It specifies the number of milliseconds required to execute the query.

indexBounds: {
"a" : [
[ 500, 500 ]
],
"b": [
{"$minElement" : 1}, {"$maxElement" : 1}
],
"c": [
{"$minElement" : 1}, {"$maxElement" : 1}
]
}

This shows the bounds that were used to lookup the index.

indexOnly: It specifies whether or not the database query could be satisfied by the index (covered index). If everything that the query is asking for can be satisfied with just an index, and the document need not be retrieved.

Mongodb may or may not use indexes for various phases such as find or sort operation independent of each other depending on the specified fields to filter or sort. For example below query uses the index on field a for sorting but unable to use any indexes (a=1), (a=1, b=1) or (a=1,b=1,c=1) for find the corresponding documents.
db.foo.find({$and" [{c:{$gt:250}}, {c:{$lte:500}}] }).sort({a:1}).explain()

For the above query the nscannedObjects or nscanned will be the documents scanned for the query which will be higher, while the number of documents returned n will be lower as it uses the index to sort the documents. If the values of nscanned and nscannedObjects is same as number of entries in the collection then the query performed entire collection scan. A covered query is a query in which, all the fields in the query are part of an index, and all the fields returned in the results are in the same index.

Database cannot use the index for sorting if the query sort order does not matches the index sort order (which is ascending order by default). For example if there are multiple constraints over the query for sorting i.e. (a ,b) were a is descending but b is in ascending order but the available indexes are (a=1), (a=1,b=1) and (a=1,b=1,c=1), then none of the indexes can be used for sorting.
db.foo.find({c:1}).sort({a:-1, b:1})

The stats command shown below returns variety of collection statistics:
db.students.stats()

avgObjectSize: 2320000016 - means the average size of the object is 232 bytes.
storageSize: 2897113088 - means the collection uses almost 3 GB of size in the disc.

To determine the index size we use the totalIndexSize method as below which gives the size of the index in bytes.
db.students.totalIndexSize()

Index Cardinality specifies the number of unique values of field compared to the index on the field.
Regular index has 1:1, i.e. each document has one index. Sparse Index has indexes less than the number of documents. MultiKey index has indexes more than the number of documents. When the documents are moved from one part of the disc to another, all the corresponding indexes need to be updated.

MongoDB can be hinted to use the specific index using the hint command. The below example shows hinting mongodb to use no index i.e. specify $natural as 1
db.foo.find({a:100, b:100, c:100}).hint({$natural:1}).explain()

Mongodb can be hinted to use a specific index e.g. index c in ascending order
db.foo.find({a:100, b:100, c:100}).hint({c:1}).explain()

Further indexing not always improves performance of the query. The operations such as
$gt (greater than), $lt (less than), $ne (not equals), doesn't exists, regex (if its not stemmed on the left part then the query will be slow in-spite of indexes, e.g. ^abcd/ ) are not efficient while using indexes.

Geo-spatial Indexes allow to find things based on location information such as x, y co-ordinates. For example if we have a location with x, y co-ordinates, i.e. "location" : [ 41.232, -75.343 ], then we use ensureIndex to create a 2d index as below:
db.stores.ensureIndex({location: '2D', type:1})
db.stores.find({ location: {$near: [50,50]}})

For Spherical models location must be specified as longitude, latitude as in the below example. The Spherical parameter true indicates that we are looking for spherical model. The maxdistance indicates the distance around in radians (6 radians all around the earth)

db.runCommand({ geoNear: "stores", near: [50,50], Spherical: true, maxDistance:1})

MongoDB provides various logging tools in order to analyze query performance. MongoDB stores the log files by default in "/data/db" for unix environment and "c:/data/db" for windows environment. It also automatically logs slow queries above 100ms in mongod default text log. Mongodb also provides a profiler which writes entries/documents to system.profile collection for any query that takes longer than the specified time. The mongodb profiler as 3 levels specified as below:
level 0 : Off
level 1: log slow queries
level 2: log all queries

In order to enable level 2 profiling a profile parameter is specified while starting an instance of mongod.
mongod -dbpath /usr/local/var/mongodb --profile 1 --slowms 2

The profile output can be searched by querying the system.profile collection. The below query for example finds anything with the namespace "/test.foo/" and sort it by timestamp.
db.system.profile.find({ns:/test.foo/}).sort({ts:1}).pretty()

The below query finds were milliseconds greater than 1 and sorts the output by timestamp.
db.system.profile.find({millis: {$gt:1}}).sort({ts:1}).pretty()

The below command finds the slowest query using mongodb system profiler.
db.system.profile.find().sort({millis: -1}).limit(1)

To determine the current profile level and profile status details of the database we use the following commands respectively.
db.getProfilingLevel()
db.getProfilingStatus()

The below command sets the profiling level to level 1 and logs all the resulting documents that take longer than 4 milliseconds to fetch.
db.setProfilingLevel(1, 4)

Profiling can be turned off in mongodb by setting the profiling level to 0 as below:
db.setProfilingLevel(0)

Mongotop command helps to track the time spend on reading and writing data by the mongod instance. In order to run the mongotop less frequently we specify the number of seconds after which mongotop executes everytime as below:
mongotop 3

The mongostat utility provides a quick overview of the status of a currently running mongod or mongos instance and executed by the mongostat command. It provides the query per sec or update per sec, index miss percentage, or miss rate to the index in memory, i.e. whether an index is in memory (or it has to go to the disc) details.

Aggregation Framework

Aggregation pipeline is a framework were documents enter a multi-stage pipeline that transforms the documents into an aggregated result. Below are all the stages which refine the result.
Collection --> $project --> $match --> $group --> $sort --> Result

The aggregate() method calculates aggregate values for the data in a collection and is called on a collection object. The aggregate() method takes an array as a parameter. Each of the items in the array inside the aggregate() method is the stage that transforms the collection. Each stage can exist more than once, and occur in any order in the aggregation pipeline. The list of stages in aggregation are as follows:

$project - It is used to select or reshape (change its form) the results. The input to output ratio is 1:1 were if it sees 10 documents then it produces 10 documents.
$match - It is used to filter the documents based on the query or conditions. The input to output ratio is n : 1
$group - It is used to aggregate the documents based on the common fields. The input to output ratio is n : 1
$sort - It is used to sort the documents in the aggregation pipeline. The input to output ratio is 1 : 1
$skip - It skips the specified number of documents. The input to output ratio is n : 1
$limit - It limits the number of documents as a result. The input to output ratio is n : 1
$unwind - Mongodb can have documents with subarrays i.e. prejoined data. This unjoins the data i.e. normalizes the data. The input to output ratio is 1 : n

The $group phase is used to aggregate the documents in the collection based on the group id i.e. _id. We can group by an _id as null which essentially gives us all the documents.

Compound aggregation is to aggregate more than one key. For example in the below aggregation, the _id key consists of grouping of multiple keys. The _id key can be a complex key, but it has to be unique.

db.products.aggregate([
{$group: { _id: { "manufacturer": "$manufacturer",
"category": "$category" },
num_products:{$sum:1} } }
])

Aggregation Expressions:

1) The $sum, $avg, $min, $max are used to calculate sum, find average, find minimum or maximum value for the group of documents with matching group id.

db.products.aggregate([

{$group: { _id: "$category",
avg_price:{ $avg:"$price"} } }
])

2) The $push and $addtoSet are used to build arrays and to push values into an array of result document. The addtoSet adds unique values to the array opposite to that of push which adds duplicates too.

db.products.aggregate([
{$group: { _id: { "maker":"$manufacturer" },
categories: { $addToSet:"$category" } } }
])

3) $first and $last are group operators which gives us the first and last values in each group as the aggregation pipeline processes the documents. The documents need to be sorted in order to get the first and last documents. The $first operator finds the first value of the key in the documents while the $last operator finds the last value of the key in the documents.

db.zips.aggregate([
{$group: { _id: {state:"$state", city: "$city"},
population: {$sum:"$pop"} } },
{$sort: {"_id.state":1, "population":-1} },
{$group: { _id:"$_id.state",
city: {$first: "$_id.city"}
population: {$first:"$population"} } },
{$sort: {"_id":1} }
])

We can run one aggregation stage more than once in the same aggregation query, for example using the $group operation in stages to find the average class grade in each class.

db.grades.aggregate([
{'$group': { _id:{ class_id: "$class_id",
student_id: "$student_id"},
'average': { "$avg": "$score" } } },
{'$group': { _id:"$_id.class_id",
'average': {"$avg":"$average"} } }
])

The $project phase allows to reshape the documents as they come through the pipeline. It has 1:1 input to output documents ratio. We can remove keys, add keys or reshape keys (take to key and put it in sub-document with another key). Various functions can be applied on the keys such as $toUpper, $toLower, $add, and $multiply. The project phase is mainly used to clean up the document, eliminate or cherry pick the documents during initial stages.

db.products.aggregate([
{$project: { _id: 0,
'maker': { $toLower:"$manufacturer"},
'details': { 'category': "$category",
'price' : {"$multiply": ["$price", 10] } },
'item':'$name' } }
])

In case the key is not mentioned, it is not included, except for _id, which must be explicitly suppressed. If you want to include a key exactly as it is named in the source document, you just write key:1, where key is the name of the key.

The $match phase is used for filtering the documents as they pass through the pipe, the documents in to out ratio is n:1. It aggregates a portion of documents or search for particular part of documents.

db.zips.aggregate([
{ $match: { state:"NY" } },
{ $group: { _id: "$city",
population: {$sum: "$pop"},
zip_codes: {$addToSet: "$_id"} } },
{ $project: { _id: 0,
city: "$_id",
population: "$population",
zip_codes:1 } }
])

The $sort phase sorts all the documents from the previous phase in the aggregation pipeline. Sorting can be a memory hog, as it does sorting in the memory. If the sort is before grouping and after match, then it uses index, but cannot use index after grouping phase. Sorting can be done multiple times in the aggregation pipeline.

db.zips.aggregate([
{ $match: { state:"NY" } },
{ $group: { _id: "$city",
population: {$sum: "$pop"},
zip_codes: {$addToSet: "$_id"} } },
{ $project: { _id: 0,
city: "$_id",
population:1, } },
{ $sort: { population:-1 } },
{ $skip: 10 },
{ $limit: 5 }
])

The $skip and $limit operations must be added only after $sort operation other wise the result would be undefined. Usually we use skip first and then limit, as order of the skip and limit matters for the aggregation framework compared to the normal find operation.

The $unwind operation unjoins the prejoined data i.e. array data to flat elements such that {a:1, b:2, c:['apple', 'pear', 'orange']} on unwinding we get the results as {a:1, b:2, c:'apple'}, {a:1, b:2, c:'pear'}, {a:1, b:2, c:'orange'}

db.posts.aggregate([
{"$unwind":"$tags"},
{"$group": { "_id":"$tags",
"count":{$sum:1} } },
{"$sort":{"count":-1}},
{"$limit": 10},
{"$project": { _id:0
'tag':'$_id',
'count': 1 } }
])

Double unwind is used for more than one array in the document, by creating a Cartesian product of the two arrays and the rest of the documents. The effect of $unwind can be reversed by a $push operation. Further in case of double unwind the effect can be reversed using two consecutive push operations. In the below example we have two arrays, sizes[] and colors[] in the document, and we preform the double unwind operation:

db.inventory.aggregate([
{$unwind: "$sizes"},
{$unwind: "$colors"},
{$group: { '_id': {'size':'$sizes',
'color':'$colors'},
'count': {'$sum':1} } }
])

Below is the mapping between the SQL world operations with the MongoDB aggregation operators.

SQL Terms, Functions	MongoDB Aggregation Operators
WHERE	$match
GROUP BY	$group
HAVING	$match
SELECT	$project
ORDER BY	$sort
LIMIT	$limit
SUM()	$sum
COUNT()	$sum
join	No direct corresponding operator; however, the $unwind operator allows for somewhat similar functionality, but with fields embedded within the document.

Some of the limitations of MongoDB aggregation framework are as follows:

Aggregation can be performed on documents limited to 16MB of memory.
Aggregation can't use more than 10% of the memory on the machine.
Aggregation works in the sharded environment, but after the first group or the first sort phase the aggregation has to be brought back to MongoS. The first group and sort can be slip up to run on different shard. Then they need to be gathered to mongos for final result before sending for next stage of the pipeline. Hence the calculations for the aggregations happen on the mongos (router) machine which typically also hosts the application.

ReplicaSet

In order to provide fault tolerance in mongodb we have the replicaset. The replicaset has a single primary node and multiple secondary nodes. The application mainly writes and reads from the primary node. The secondary node syncs the data with the primary nodes. If the primary goes down, an election is conducted among the secondary nodes to elect new primary node. If the old primary comes back up it will join the replicaset as a secondary node. The minimum original number of nodes needed to assure the election of a new primary if a node goes down is 3 nodes. Every node has one vote.

Types of Replica Set Nodes:

Regular : It has data and can become a primary node. It is a node of normal type and can be primary/secondary.
Arbiter : This node is just there for voting purposes, and maintains majority for voting.
Delayed/Regular : It is the disaster recovery node and has hours set behind other nodes. It can participate in voting but can't become primary node. The priority set to zero.
Hidden : used for analytics, never become primary, priority set to zero.

The application always writes to the primary. In case of the failover when the primary goes down, the application is unable to perform any writes. The application on the other hand can read from secondary nodes as well, but the data can be stale as the data written to the primary node is asynchronously synced
with the secondary nodes.

A replica set which mostly should be different machines or on different ports is created as below.

mkdir -p /data/rs1 /data/rs2 /data/rs3
mongod --replSet m101 --logpath "1.log" --dbpath /data/rs1 --port 27017 --oplogSize 64 --smallfiles --fork
mongod --replSet m101 --logpath "2.log" --dbpath /data/rs2 --port 27018 --oplogSize 64 --smallfiles --fork
mongod --replSet m101 --logpath "3.log" --dbpath /data/rs3 --port 27019 --oplogSize 64 --smallfiles --fork

The --replSet option tells that the current mongod instance is part of the same replica set "m101". To tie these instances together, we create the configuration as below:

config = { _id: "m101", members:[
{ _id : 0, host : "localhost:27017", priority:0, slaveDelay:5 },
{ _id : 1, host : "localhost:27018"},
{ _id : 2, host : "localhost:27019"} ]
};

In the above configuration the "slaveDelay" option delays the data sync with 5 seconds on the specified instance. The "priority" option which when set to 0 makes the instance as non-primary node. To initialize the above configuration we connect to the current mongod instance using mongo shell and then use the replica set initialization command as below:
mongo --port 27018
rs.initiate(config);

To get the replicaset status we use the replica set status command: rs.status();
To read the data from the secondary node we run the command: rs.slaveOk();
To check if the current instance is primary or not we use the command: rs.isMaster();
To force the current replica set node to step down as primary node we use the command: rs.stepDown();

The data is replicated on multiple instances using a special cap collection with a limited size (and loops after it fills), called "db.oplog.rs". The secondary instances acts as the primary for updates to the oplog collection since a particular timestamp in order to keep the data in sync. In case of the failure in primary instance, it takes very short time to elect a new secondary node depending on the number of instances in the replica set.

When the primary node goes down with some writes pending to be synced up with the secondary nodes, the secondary node becomes primary node and unaware of the writes to the old primary node. Now when the old primary node comes backup as a secondary node, it finds out its additional writes and rollbacks writing them to a log file. Also it should be noted that once the mongo shell is connected to replica set it needs a manual shutdown of the server in case of the fail over.

In the replica set, the client application is usually connected to the primary and secondary of the replica set.
When an insert message is sent to the primary, it writes it to the RAM, then they are added to the journal asynchronously, and then written to the data directory separately providing durability and recoverability. Secondary nodes are updated using the oplog collection of the primary node similar starting with the RAM, then the journal and then the data directory. To determine if the write succeeded, we call the getLastError message. The getLastError method takes various parameters such as j, w, fsync, and wtimeout. By default the getLastError will wait for acknowledgement from the primary to its RAM when w=1. When journal is true i.e. when j = true, it means that getLastError call will not return until the journal is written. The fsync=true means that getLastError call won't return until it writes to the RAM and also syncs up with the data directory. To ensure secondary nodes are updated too, we change w=2 which means return getLastError call when primary node and atleast one secondary node is updated. The wtimeout by default is infinity, which indicates how long the getLastError call will wait before it returns. Hence the default values of the getLastError parameters are as follows:

w=1 by default
wtimeout=no value by default
journal=false by default
fsync=false by default

By default all read requests are sent to the primary and response comes from the primary alone. When we read from the secondary, it may be possible to have a replication lag as all the writes are done only on the primary also known as eventual consistency. Read preferences by default is the primary. Secondary read preferences will send the reads to a randomly selected secondary. The preferred secondary will send reads to secondary if available else it sends read requests to the primary. Similarly primary preferred will send all writes to primary unless there is no primary. Nearest read preference will send the writes to secondary or primary without distinguishing between them, only sending the requests to the replica set memebers within the certain window of the fastest ping time which is dynamically calculated.

Sharding

Sharding is a process of storing the data records across multiple machine providing horizontal scaling to mongodb to handle data growth. In order to implement sharding on mongodb we deploy multiple mongod servers, and have a mongos instance which acts as the router to all the mongod servers. The application talks to mongos which then talks to individual mongod instances. The mongod server can be a single server or a set of servers called a Replica sets (were data is in sync which is logically one shard.). Hence with the help of multiple mongod instances each known as shards, the application can access the collections transparently in order to achieve scaling out. The shards split the data for the collections, which by themselves are replica sets. All the queries are made through the router called mongos which handles the sharding distribution. Sharding uses the range based approach using a shard key assigned to each shard. For example the query to retrieve the records based on order_ids (which is shard key) will have a range of order_ids assigned to each shard using the mapping of the chunks. So the order_id maps to a chunk which maps to a shard, so the mongos will route the request for the order_id to the mapped shard. In case shard_key is not specified, then mongos scatters the request to all the shards and gathers back the response from all the servers. For insert operation a shard_key has to be specified in order to add the data to a particular shard in a sharded environment. The collections which are not sharded stays in the shard0 mongod instance. The shard key is determined by the database user and has to be a part of the document schema. Mongo breaks the collection based on the shard key into chunks and decides the shard each collection resides on. Mongos instance is stateless and usually resides on the same machine as the application. Mongos also handles failures similar to replica set were one instance goes down another comes back up. Generally both the application and the mongo shell connects to the mongos instance.
The insert operation requires passing the shard key (even multi parted shard key) to the mongos. On the other hand for read operations, if the shard key is not specified then it has to broadcast the request across all the shards. If update you cannot specify shard key then you have to use multi update.

Below are the parameters specified with mongod command to create instances of replica set which are specified as shard server is by using the "--shardsvr" option:
mongod --replSet s0 --logpath "s0-r0.log" --dbpath /data/shard0/rs0 --port 37017 --fork --shardsvr --smallfiles

Similar to the replica set creation, we create a config document and execute the rs.initiate command along with the config document. The config servers are created similarly by specifying the "--configsvr" option to the mongod command as below:
mongod --logpath "cfg-a.log" --dbpath /data/config/config-a --port 57040 --fork --configsvr

In order to make the replica sets and the config servers to work together, we first start the mongos on the standard port (i.e. standard mongod port) specifying the config servers.
mongos --logpath "mongos-1.log" --configdb localhost:57040,localhost:57041,localhost:57042 --fork

Secondly we execute the adminCommands to add the shard (shard_id/instance_host which is a seed list) to the mongos, enable sharding on the database and specify the shard collection with the shard key. MongoDB requires an index be created on the starting prefix of the shard key and creates such index if collection does not exists yet.

db.adminCommand( { addShard : "s0/"+"localhost:37017" } );
db.adminCommand( { addShard : "s1/"+"localhost:47017" } );
db.adminCommand( { addShard : "s2/"+"localhost:57017" } );
db.adminCommand({enableSharding: "test"})
db.adminCommand({shardCollection: "test.grades", key: {student_id:1}});

The status of the shard can be checked by running the sh.status() command in the mongo terminal connected to mongos instance. It provides the partitioned status which is true for sharded databases, the chunk information and the range of the shard key for various shards. The db.collectionname.stats() gives the collection statistics with the sharded flag set as true.

Implications of sharding on development:

Every document should include the shard key. The shard key should be chosen among the fields which will be used in most of the queries.
The shard key must be immutable.
There should be index that starts with the shard key, i.e. it should include the shard key such as [student_id, class]. There should not be a multikey index.
When doing the update, either the shard key or multi is true (which sends it to all the nodes) should be specified.
No shard key means scatter gather i.e. request will be sent to all the shards.
There should be no unique key unless it is a part of or starts with the shard key.

A shard is always a replica set to ensure reliability. The mongos has connections to the primary and possibly the secondary nodes of the replica set and is seedless. The values concerned for write operations i.e. j value, w value and wtimeout are passed by the mongos to each shard node and are reflected in the final write. Usually mongos is replicated itself and the driver can take multiple instances of the mongos from the application.

The following points should be considered while choosing a shard key.

There should be sufficient cardinality, meaning that there should be more values for the shard key in the database. The secondary part of the key can be added to ensure sufficient cardinality.
Avoid hotspotting in writes which occurs with anything that is monotonically increasing. Example in case the shard key is _id then its value keeps on increasing always over the max-key within the shard range and will always hits the last shard instance. This is bad especially when there are frequent inserts into the database. For example in case of the orders schema containing [order_id, date, vendor], we select the [vendor,date] as shard key as it has sufficient cardinality and vendor will avoid increasing hotspotting. If the problem is naturally parallel such as [username, album] schema the shard key can be the [username] as processing multiple users album in parallel is natural.

Singleton Pattern

2013-11-29T13:28:00.002-08:00

Singleton is one of the most basic and easy to use design pattern from the Gang of Four book of design patterns. The GOF describe the singleton pattern as "Ensure a class has only one instance, and provide a global point of access to it". Singleton pattern requires to ensure that only one instance of a class is created in the Java Virtual Machine.

public class Singleton {
 
    // Static member holds only one instance of the Singleton class
    private static Singleton instance;
 
    // Singleton prevents any other class from instantiating
    private Singleton() {
    }
 
    // Providing Global point of access
    public static Singleton getInstance() {
        if (null == instance) {
            instance = new Singleton();
        }
        return instance;
    }
}

Singleton pattern is generally implemented using static factory method called getInstance, static variable holding a class instance and a private constructor. This ensures that singleton instance is created only when required also known as lazy instantiation. The private constructor inhibits sub-classing of the singleton class which if intentional should be made final class or added into an isolated singleton package.

Mutliple classloaders still will be able to have multiple singleton instances as the classes are loaded by different classloaders access a singleton. Hence in multiple JVM environment, each of them will have their own copy of the singleton object which can lead to multiple issues specially in a clustered environment where the access to the resource needs to be restricted and synchronized. The basic method to handle such case is to load the classloader which in turn loads the singleton instance as below.

private static Class getClass(String classname) throws ClassNotFoundException {
      ClassLoader classLoader = Thread.currentThread().getContextClassLoader();
      if(classLoader == null)
         classLoader = Singleton.class.getClassLoader();
      return (classLoader.loadClass(classname));
   }
}

There are multiple other techniques such as JMS, DB, Custom API and 3rd party tools which handle such scenario but they also impact the business logic. Tools like Terracotta and Oracle Coherence work on the concept of providing an in memory replication of objects across JVMs in effect providing a singleton view or making use of any of the cluster-aware cache provider’s like SwarmCache or JBoss TreeCache. Application servers provide some level of custom API’s to circumvent the problem such as JBoss HASingleton Service, Weblogic's concept of Singleton Service and WebSphere's singleton across cluster in the WebSphere XD version.

Also if the singleton implements java.io.Serializable and serialized the object then the subsequent deserialize calls over multiple times would end up with multiple instances of singleton class. The best way to handle this scenario is to make the singleton as an enum enabling the underlying java implementation to handle the details. Other work around would be to implement the readResolve method (or readobject() method) in Singleton class to return the same singleton instance. The clone method is overridden to throw an exception preventing multiple instances of singleton class using cloning.

Also the getInstance method is not thread safe since multiple threads can call getInstance method in parallel creating multiple class instances. The multithreading access can be handled by declaring the getInstance method as synchronized. But synchronized method in turn adds an extra overhead for every call to the getInstance method since synchronized methods can run up to 100 times slower than unsynchronized methods. This can be avoided using the double checked locking as below which checks for synchronization for the very first call to getInstance method when the class instance is not initialized. The double null check is required in the case were a thread can be preempted after entering the synchronized block, before it can assign the singleton member variable, with the subsequent waiting thread to enter the synchronized block. Also when multiple threads tries to read the singleton instance, there is a possibility that a stale or partially initialized value is read by a thread, causing it to create a new instance. The volatile variable from java 5 guarantees that all the writes will happen on volatile variable before any reads, hence avoiding creation of another instance due to race condition.

Unfortunately the double-checked locking is not guaranteed to work because the compiler is free to assign a value to the singleton member variable before the singleton's constructor is called.

public class Singleton {
 
    // Singleton instance should we volatile to avoid reading stale value
    private static volatile Singleton instance;
 
    // Singleton prevents any other class from instantiating
    private Singleton() {
    }

    public static Singleton getInstance() {
       if (null == instance) {
          synchronized (Singleton.class){
              if (null == instance) {
                instance = new Singleton();
              }
          }
       }
       return instance;
    }
}

To provide a fast, easy and thread safe singleton solution, we initialize the class instance by static variables which guaranteed that they are executed only once during the first time when they are accessed. The Classloader guarantees that singleton instance will not be visible until its fully created. The below implementation however compromises the fact that singleton instance can't be changed later to allow multiple instances of Singleton class. The below implementation specifies singleton during compile time as opposed to runtime, a singleton registry implementation can be used which maintains the static HashMap containing the class name as key and the class instance (instantiated using reflection) as the value.

// Singleton Class
public class Singleton {

 // declare a static instance of Singleton class to initialize only once during loading. 
    private static final Singleton instance = new Singleton();
  
    // declare the constructor as private
    private Singleton() {}
 
    // get the instance of singleton class
    public static Singleton getInstance() {
        return instance;
    }
}

Another approach to implement singleton which is considered much easier is using enums. Enum guarantees thread-safety by default hence there is no need for double checked locking during instance creation of singleton class. Another problem with conventional implementation of singleton is that once we implement Serializable interface they no longer remain Singleton because readObject() method always returns a new instance just like constructor. This can be avoided by overriding readResolve() method to return the singleton instance and discarding the newly created instance. This can become even more complex if the Singleton class maintained state, were the instance becomes transient. With the Enum approach, as the since enum instances are by default final, it provides safety against multiple instances due to serialization.

// Implement a singleton using enum which is accessed as Singleton.INSTANCE
public enum Singleton{
    INSTANCE;
}

Some of the criticism towards singleton is that they increase the dependencies by adding global code. They make the code tightly coupled, violate single responsibility principle and also maintain their state for the lifetime of the application which complicates unit testing (by initializing singletons in specific order for unit testing). On the other hand singletons are used extensively to ensure that there are no duplicate instances of the class thus avoiding OutOfMemory issues. Hence the default scope for spring beans is singleton since any stateless object may be singleton without causing any concurrency issues.
The most commonly accepted usage of Singletons were they yield the best results is in a situation where various parts of an application concurrently try to access a shared resource such as a Logger.

CXF WS Security

2013-09-08T09:47:00.002-07:00

Setup the WS Security in Weblogic
Test it using SOAP UI Client
Create CXF Client to Send Request with BST
Receive the Response from CXF Client with Security Confirmation

Setting up WS Security in Weblogic

Oracle Weblogic Server 12c was used to configure with the client application. The client application is the EJB application with an EJB Stateless bean. It uses weblogic.jws.Policies and weblogic.jws.Policy classes to specify the location to the policy file.

@Stateless
@WebService(targetNamespace="http://ws-connector.sp.ttp.tsm.com", name="TTPSPService",
  portName="TTPSPPort", endpointInterface="com.tsm.ttp.sp.ws_connector.TTPSP")
@Policies( { @Policy(uri = "policy:TTPSP-Policy.xml") } )
public class TTPSP  implements com.tsm.ttp.sp.ws_connector.TTPSP {

    public TTPSP() {     }

   @Override
   public CheckEligibilityResponseType checkEligibility(CheckEligibilityRequestType input) {
 CheckEligibilityResponseType result= new CheckEligibilityResponseType();
 return result;
   }
   ...........
}

The policy file located in the “Project/ejbModule/META-INF/policies” folder. The policy.xml specified that it requires “WssX509V3Token11” to the Recipient which is believed to be the Binary Security Token. The algorithm-suite preferred was “Basic256”. Timestamp must be included and, both the Headers and Body be signed entirely. Also specified the requirement of security confirmation using “<sp:RequireSignatureConfirmation/>” element.

To generate the build first we need to generate the “wlfullclient.jar” for the current weblogic server. The JarBuilder is used to create wlfullclient.jar using the following command.
WL_HOME/server/lib> java -jar wljarbuilder.jar

In some cases the "weblogic.jws.Policies" or other packages maybe absent in the wlfullclient.jar recently created. In case of Weblogic 10.3.3, the packages weblogic.jws.Policies and weblogic.jws.Policy are not present in the “wseeclient.jar” and “wls-api.jar” jar files either. These packages (weblogic.jws.Policies) can be found in the jar file

“C:\oracle\Middleware\modules\*ws.api_1.0.0.0.jar*” for Weblogic 10.3.3 and in

“C:\Oracle\Middleware\modules\ws.api_2.0.0.0.jar” for Weblogic 12.1

In Eclipse, we create a “New EJB Project” and all the source packages are added in the “ejbModule” folder. Also the “ejbModule” contains the “META-INF” folder containing the “MANIFEST.MF” and the “policy.xml” files. Create a “lib” folder in the project and add the “wlfullclient-11.1.jar”, “ws.api_1.1.0.0.jar” and other jars required.

Now create a “New Enterprise Application Project” i.e. EAR Project naming it same as previous project with EAR appended at the end. During the creation configure the EAR settings to add J2EE module dependencies. The EAR Project can be created by right clicking the “Deployment Descriptor: Projectname” -> New -> Project -> EAR Project.

In order to configure the weblogic server with WS Security, we need to generate a keystore using java keytool as follows:

1) Generate a new JKS Keystore with new Keypair:

keytool -genkeypair -alias bank: BANK -keyalg RSA -keysize 1024

-validity 365 -keystore bank.jks

KeyStore Password: t1bank

Enter key password for <bank: BANK>: t1bank

2) Export a certificate from the generate keystore:

keytool -exportcert -alias bank: BANK -file bank.cer -keystore bank.jks

Enter keystore password:

Certificate stored in file <bank.cer>

Now to configure the keystore in weblogic we have two choices, one is to add the keys from the bank.jks keystore to the DemoTrust.jks keystore. The other option is to change the Keystore configuration to use the Custom keystore.

Initially, we tried to setup a custom keystore using the description from this link. The process was as follows:

In Weblogic server administration, expand Servers and select the server you need to update.
Select Configuration -> Keystores -> SSL.
Click the Change link under Keystore Configuration.
Select Custom Identity and Java Standard Trust as the keystore configuration type and continue.
For the Custom Identity Keystore File Name, enter the path to your Java keystore. Select Keystore type as jks .
Enter your Custom Identity Keystore Passphrase as the password you used when you created the Java keystore
Confirm the password, click Continue and then Finish.
Go back under Servers and select the server that you are working with.
Select Configuration -> Keystores -> SSL.
Under Configure SSL, select Keystores as the method for storing identities.
Enter the server certificate key alias (in this example, myalias was used), and the keystore password
Click Finish to finalize the changes. You will need to reboot Weblogic for those changes to take effect.

After going with the above approach by changing the "Keystore Configuration" to "Custom Identity and Java Standard Trust" and setting all the JKS Keystores pointing to bank.jks, weblogic console gave the following error:

"weblogic.management.DeploymentException: Deployment could not be created. Deployment creator is null."

The reason for the above error turns to be that the SSL configuration is not been updated corresponding to the Keystore configuration. Hence the “SSL Configuration" was configured to use a “Custom Trust Store” and the Key Alias and Password to be used were specified. It resulted in a failure, as no request was able to hit the Weblogic server.

After learning from the above failures, we swap to the first option, i.e. add the certificate to the DemoTrust.jks. The Demo keystores are the Keystores configured in Weblogic Console by Default. The Demo keystores are configured under (Environment-> Servers-> AdminServer-> Configuration-> Keystores). The names of the Demo keystores and their default passwords are as follows:

Keystore: DemoTrust.jks

Password: DemoTrustKeyStorePassPhrase

Path: C:\Oracle\Middleware\wlserver_10.3\server\lib

Keystore: DemoIdentity.jks

Password: DemoIdentityKeyStorePassPhrase

Path: C:\Oracle\Middleware\wlserver_10.3\server\lib

Keystore: cacerts

Password: changeit

Path: C:\Oracle\Middleware\jdk160_21\jre\lib\security

All the Demo keystores for the Weblogic server are located in the path “Oracle\Middleware\wlserver_10.3\server\lib”. One could find the "DemoTrust.jks" and "DemoIdentity.jks" files here. Here we add the bank.cer ONLY TO “DemoTrust.jks” keystore and NOT TO “DemoIdentity.jks”. We DON'T ADD bank.cer to "cacerts" in located in “Oracle\Middleware\jdk160_21\jre\lib\security” folder too. The process is as follows:

1) Add the bank.cer ONLY TO DemoTrust.jks keystore using the following command:

keytool -importcert -alias bank: BANK -file bank.cer -keystore DemoTrust.jks

Enter keystore password: DemoTrustKeyStorePassPhrase

Trust this certificate? [no]: yes

Certificate was added to keystore

2) We confirm if the keys are added into the DemoTrust.jks by the following command:

keytool -list -keystore DemoTrust.jks

All the server logs can be found in the following log file:

“Oracle\Middleware\user_projects\domains\base_domain\servers\AdminServer\logs\base_domain.log”.

WS Security with BST Client using SOAP-UI

Open the SOAP-UI and create a new project based on the WSDL or Endpoint provided. In order to set WS Security for the SOAP-UI client, right click on the project created and select “Show Project View” from the Menu.

Select the “WS-Security Configurations” tab and select the “Keystores/Certificates” tab in the inner window. Then click on the “+” button to select the new Keystore and enter the Keystore password.

Then Select the “Outgoing WS-Security Configurations” tab in the inner window. Click the add button from the top section to add a new Configuration in the outgoing WS-Security configurations. Fill in the Default Username/Alias and password to be used in all the WSS Actions. Now in the bottom section click the “+” button to add a Timestamp Entry. Fill the “Time to Live” as 1800000 and check the option to set the Millisecond Precision of the Timestamp.
Moving forward, add the second WSS Entry “Signature” which will be creating the Binary Security Token. Select the Keystore which is been entered in the “Keystores/Certificates” section and enter the Alias name with the corresponding password. Select the Key Identifier Type as “Binary Security Token” in order to create the Binary Security Token first. Select the signature algorithm, canonicalization algorithm and the digest algorithm. At last check “Use single certificate for signing” in order to use only the base certificate and not all the certificates in the chain. The “Parts” section is kept empty, but by default SOAP-UI will sign the “Body” element using the generated BinarySecurityToken.

Moving forward, add another Signature WSS entry, the third one of all. Similar to the previous Signature configuration, select the keystore, enter alias and password, and select the same entries for the algorithms as before. Most importantly for the Key Identifier Type select “Issuer Name and Serial Number” in order to sign all the elements. The “Use single certificate for signing” option remains unchecked as all the certificates in the chain should be used for signing. Unlike before, use the “+” button near the “Parts” section to add the “Timestamp”, “Body” and “BinarySecurityToken” elements with the namespace and encoding information (Default its “Content”).

Now the project is WS Security enabled for the Requests. But before firing the individual requests, select the Request Method and under project properties make sure that “Strip whitespaces” Property is set to “true”.

Then click the “Authentication and Security-related settings” for the Request at the bottom causing a window being opened. Select the Outgoing WSS as the same name given in the “Outgoing Security Configurations” section before.

This can also be done by right clicking the request and using the menu to select the “Outgoing WSS” to the corresponding Outgoing WS Security configuration. Mostly the prior method is preferred.

The Resulting SOAP-UI Request is as follows:


   
     
       
         
         
         
         
           
             
           
           
           0up9O5yZ6wLnau/eTzPZtfz+IIM=
         
         
           
             
           
           
           EAuvZTemCXTia8fPYXngIZOCPE0=
         
         
           
             
           
           
           kYMlR5YhU9CHpVaL0uCVnxINNF0=
         
         
         FSSax.....CWtoxx0=
         
        
         
           
              CN=BANK,OU=BANK,O=BANK,L=SG,ST=SG,C=SG 
              1329894156
           
       
     
     
   
   
           l4TLCUURhrJbRjXEIEGirTpg==
   
   
     
       
       
       
         
           
      
         
         EAuvZTemCXTia8fPYXngIZOCPE0=
       
     
     TmjGBLZJ69kHZNG8=
     
        
          
        
    


    2012-02-22T08:07:59.352Z
    2012-03-14T04:07:59.352Z
 
 
 
 
      
            1222
            232
            CheckEligibility

From the above request format received from SOAP-UI for the WS Security enabled Server, we could point out some of the key things. First, the Security Header inside the Soap Header contains the following elements:

Signature 1
BinarySecurityToken
Signature 2
Timestamp

(highlighted in Blue above) while Signature 2 consists of <ds:KeyInfo> Element.

In Signature 1 we find the <ds:X509Data> Element (highlighted in Blue) inside the <wsse:SecurityTokenReference> element. The <ds:X509Data> Element contains the <ds:X509IssuerSerial> element. From its name it suggests that this is signed by the IssuerSerial KeyIdentifier. Also in Singature 1 element we find three <ds:Reference> elements assumed to be signatures (from the order of Signature Parts specified in SOAP-UI) as follows:

TIMESTAMP : <ds:Reference URI="#Timestamp-8">
BODY : <ds:Reference URI="#id-10">
BINARYSECURITYTOKEN: <ds:Reference URI="#CertId-2B6B2C4066C46E9954132989807937513">

In Signature 2 on the other hand we see just the <wsse:Reference> element inside the <wsse:SecurityTokenReference> element. The <wsse:Reference> element has the ValueType as “X509v3” which suggests that this is signed by the BinarySecurityToken. Even though we didn’t specify any values for “Parts” section in the first Signature using BinarySecurityToken as KeyIdentifier, we see one <ds:Reference URI="#id-10"> element assumed to be a signature. Comparing the URI of the Reference element with Signature 1 element signatures, we assume that it is the Signature of the Body Element. Hence even if the Signature Parts is empty, by default the Body element is signed by Default using the specified KeyIdentifier.

Create CXF Client to Send Request with BST

One of a Senior developer Xei Songwen provided an implementation of WS Security using which just signed Body element to send the request. The classes contained a Dispatcher, Client, Customized WSS4JOutInterceptor implementation, PasswordCallback, SigningCheck.properties and the Spring configuration described in the Class Diagram.

The following Jar Issues were faced and resolved while testing the application initially:

ERROR:
Caused by: java.lang.NoClassDefFoundError: org.apache.axiom.soap.impl.dom.soap11.SOAP11Factory
at org.apache.axis2.saaj.SOAPPartImpl.<init>(SOAPPartImpl.java:209)
at org.apache.axis2.saaj.SOAPPartImpl.<init>(SOAPPartImpl.java:246)

ADDED: saaj-impl-1.3.2.jar
REMOVED: axis2-saaj-1.4.jar

ERROR:

Caused by: java.lang.NoClassDefFoundError: com.sun.org.apache.xerces.internal.dom.DocumentImpl

at java.lang.ClassLoader.defineClassImpl(Native Method)

at java.lang.ClassLoader.defineClass(ClassLoader.java:223)

ADDED: xercesImpl-sun-version.jar

ERROR:

Caused by: java.lang.IncompatibleClassChangeError

at org.apache.xalan.transformer.TransformerIdentityImpl.createResultContentHandler(TransformerIdentityImpl.java:207)

at org.apache.xalan.transformer.TransformerIdentityImpl.transform(TransformerIdentityImpl.java:330)

at com.sun.xml.messaging.saaj.util.transform.EfficientStreamingTransformer.transform(EfficientStreamingTransformer.java:423)

at com.sun.xml.messaging.saaj.soap.EnvelopeFactory.createEnvelope(EnvelopeFactory.java:136)

at com.sun.xml.messaging.saaj.soap.ver1_1.SOAPPart1_1Impl.createEnvelopeFromSource(SOAPPart1_1Impl.java:102)

at com.sun.xml.messaging.saaj.soap.SOAPPartImpl.getEnvelope(SOAPPartImpl.java:156)

at com.sun.xml.messaging.saaj.soap.MessageImpl.getSOAPBody(MessageImpl.java:1287)

ADDED: saaj-api-1.3.2.jar

The spring configuration for the WSS4JOutInterceptor is as follows:

       
               
               
               
                 
                 
                 
                 
                 
                             
                                     passwordCallbackRef
                            
                           
                
                
                  <!—“DirectReference” -->

When tried to use the KeyIdentifier as “DirectReference” or “IssuerSerial” in the Single WSS4JOutInterceptor and specified BinarySecurityToken element in the “signatureparts” as specified above, it gave the following error:

Caused by: org.apache.ws.security.WSSecurityException: General security error (WSEncryptBody/WSSignEnvelope: Element to encrypt/sign not found: http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd, BinarySecurityToken)
at org.apache.ws.security.message.WSSecSignature.addReferencesToSign(WSSecSignature.java:588)
at org.apache.ws.security.message.WSSecSignature.build(WSSecSignature.java:769)
at org.apache.ws.security.action.SignatureAction.execute(SignatureAction.java:57)

In order to tackle the problem of missing BinarySecurityToken element in the SecurityHeader before the Interceptor tries to sign the BST (BinarySecurityToken) element, BST element is added before the request is passed to the inoke() method. The code added is as follows:

final String XMLNS_WSU = "http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd";
final String XSD_WSSE = "http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-secext-1.0.xsd";

final SOAPFactory sf = SOAPFactory.newInstance();
final SOAPElement securityElement = sf.createElement("Security", "wsse", XSD_WSSE);
final SOAPElement authElement = sf.createElement("BinarySecurityToken", "wsse", XSD_WSSE);
authElement.setAttribute("EncodingType", " http://.....1.0#Base64Binary");
authElement.setAttribute("ValueType", "http://.....1.0#X509v3");
authElement.setAttribute("wsu:Id", "CertId-CA440EE13ADE87BAE5133044746778913");
authElement.addAttribute(new QName("xmlns:wsu"), XMLNS_WSU);
authElement.addTextNode("SMDFhdffIUSDFJL9090ddf213asdsKFHkfdfgjfs234gbhfg56icxdd24rgd"));
securityElement.addChildElement(authElement);
soapRequest.getSOAPHeader().addChildElement(securityHeader);

But instead of detecting the BST element and trying to sign it, the WSS4JOutInterceptor throws the following exception:

org.apache.xmlbeans.XmlException: error: Attribute "Id" bound to namespace "http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" was already specified for element "wsse:BinarySecurityToken".

Considering the suggestion given from the CXF Forum two interceptors extending WSS4JOutInterceptor were configured. The first one is configured with the KeyIdentifier as “DirectReference”, while the second one configured as “IssuerSerial”. Now the BinartSecurityToken (BST) was generated but nothing was signed. Also the Issuer Serial along with Timestamp elements was absent in the SecurityHeader. On reversing the KeyIdentifier values across the two interceptors, BinarySecurityToken vanished but all the previously missing elements reappeared. This lead to a suspicion that only the first interceptor was being called while the second interceptor remained unexecuted.
After running the debugger numerous times, the doubt was confirmed. As mentioned by somebody in the forum that the instance names of the two interceptors along with their class names should be different in order for them to be executed. But still success remained far off. One doubt still pondered that both the interceptors extend the WSS4JOutInterceptor for all their functionality with just the Class name different.
After looking at the source code of org.apache.cxf.ws.security.wss4j.WSS4JOutInterceptor below, it seems to be a possibility that getId() method of the WSS4JOutInterceptorInternal Class is called before calling the handleMessage() method of the inner class. This handleMessage() method (line 257) in turn calls the doSenderAction() method defined in the org.apache.ws.security.handler.WSHandler Class.

public class  WSS4JOutInterceptor extends AbstractWSS4JInterceptor {
       ...................................

   private WSS4JOutInterceptorInternal ending;

   public  WSS4JOutInterceptor() {
         super();
         setPhase(Phase.PRE_PROTOCOL);
         getAfter().add(SAAJOutInterceptor.class.getName());
         ending = createEndingInterceptor();
   }

       ...................................
   final class  WSS4JOutInterceptorInternal implements PhaseInterceptor {
            ...................................
     public void  handleMessage(SoapMessage mc) throws Fault { …………. 
           doSenderAction(doAction, doc, reqData, actions, somebooleanvalue);
     }
            ...................................
     public String  getId() {

            return WSS4JOutInterceptorInternal.class.getName();
     }
            ...................................
   }
}

If the getId() method of the WSS4JOutInterceptorInternal Class is altered to return a different class name rather than the actual one, then following exception is thrown:

SystemErr R javax.xml.ws.soap.SOAPFaultException: Unknown exception, internal system processing error.
SystemErr R at org.apache.cxf.jaxws.DispatchImpl.mapException(DispatchImpl.java:235)
SystemErr R at org.apache.cxf.jaxws.DispatchImpl.invoke(DispatchImpl.java:264)
SystemErr R at org.apache.cxf.jaxws.DispatchImpl.invoke(DispatchImpl.java:195)

When a new Interceptor (MSMBSTWSS4JOutInterceptor) imitating the same code copied from WSS4JOutInterceptor is added along with the old Interceptor (MSMWSS4JOutInterceptor) extending WSS4JOutInterceptor, then both the Interceptors are invoked one after the another. Hence the first Interceptor creates the BinarySecurityToken while the second Interceptor (extending WSS4JOutInterceptor) signs all the elements including the BinarySecurityToken created before.

The same issue of signing the BinarySecurityToken can be resolved by overriding the Apache CXF and SAAJ classes. We first override the doSenderAction() method of the WSHandler Class in the BSTWSS4JOutInterceptor implementation.

public class BSTWSS4JOutInterceptor extends WSS4JOutInterceptor {

 private static final String msmActionClass = “org.example.BSTSignatureAction”; 

 protected void doSenderAction(int doAction, Document doc, RequestData reqData,
                               Vector actions, boolean isRequest){

     boolean mu = decodeMustUnderstand(reqData);
     ............
     ............
     for (int i = 0; i < actions.size(); i++) {
       int actionToDo = ((Integer) actions.get(i)).intValue();
       ............
       switch (actionToDo) {
       case WSConstants.UT:
       case WSConstants.ENCR:
       case WSConstants.SIGN:
       case WSConstants.ST_SIGNED:
       case WSConstants.ST_UNSIGNED:
       case WSConstants.TS:
       case WSConstants.UT_SIGN:

    if (isBSTEnabled && actionToDo == WSConstants.SIGN) {
  
           Action doit = null;
           
           try {
            doit = (Action) Loader.loadClass(msmActionClass).newInstance();
           } catch (Throwable t) {
               if (log.isDebugEnabled()) {
                 log.debug(t.getMessage(), t);
               }
               throw new WSSecurityException(WSSecurityException.FAILURE,
               "unableToLoadClass", new Object[] { msmActionClass }, t);
           }

    if(doit != null) {
  doit.execute(this, actionToDo, doc, reqData);
    }

  } else {
  wssConfig.getAction(actionToDo).execute(this, actionToDo, doc, reqData);
   }

         break;

       case WSConstants.NO_SERIALIZE:
                reqData.setNoSerialization(true);
                break;
       default:
                Action doit = null;     
     ............
     ............
 }
}

Now in the overridden BSTSignatureAction class we override the implementation of the execute() method inorder to change the WSSecSignature class to BSTWSSecSignature class as follows:

public class BSTSignatureAction implements Action {

 public void execute(WSHandler handler, int actionToDo, Document doc,
                     RequestData reqData){

     String password = handler.getPassword(...).getPassword();
     BSTWSSecSignature wsSign = new BSTWSSecSignature();
     ............
     ............
     ............
     try {
      wsSign.build(doc, reqData.getSigCrypto(), reqData.getSecHeader());
      reqData.getSignatureValues().add(wsSign.getSignatureValue());
     }
     catch() { ... }
 }
}

At last we override the Now in the overridden BSTSignatureAction class we override the implementation of the execute() method inorder to change the WSSecSignature class to BSTWSSecSignature class as follows:

public class BSTWSSecSignature extends WSSecBase {

............
 public Document build(Document doc, Crypto cr, WSSecHeader secHeader)
                 throws WSSecurityException{

     ............ 
     // call addBST() method, a duplicate of prepare() method were keyIdentifierType is 
     // considered only as BST_DIRECT_REFERENCE in its switch case.
     addBST(doc, cr, secHeader);
        
     // create an empty vector for signature parts 
     Vector bstparts = new Vector();
        
     if(parts != null) {
        for (WSEncryptionPart part : (Vector)parts)
        {
          if(part.getName().equalsIgnoreCase("Body")) {
           // add the BODY element as by default if signature parts is empty
           // it signs the BODY element.
            bstparts.add(part);
          }
        }
     }
        
     // add the empty signature to the Security Header
     addReferencesToSign(bstparts, secHeader);
     // prepend the signature at the top of the Security Header
     prependToHeader(secHeader);
     // compute the digest values for the BODY element signature using the 
     // BinarySecurityToken 
     computeSignature();
        
     if (bstToken != null) {
        // prepend the BinarySecurityToken element at the top of the signature in 
        // Security Header
        prependBSTElementToHeader(secHeader);
     }
        
     // continue with the normal process of signing and adding IssuerSerial signatures
     prepare(doc, cr, secHeader);
     SOAPConstants soapConstants = 
     WSSecurityUtil.getSOAPConstants(doc.getDocumentElement());

     if (parts == null) {
        parts = new Vector();
        WSEncryptionPart encP = 
            new WSEncryptionPart(
                soapConstants.getBodyQName().getLocalPart(), 
                soapConstants.getEnvelopeURI(), 
                "Content"
            );
        parts.add(encP);
     }

     addReferencesToSign(parts, secHeader);
     prependToHeader(secHeader);
     // Eliminate call to prependBSTElementToHeader() as it is called beforehand
     computeSignature();

     return doc;
}
     ............ 
}

Receiving Response with Security Confirmation:

Initially “enableSignatureConfirmation” was set to “true” only in the wss4jInConfiguration.

 
               ......................
               
                       
                               
                                ......................
                               
                       
               
               ......................

This caused the following error to pop up:

0000001e SystemErr R Caused by: org.apache.ws.security.WSSecurityException:
WSHandler: Check Signature confirmation: got SC element, but no matching SV
at org.apache.ws.security.handler.WSHandler.checkSignatureConfirmation(WSHandler.java:392)
at org.apache.cxf.ws.security.wss4j.WSS4JInInterceptor.handleMessage(WSS4JInInterceptor.java:224)

After repeated combinations and retries it became clear that “enableSignatureConfirmation” has to be set “true” not only for the Wss4jInInterceptor but for both WSS4JOutInterceptors. The reason predicted is that, there are two “SecurityConfirmation” elements added in the Response from the Weblogic Server. Now at the receiver end, when we enable the “enableSignatureConfirmation” entry in Wss4jInInterceptor, it tries to check for the Security Vector if there are similar two entries in order to verify the corresponding incoming two elements. As both the WSS4JOutInterceptors didn’t enable the “enableSignatureConfirmation” entry, there are no entries in the Security Vector to verify. Hence we get the above exception.

......................
<wsse11:SecurityConfirmation>sdfsa9er8sd9f8sd9fgds</wsse11:SecurityConfirmation>
<wsse11:SecurityConfirmation>sdfsa9er8sd9f8sd9fgds</wsse11:SecurityConfirmation>
......................

                           
               ....................
               
                       
                               
                                ....................
                               
                       
               
               ....................

Further when tried to alter the contents of even one of the element, the same error as below is thrown again as the contents of the SC elements don’t match with the contents in the SC Vectors.

org.apache.cxf.binding.soap.SoapFault: WSHandler: Check Signature confirmation: got a SC element, but no stored SV.

Going further when the value of the “action” entry was “Timestamp Signature” it threw the following exception:

Security processing failed (actions mismatch)
Caused by: org.apache.ws.security.WSSecurityException: An error was discovered processing the <wsse:Security> header
at org.apache.cxf.ws.security.wss4j.WSS4JInInterceptor.handleMessage(WSS4JInInterceptor.java:290)

After debugging the source it was found that the exception originated from line number 290 of the class org.apache.cxf.ws.security.wss4j.WSS4JInInterceptor. Following was the piece of the code:

// now check the security actions: do they match, in any order?

  if (!ignoreActions && !checkReceiverResultsAnyOrder(wsResult, actions)) {
      LOG.warning("Security processing failed (actions mismatch)");
      throw new WSSecurityException(WSSecurityException.INVALID_SECURITY);
  }

The call to the checkReceiverResultsAnyOrder() method returned false causing it to throw the WSSecurityException. After deeper look in the source code of the checkReceiverResultsAnyOrder() method in org.apache.ws.security.handler.WSHandler class it was found that it compares the Elements in the response with the Actions specified in the configuration entry of WSS4JInInterceptor. It checks whether the same actions are specified corresponding to the elements present in the <SecurityHeader> of the response. But from the line highlighted in red below, <SecurityConfirmation> and <BinarySecurityToken> elements in the response doesn’t need to have the corresponding Action name in the configuration. This seems logical as the possible values for the “action” entry in the configuration are { NoSecurity , UsernameToken , UsernameTokenNoPassword , SAMLTokenUnsigned , SAMLTokenSigned , Signature , Encrypt , Timestamp , UsernameTokenSignature }.

protected boolean checkReceiverResultsAnyOrder(Vector wsResult, Vector actions);

  java.util.List recordedActions = new Vector(actions.size());

        for (int i = 0; i < actions.size(); i++) {
 
           Integer action = (Integer)actions.get(i); 
           recordedActions.add(action);
        }
 
        for (int i = 0; i < wsResult.size(); i++) {
 
          final Integer actInt = (Integer) ((WSSecurityEngineResult) wsResult
                     .get(i)).get(WSSecurityEngineResult.TAG_ACTION);

          int act = actInt.intValue();

          if (act == WSConstants.SC || act == WSConstants.BST) {
             continue;
          }
 
          if (!recordedActions.remove(actInt)) {
             return false;
          }
        }
 
         if (!recordedActions.isEmpty()) {
             return false;
         }
         return true;
     }

Now looking at the response below from the WS Security enabled Weblogic server, the possible values for the action should be corresponded with , , . But using the above information, there is no such action as “enableSignatureConfirmation” and hence we are left with only the “Timestamp” action in the WS Security configuration entry resolving the exception.


   
      
         
         
         
           2012-03-02T19:35:55Z
           2012-03-02T19:36:55Z 
         
      
   
   
      
         100
         SUCCESS

When a request is sent to the Oracle Weblogic Server at last the following error was encountered:

WSDLException (at /con:soapui-project): faultCode=INVALID_WSDL: Expected element '{http://schemas.xmlsoap.org/wsdl/}definitions' when trying to load.

On carefully looking the request sent by the Client with the one sent by SOAP-UI, it was found that the Request tag for Client Request was "<CheckEligibility>" while the Request Type of SOAP-UI was "<CheckEligibility Request>".

Windows Commands

2013-08-18T09:30:00.000-07:00

Over the recent years there are many new commands introduced in windows operating systems besides the original DOS commands. These newly added commands enable us to carry out operations which are quite helpful and sophisticated. The full documentations of all the commands is available on microsoft's msdn website.

1) XCOPY:
Following MS-DOS command copies files and directories from source to destination and "/E" creates empty directories, "/C" continues even if there is an error, "/H" includes hidden / system files, "/R" overwrites read only files in the destination, "/K" retaining the file attributes, and "/O" ownership / Access control list information, "/Y" avoiding prompting while overwriting the files.

xcopy source destination /E /C /H /R /K /O /Y

Following command copies files and directories from source to destination and "/C" continuing even if there is an error, "/D" copy the file modified dates, "/S" copy files and subdirectories recursively except empty directories, "/H" include hiddern / system files

xcopy source destination /C /D /S /H

2) ROBOCOPY:

Robocopy is the very powerful external command to copy files in windows. Following command copies all the files including the empty directories from the given source location to destination,

robocopy source destination /MIR

3) TASKKILL:

It is used to kill one or more tasks / processes using process id or process name. The following command terminates the process by name forcefully.

taskkill /im processname /f

The following command on the other hand terminates all the processes running by the use name "john".

taskkill /F /FI "USERNAME eq john"

4) NETSTAT:

It displays active TCP connections, ports on which the computer is listening, Ethernet statistics, the IP routing table, IPv4 statistics. The following command displays the process actual file name using the "-b" option.

netstat -b

5) SHUTDOWN:

The remote shutdown tool enables to shutdown the local or remote computer within the network.

Following command shuts down the computer by closing all the applications after specified time delay using "/t" option and displaying the message.

shutdown \\computername /l /a /r /t:xx "msg" /y /c

shutdown /l /t:120 "The computer is shutting down" /y /c

Following command reboots "/r" the remote machine specified using "/m" option. It forces all the applications to close after a a minute delay "/t" with the reason "Application: Maintenance (Planned)" and the comment "/c" "Reconfiguring Applications" type:

shutdown /r /m \\RemoteMachine /t 60 /c "Reconfiguring Applications" /f /d p:4:1

6) SCHTASKS:

Schtasks command is used to query or execute the tasks inside the Task Scheduler.

Following command lists all the tasks present on the remote machine.

schtasks /query /s \\RemoteMachine

Following command lists all the tasks matching the name "MyTask" present on the remote machine.

schtasks /query /s \\RemoteMachine | findstr "MyTask"

Following command runs the specified task name with the full path present on the specified remote machine.

schtasks /run /s \\RemoteMachine /tn "\Microsoft\Windows\Tasks\MyTask"

Similarly following command ends the specified task on the remote machine.
schtasks /end /s \\RemoteMachine /tn "\Microsoft\Windows\Tasks\MyTask"

Following command queries the task matching the name "\Microsoft\Windows\Tasks\MyTask" present on the remote machine. It displays advance properites of the task in a list format.

schtasks /query /s \\RemoteMachine /tn "\Microsoft\Windows\Tasks\MyTask" /fo LIST /v

Also we can create a new task in the task scheduler using the following command:
schtasks /create /tn task_name /tr "...\path\task.bat" /sc daily /st 10:00:00 /s \\ComputerName /u username /p password

7) SC:
The SC command is used to communicate the service controller to manage windows services. It helps to create, update and delete windows service using various options which run as background processes. Note that all the sc command options require a space between the equals sign and the value.

Following command creates a new window service with the specified name and run the executable specified along with the binpath option.
sc create "servicename" binpath= "C:\Windows\System32\sample.exe" DisplayName= "Sample Service" start= auto

Following command delete the windows service with the specified name.
sc delete servicename

Below command lists all the windows services on the command line.
sc queryex type= service state= all | find "_NAME"

Alternatively following service commands can be used to start/stop windows services:
Start a service: net startservice
Stop a service: net stopservice
Pause a service: net pauseservice
Resume a service: net continueservice

8) WMIC:
The WMIC command provides a command line interface to Windows Management Instrumentation (WMI). WMI is the infrastructure to handle data and operations of the windows operating system and enables to carry out administrative tasks using WMI scripts.

Following command gives the hardware architecture details of the CPU of the current machine
wmic cpu get caption

Below command provides the information regarding the current Windows OS architecture, primarily 32/64 bit system.
wmic OS get OSArchitecture

9) PSEXEC:
This is a utility tool which allows us to execute commands on the remote machines redirecting the remote console output to our local system. There are many other advance usages of the tool.

psexec \\ComputerName cmd

8) NET USE:
The NET USE command enables to connect or disconnect a computer computer from a shared resource, or to display information about computer connections. The below command assigns the disk drive Z: to the shared directory on \\zdshare

net use Z: \\zdshare\IT\deploy

The below command disconnects the Z drive from the \\zdshare directory.

net use Z: /delete

Help Option: Use the "/?" option to display the help for the command

net use /?

8) FINDSTR:
The FINDSTR command is used to search for patterns of text in files using regular expressions. Find the specified text "APC" with /c as a literal search string with non case-sensitive search. Also repeat the search for zero or more occurrences of previous character or class.

findstr /i /c:"APC" *

Test Driven Development

2013-08-17T19:58:00.001-07:00

Test Driven Development is famous software development process which relies on the developer to write an automated test case before writing any piece of functional code. It emphasizes series of unit tests and re-factoring to provide a simple design.

Everyone is accustomed to the general practice of software development which looks as below:

Design: Figure out how you're going to accomplish all the functionality.
Code: Type in the code that implements the design.
Test: Run the code a couple of times to see if it works, then hand it over to QA.

On the other hand Test Driven Development modifies this approach as below:

Test: Figure out what the next chunk of function is all about.
Code: Make it do that.
Design: Make it do that excellently.

As described above TDD completely inverts the accepted ordering of 'design-code-test'. So, from one view, TDD just puts the design after the test and the code. Refactoring is considered as pure design in TDD.

In TDD world we are not allowed to figure out a complete or excellent design to get our test (and all existing tests) to pass, before we start coding it. Although there is sometimes a debate on whether there should be some kind of initial design phase were interfaces (along with methods signature) for the future classes needs to be defined. Further it is not allowed to reduce or skip the "refactor" step during the TDD development. Hence after each iteration of passing test, there should be refactoring done on the code which indirectly contributes to the design. Also once a test is written, TDD allows us to do either of the following during implementation to pass the test:

Reuse some existing code
Introduce meaningful new class(es) and method(s)
Copy existing method(s) and change the copies

TDD helps in certain aspects of the integration, as the entire process a divided into a series of small steps. The more often we check in the code in version control system, and the smaller our changes are, the less likelihood of getting any 'merge conflicts' with others. Also every commit is a guaranteed fallback position, a piton in the rock that we can easily go back to if we slip and fall.

Below is the Red-Green-Refactor Rule for Test Driven Development:

RED	When you write the test, you are designing the behavior you expect the code-under-test to perform.
GREEN	When you write the code to pass the test, you are designing the internal implementation of that behavior.
REFACTOR	Your micro-focus on getting to green probably 'un-designed' the code. When you refactor you are re-designing.

The Stepwise Premise for TDD goes as below:
- Can gigantic complex architectures really be created using nothing other than red-green-refactor?
- Consider these issues:

All large solutions don't just materialize out of nowhere; they are ultimately created in modest steps anyway.
Even if we have analysis and design phases for large-scale architectural features, we can still develop using TDD.
Considerable data is available to support the idea that complex global design processes frequently don't work.
TDD has a serious track record: it is being used all over the world to create complex systems.

Below are the commonly used TDD patterns:

Specify It

Essence First: What is the most basic functionality needed, not including anything fancy
Test First: What exactly will we be testing? Capture that in the test method name.
Assert First: What behavior would you like to check? Writing the assert statement will lead us to produce the structure backwards by "backfilling the method" by declaring the objects and methods we need to create as well as the expected result of calling the new code.

Frame It

Frame First: Create whatever class(es), constructor(s) and method(s) are needed by our assert statement.

Evolve It

Do The Simplest Thing That Could Possibly Work: Focus on minimalism by asking oneself to program only what is absolutely necessary to pass a test.
Break It To Make It: Write a new test code that we know will fail because as our production code isn't capable of handling the new test.
Refactor Mercilessly: Make design improvements continuously, aggressively, mercilessly avoiding really bad code.
Test Driving: In TDD, we don't want to stray too far from the Green Bar.

Finally, Robert Martin, one of the fanatic devotee of Test Driven Development provides the three laws of TDD in his book Clean Code as below:

First Law: You may not write production code until you have written a failing unit test.
Second Law: You may not write more of a unit test than is sufficient to fail, and not compiling is failing.
Third Law: You may not write more production code than is sufficient to pass the currently failing test.

Refactoring generally involves by taking an existing class that's too complex, and break it into smaller classes, each of which takes part of the old class's responsibility, and both of which work together. There are numerous advantages of refactoring the classes to smaller ones, some listed as as follows:

1) By making classes smaller, thus easier to grasp at one time.
2) By aligning the smaller classes with a well-understood functional breakdown of the underlying problem.
3) By making the couplings between classes mirror the couplings between functionality.
4) By (ultimately) allowing complex systems to be built by composing many simpler objects.
5) By making each smaller class easier to test.

Refactoring also involves Decremental Development, which means finding ways to shrink the code even as we continue to add new features. All the common functionality are moved as a part of library, while pre-existing libraries (core as well as external) with required implementation is searched for instead of re-inventing the wheel.

GUI Applications

In order to apply TDD on GUI applications, they need to have clear separation between user interface and operational logic most commonly achieved by MVC pattern. Although the model/view split isn't the only technique for TDD'ing GUI's, but it does represent the meta-pattern for all of them.
Following can be achieved by splitting responsibilities:

We can test the Model by having our TestCase pretend to be the View.
The most important interactions are on the Model, enabling to test core functionality.
We can use fake domain objects for testing which are in turn are used by the Model.
We can test the View by creating a fake Model and driving it that way.
The View can be tested by driving the window's programmatically.

A lot of enhancements can be applied to the Model-View split further such as follows:
- Add Publisher-Subscriber to allow multiple Views on the same Model.
- Add a Controller class to translate View-gestures into Model-commands.
- Add a Command system to isolate and manipulate individual commands.

Test Driven Development Shortcomings

TDD is a development process which assures quality by enforcing unit tests. Although the quality of the code mainly depends on the quality of tests, not when the tests are written during development or how many lines are covered. The essential purpose for writing unit tests is to reduce the possibly of defects in the development phase itself and provide a set of automated tests to validate future changes without introducing new defects. Although such approach is greatly beneficial, the question raised often is to what extent should the tests be written ? When does this approach looses efficiency over the value of auto-tested code ? Does this provide optimal solution to the complex process of software development and unforeseen defects. Is the time and effort spent in writing unit tests to prevent and decrease defects the best approach ?

Most of the Unit Testing tutorials, TDD books and sites describe the approach with basic examples such as processing students grades, calculating wages etc. Although it does gives us a perspective and seems to make the approach by far the best one, but when applied in the co-operate world, such approach has some inherent issues listed as below:

1) Testing a piece of code completely, may involve huge number of scenarios to be considered. Even to select the subset of critical cases and write the test cases for them, it involves almost similar effort as writing the original functional code. But even after selecting a subset of critical cases, we still open ourselves to the possible defects occurring from the ignored scenarios. How to decide which cases are critical and which should be ignored. Some cases may be ignored before, but considering the entire system, such cases could lead to vital failures. Hypothetically, even if we painstakingly compile all the critical cases and wrote unit tests for the entire application, we cannot be sure that there wouldn't be any defects coming up from the unit tested code. Often times, the unit tests validate obvious scenarios (mostly by replicating the code/object in unit test or verifying if the method does get called) thus providing us with a false sense of security. This mostly is caused when the same person writes both the test and the code.

2) Compared to most of the unit testing examples in tutorials, books and articles, the professional code is not that simple or straight forward to isolate. Many real world systems involves, file handling, calling external services, databases, invoking external processes and multi-threading operations. The outcome of these operations is hard to predict. We cannot comprehend the possible values returned by the external services, or by the database all the times. Some of the scenarios such as concurrent operations, server timeout, etc are difficult to recreate in unit test environment. Even if a unit test could be written to check the handling of possible service failures, it would require a substantial amount of efforts compared to manual or integration testing.

3) The basic premise of TDD is that the test drives the system design and implementation. Hence if the line of code cannot be tested then it shouldn't have be written at all. Sometimes due to the limitations of Unit Testing tools such as Junit, Mockito and others the unit test cannot isolately test a certain piece of code. Static methods is one of such cases were despite using Powermock there are many questions raised over the effectiveness of those tests. Also private class fields/methods mostly tend to be changed to lower access modifiers to facilitate unit testing as far as Junit is concerned. Concerns are also raised about the use of Mockito's InjectMocks in unit tests and recommended to use constructor based auto-wiring instead of setter or field based auto-wiring. This ultimately restricts the usage of some features of the programming language or the frameworks inside the boundaries of testability often tagged as bad design.

5) As mentioned previously by Robert Martin, no production code should be written without the corresponding failing test. This totally ignores the fact that whether the unit test is effective, productive and valuable in catching issues. Further it blurs the line between writing a unit test on the behavior/functionality of the code rather than mapping each line of production code with the corresponding unit test. For example creating a new object, setting values to an object, non-conditional calls to library's void methods, logging etc sure compounds to numerous lines of production code, but they hardly articulate any logic or behavior. Consider the following code below:

Properties properties = new Properties();
properties.setProperty("key", "value");
properties.store(new FileOutputStream("C:/test.properties"), null);

The above code creates a Properties object and uses built-in store method of API to create properties file without any conditional logic. There could be many what if arguments made such as what if the store method is not called or file path is incorrect, or properties are not set or incorrectly set etc which often is a slippery slope. But mandating the existence of a line of code or their order is not the purpose of unit test, but is to make sure an independent chunk of code behaves as intended. Any piece of code which only has a single logical flow and returns same or similar results no matter the input has no concrete behavior. Further, if the code does not provide any behavior by itself or relies on external library methods for its behavior then unit testing such code not only adds to overhead and maintenance but fails to provide any productive feedback to detect real problems.
Further, mandating TDD during a proof of concept or trial and error to fix a known problem not only increases the development overhead exponentially but also distracts the developer from the core task/problem.

4) Someone has said "the line of code that is fastest to write, that never breaks, that doesn't need maintenance is the line you never have to write". In Test Driven Development, as the unit test drives the development (rather than us choosing the critical methods to unit test), there is a lot more test code involved. Multiple scenarios for the given piece of code may encourage duplicate code unless only a single person works on it. In the co-operate projects such big chunks of test code adds up to the maintenance of the system. Badly written unit tests which often involves hardcoded error strings further consume time/effort to maintain. Fragile tests which generate false failures mostly tend to be ignored even in case of valid errors. Modifying the existing functionality using TDD becomes quite challenging as we need to deal with a mesh of interconnected mock objects and a series of test cases.

Finally the root issue with TDD is not the effort or time required to write them, but their value compared to the effort i.e. Developer Productivity. TDD is much easier to be applied when the design documents dictates the classes/methods and their functionality beforehand. It also would help if all the possible test cases are listed (usually by testers) for the pre-designed classes.

Was it really Behavior Driven Development ?

Since writing this 2013 blog post, many others have joined to question the effectiveness of TDD. David Heinemeier Hansson, the creator of Ruby on Rails has described TDD as "Test-first fundamentalism is like abstinence-only sex ed: An unrealistic, ineffective morality campaign for self-loathing and shaming". After the blog post Kent Beck put forward his sarcastic defense on TDD which later was followed by conversation with Martin Fowler on whether TDD is Dead. Though the conclusion of the conversation was that TDD is valuable in some contexts, but much disagreement prevailed over the number and type of contexts in which it should be applied. Then in the DevTernity 2017 conference Ian Cooper gave a talk on "TDD, Where Did It All Go Wrong" which was promoted by Uncle Bob Martin. In the talk Cooper pointed out that TDD is being practiced incorrectly since we are focused on testing the implementation details instead of testing the system behavior. Due to this we often write more test code than implementation code. Such implementation driven tests with spaghetti of mocks makes refactoring painful, maintenance a nightmare and decreases the overall development productivity. Developers too often don't understand the intent of such tests and are unable to deduce the system behavior by reading them. Enhancements and re-designs becomes difficult as changing the implementation also requires to change the tests which is long haul process.

TDD is mainly practiced by using 'adding a new method to a class' as trigger to write a test. Such test-case per class approach fails to capture the true ethos for TDD. Adding a new class or method is not the trigger for writing tests. The trigger is implementing a requirement. Write tests to cover the use cases or user stories, not the implementation classes or methods. The system under test is not a class but the exports from a module or its facade. The 'unit' of 'unit testing' here really means module, not a class. A class by itself can be the facade, but many classes are implementation details of the module. Do not write tests for implementation details, these change. Write tests only against the stable contract of the (public) API (which can be within a module). Ian Cooper referenced the first book on TDD, "Test-driven Development: By Example" by Kent Beck and pointed out that Kent has explicitly stated that we need to be testing behavior not the implementation. On page 4 of the book Kent writes "What behavior will we need to produce the revised report? Put another way, what set of tests, when passed, will demonstrate the presence of code we are confident will compute the report correctly ?", which clearly refers to test over behavior not implementation. Kent further states that "When we write a test, we imagine the perfect interface for our operation. We are telling ourselves a story about how the operation will look from the outside. Our story won't always come true, but its better to start from the best-possible application program interface (API) and work backward than to make things complicated, ugly, and 'realistic' from the get-go", which affirms testing API's not implementation methods. The tests should run in isolation from other tests, but not the system under test. The unit of isolation is not the class under test, but the tests themselves. Although tests can and should test several classes working together if that is what is needed to test the behavior. We avoid file system, database, simply because these shared fixture elements prevent us from running in isolation from other tests, or the tests become slow. But if there is no shared fixture problem (one test does not affect another) then its perfectly fine to talk to database (though in-memory) or file system in unit tests. Focusing on methods for testing creates tests which are hard to maintain and code which is difficult to refactor because implementation details are exposed to the tests. Such tests do not capture the behavior we want to preserve and becomes difficult to understand. Refactoring is the process of changing a software system in such a way that it does not alter the external behavior of the code yet improves its internal structure. It is the step were we improve our design/implementation, produce clean code, remove duplication, sanitize code smells and apply design patterns. During refactoring to clean code we should not write new unit tests since we are not introducing new public APIs / classes. Dependency is the key problem in software development at all scales. Dependency between the tests and the code should be eliminated by avoiding mocking. Tests should not depend on implementation details by using Mocks because changing the implementation breaks such tests. Hence mocks should be avoided at all costs except to isolate the tests on the module boundaries (databases, external services, file systems).

Logging Frameworks

2013-04-09T11:20:00.003-07:00

In any complex application comprising of several components working together, tracking failures effectively becomes more challenging. Even though the application is separated by individual components, a trace of operation is required to investigate potential failures. In such circumstances, logging individual component activities comes in handy and provides a great depth of insight over periodic operations. Logging using system.out and filewriter in Java was prevalent but now with more sophisticated frameworks available, such techniques have become a thing of the past. There are three major logging frameworks which are dominant in the java world apart from countless others. They are Log4J, Slf4J and Logback frameworks.

Java Logging API
The java logging API contains a basic set of logging capabilities in the java.util.logging package using the Logger class. The Logger actually is a hierarchy of Loggers, and a . (dot) in the hierarchy indicates a level in the hierarchy. If we get a Logger for the com.example then the logger is a child of the com Logger and the com Logger is child of the Logger for the empty String. We can configure the main logger which affects all its children. The log levels such as SEVERE, WARNING, INFO etc define the severity of a message. The Level class is used to define which messages should be written to the log. The levels OFF and ALL to turn the logging of or to log everything. Each logger can access several handlers which receives the log messages from the logger and exports it to a target file (FileHandler) or console (ConsoleHandler). Each handlers output can be configured with formatters such as SimpleFormatter to generate messages in text or XMLFormatter to generate messages in XML format. The log manager is responsible for creating and managing the logger and the maintenance of the configuration.

The logging can be configured using the log.properties file with the below sample configuration.

# Logging
handlers = java.util.logging.FileHandler, java.util.logging.ConsoleHandler.level = ALL

# File Logging
java.util.logging.FileHandler.pattern = %h/myApp.log
java.util.logging.FileHandler.formatter = java.util.logging.SimpleFormatter
java.util.logging.FileHandler.level = INFO

# Console Logging
java.util.logging.ConsoleHandler.level = ALL

The "-Djava.util.logging.config.file=/absolute-path/logging.properties" parameter is used to load a custom log.properties for java util logging. It works with following cases:

Move the file log.properties to the default package (the root folder for your sources)
add it directly to the classpath (just like a JAR)
You can specify the package in which the file is, replacing "." with "/": -Djava.util.logging.config.file=com/company/package/log.properties
You can specify the absolute path

The most famous way to disable all the logging for any frameworks is by setting the error output to NULL as follows:

  static {
    //Windows style
    try {
        PrintStream nps = new PrintStream(new FileOutputStream("NUL:"));
        System.setErr(nps);
        System.setOut(nps);
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    }
  }

Log4J Framework
Log4J is the oldest of the above frameworks, and widely used due its simplicity of usage. It defines various log levels and messages. Log4j is thread safe and optimized for speed. It is based on a named logger hierarchy. It supports multiple output appenders per logger and internationalization.
Log4j is not restricted to a predefined set of facilities. Its logging behavior can be set at runtime using a configuration file. It is designed to handle Java Exceptions from the start. Log4j uses multiple levels, namely ALL, TRACE, DEBUG, INFO, WARN, ERROR and FATAL to denote log levels. The format of the log output can be easily changed by extending the Layout class. The target of the log output as well as the writing strategy can be altered by implementations of the Appender interface. Log4j is fail-stop but it does not guarantee that each log statement will be delivered to its destination.
Below is the sample log4j property file: log4j.properties

#suppress logging from spring and hibernate to warn
log4j.logger.org.hibernate=WARN
log4j.logger.org.springframework=WARN

# Set root logger level to DEBUG and its only appender to Appender1.

log4j.rootLogger=INFO, Appender1,Appender2

# Appender1 is set to be a ConsoleAppender.

log4j.appender.Appender1=org.apache.log4j.ConsoleAppender

log4j.appender.Appender2=org.apache.log4j.RollingFileAppender

log4j.appender.Appender2.File=sample.log

# Appender2 uses PatternLayout.

log4j.appender.Appender1.layout=org.apache.log4j.PatternLayout

log4j.appender.Appender1.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n

log4j.appender.Appender2.layout=org.apache.log4j.PatternLayout

log4j.appender.Appender2.layout.ConversionPattern=%-4r [%t] %-5p %c %x - %m%n

Log4j sample code is as follows:

      try {
            Properties props = new Properties();
            props.load(TestHTTP.class.getResourceAsStream("/log4j.properties"));
            System.out.println("props = " + props.toString());
            PropertyConfigurator.configure(props);
      } catch (IOException e) {
            e.printStackTrace();
      }

      LogManager.getRootLogger().setLevel(Level.OFF);

      // Pavan's Code
      Logger log = Logger.getLogger("myApp");
      log.setLevel(Level.ALL);
      log.info("initializing - trying to load configuration file ...");

      Properties preferences = new Properties();
      try {
          FileInputStream configFile = new FileInputStream("/path/to/app.properties");
          preferences.load(configFile);
          LogManager.getLogManager().readConfiguration(configFile);
      } catch (IOException ex)  {
          System.out.println("WARNING: Could not open configuration file");
          System.out.println("WARNING: Logging not configured (console output only)");
      }

      log.info("starting myApp");

Logback Framework
Logback framework is a successor to the log4j framework providing Slf4J Api implementation natively. Logging configuration can be provided either in xml or groovy. It provides a SiftingAppender which enables to maintain seperate the logfiles based on the user session instance and the ability to switch the loglevel for individual users. Logback automatically reloads upon configuration changes and provides a better I/O failover in case of server failure.

Logback delegates the task of writing a logging event to components called appenders.
Appenders must implement the ch.qos.logback.core.Appender interface, which contains doAppend() method which is responsible for outputting the logging events in a suitable format to the appropriate output device.

Sample configuration for logback framework is as follows:



  
    
  

  
     
        
     
  

  
    
      
      
      
      
      
      
      
    
  

  
    PERFORMANCE
    ALLOW
  

  
    
      %date [%thread] %mdc %-5level %logger %msg %n %ex
    
  

 
    
      unknown
    
    
      
        
          ERROR
          ACCEPT
          DENY
        
        ${logdir}/${contextName}Error.log
        true
        
          ${logdir}/${contextName}Error%d{yyyy-MM-dd}.%i.log
          
            
            10MB
          
          
          30

          
          30
        

        
          %date [%thread] %mdc %-5level %logger %msg %n

In any complex application

Maven Plugin Development

2013-02-27T19:34:00.001-08:00

Maven carries out all its implementation using plugins making them highly significant for its operation. But often there are times when a customized plugin implementation might be needed in order to carryout some peculiar build-related tasks. Tasks especially involving jenkins build operations, or command line operations which can be better off using maven than ant scripts. Further plugins can call other plugins and create custom goals to carryout large series of operations. Hence maven plugin development comes handy in creating customized maven plugins.

A maven plugin contains a series of Mojos (goals), with each Mojo being a single Java class containing a series of annotations which tells Maven the way to generate the Plugin descriptor. Every maven plugin must implement the Mojo interface which requires the class to implement getLog(), setLog() and execute() methods. The abstract class AbstractMojo provides the default implementation of getLog() and setLog(), thus only requiring to implement the execute() method. The getLog() method can be used to access the maven logger which has methods info, debug() and error() to log at various levels. The execute() method is the entry point of the plugin execution and provides the customized build-process implementation for the maven plugin.
The AbstractMojo implementation does require to have a @goal annotation in the class-level javadoc annotation. The goal name specified with the javadoc @goal annotation defines the maven goal name to be used along with the goal prefix in order to execute the plugin. The mojo goal can be used directly in the command line or from the POM by specifying mojo-specific configuration. The @phase annotation if specified, binds the Mojo to a particular phase of the standard build lifecycle e.g. install. It is to be noted that the phases in the maven lifecycle are not called in series from the phase name specified with the @phase annotation in the Maven Mojo. The @execute annotation can be used to specify either phase and lifecycle, or goal to be invoked before the execution of plugin implementation. When the mojo goal is invoked, it will first invoke a parallel lifecycle, ending at the given phase. If a goal is provided instead of a phase, that goal will be executed in isolation. The execution of either will not affect the current project, but instead make available the ${executedProject} expression if required. The @requireProject annotation denotes whether the plugin executes inside a project thus requiring a POM to execute or else can be executed without a POM. By default the @requireProject is set to true, thus requiring to run inside a project. The @requiresOnline annotation mandates the plugin to be executed in the online mode. The Maven Mojo API Specification provides all the available annotations in detail.
Maven mojo class can also access maven specific parameters such as MavenSession, MavenProject, Maven etc using the maven parameter expressions "${project}", "${session}" or ${maven}. These maven model objects can be used to get the project details in the POM or alter the session to execute another project. Below is the sample maven plugin mojo, which reads another pom, creates a new maven project and alter the session to execute the new project. Also it lists the plugins present in the maven project.

/**
 * @goal sample-task
 * @requiresProject false
 * @execute lifecycle="mvnsamplecycle" phase="generate-sources"
 */
public class SampleMojo extends AbstractMojo {
 /** 
 * The Maven Session Object 
 * @parameter expression="${session}" 
 * @required 
 * @readonly 
 */ 
 private MavenSession session; 

 /**
 * The maven project.
 * @parameter expression="${project}"
 * @readonly
 */
 private MavenProject project;

 public void execute() throws MojoExecutionException, MojoFailureException {

      // Create a new MavenProject instance from the pom.xml and set it as current project.
      MavenXpp3Reader mavenreader = new MavenXpp3Reader();
      File file = new File("../../pom.xml");
      FileReader reader = new FileReader(file);
      Model model = mavenreader.read(reader);
      model.setPomFile("../../pom.xml");
         
      MavenProject newProject = new MavenProject(model);
      project.setBuild(newProject.getBuild());
      project.setExecutionProject(newProject);
      project.setFile(file);
      session.setCurrentProject(newProject);
      session.setUsingPOMsFromFilesystem(true);

      // Create a new MavenSession instance and set it to execute the new maven project.
      ReactorManager reactorManager = new ReactorManager(session.getSortedProjects());
      MavenSession newsession = new MavenSession( session.getContainer(), session.getSettings(), session.getLocalRepository(),
      session.getEventDispatcher(), reactorManager, session.getGoals(),
      session.getExecutionRootDirectory()+ "/" + app, session.getExecutionProperties(), session.getUserProperties(), new Date());
         
      newsession.setUsingPOMsFromFilesystem(true);
      session = newsession;
      project.setParent(newProject);
      project.addProjectReference(newProject);
      project.setBasedir(new File(app));

      // List all the plugins in the project pom.
      List plugins = getProject().getBuildPlugins();

      for (Iterator iterator = plugins.iterator(); iterator.hasNext();) {
         Plugin plugin = (Plugin) iterator.next();
         if(key.equalsIgnoreCase(plugin.getKey())) {
             getLog().info("plugin = " + plugin);
         }
      }
 }
}

Below are the required dependencies for the Maven Plugin. Note that the last three dependencies along with the maven-invoker are optional and used to access the Maven Object Model with the objects, MavenSession, MavenProject etc.

 
    
      org.apache.maven
      maven-plugin-api
      2.0
    
    
      commons-io
      commons-io
      2.1
    

    
    
      org.apache.maven.shared
      maven-invoker
      2.1.1
    
    
      org.codehaus.plexus
      plexus-component-annotations
      1.5.5
    
    
      org.codehaus.plexus
      plexus-utils
      3.0.8
    
 

 
   
     ...................................
     
       maven-plugin-plugin
       2.3
       
           samples
       
     
     ...................................

Maven Lifecycle

The process of building and distributing a particular artifact (project) is defined as the Maven build lifecycle. There are three built-in build lifecycles: default, clean and site. The default lifecycle handles the project deployment, the clean lifecycle handles project cleaning, while the site lifecycle handles the creation of project's site documentation. Each of the build lifecycles is defined by a different list of build phases, wherein a build phase represents a stage in the lifecycle. The build phases listed in the lifecycle are executed sequentially to complete the build lifecycle. On executing the specified build phase in the command line, it will execute not only that build phase, but also every build phase prior to the called build phase in the lifecycle. This works for multi-module scenario too. The build phase carries out its operations by declaring goals bound to it.

A goal represents a specific task (finer than a build phase) which contributes to the building and managing of a project. It may be bound to zero or more build phases. A goal not bound to any build phase could be executed outside of the build lifecycle by direct invocation. The order of execution depends on the order in which the goal(s) and the build phase(s) are invoked. Moreover, if a goal is bound to one or more build phases, that goal will be called in all those phases. Furthermore, a build phase can also have zero or more goals bound to it. If a build phase has no goals bound to it, that build phase will not execute. But if it has one or more goals bound to it, it will execute all those goals mostly in the same order of declaration as in the POM.
Goals can be bound to a particular lifecycle phase by configuring a plugin in the project. The goals that are configured will be added to the goals already bound to the lifecycle from the selected phase. If more than one goal is bound to a particular phase, the order used is that those from the selected phase are executed first, followed by those configured in the POM. Note that the <executions> element can be used to gain more control over the order of particular goals. It can also run the same goal multiple times with different configuration if required. Separate executions can also be given an ID so that during inheritance or the application of profiles, it can be controlled whether the goal configuration is merged or turned into an additional execution. When multiple executions are given that match a particular phase, they are executed in the order specified in the POM, with inherited executions running first.


  
    process-classes
    
      
        jcoverage:instrument
      
    
  
  
  
    test
    
      
        surefire:test
        
          
          ${project.build.directory}/generated-classes/jcoverage
          true

Report Plugin

Writing a Report plugin is similar to the Mojo plugin were we extend the AbstractMavenReport class instead of AbstractMojo class. The report plugin can be added to the plugins of the reporting section to generate the report with the Maven site. The goal to be executed is specified in the report tag in the reportSet section which control the execution of the goals. The methods getProject(), getOutputDirectory(), getSiteRenderer(), getDescription(), getName(), getOutputName(), getBundle() and executeReport() are required to be overridden.

Note: In order to create the report without using Doxia, e.g. via XSL transformation from some XML file, add the following method to the report Mojo:

public boolean isExternalReport() {
    return true;
}

Following dependencies are required for maven report plugin


    org.apache.maven.reporting
    maven-reporting-api
    2.0.8

 

    org.apache.maven.reporting
    maven-reporting-impl
    2.0.4.3

 

    org.codehaus.plexus
    plexus-utils
    2.0.1

AbstractMavenReportRenderer is used to handle the basic operations with the Doxia sink to setup the head, title and body of the html report. The renderBody method is implemented to fill in the middle of the report by using the utilities for sections and tables in Doxia. To use Doxia Sink-API we import the org.apache.maven.doxia.sink.Sink class and call the getSink() method to get its instance. Then we use the doix api as in the below example to header, title and body. The starting tag is denoted by xxx() while the end tag is denoted by xxx_() similar to html tags. The rawText() method outputs exactly specified text while the text() method adds escaping characters. The sectionning is strict which means that section level 2 must be nested in section 1 and so forth. Below sample report mojo override the required methods and provide a sample usage of Doxia API.

public class ReportMojo extends AbstractMavenReport {

 /**
 * Report output directory.
 * @parameter expression="${project.reporting.outputDirectory}"
 * @required
 * @readonly
 */
 private String outputDirectory;

 /**
 * Maven Project Object.
 * @parameter default-value="${project}"
 * @required
 * @readonly
 */
 private MavenProject project;
 
 /**
 * Maven Report Renderer.
 * @component
 * @required
 * @readonly
 */
 private Renderer siteRenderer;

 protected MavenProject getProject() {
  return project;
 }

 protected String getOutputDirectory() {
  return outputDirectory;
 }

 protected Renderer getSiteRenderer() {
  return siteRenderer;
 }

 public String getDescription(Locale locale) {
  return getBundle(locale).getString("report.description");
 }

 public String getName(Locale locale) {
  return getBundle(locale).getString("report.title");
 }

 public String getOutputName() {
  return "sample-report";
 }

 private ResourceBundle getBundle(Locale locale) {
  return ResourceBundle.getBundle("sample-report", locale, this.getClass().getClassLoader());
 }

 @Override
 protected void executeReport(Locale locale) throws MavenReportException {

     Sink sink = getSink();
     sink.head();
     sink.title();
     sink.text( getBundle(locale).getString("report.title") );
     sink.title_();
     sink.head_();
   
     sink.body();
     sink.section1();
     sink.sectionTitle1();
     sink.text( String.format(getBundle(locale).getString("report.header"), version) );
     sink.sectionTitle1_();
     sink.section1_();
      
     sink.lineBreak();

     sink.table();
     sink.tableRow();
     sink.tableHeaderCell( );
     sink.bold();
     sink.text( "Id" );
     sink.bold_();
     sink.tableHeaderCell_();
     sink.tableRow_();

     sink.tableRow();
     sink.tableCell();
     sink.link( "http://some_url" );
     sink.text( "123" );
     sink.link_();
     sink.tableCell_();
     sink.tableRow_();
     sink.table_();
      
     sink.body_();
     sink.flush();
     sink.close();
 }

MultiPage Report Plugin

Often times there is a need to create maven reports with multiple pages. But the maven report plugin only provides a single instance of doxia sink to create an html page. If we try to copy the implementation of the execute() method in AbstractMavenReport class and try to loop it with different filenames then we do get the required multiple pages but it only works when the report plugin is executed directly without the maven site. The maven site plugin does not calls the execute() method but calls the actual implementation of the executeReport(Locale) method. Hence such logic does not work for the mvn site but works for direct execution of the plugin. The ReportDocumentRenderer from maven-site-plugin creates the SiteRendererSink and calls report.generate(sink,locale) which in turn calls executeReport(Locale) method. Using the createSink() method fails in such case. There is no way to create more SiteRendererSinks within the report, because those sinks are from a different classloader. Maven does provide the AbstractMavenMultiPageReport class to implement but it also does not provide any way to create multiple sink instances. After we upgrade to the maven-report-plugin 3.0 we have a new method in AbstractMavenReport class called getSinkFactory(). It allows to create new sink instances when executeReport method is called from the site-plugin which initializes the factory instance. In case of the direct execution of the multipage report plugin, the execute() method of AbstractMavenReport class has no implementation for initializing the factory method neither any setter to set the factory. Hence in such case we use the dirty hack and copy the execute method implementation in the executeReport method of the multipage report class to create a new sink instance. For accessing the getFactory method we upgrade the maven-reporting-api to 3.0 as follows:


    org.apache.maven.reporting
    maven-reporting-api
    3.0
    
        
            org.apache.maven.doxia
            doxia-sink-api
        
     



    org.apache.maven.doxia
    doxia-sink-api
    1.3



    org.apache.maven.reporting
    maven-reporting-impl
    2.2

Following code provides an overview of the implementation with an example of generating a multipage report:

public class MultiPageReportMojo extends AbstractMavenReport {

  .......................

  /**
   * Copied implementation from {@link AbstractMavenReport}. Generates the index page and 
   * report pages for all the environments. If the {@link SinkFactory} is null 
   * (when invoked directly) then creates a new {@link SiteRendererSink} object using 
   * {@link RenderingContext}. If the {@link SinkFactory} is not null (usually for mvn site) 
   * then uses its createSink() method to create a new {@link Sink} object. 
   * @see org.apache.maven.reporting.AbstractMavenReport#execute()
   */
   @Override
   protected void executeReport(Locale locale) throws MavenReportException {

    List<String> envList = Arrays.asList("local", "devl", "qual", "cert", "prod");
  
    // index method uses getSink() method from AbstractMavenReport class to directly access 
    // the sink and render the index page.
    executeReportIndex(locale, envList);
  
    for (String env : envList) {
   
      File outputDirectory = new File( getOutputDirectory() );
      Writer writer = null;
   
      try {
    
         String filename = outputPrefix + env + ".html";
         SinkFactory factory = getSinkFactory(); 

         if(factory == null) {
     
           SiteRenderingContext siteContext = new SiteRenderingContext();
           siteContext.setDecoration( new DecorationModel() );
           siteContext.setTemplateName( "org/apache/maven/doxia/siterenderer/resources/default-site.vm" );
           siteContext.setLocale( locale );
               
           RenderingContext context = new RenderingContext( outputDirectory, filename );

           SiteRendererSink renderSink = new SiteRendererSink( context );

           // This method uses the sink instance passed for the environment to render the report page.
           executeConfigReport(locale, renderSink);

           renderSink.close();

           if ( !isExternalReport() ) { // MSHARED-204: only render Doxia sink if not an external report
                
             outputDirectory.mkdirs();
             writer = new OutputStreamWriter( new FileOutputStream( new File( outputDirectory, filename ) ), "UTF-8" );
             getSiteRenderer().generateDocument( writer, renderSink, siteContext );
           }
         }
         else {
           Sink renderSink = factory.createSink(outputDirectory, filename);

           // This method uses the sink instance passed for the environment to render the report page.
           executeConfigReport(locale, renderSink);

           renderSink.close();
         }
      } catch (Exception e) {
         getLog().error("Report, Failed to create server-config-env: " + e.getMessage(), e);
         throw new MavenReportException(getName( Locale.ENGLISH ) + "Report, Failed to create server-config-env: " 
                                                                  + e.getMessage(), e);
      } finally {
         IOUtil.close( writer );
      }
  }

  .......................

  /**
   * Renders the table header cell with the specified width and text using the specified sink instance.
   * @param sink
   *   {@link Sink} instance to render the table header cell.
   * @param width
   *   {@link String} of the table header cell.
   * @param text
   *   {@link String} in the table header cell.
   */
  protected void sinkHeaderCellText(Sink sink, String width, String text) {

        SinkEventAttributes attrs = new SinkEventAttributeSet();
        attrs.addAttribute(SinkEventAttributes.WIDTH, width);
        sink.tableHeaderCell(attrs);
        sink.text(text);
        sink.tableHeaderCell_();
  }
}