Wednesday, December 31, 2025

Model Training and Fine-Tuning

 

Retrieval Augment Generation: If you want to answer questions about the knowledgebase that changes then RAG is easier.

Quality of RAG is very much dependent on the retrieval process.

--------

Data Collection

Data Cleaning, filtering


Deduplication:

Exact duplicates - Hashing (SHA-1, MD5) of full documents or lines - Identical web pages removed.


Near duplicates - MinHash, SimHash, Locality Sensitive Hashing (LSH) - Two web pages that differ by small edits


Semantica Duplicates - Sentence  embeddings + cosine similarity - Two sentences with same meaning but with different wording


Code deduplication - AST (Abstract Syntax Tree) hashing - Detecting identical code logic with different variable names.


Deidentification: remove personally identifiable information or sensitive information. Add safety filter to make it less toxic.

The dataset is collected via human annotations, where expert curate question answer pairs, chat style conversations and safety filters.

Scale.com = Provide dataset


Tokenizations: Split the sentence into tokens and then create embedding for each of the tokens.

GPT uses Byte Pair Encoding (BPE).

Some modern designs skip conventional token splits and operate on raw bytes (e.g. UTF-8) so that no language-specific tokenizer is needed, improving multilingual and code support.

--------

Attention Mechanism

Flash Attention

Sparse Attention


Positional Embeddings

RoPE and AliBi


Scaling

Mixture of Experts (MoE) for Scaling - Used in DeepSeek


Activation Functions 

Traditional functions: Relu, Sigmoid

New: GeLU, SwigLu


Optimizer

Old: Adam

New: Muon


Use CUDA Programming for GPU coding

Llama 3 405B is trained on up to 16K H100 GPUs.


Model Pre-Training:

- Learn Statistical Regularities (Common sentences)

- Acquire Foundational Linguistic Structures (Grammer)

- Develop abstract representation of World Knowledge


Model learns to predict next token on unstructured data.

It gains language understanding and background knowledge.

The model can only complete the text after pre-training and it is unable to answer questions.


https://arxiv.org/pdf/2302.13971

https://huggingface.co/datasets/agentlans/common-crawl-sample



Mid-Training: Fine Tunes the model's "thinking patterns" before it learns to follow instructions

This can include long-context documents or instruction-rich Q&A, designed to improve memory, coherence and strategic reasoning.


Supervised Finetuning: Also known as instruction-tuning. Give the model question and answer and make the model similar to Chat-GPT.


Structured dataset that demonstrates intended behavior.

Common Use-cases:

 - learn to follow instructions

 - learn to answer questions about a domain

 - learn to have conservations.


Eg. Alpaca Dataset (instruction tuning)

https://huggingface.co/datasets/yahma/alpaca-cleaned

Eg. Guanaco (Conversation tuning)

https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style

Eg. My Paul Graham Dataset (conversation tuning)

https://huggingface.co/datasets/pookie3000/pg_chat



Post Training: Model Alignment (reinforcement learning)


Preference Finetuning: It means aligning an LLM's behavior with human preferences - teaching it not just to be correct, but to be helpful, safe, polite, and aligned with what users actually like.


Use dataset of rejected and preferred assistant responses.

Goals:

 - Make model better to follow human preferences 

 - Make model safer.


Techniques:

RLHF (Reinforcement Learning with Human Feedback)- Human judgements can be inconsistent or biased. Not Scalable.


RL with Verifiable Rewards: Verifiable rewards are rewards which can be computed atomically and unambiguously. 

a) Task Definition

b) Generate Responses

c) Evaluate Automatically

d) Evaluate automatically


DPO (Direct Preference Optimization)

RLOO


Eg. Descritpiveness-sentiment-trl instruction tuning

https://huggingface.co/datasets/trl-internal-testing/descriptiveness-sentiment-trl-style


Llama 3 models are produced by applying several rounds of post-training or aligning the model with human feedback on top of a pre-trained checkpoint. Each round of post-training involves supervised fine-tuning (SFT) followed by Direct Preference Optimization (DPO) on examples collected via human annotations or generated synthetically.


Post Training: Reasoning (reinforcement learning)

Dataset of prompts and expected answers. Promote certain model outputs by using reward functions.

Methods: GRPO

It is used to create reasoning models like deepseek-r1, open o 1

Often done for quantitative domains (science, math, coding) as it's easy to define reward functions for those.

Eg. gsm8k (Grade school math)

https://huggingface.co/datasets/openai/gsm8k

Evaluation: After build and fine-tune the model, evaluation tells us how good is the model.


LoRA (Low Rank Adapation)

Using LoRA for fine tuning tasks allows us to decrease the number of parameters used and speed up inference time predictions.

 The Rank of a matrix is linearly independent rows or columns in the matrix.

 If we have a matrix (d x k) such that it can be decomposed into two matrices (d x r) and (r x k), and the rank (r) << min(d, k), then it used for LoRA to store less parameters.

Parameter Efficient Fine Tuning

 During fine tuning of the model e.g. BERT, we use back propagation to update all the model parameters. BERT large has 345 million parameters, so we need to store 345 million parameters for every task during fine-tuning. Adapters are small units (like feed forward layer) which are added to each transformer layer in the network to allow it to adapt to fine-tuned tasks. We perform the backward pass (after forward pass) to update the parameters, we only update the parameters of the adapter layers, keeping all other weights in the network frozen. Hence only the adapter weights are now needed to be stored for fine-tuning tasks which is fraction of all the weights in the model.

 It reduces the number of trainable parameters per task.

 It increases latency and inference time as the adapters are added into the network in a sequential manner (increasing time for the forward pass).

 

 In LoRA the adapters are added to the pre-trained model allowing it to adapt to a specific task. We have two feed forward layers A (d x r) matrix and B (r x d) matrix, were r is the rank (as low as 1 or 2) and d is internal representation of token (e.g. 512 or 1024 dimensions). The LoRa adapter parameters are added to every single set of pre-trained weights.

 Each attention-head in the BERT model contains sets of weight parameters Wq, Wk and Wv with 24 total layers. Each weight matrix is (d x d) matrix.

 We attached matrices [Aq Bq], [Ak Bk], and [Av Bv] to each attention-head layer in parallel were A is (d x r) and B is (r x d) matrix. During the backward pass, we update the attached matrices parameters without changing the original weights of the BERT model. We then add orginal Wq, Wk and Wv parameters with the new [Aq Bq], [Ak Bk], and [Av Bv] parameters learned for every single attention-head layer, e.g. Wq = Wq + AqBq. No additional weight parameters are used during the inference.

 In Feed Forward layer data is feed from one layer onto the next layer. Each layer has one or more units. Each unit in given layer receives input from every unit in previous layer and sends output in every unit in the next layer. A unit in layer does not communicate with any other units in the same layer. The output of all units except for those in the last layer are hidden from external viewers.


Proximal Policy Optimization 

Proximal Policy Optimization is used to learn a policy directly, a mapping from states (observations) to actions that maximizes expected cumulative reward in a reinforcement learning environment.

PPO algorithm uses two main architectures, a Policy Network and a Value Function Network. Both of these are neural networks that take an input and return an output. The policy network will take a status as input and it will produce an action as output. The output layer policy network has a number of neurons equal to the number of possible actions it can take and each neuron is a probability that an action is taken when the agent is in the input State.

The value function takes a state and action as input and it will output a real number (q-value) which quantifies how good was this decision. The output layer of Value function network has the number of neurons equal to the number of possible actions it can take. The value function neural network will take in a state as input and for every action it is going to determine a q-value which quantifies how good was this specific output action for this specific state.

We initially start with some random state which is passed into a Policy Network. The policy network will determine the probability of generating each action, which is a probability distribution. We sample from the distribution to determine the actual next action. Then we take the provided action, and we will receive some reward. We store the quadruple of (action, state, reward, action probability) in the data store which is used for training the policy function and value function network. We repeat these sequence of steps for the episode or some fixed number of time steps within the episode. The data is stored for an episode as a batch.

For the batch of (state, action, reward, action probability), the state and action are used with the value function network to get the q-value, which quantifies how good we expect this action. We then determine the total future reward for every time step using the data that we stored, which quantifies how good we actually performed. The difference between actual and expected q-values is called Advantage which is used to compute the loss. The loss is then back propagated through the value function network for learning.

We then use the same advantage and probabilities that we stored previously to determine the loss for the Policy Network. The loss is back-propagated through the Policy Network so its parameters are updated. We repeat this process for all batches of data effectively making the Policy Network and the Value Function Network better over time as they learn.

  • Compute the loss for Value function network
  • Compute the loss for Policy Network
  • Update both the Networks together
  • Repeat


PPO Loss Deep Dive

We get the batch of data for the episode we stored in pass2. For each time step we compute the actual future reward using  the data by taking the sum of discounted future rewards. We then compute the expected future reward by passing the state into the value function neural network and it produces a q-value for every action. We then look at the Q-value for the specific action neuron in the tuple. So for every time-step we have two numbers, we take the difference between these two numbers to get the advantage. We square the advantage for every time-step and take the average across the batch giving the loss.


Policy Network Loss

We get the batch of data that we stored, next we pass the batch of states to the Policy Network to get the probabilities of actions. In each case we only consider the probability of the action taken when we gathered the data.

We then divide two numbers that is a probability for the specific action that we have now divided by the probability we collected previously and this is a probability ratio. We then multiply this ratio with the advantage computed in this time step and so for every time step in the episode we have a number.

We take the probability ratio and we'll clip it to ensure that we're not changing the network too much. We multiply this by the advantage and so now for every time step we have two values.

We take the minimum of these values and then we take the average of the values across the batch to get a single number which is loss. This loss is back propagated through the policy Network .

The loss function overall strikes a balance between 

  • making effective policy updates to improve per performance and 
  • making cautious policy updates to improve stability

We approximate the value function using value neural network, and the policy using policy neural network. The deterministic policy does not explore the space compared to the stochastic policy which explores the space using probabilities even though it cannot give us the best scores.


Tuesday, December 30, 2025

OpenShift Operator

Operators are controllers which implement and manage Custom Resources with custom reconciliation logic. Operators extend Kubernetes to manage custom resources similar as Kubernetes built-in resources. Operators are of two types, Cluster Operators and OLM Operators. The Cluster Operators (Platform Operators) are used to manage the OpenShift platform itself such as managing the cluster and upgrading cluster etc. The OLM Operators are managed by Operator Lifecycle Manager (OLM).

Operators can be developed using Helm, Go and Ansible. Helm operator only support Level 2 maturity level with capabilities Basic Install and Seamless. The Helm-based operator wraps existing Helm chart into Operator for seamless installation/upgrades. It creates a simple CRD containing the desirable parameters to be passed to Helm Chart as part of the spec for custom resource. We can configure the OpenShift platform versions supported by the Operator.

Operator Lifecycle Manager (OLM)

Operator Lifecycle Manager (OLM) is Kubernetes native application and it helps to install and manage the life cycle of the operators. OLM allows to Package, Distribute and Manage Operators.
Kubernetes Operators are application specific controllers that automate the deployment, configuration and management of complex applications on Kubernetes. OLM provides a declarative way to install and manage Operators using Kubernetes cluster resource definitions (CRDs). OLM is installed by default in OpenShift cluster.



OLM is comprised of the catalog-operator, olm-operator and packageserver. The status of OLM pods can be checked using the command oc get pods -n openshift-operator-lifecycle-manager.

  • OLM Operator is responsible for deploying applications defined by CSV resources after the required resources specified in the CSV are present in the cluster. It watches for the ClusterServiceVersion (CSVs) to appear in the watched namespaces. It checks whether all the owned and required CRDs specified in CSV are available in the cluster and deploys the Operator once all CSV resources are available.
  • Catalog Operator is responsible for resolving and installing CSVs and the required resources they specify. It is also responsible for watching CatalogSources for updates to packages in channels and upgrading them (optionally automatically) to the latest available versions. Catalog Operator installs CSV and all the resources specified in CSV to the cluster, so that OLM Operator can then deploy the Operator. The catalog operator updates the catalog sources and watches the channels of subscribed operators. It uses subscriptions to query catalogs for the best version of the Operator. Catalog Operator creates all of the resources in the Install Plan to build the required resources for new and upgraded CSV.

·       PackageServer is API provider which Serves the OLM APIs (e.g., via OperatorHub). It shows all the packages available across Catalogs. The below command lists all the Operators available to be installed from all the catalog sources configured to the cluster.

kubectl get packagemanifest

Custom Resource Definition

Custom Resource Definition (CRD) allows users to extend the Kubernetes API by defining new schema and structure for new custom resource types. This includes specifying its API group, kind, plural and singular names, and the fields within its specification (spec) and status. CRD provides the object definition of the parameters which will be passed to underlying Helm Chart of the Operator.

ClusterServiceVersion

ClusterServiceVersion (CSV) is a Kubernetes Custom Resource (CR) that defines the metadata, deployment details, and lifecycle management information of the Operator for the Operator Lifecycle Manager (OLM) to install/manage operator and its dependencies. CSV is used to install the operator, manage its lifecycle such as upgrade, uninstall, etc and display the information about it in the OpenShift Console. OLM reads the CSV to ensure the Operator version to install, the permissions required to install the Operator from a catalog and the CRD it needs to watch for Operator state. CSV contains the details of the CRDs, namely the owned CRDs of the Operator and the required CRDs for the Operator. The required CRDs needs to be installed in the cluster before installing the operator. Each new version of the operator will be a new CSV.

Install Strategy is another section of CSV which tells OLM operator what to provision to get the operator running, that is the OLM operator follows when all resources exists in the cluster.

The Operator Framework's suggested-namespace annotation in CSV is used to ensure that customer creates specified namespace during operator installation.

The InstallMode in CSV specifies scope of the Operator i.e which namespaces it can watch and manage custom resources (CRs) within the cluster. Install modes determine whether the Operator: runs in a single namespace or its own namespace, watches multiple namespaces, or manages resources cluster-wide. The InstallModes allows to declare supported deployment scopes, ensuring security boundaries and whether OLM can validate if installation is compatible with the chosen OperatorGroup.

The OpenShift operator supports OwnNamespace install mode and only runs on specific namespace. If another operand instance is created with an existing operand then the second operand gets installed after the current operand is uninstalled.

kubectl get csv -n emprovise-system
kubectl describe csv -n emprovise-system

Subscription

Subscription is an OLM resource which tells OpenShift, the operator to be installed or to be updated from a given catalog source and channel. It acts like a contract between the OperatorSource / CatalogSource (where Operator packages live), and the ClusterServiceVersion. OLM will be monitoring the channels when there is subscription present and new CSVs will be applied to keep the operator updated.

Once we register or subscribe to channel then OLM watches for the channel for new versions to be released. All subscriptions by default are configured for automatic approval. Update the installPlanApprovalfield in the Subscription to manual for manual approval to install/upgrade the operator. To manually approve, update the field approved to true in install plan for operator installation to proceed.

Subscription is way to tell OLM that we want to have operator from specific channel in a package within a catalog to be installed and maintained on the cluster.

kubectl get subscription  -n emprovise-system kube-demo-subscription -o yaml
kubectl describe subscription -n emprovise-system

CatalogSource

CatalogSource (Catalog Registry) is a library of operators which contains information about available operators, their versions and instructions to install them. It is a repository of metadata that OLM uses to discover and install Operator along with its dependencies. It is designed to automate the tedious tasks to install and update the operators. Catalog source is configured to pull an image every 30 minutes and pull down latest catalog image when the update becomes available. This keeps channel updated with latest CSVs.

kubectl get catalogsource -n emprovise-system kube-demo-catalog -o yaml

OperatorGroup

OperatorGroup defines the scope of visibility for one or more Operators by providing multitenancy configuration. OperatorGroup assigns target namespaces and generates required RBAC access for member Operators. It ensures there is no conflict in APIs supported by each member managed by the same group. There must be only single OperatorGroup created per namespace. The Operator should have the Install Mode configured in its CSV that supports the OperatorGroup's target namespaces in order to become the member of the OperatorGroup. Member CSVs of an OperatorGroup gets the olm group annotations in generated CSV for the bundle.

oc get operatorgroups -n emprovise-system -o yaml

Channels

Channel represents a stream of versions within an Operator Catalog that groups one or more versions of an Operator’s ClusterServiceVersions (CSVs) to represent a logical upgrade path. Each channel defines a sequence (or upgrade path) of Operator versions that OLM can follow when installing or upgrading an Operator.

KubeDemo Operator supports two channels stable which is default, and alpha channel in Operator Bundle/Catalog. OLM will be monitoring the channels when there is subscription present and new CSVs will be applied to keep the operator upto date.
Add catalog source to the cluster and have OLM can watch a specific channel in a package from the catalog (service) to detect and apply updates to our operator.

If new updates available in the channel which it is subscribed to OLM would automatically will create install plan which defines all the resources required to install/upgrade the new version of operator. Install Plans get created in same namespace as the operator when new version is available.

Install Plan

Install plan contain the details of all the resources which needs to be created in order to automatically install or upgrade a CSV. Install plans gets generated for each new CSV being applied to the cluster.

kubectl get installplan -n emprovise-system
kubectl describe installplan -n emprovise-system

Operator Controller

Operator Controller is the core execution engine of the Operator deployed on OpenShift cluster. It houses operational logic extending the Kubernetes API to manage complex stateful applications. It executes the Kubernetes Control Loop (or "reconciliation loop"), which continuously monitors the custom resources, comparing the desired state to the actual state and takes the remediating action. The Operator Controller’s reconciliation loop function is a continuous, three-step cycle:

·       Watch: It watches for changes (events) on the Custom Resources (CRs) defined by the Operator's Custom Resource Definition (CRD).

·       Analyze: When a change is detected (e.g., a user creates, updates, or deletes a CR), the Operator Controller compares the new desired state (defined in the CR) with the current actual state of the application in the cluster (e.g., existing Pods, Services, ConfigMaps).

·       Act (Reconcile): If a difference is found, the Operator Controller executes the necessary steps to reconcile the state. This usually involves issuing standard Kubernetes API calls to create, update, or delete underlying resources (like Deployments, StatefulSets, or Secrets) to match the user's defined goal.

Operator Controller is referred as Controller Manager in OpenShift Operator repository.

Metrics Service

Operator bundle resource exposes metrics endpoint (on port 8443) as service from the Controller manager which can be used by Prometheus or Kubernetes Metric Server for scraping metrics and monitoring operator health.

RBAC Permissions

RBAC (Role-Based Access Control) resources determine the access for Operator’s controller and the end user access (via OLM-generated roles) to the KubeDemo custom resource. OLM installs and manages these RBAC objects automatically when the ClusterServiceVersion (CSV) is applied from your Operator bundle.

The Operator SDK auto-generates RBAC roles for end users to inspect KubeDemo CRs safely, providing minimal read-only access to observe but not modify the KubeDemo custom resources.

SQLite Based Catalog

SQLite based catalog is the legacy catalog storage for operators. It was part of legacy implementation of catalog images and deprecated. It is NOT recommended to enabled SQLite based catalog by setting environment variable CATALOG_MODE to sqlite.

File Based Catalog

File-Based Catalog is the declarative YAML/JSON format replacing the SQLite database format in Catalog image which was hard to: maintain, version control, and review.

Each document defines a single catalog entry (like an Operator package, a version, or a channel). The catalog contains Operator version entries for each channel (alpha and stable channel), Operator bundle image URLs, and Custom Resource Definitions (CRDs) description and CRD example. These entries are built into an image (the catalog image) that OLM uses to discover and install Operators. The KubeDemo Operator by default uses File based catalog with environment variable CATALOG_MODE to fbc.

Catalog Template

Catalog templates are used to simplify a view of a catalog and allow easier manipulation of catalogs. They are reusable YAML “blueprints” that define how catalog entries (packages, channels, bundles) should be structured or generated. Catalog templates are of two types Basic Template and Semantic Template identified by a top-level key schema: olm.template.basic or schema: olm.template.semanticrespectively. Templates uses YAML parameters to simplify catalog authoring and reduce duplication. Template is similar to "macros that expand into actual catalog entries (packages, channels, bundles).

Basic Template

Basic template provides a simple and explicit way to define catalog entries (packages, channels, and bundles) — all in one file. It’s typically used for small or manually curated catalogs, or when you want direct control.

Semver Template

Semver template is an advanced version of the basic template that introduces semantic version awareness and automated upgrade graph generation. It’s designed for larger catalogs or Operators with multiple versions, where you want OLM to auto-generate channel graphs based on semantic version rules. This template supports channel names Candidate, Fast, and Stable, in order of increasing channel stability.

Since currently we don’t anticipate frequent Helm chart and Operator releases, prefer manually defined and managed channels with operator versions, KubeDemo Operator uses Basic template to generate FBC.

Catalog Basic Template Management

The Basic Template for File-based Catalog (FBC) is used to manage the upgrade graph of the Operator with version entries for supported channels alpha and stable.

The previous catalog/basic.yaml is either converted from previous catalog/catalog.yaml downloaded from previous catalog image for generate new catalog FBC. If no previous catalog exists then empty catalog file copied from config/catalog/basic-template to catalog/basic.yaml. The new operator bundle version entry is added for alpha channel (Default Fast Channel) along with olm.bundle Bundle image URL to the basic template. Finally the basic template is converted to full FBC catalog/catalog.yaml which is packaged in new catalog image.

The below custom scripts and Makefile commands are used to manage the Basic Template and to convert to full form of File-based catalog.

·       Convert catalog shell script converts full File-based catalog to basic template catalog. The make extract-catalog command leverages the /scripts/convert_catalog.sh script for basic template conversion.

./scripts/convert_catalog.sh catalog/catalog.yaml catalog/basic_template.yaml

·       Update the basic template catalog by adding new operator bundle version to the specified channel.

./scripts/update-catalog.sh catalog/basic.yaml quay.io/community-operator-pipeline-prod/kube-demo:0.0.6 alpha

·       Below command reads the basic catalog template file from input variable INPUT_CATALOG_TEMPLATE and finds the latest operator version for FAST_CHANNEL set to alpha channel.

make latest-version-from-catalog

·       The generate catalog command adds new operator entry (using update-catalog.sh) to the existing catalog/basic.yaml catalog basic template if present or to an empty basic template catalog copied from config/catalog/basic-template.yaml. It then generates the full form of file based catalog in catalog/catalog.yaml and validates the catalog. Finally it generates the Dockerfile to build catalog image.

make catalog-generate

Operator Images

The Operator image packages the KubeDemo Helm chart and the "watches.yaml" configuration in the opt/helm path to install the KubeDemo Helm chart.

The make bundle command generates the Operator Bundle files which has Manifests directory consisting of ClusterServiceVersion, Operator's CRD, RBAC permissions, Metadata directory which has annotations.yaml used to generate the Bundle Docker image and Tests directory containing scorecard test execution configuration. The Operator image and the Operator Bundle files are used to publish Community Operator in production.

The Bundle and Catalog docker images are only used for local operator testing or integration testing. We use the command oc apply -k config/olm to install the KubeDemo operator on OpenShift cluster. The Bundle image contains the Operator Bundle files generated from make bundle command added to the manifests/metadata/ and /tests/scorecard respectively in the container.

The Catalog docker image contains the full FBC catalog generated added inside the /configs directory in the container.

Operator images only supports Linux ARM64 and Linux AMD64 systems.

Operator SDK

The OpenShift Operator project was initialized using the operator-sdk utility's initialize command.
The Operator SDK is toolkit part of Operator Framework which provides the tools to build, test, and package Operators. It supports below commands to create, delete and test Operators.

Create operator project directory and initialize the project.

operator-sdk init --domain emprovise.com --plugins helm

Create a simple Kube-Demo API using Helm’s built-in chart boilerplate (from helm create):

operator-sdk create api --group demo --version v1alpha1 --kind KubeDemo

The Operator SDK’s install operator command creates temporary CatalogSource and Subscription to install kube-demo operator similar to new customized kubectl apply -k config/olm command. It is used for local testing to run the operator bundle.

operator-sdk run bundle "${V1CS_OPERATOR_BASE_IMG_URL}/kube-demo-bundle:v${OPERATOR_VERSION}" --namespace emprovise-system

The operator upgrade command upgrades the installed operator by deleting the old subscription and creating new operator’s subscription:

operator-sdk run bundle-upgrade "${V1CS_OPERATOR_BASE_IMG_URL}/kube-demo-bundle:v${OPERATOR_VERSION}" --namespace emprovise-system

The Operator SDK’s cleanup command deletes the Operator Subscription, CatalogSource and ClusterServiceVersion. It is similar to executing kubectl delete -k config/olm which is used currently for more customized install/uninstall.

operator-sdk cleanup kube-demo --namespace emprovise-system

Operator Package Manager (OPM)

Operator Package Manager (OPM) is a CLI tool used to build, manage, and inspect Operator catalogs images used by OLM to list the available Operator version and upgrade path. OPM manages catalogs that list Operators and their bundles. OLM uses CatalogSources to discover Operators which are served from index images built using OPM. OPM supports Sqlite catalogs using opm index addopm index inspectopm serve commands and FBC catalogs using below commands.

Render and view manifests inside a bundle image.

opm render docker.io/emprovise/kube-demo-bundle:v0.0.1

Convert full FBC to Basic template catalog.

opm alpha convert-template basic -o yaml catalog/catalog.yaml -o yaml

Validate full FBC catalog.

opm validate ./catalog

Helm Idempotency

The Operator Helm chart should be Idempotent, i.e. every repeated install/update of same Helm chart should produce exact same resources. This is required to prevent the Controller Manager from repeatedly reconciling the Helm chart installation. The controller manager runs a dry run upgrade every few minutes, and if resources are found to have drifted (ie. secrets are randomly defined), then the controller manager will repeatedly reinstall Helm Chart.

Note: The mode to allow checking existing secrets using lookup uses the --dryrun=server option of Helm internally within the Operator controller. This is enabled as dryRunOption: server in openshift-operator/watches.yaml

 

OpenShift Operator Community Release Process

Community Operator Release requires to create a forked repository of Community Operators Production which will be used to create community operator pull requests.

Community Operator Release with Auto-Release FBC

·       Execute the OpenShift Operator Release workflow from actions repository by Manually triggering the workflow using Run workflow and specifying the New Operator release versions. This will create a new Operator image (along with Bundle and Catalog images which are used for internal testing) pushed into kube-demo Integration ECR and downloadable Operator Bundle artifact (Manifests, Annotations and Tests directories) from the workflow.

·       Create new directory named as new operator version number e.g. 0.0.2 in the forked Community Operators Production repository inside the operators/kube-demo directory and copy all the Operator Bundle artifact files as its contents. Also add the release-config.yaml file with the replaces field containing the previous (existing) operator version e.g. v0.0.1 as below. The release-config.yamlautomates the catalog update using the File-Based-Catalog auto-release and avoids manually creating a second PR with catalog changes.

---
catalog_templates:
  - template_name: basic.yaml
    channels: [stable, alpha]
    replaces: kube-demo.v0.0.1

·       The release-config.yaml for auto-release of File-Based-Catalog using basic template not just supports replacing version but can also specify the list of versions or range of versions to skip with below fields.

o   replaces - the bundle that the new bundle replaces in the update graph.

o   skips - a list of bundles that should be skipped in the update graph.

o   skipRange - a range of bundles that should be skipped in the update graph.

·       Push the above new operator directory changes as commit to the forked community operator prod repository on new branch e.g. kube-demo-0.0.2. Create signed commit using the command git commit -S -m "release operator kube-demo (0.0.2)" --author=<name_email> for DCO check. Create a new Pull request in the Community Operators Production repository using the forked community operator changes.

·       Once the Community Operator Prod pull request is created, it triggers the operator-hosted-pipeline which executes all validation checks such as pre-flight checks and other static-tests. On successful completion of validation checks/tests the pull request automatically gets merged if it is authorized by approved reviewers list in CI yaml. This will trigger the operator-release-pipeline workflow which creates the operator bundle image using the provided bundle files for the new operator version. The merged pull request will have the newly release bundle image URL in Release Info as below.

A black screen with white textAI-generated content may be incorrect.

·       It triggers to automatically create new Pull Request for Catalog update which will update the basic.yaml file containing the operator’s basic template catalog entries. The Community Operator release workflow then adds/sync index for new Operator bundle version to the existing Community Operator Catalog.

NOTE: Once the Operator version (manifest files) is added to Community-Operators-Prod repository we cannot make any changes for existing operator version. If you want to update bundle files for existing Operator version, you first need to delete the operator and update the corresponding catalog entries using below make catalogs command. Then add the same operator with updated bundle files similar to adding new operator version.

Community Operator Release with Manual-Release FBC

·       Manual-Release FBC for community operator is used when we have to:

o   Add existing operator version to another channel (alpha or stable).

o   The operator-hosted-pipeline workflow for automated Catalog update pull request fails, in which case we manually need to update the basic template and full FBC catalogs (for v4.12 to v4.20).

o   Delete the released operator version which requires updating the catalogs.

·       Execute the OpenShift Operator Release workflow passing the new operator version which publishes the operator image in public ECR and outputs bundle artifact from the workflow.

·       Create new directory named as new operator version number e.g. 0.0.2 in the forked Community Operators Production repository inside the operators/kube-demo directory and copy all the Operator Bundle artifact files as its contents.

·       Push the above new operator directory changes as commit to the forked community operator prod repository on new branch, using signed commit for e.g. git commit -S -m "release operator kube-demo (0.0.2)" --author=<name_email>.

·       Create a new Pull request in the Community Operators Production repository using the above forked community operator changes. After the pull request is merged we get the newly released bundle image URL in Release Info from the Merged Operator Pull Request.

·       Manually add the catalog entry for the desired channel (namely alpha and/or stable) and olm.bundle schema entry with the bundle image URL from the previous pipeline execution in basic.yaml in catalog-templates. For example in the below snippet of catalog basic template, we add entry to alpha channel with replaces kube-demo.v0.0.1 for name kube-demo.v0.0.2. We also add newly released bundle image URL quay.io/community-operator-pipeline-prod/kube-demo:0.0.2 for olm.bundle schema entry as below.

---
schema: olm.template.basic
entries:
  - schema: olm.channel
    name: alpha
    package: kube-demo
    entries:
      - name: kube-demo.v0.0.1
      - name: kube-demo.v0.0.2
        replaces: kube-demo.v0.0.1
  - schema: olm.bundle
    image: quay.io/community-operator-pipeline-prod/kube-demo:0.0.1
  - schema: olm.bundle
    image: quay.io/community-operator-pipeline-prod/kube-demo:0.0.2

·       Update the full FBC catalogs for kube-demo in catalogs directory, by generating full FBC for all the supported OpenShift versions (v4.12-v4.20) using the updated basic template (basic.yaml) from the below command. The below command uses kube-demo Makefile and render_catalogs.sh which requires basic template name to be basic.yaml

make catalogs

·       Create new signed commit with the updated basic.yaml and all the support Openshift’s full FBC catalogs for kube-demo operator. Push new commit to the forked Community Operators Production repository with new branch e.g. kube-demo-0.0.2-fbc.

·       Create new pull request for Catalog Update which gets auto-merged (if PR created by approved reviewers list) after validating all the checks from operator-hosted-pipeline workflow. The merged catalog update pull request executes the operator-release-pipeline workflow to release all the catalog updates to community OperatorHub.

 

Tuesday, December 31, 2024

New age of AI Agents

 With the rise of large language models (LLM) the software industry is going though a paradigm shift with Artificial Intelligence. AI agents are poised to revolutionize the use of LLMs and create a new age in software development. The advent of AI agents promises to unlock a new frontier of productivity and efficiency.

AI agents are systems designed to make decisions and take actions towards a specific goal without needing step-by-step instructions. They are powered by large language models (LLMs) and have the ability to act on their own, unlike traditional chatbots which can only respond to prompts.

AI agents are defined by specific roles (writer, image creator) and have access to other resources like the Internet to get information to carry out the assigned tasks. Agents are powered by large language models like GPT-4 which serve as their “brain” to understand queries, reason, and generate outputs. Multiple agents can collaborate, with one agent delegating tasks to another. Agents can iteratively refine their understanding, search for more information, and ultimately produce the desired output like a report or article summarizing their findings.

The key difference from traditional LLMs is that agents can break down higher-level goals into sub-tasks, make plans, leverage tools, and iteratively work towards solutions, exhibiting agency and autonomous behavior.

The true potential of AI agents lies not only in their individual capabilities but also in their ability to collaborate and leverage each other’s strengths. By combining multiple agents with specialized roles and goals, organizations can create powerful teams capable of tackling intricate challenges that would be insurmountable for a single entity.

Saturday, April 20, 2024

Advanced Kubernetes

Kubernetes is a container orchestration tool for automating the deployment, scaling and management of containerized applications. Kubernetes (K8s) has undoubtedly become the cornerstone of modern cloud infrastructure. In the previous Kubernetes post we went through the basics of Kubernetes, now lets dive into some advance Kubernetes topics.

Kubernetes Operator

Kubernetes manages the complete life cycle of stateless application in fully automated way without any extra knowledge to create/update/delete resources. Kubernetes uses the core control loop mechanism were it Observes the state, Checks for the differences between current & desired state and takes Action to update to desired state. For stateful application were storage resources such as database is present, the Kubernetes control loop process doesn't work. The stateful application can have database replicas for example which have different state and identity, so these replicas need to be synchronized and sequentially updated for database consistency. This process would vary for various databases from MySQL to Postgres. Kubernetes Operators as used by stateful application to automate these complex stateful operations to update the resources.


Kubernetes Operators use the core control loop mechanism to watch for changes in the application state and make updates. It uses Kubernetes Custom Resource Definition (CRD) which extend existing K8s API and uses app specific knowledge to automate the lifecycle of the application it operates. Each resource in Kubernetes such as Pods, Services Deployments etc is an endpoint on Kubernetes API. This endpoint stores the collection of Kubernetes objects. The Kubernetes API is used to create, update, delete and get these Kubernetes objects. Kubernetes CRD is a way of extending the Kubernetes API to develop custom resources and install in any Kubernetes cluster. Once the resource is installed in Kubernetes cluster, we can create its object using Kubernetes CLI (kubectl). CRD only allows to store and fetch the data, but when combined with custom controller we can have a custom declarative API, which allows the keep the object's current state in sync with object's desired state.

Custom operator are created for each applications and can be found in OperatorHub.ioOperator SDK allows developers easily Kubernetes native applications.

Operator consists of three parts, a controller, a state and a resource.

The controller manages the (internal or external) resources, and contains the logic to get from current state to the desired state. It manages the resources using the Kubernetes loop mechanism: Observe, Check and Adjust the state. State is the state of the resource called as CRD. The controller uses the state that is stored in Kubernetes control plane (Etcd) to ensure that the resource is at the requested state. State is described as CRD with declarative description in form of YAML. Customer Kubernetes Operator is an application which defines an API for its state, stores the state in Etcd (Control plane) and uses the Kubernetes API to manage, update and delete the desired resource. The resource can be Kubernetes native resource e.g. a Pod or any other external resource outside Kubernetes e.g. Printer.

Create a new PDFDocument CustomResourceDefinition.

$ kubectl apply -f pdf-crd.yaml

$ kubectl get pdfdocument

$ kubectl get pdf

$ kubectl api-resources | grep pdf

$ kubectl proxy --port=8080

$ curl localhost:8080/apis/k8s.startkubernetes.com/v1/namespaces/default/pdfdocuments

$ kubectl get crd

Create a custom controller using KubeBuilder.

$ go mod init k8s.startkubernetes.com/v1

$ kubebuilder init

$ kubebuilder init --domain dev.emprovise --repo=github.com/pranav-patil/sample-oprator

$ kubebuilder create api --group k8s.startkubernetes.com --version v1 --kind PdfDocument


Admission Webhooks

In Kubernetes any events e.g. create/delete pods, scale deploy etc are requested through the API Server. Admission Webhooks allows us to intercept these requests at different stages. There are two types of Admission web hooks, Mutating web hooks and Validation web hooks.

  1. Mutating web hooks intercepts the requests (object/YAML) before it hits the API server and allows to inject changes to the request object.
  2. Validation web hooks allows to accept or reject the request to the API service, example for policy enforcement.

https://github.com/marcel-dempers/docker-development-youtube-series/tree/master/kubernetes/admissioncontrollers/introduction

inject changes or after the request is being validated by the API server.


Pod Security Admission

Pod security admission is an implementation of the Kubernetes pod security standards. Use pod security admission to restrict the behavior of pods. Pods that do not comply with the pod security admission defined globally or at the namespace level are not admitted to the cluster and doesn't run. Globally, the privileged profile is enforced, and the restricted profile is used for warnings and audits.

Kubernetes v1.21 and later use the Pod Security Standards (PSS) and Pod Security Admission (PSA) controls to manage Pod security, instead of the older PodSecurityPolicy. Pod Security Policy (PSP) was deprecated in Kubernetes v1.21, and removed from Kubernetes in v1.25. Pod Security Standards (PSS) and Pod Security Admission (PSA) were turned on by default in Kubernetes v1.23, and replace Pod Security Policies (PSP) in Kubernetes v1.25 and above. For Kubernetes v1.22 PodSecurity has to be enabled manually, add --feature-gates=PodSecurity=true to below configuration:
  • Kubelet level: /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
  • Kube-API level: /etc/Kubernetes/manifest/kube-apiserver.yaml
Use the below command to check the configuration in API server. There are multiple feature gates which can be enabled in API server config.

kubectl describe pod -n kube-system kube-apiserver-master-node

The PodSecurity restrictions can be applied at Namespace, Label or Cluster level using AdmissionConfig file.
apiVersion: v1
kind: Pod
kind: Namespace
metadata:
 name: dev
 labels:
   pod-security.kubernetes.io/{{MODE}}={{LEVEL}}

Below are the Pod Security Levels to be defined.
  • Privileged - It is unrestricted policy, full admin permissions, widely open and used for privilege escalations.
  • Baseline - Allows the default minimally and is specified at Pod configuration. The Allow privileged escalation must be false.
  • Restricted - Heavily restricted policy and Pod hardening.

All the Pod Security levels work at the Security context level under spec.containers[*].securityContext.

Below are the Pod Security Modes to be applied.
  • Enforce - It's a rule, if violated Pod creation is rejected.
  • Audit - If violated, it records event in audit.log, Pod creation allowed
  • Warn - If violated then warning message, Pod creation allowed

Policy application is controlled based on labels on the namespace. The following labels are supported:

  • pod-security.kubernetes.io/enforce: <policy level>
  • pod-security.kubernetes.io/enforce-version: <policy version>
  • pod-security.kubernetes.io/audit: <policy level>
  • pod-security.kubernetes.io/audit-version: <policy version>
  • pod-security.kubernetes.io/warn: <policy level>
  • pod-security.kubernetes.io/warn-version: <policy version>

The two controllers work independently using the following processes to enforce security policies:

  • The security context constraint controller may mutate some security context fields per the pod’s assigned SCC (Security context constraints). The security context constraint controller also validates the pod’s security context against the matching SCC.
  • The pod security admission controller validates the pod’s security context against the pod security standard assigned to the namespace.