Emprovise Blog: Model Training and Fine-Tuning

Retrieval Augment Generation: If you want to answer questions about the knowledgebase that changes then RAG is easier.

Quality of RAG is very much dependent on the retrieval process.

--------

Data Collection

Data Cleaning, filtering

Deduplication:

Exact duplicates - Hashing (SHA-1, MD5) of full documents or lines - Identical web pages removed.

Near duplicates - MinHash, SimHash, Locality Sensitive Hashing (LSH) - Two web pages that differ by small edits

Semantica Duplicates - Sentence embeddings + cosine similarity - Two sentences with same meaning but with different wording

Code deduplication - AST (Abstract Syntax Tree) hashing - Detecting identical code logic with different variable names.

Deidentification: remove personally identifiable information or sensitive information. Add safety filter to make it less toxic.

The dataset is collected via human annotations, where expert curate question answer pairs, chat style conversations and safety filters.

Scale.com = Provide dataset

Tokenizations: Split the sentence into tokens and then create embedding for each of the tokens.

GPT uses Byte Pair Encoding (BPE).

Some modern designs skip conventional token splits and operate on raw bytes (e.g. UTF-8) so that no language-specific tokenizer is needed, improving multilingual and code support.

--------

Attention Mechanism

Flash Attention

Sparse Attention

Positional Embeddings

RoPE and AliBi

Scaling

Mixture of Experts (MoE) for Scaling - Used in DeepSeek

Activation Functions

Traditional functions: Relu, Sigmoid

New: GeLU, SwigLu

Optimizer

Old: Adam

New: Muon

Use CUDA Programming for GPU coding

Llama 3 405B is trained on up to 16K H100 GPUs.

Model Pre-Training:

- Learn Statistical Regularities (Common sentences)

- Acquire Foundational Linguistic Structures (Grammer)

- Develop abstract representation of World Knowledge

Model learns to predict next token on unstructured data.

It gains language understanding and background knowledge.

The model can only complete the text after pre-training and it is unable to answer questions.

https://arxiv.org/pdf/2302.13971

https://huggingface.co/datasets/agentlans/common-crawl-sample

Mid-Training: Fine Tunes the model's "thinking patterns" before it learns to follow instructions

This can include long-context documents or instruction-rich Q&A, designed to improve memory, coherence and strategic reasoning.

Supervised Finetuning: Also known as instruction-tuning. Give the model question and answer and make the model similar to Chat-GPT.

Structured dataset that demonstrates intended behavior.

Common Use-cases:

- learn to follow instructions

- learn to answer questions about a domain

- learn to have conservations.

Eg. Alpaca Dataset (instruction tuning)

https://huggingface.co/datasets/yahma/alpaca-cleaned

Eg. Guanaco (Conversation tuning)

https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style

Eg. My Paul Graham Dataset (conversation tuning)

https://huggingface.co/datasets/pookie3000/pg_chat

Post Training: Model Alignment (reinforcement learning)

Preference Finetuning: It means aligning an LLM's behavior with human preferences - teaching it not just to be correct, but to be helpful, safe, polite, and aligned with what users actually like.

Use dataset of rejected and preferred assistant responses.

Goals:

- Make model better to follow human preferences

- Make model safer.

Techniques:

RLHF (Reinforcement Learning with Human Feedback)- Human judgements can be inconsistent or biased. Not Scalable.

RL with Verifiable Rewards: Verifiable rewards are rewards which can be computed atomically and unambiguously.

a) Task Definition

b) Generate Responses

c) Evaluate Automatically

d) Evaluate automatically

DPO (Direct Preference Optimization)

RLOO

Eg. Descritpiveness-sentiment-trl instruction tuning

https://huggingface.co/datasets/trl-internal-testing/descriptiveness-sentiment-trl-style

Llama 3 models are produced by applying several rounds of post-training or aligning the model with human feedback on top of a pre-trained checkpoint. Each round of post-training involves supervised fine-tuning (SFT) followed by Direct Preference Optimization (DPO) on examples collected via human annotations or generated synthetically.

Post Training: Reasoning (reinforcement learning)

Dataset of prompts and expected answers. Promote certain model outputs by using reward functions.

Methods: GRPO

It is used to create reasoning models like deepseek-r1, open o 1

Often done for quantitative domains (science, math, coding) as it's easy to define reward functions for those.

Eg. gsm8k (Grade school math)

https://huggingface.co/datasets/openai/gsm8k

Evaluation: After build and fine-tune the model, evaluation tells us how good is the model.

LoRA (Low Rank Adapation)

Using LoRA for fine tuning tasks allows us to decrease the number of parameters used and speed up inference time predictions.

The Rank of a matrix is linearly independent rows or columns in the matrix.

If we have a matrix (d x k) such that it can be decomposed into two matrices (d x r) and (r x k), and the rank (r) << min(d, k), then it used for LoRA to store less parameters.

Parameter Efficient Fine Tuning

During fine tuning of the model e.g. BERT, we use back propagation to update all the model parameters. BERT large has 345 million parameters, so we need to store 345 million parameters for every task during fine-tuning. Adapters are small units (like feed forward layer) which are added to each transformer layer in the network to allow it to adapt to fine-tuned tasks. We perform the backward pass (after forward pass) to update the parameters, we only update the parameters of the adapter layers, keeping all other weights in the network frozen. Hence only the adapter weights are now needed to be stored for fine-tuning tasks which is fraction of all the weights in the model.

It reduces the number of trainable parameters per task.

It increases latency and inference time as the adapters are added into the network in a sequential manner (increasing time for the forward pass).

In LoRA the adapters are added to the pre-trained model allowing it to adapt to a specific task. We have two feed forward layers A (d x r) matrix and B (r x d) matrix, were r is the rank (as low as 1 or 2) and d is internal representation of token (e.g. 512 or 1024 dimensions). The LoRa adapter parameters are added to every single set of pre-trained weights.

Each attention-head in the BERT model contains sets of weight parameters Wq, Wk and Wv with 24 total layers. Each weight matrix is (d x d) matrix.

We attached matrices [Aq Bq], [Ak Bk], and [Av Bv] to each attention-head layer in parallel were A is (d x r) and B is (r x d) matrix. During the backward pass, we update the attached matrices parameters without changing the original weights of the BERT model. We then add orginal Wq, Wk and Wv parameters with the new [Aq Bq], [Ak Bk], and [Av Bv] parameters learned for every single attention-head layer, e.g. Wq = Wq + AqBq. No additional weight parameters are used during the inference.

In Feed Forward layer data is feed from one layer onto the next layer. Each layer has one or more units. Each unit in given layer receives input from every unit in previous layer and sends output in every unit in the next layer. A unit in layer does not communicate with any other units in the same layer. The output of all units except for those in the last layer are hidden from external viewers.

Proximal Policy Optimization

Proximal Policy Optimization is used to learn a policy directly, a mapping from states (observations) to actions that maximizes expected cumulative reward in a reinforcement learning environment.

PPO algorithm uses two main architectures, a Policy Network and a Value Function Network. Both of these are neural networks that take an input and return an output. The policy network will take a status as input and it will produce an action as output. The output layer policy network has a number of neurons equal to the number of possible actions it can take and each neuron is a probability that an action is taken when the agent is in the input State.

The value function takes a state and action as input and it will output a real number (q-value) which quantifies how good was this decision. The output layer of Value function network has the number of neurons equal to the number of possible actions it can take. The value function neural network will take in a state as input and for every action it is going to determine a q-value which quantifies how good was this specific output action for this specific state.

We initially start with some random state which is passed into a Policy Network. The policy network will determine the probability of generating each action, which is a probability distribution. We sample from the distribution to determine the actual next action. Then we take the provided action, and we will receive some reward. We store the quadruple of (action, state, reward, action probability) in the data store which is used for training the policy function and value function network. We repeat these sequence of steps for the episode or some fixed number of time steps within the episode. The data is stored for an episode as a batch.

For the batch of (state, action, reward, action probability), the state and action are used with the value function network to get the q-value, which quantifies how good we expect this action. We then determine the total future reward for every time step using the data that we stored, which quantifies how good we actually performed. The difference between actual and expected q-values is called Advantage which is used to compute the loss. The loss is then back propagated through the value function network for learning.

We then use the same advantage and probabilities that we stored previously to determine the loss for the Policy Network. The loss is back-propagated through the Policy Network so its parameters are updated. We repeat this process for all batches of data effectively making the Policy Network and the Value Function Network better over time as they learn.

Compute the loss for Value function network
Compute the loss for Policy Network
Update both the Networks together
Repeat

PPO Loss Deep Dive

We get the batch of data for the episode we stored in pass2. For each time step we compute the actual future reward using the data by taking the sum of discounted future rewards. We then compute the expected future reward by passing the state into the value function neural network and it produces a q-value for every action. We then look at the Q-value for the specific action neuron in the tuple. So for every time-step we have two numbers, we take the difference between these two numbers to get the advantage. We square the advantage for every time-step and take the average across the batch giving the loss.

Policy Network Loss

We get the batch of data that we stored, next we pass the batch of states to the Policy Network to get the probabilities of actions. In each case we only consider the probability of the action taken when we gathered the data.

We then divide two numbers that is a probability for the specific action that we have now divided by the probability we collected previously and this is a probability ratio. We then multiply this ratio with the advantage computed in this time step and so for every time step in the episode we have a number.

We take the probability ratio and we'll clip it to ensure that we're not changing the network too much. We multiply this by the advantage and so now for every time step we have two values.

We take the minimum of these values and then we take the average of the values across the batch to get a single number which is loss. This loss is back propagated through the policy Network .

The loss function overall strikes a balance between

making effective policy updates to improve per performance and
making cautious policy updates to improve stability

We approximate the value function using value neural network, and the policy using policy neural network. The deterministic policy does not explore the space compared to the stochastic policy which explores the space using probabilities even though it cannot give us the best scores.

Wednesday, December 31, 2025

Model Training and Fine-Tuning

No comments: