Retrieval Augment Generation: If you want to answer questions about the knowledgebase that changes then RAG is easier.
Quality of RAG is very much dependent on the retrieval process.
--------
Data Collection
Data Cleaning, filtering
Deduplication:
Exact duplicates - Hashing (SHA-1, MD5) of full documents or lines - Identical web pages removed.
Near duplicates - MinHash, SimHash, Locality Sensitive Hashing (LSH) - Two web pages that differ by small edits
Semantica Duplicates - Sentence embeddings + cosine similarity - Two sentences with same meaning but with different wording
Code deduplication - AST (Abstract Syntax Tree) hashing - Detecting identical code logic with different variable names.
Deidentification: remove personally identifiable information or sensitive information. Add safety filter to make it less toxic.
The dataset is collected via human annotations, where expert curate question answer pairs, chat style conversations and safety filters.
Scale.com = Provide dataset
Tokenizations: Split the sentence into tokens and then create embedding for each of the tokens.
GPT uses Byte Pair Encoding (BPE).
Some modern designs skip conventional token splits and operate on raw bytes (e.g. UTF-8) so that no language-specific tokenizer is needed, improving multilingual and code support.
--------
Attention Mechanism
Flash Attention
Sparse Attention
Positional Embeddings
RoPE and AliBi
Scaling
Mixture of Experts (MoE) for Scaling - Used in DeepSeek
Activation Functions
Traditional functions: Relu, Sigmoid
New: GeLU, SwigLu
Optimizer
Old: Adam
New: Muon
Use CUDA Programming for GPU coding
Llama 3 405B is trained on up to 16K H100 GPUs.
Model Pre-Training:
- Learn Statistical Regularities (Common sentences)
- Acquire Foundational Linguistic Structures (Grammer)
- Develop abstract representation of World Knowledge
Model learns to predict next token on unstructured data.
It gains language understanding and background knowledge.
The model can only complete the text after pre-training and it is unable to answer questions.
https://arxiv.org/pdf/2302.13971
https://huggingface.co/datasets/agentlans/common-crawl-sample
Mid-Training: Fine Tunes the model's "thinking patterns" before it learns to follow instructions
This can include long-context documents or instruction-rich Q&A, designed to improve memory, coherence and strategic reasoning.
Supervised Finetuning: Also known as instruction-tuning. Give the model question and answer and make the model similar to Chat-GPT.
Structured dataset that demonstrates intended behavior.
Common Use-cases:
- learn to follow instructions
- learn to answer questions about a domain
- learn to have conservations.
Eg. Alpaca Dataset (instruction tuning)
https://huggingface.co/datasets/yahma/alpaca-cleaned
Eg. Guanaco (Conversation tuning)
https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style
Eg. My Paul Graham Dataset (conversation tuning)
https://huggingface.co/datasets/pookie3000/pg_chat
Post Training: Model Alignment (reinforcement learning)
Preference Finetuning: It means aligning an LLM's behavior with human preferences - teaching it not just to be correct, but to be helpful, safe, polite, and aligned with what users actually like.
Use dataset of rejected and preferred assistant responses.
Goals:
- Make model better to follow human preferences
- Make model safer.
Techniques:
RLHF (Reinforcement Learning with Human Feedback)- Human judgements can be inconsistent or biased. Not Scalable.
RL with Verifiable Rewards: Verifiable rewards are rewards which can be computed atomically and unambiguously.
a) Task Definition
b) Generate Responses
c) Evaluate Automatically
d) Evaluate automatically
DPO (Direct Preference Optimization)
RLOO
Eg. Descritpiveness-sentiment-trl instruction tuning
https://huggingface.co/datasets/trl-internal-testing/descriptiveness-sentiment-trl-style
Llama 3 models are produced by applying several rounds of post-training or aligning the model with human feedback on top of a pre-trained checkpoint. Each round of post-training involves supervised fine-tuning (SFT) followed by Direct Preference Optimization (DPO) on examples collected via human annotations or generated synthetically.
Post Training: Reasoning (reinforcement learning)
Dataset of prompts and expected answers. Promote certain model outputs by using reward functions.
Methods: GRPO
It is used to create reasoning models like deepseek-r1, open o 1
Often done for quantitative domains (science, math, coding) as it's easy to define reward functions for those.
Eg. gsm8k (Grade school math)
https://huggingface.co/datasets/openai/gsm8k
Evaluation: After build and fine-tune the model, evaluation tells us how good is the model.
Proximal Policy Optimization
Proximal Policy Optimization is used to learn a policy directly, a mapping from states (observations) to actions that maximizes expected cumulative reward in a reinforcement learning environment.
PPO algorithm uses two main architectures, a Policy Network and a Value Function Network. Both of these are neural networks that take an input and return an output. The policy network will take a status as input and it will produce an action as output. The output layer policy network has a number of neurons equal to the number of possible actions it can take and each neuron is a probability that an action is taken when the agent is in the input State.
The value function takes a state and action as input and it will output a real number (q-value) which quantifies how good was this decision. The output layer of Value function network has the number of neurons equal to the number of possible actions it can take. The value function neural network will take in a state as input and for every action it is going to determine a q-value which quantifies how good was this specific output action for this specific state.
We initially start with some random state which is passed into a Policy Network. The policy network will determine the probability of generating each action, which is a probability distribution. We sample from the distribution to determine the actual next action. Then we take the provided action, and we will receive some reward. We store the quadruple of (action, state, reward, action probability) in the data store which is used for training the policy function and value function network. We repeat these sequence of steps for the episode or some fixed number of time steps within the episode. The data is stored for an episode as a batch.
For the batch of (state, action, reward, action probability), the state and action are used with the value function network to get the q-value, which quantifies how good we expect this action. We then determine the total future reward for every time step using the data that we stored, which quantifies how good we actually performed. The difference between actual and expected q-values is called Advantage which is used to compute the loss. The loss is then back propagated through the value function network for learning.
We then use the same advantage and probabilities that we stored previously to determine the loss for the Policy Network. The loss is back-propagated through the Policy Network so its parameters are updated. We repeat this process for all batches of data effectively making the Policy Network and the Value Function Network better over time as they learn.
- Compute the loss for Value function network
- Compute the loss for Policy Network
- Update both the Networks together
- Repeat
PPO Loss Deep Dive
We get the batch of data for the episode we stored in pass2. For each time step we compute the actual future reward using the data by taking the sum of discounted future rewards. We then compute the expected future reward by passing the state into the value function neural network and it produces a q-value for every action. We then look at the Q-value for the specific action neuron in the tuple. So for every time-step we have two numbers, we take the difference between these two numbers to get the advantage. We square the advantage for every time-step and take the average across the batch giving the loss.
Policy Network Loss
We get the batch of data that we stored, next we pass the batch of states to the Policy Network to get the probabilities of actions. In each case we only consider the probability of the action taken when we gathered the data.
We then divide two numbers that is a probability for the specific action that we have now divided by the probability we collected previously and this is a probability ratio. We then multiply this ratio with the advantage computed in this time step and so for every time step in the episode we have a number.
We take the probability ratio and we'll clip it to ensure that we're not changing the network too much. We multiply this by the advantage and so now for every time step we have two values.
We take the minimum of these values and then we take the average of the values across the batch to get a single number which is loss. This loss is back propagated through the policy Network .
The loss function overall strikes a balance between
- making effective policy updates to improve per performance and
- making cautious policy updates to improve stability
We approximate the value function using value neural network, and the policy using policy neural network. The deterministic policy does not explore the space compared to the stochastic policy which explores the space using probabilities even though it cannot give us the best scores.
No comments:
Post a Comment