Sunday, December 31, 2023

Large Language Models (LLMs) - Empowering Generative AI

ChatGPT from OpenAI is the fastest growing consumer application in human history. ChatGPT is the chatbot variant of GPT 3.5 (Generative Pre-trained Transformer) which is the Large Language Model from OpenAPI. GPT model was trained by 570 GB of text data from the internet, digital books, wikipedia and many more. It makes billions of connections between all the words using the trained data, which is used to answer any given question. GPT model was trained using reinforcement learning from human feedback to align its responses to be truthful, helpful and harmless. GPT 3.5 has 175 Billion parameters spread across 96 layers in a neural network.


Artificial Neural Networks

Artificial Neural Networks (ANNs) are modeled after the neurons in the human brain. ANNs contain artificial neurons which are called units. These units are arranged in a series of layers that together constitute the whole Artificial Neural Network in a system. A layer can have only a dozen units or millions of units as this depends on how the complex neural networks will be required to learn the hidden patterns in the dataset. Commonly, Artificial Neural Network has an input layer, an output layer as well as hidden layers. The input layer receives data from the outside world which the neural network needs to analyze or learn about. Then this data passes through one or multiple hidden layers that transform the input into data that is valuable for the output layer. Finally, the output layer provides an output in the form of a response of the ANNs to input data provided. In the majority of neural networks, units are interconnected from one layer to another. Each of these connections has weights that determine the influence of one unit on another unit. As the data transfers from one unit to another, the neural network learns more and more about the data which eventually results in an output from the output layer. The input layer receives the input from external sources and releases it to the hidden layer. In the hidden layer, each neuron receives input from the previous layer neurons, computes the weighted sum, and sends it to the neurons in the next layer. These connections are weighted means effects of the inputs from the previous layer are optimized more or less by assigning different-different weights to each input and it is adjusted during the training process by optimizing these weights for improved model performance. 


Artificial neural networks are trained using a training set. The output obtained by the ANN is corroborated by a human-provided description. Backpropagation is used to make adjustments by fine-tuning the weights of the connections in ANN units based on the error rate obtained. The depth, number of hidden layers, and I/O capabilities of each node are a few criteria used to identify neural networks. Types of neural network models are:
  • Feedforward artificial neural networks.
  • Perceptron and Multilayer Perceptron neural networks.
  • Radial basis functions artificial neural networks.
  • Recurrent neural networks.
  • Modular neural networks.

Supervised and Unsupervised Learning

Supervised machine learning is the type of machine learning in which machines are trained using well "labelled" training data, on the basis of which machines can predict the output. Labelled data is the input data were its already tagged with the correct output. In supervised learning, the training data provided to the machines work as the supervisor (teacher) that teaches the machines to predict the output correctly. The aim of supervised learning by providing input data as well as correct output data to machine learning model, is to find a mapping function to map the input variable (x) with the output variable (y). Supervised learning can be further divided into Regression and Classification problems. Unsupervised machine learning, uses machine learning algorithms to analyze and cluster unlabeled datasets. These algorithms discover hidden patterns or data groupings (based on similarities or differences) without the need for human intervention. The role of an unsupervised learning algorithm is to discover the underlying structure of an unlabeled dataset by itself. Clustering methods involve grouping untagged data based on their similarities and differences. When two instances appear in different groups, we can infer they have dissimilar properties. The goal of supervised learning is to build a model that performs well on new data. Train test split is a model validation process that allows you to simulate how your model would perform with new data. Train test split is a model validation procedure that allows you to simulate how a model would perform on new/unseen data. Below are the some of prominent unsupervised machine learning techniques/algorithms.
  • Regression Analysis is used in machine learning for prediction and forecasting. Linear Regression is used to handle regression problems whereas Logistic regression is used to handle the classification problems.
  • K-nearest neighbors (kNN), is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point.
  • Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.
  • Decision Tree (DT) is a decision support hierarchical model that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.
  • Random Forest (RF) combines the output of multiple decision trees to reach a single result and handles both classification as well as regression problems.
  • GridSearchCV (Grid Search Cross-Validation) is a technique used in machine learning to search and find the optimal combination of hyperparameters for a given model.

Next Token Prediction

Masked Language Modeling

GPT Model

Sequence-to-sequence (seq2seq) models are used to convert one type of sequence to another.  They are based on Recurrent Neural Networks (RNNs). RNNs are good at processing sequences because they can remember the previous inputs in the sequence. Adding attention mechanism to the seq2seq models allows it to focus on specific parts of the input sequence when generating the output sequence. seq-2-seq models are unable to work with long ranging inputs with dependencies (paragraphs of text or essays) and unable to parallelize hence take a long time to process.
 
Transformer Architecture

The Transformer is a new architecture that was proposed in the paper Attention Is All You Need in June 2017. Transformers consists of encoder which works on the input and the decoder which works on the target output. The transformer takes a sequence of tokens (words) and predicts the next word in the output sequence. It iterates through the encoder layers to generate encodings. Transformer relies on self-attention to compute representations of its input and output sequences. Self-attention allows the Transformer to attend to all positions in the input sequence, regardless of their distance. This makes it possible for the Transformer to handle long-range dependencies more effectively than RNNs.

The architecture of the Transformer consists of two main parts: the encoder and the decoder. The encoder reads the input sequence and creates a representation of it. The decoder uses the representation created by the encoder to generate the output sequence. Both the encoder and decoder are made up of a stack of self-attention layers. The self-attention layers allow the Transformer to attend to all positions in the input or output sequence, regardless of their distance. The input word vectors in the encoder are parsed into key, query, and value vectors. The dot product of the key and query provide the attention weight, which is squashed using a softmax function across all attention weights so that the total weighting sums to 1. The value vectors corresponding to each element are summed according to their attention weights before being fed into subsequent layers.


Encoder

During training the encoder will take the input English words of the sentence simultaneously and it will generate word vectors simultaneously. These words vectors will eventually be context aware using the attention mechanism. To transform a word into a vectors which capture a lot of the meaning/semantic information of the words, methods such as word embedding algorithms are used.

The model's loss is determined by comparing the predication made with the soft max and the labeled sentence. The common loss function used for such problems is the cross entropy loss. Cross entropy loss is computed for every predicted word and add them up to get a single loss. This loss is back-propagated through the RNN to update the its parameters.
Number of sentences are passed at once through the network (called batch size) in order to update the weights of the entire network at once. The batch size is a network parameter. Also another parameter is the maximum number of words any sentence passed contains. Passing sentences in batches enables faster training. When training neural networks with gradient descent we pass a single input to generate a single output prediction, compare the prediction and label and quantify this as a loss. The loss is then back propagated to update the parameters of the network. This works for small networks but for large networks with millions of parameters the updates are slower. Hence the weights are updated after passing through batch of sentences instead of updating them after every single sentence, making Mini-batch Gradient Descent a commonly used practice.

Each word is converted into vector (512 dimensional vectors), which is stored in a tensor with batch size (no of sentences per batch) by max words in sentence by 512 dimensional tensor. A position encoding with same tensor size which are generated from sin/cosine functions (values between -1 to +1) are then added to the word tensor. The positional encoder defines the ordering of the words passed to the encoder. 
The resulting tensor is then passed through the feed forwarder network in order to get query vectors, key vectors and value vectors. Every word vector is split into Query vector, Value vector and Key Vector. Hence we have three times that of original word vector with (batch size x max words x 512) dimensions. This split of the word is required to carry out the multi head attention. The multi head means we perform the parallel operations in multiple stacks with copies of same query, key and value vectors. Self attention as to analyze the context and build context within the same sentence by comparing every within word sentence to each other. Since many sentences will not have exactly the max word size, padding tokens are added. Padding tokens are not considered while computing a loss and performing back propagation, hence these are masked out from those operations. The query, key and value vectors are broken into 8 small (64 dimensional) vectors. Then the query vector is multiplied by key vector. This will compare every word in English sentence i.e. query vector with every word in same English sentence which is the key vector, with result called as self attention matrix.

Some scaling is typically done on the Attention matrix before adding the padding mask to stabilize the training, which prevents the values after multiplication from exploding to very high or very low numbers. Scaling means dividing every value in the self attention tensor by a constant value, which is square root of the key dimension size vector for one head (used in the paper). Once scaling is completed the padding mask is added to the attention matrix. The padding mask is used to prevent the padding tokens from propagating values. Technically zero is used for pass through and negative infinity (-10^9) is used for masking, since a soft max is performed eventually.
The soft max operation uses exponents, were exponent of zero becomes 1 (pass through) and exponent of negative infinity becomes 0 (mask or block info).
After applying the padding mask and then applying the soft max operation we get the Attention Matrix. Each value in attention matrix quantifies how much attention each word should pay to each other word. The value matrix computed earlier is applied to the attention matrix to get a Value Tensors which are contextually aware. All the value tensors from different heads are concatenated.

Decoder

The decoder initially takes just the start token as input, followed by resulting words in subsequent iterations and end token is added at the end. The English word vectors generated from the encoder are also passed to the decoder. It predicts the next word given the first word, then next word given the first and second word and so forth.

Positional Encoding tags each word in the input with an ordering number before being sent for processing. Store information about the word order in the data itself rather than within the structure of the network. As the model is trained with lot of text data, it learns how to interpret those position encodings.

The word embeddings of the input sequence are passed to the first encoder. These are then transformed and propagated to the next encoder. The output from the last encoder in the encoder stack is passed to all the decoders in the decoder stack as shown in the figure below.


In addition to the self-attention and feed-forward layers, the decoders also have one more layer of the Encoder-Decoder Attention layer, which helps the decoder focus on the appropriate parts of the input sequence.

Self attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. The self-attention layer calculates a score for each pair of words in the sentence, and then uses these scores to determine how much attention to pay to each word when computing the representation of the word.

LLaMA 2 model

Jurassic Model

Amazon Titan

Mistral Models

Mistral AI is a company which produces open source AI models with less parameters than GPT4.

OpenSource Models

There are over 325000 open source models on HuggingFace with top LLMs chart. These models are smaller compared to the proprietary models and have considerably less parameters. But they can be fine tuned or modified based on anyone's requirements. Many open source models are variations on LLaMA 2 model provided by Meta. Vicuna model is an open source model created on top of LLaMA 2. Bloom model is a multi-lingual language model created by BigScience. IBMs watsonx.ai also offers many version of Llama 2 and other foundation models. Hermes-13b. Falcon 180B is an open-access large language model that builds on the previous releases in the “Falcon” family and has 180 billion parameters.

Disadvantages of LLMs

  • Hallucinations results when the LLMs are trained on incomplete, contradictory or inaccurate data from misunderstanding context.
  • Bias happens when the source of training data is not diverse or not representative.
  • Security issues occur when Private Personal Information (PPI) is exposed when data is used for training models.

AI Tools

There are various tools which can be leveraged for training and working with AI Models.

Ollama is the tool used to run LLM on local machine. For example the below command runs the llama model.

ollama run llama2