Emprovise Blog: Large Language Models (LLMs)

ChatGPT from OpenAI is the fastest growing consumer application in human history. ChatGPT is the chatbot variant of GPT 3.5 (Generative Pre-trained Transformer) which is the Large Language Model from OpenAPI. GPT model was trained by 570 GB of text data from the internet, digital books, wikipedia and many more. It makes billions of connections between all the words using the trained data, which is used to answer any given question. GPT model was trained using reinforcement learning from human feedback to align its responses to be truthful, helpful and harmless. GPT 3.5 has 175 Billion parameters spread across 96 layers in a neural network.

Artificial Neural Networks

Artificial Neural Networks (ANNs) are modeled after the neurons in the human brain. ANNs contain artificial neurons which are called units. These units are arranged in a series of layers that together constitute the whole Artificial Neural Network in a system. A layer can have only a dozen units or millions of units as this depends on how the complex neural networks will be required to learn the hidden patterns in the dataset. Commonly, Artificial Neural Network has an input layer, an output layer as well as hidden layers. The input layer receives data from the outside world which the neural network needs to analyze or learn about. Then this data passes through one or multiple hidden layers that transform the input into data that is valuable for the output layer. Finally, the output layer provides an output in the form of a response of the ANNs to input data provided. In the majority of neural networks, units are interconnected from one layer to another. Each of these connections has weights that determine the influence of one unit on another unit. As the data transfers from one unit to another, the neural network learns more and more about the data which eventually results in an output from the output layer. The input layer receives the input from external sources and releases it to the hidden layer. In the hidden layer, each neuron receives input from the previous layer neurons, computes the weighted sum, and sends it to the neurons in the next layer. These connections are weighted means effects of the inputs from the previous layer are optimized more or less by assigning different-different weights to each input and it is adjusted during the training process by optimizing these weights for improved model performance.

Artificial neural networks are trained using a training set. The output obtained by the ANN is corroborated by a human-provided description. Backpropagation is used to make adjustments by fine-tuning the weights of the connections in ANN units based on the error rate obtained. The depth, number of hidden layers, and I/O capabilities of each node are a few criteria used to identify neural networks. Types of neural network models are:

Feedforward artificial neural networks.
Perceptron and Multilayer Perceptron neural networks.
Radial basis functions artificial neural networks.
Recurrent neural networks.
Modular neural networks.

Supervised and Unsupervised Learning

Supervised machine learning is the type of machine learning in which machines are trained using well "labelled" training data, on the basis of which machines can predict the output. Labelled data is the input data were its already tagged with the correct output. In supervised learning, the training data provided to the machines work as the supervisor (teacher) that teaches the machines to predict the output correctly. The aim of supervised learning by providing input data as well as correct output data to machine learning model, is to find a mapping function to map the input variable (x) with the output variable (y). Supervised learning can be further divided into Regression and Classification problems. Unsupervised machine learning, uses machine learning algorithms to analyze and cluster unlabeled datasets. These algorithms discover hidden patterns or data groupings (based on similarities or differences) without the need for human intervention. The role of an unsupervised learning algorithm is to discover the underlying structure of an unlabeled dataset by itself. Clustering methods involve grouping untagged data based on their similarities and differences. When two instances appear in different groups, we can infer they have dissimilar properties. The goal of supervised learning is to build a model that performs well on new data. Train test split is a model validation process that allows you to simulate how your model would perform with new data. Train test split is a model validation procedure that allows you to simulate how a model would perform on new/unseen data. Below are the some of prominent unsupervised machine learning techniques/algorithms.

Regression Analysis is used in machine learning for prediction and forecasting. Linear Regression is used to handle regression problems whereas Logistic regression is used to handle the classification problems.
K-nearest neighbors (kNN), is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point.
Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.
Decision Tree (DT) is a decision support hierarchical model that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.
Random Forest (RF) combines the output of multiple decision trees to reach a single result and handles both classification as well as regression problems.
GridSearchCV (Grid Search Cross-Validation) is a technique used in machine learning to search and find the optimal combination of hyperparameters for a given model.

Next Token Prediction

Masked Language Modeling

GPT Model

Sequence-to-sequence (seq2seq) models are used to convert one type of sequence to another. They are based on Recurrent Neural Networks (RNNs). RNNs are good at processing sequences because they can remember the previous inputs in the sequence. Adding attention mechanism to the seq2seq models allows it to focus on specific parts of the input sequence when generating the output sequence. seq-2-seq models are unable to work with long ranging inputs with dependencies (paragraphs of text or essays) and unable to parallelize hence take a long time to process.

Transformer Architecture

The Transformer is a new architecture that was proposed in the paper Attention Is All You Need in June 2017. Transformers consists of encoder which works on the input and the decoder which works on the target output. The transformer takes a sequence of tokens (words) and predicts the next word in the output sequence. It iterates through the encoder layers to generate encodings. Transformer relies on self-attention to compute representations of its input and output sequences. Self-attention allows the Transformer to attend to all positions in the input sequence, regardless of their distance. This makes it possible for the Transformer to handle long-range dependencies more effectively than RNNs.

The architecture of the Transformer consists of two main parts: the encoder and the decoder. The encoder reads the input sequence and creates a representation of it. The decoder uses the representation created by the encoder to generate the output sequence. Both the encoder and decoder are made up of a stack of self-attention layers. The self-attention layers allow the Transformer to attend to all positions in the input or output sequence, regardless of their distance. The input word vectors in the encoder are parsed into key, query, and value vectors. The dot product of the key and query provide the attention weight, which is squashed using a softmax function across all attention weights so that the total weighting sums to 1. The value vectors corresponding to each element are summed according to their attention weights before being fed into subsequent layers.

Encoder

During training the encoder will take the input English words of the sentence simultaneously and it will generate word vectors simultaneously. These words vectors will eventually be context aware using the attention mechanism. To transform a word into a vectors which capture a lot of the meaning/semantic information of the words, methods such as word embedding algorithms are used.

The model's loss is determined by comparing the predication made with the soft max and the labeled sentence. The common loss function used for such problems is the cross entropy loss. Cross entropy loss is computed for every predicted word and add them up to get a single loss. This loss is back-propagated through the RNN to update the its parameters.

Number of sentences are passed at once through the network (called batch size) in order to update the weights of the entire network at once. The batch size is a network parameter. Also another parameter is the maximum number of words any sentence passed contains. Passing sentences in batches enables faster training. When training neural networks with gradient descent we pass a single input to generate a single output prediction, compare the prediction and label and quantify this as a loss. The loss is then back propagated to update the parameters of the network. This works for small networks but for large networks with millions of parameters the updates are slower. Hence the weights are updated after passing through batch of sentences instead of updating them after every single sentence, making Mini-batch Gradient Descent a commonly used practice.

Each word is converted into vector (512 dimensional vectors), which is stored in a tensor with batch size (no of sentences per batch) by max words in sentence by 512 dimensional tensor. A position encoding with same tensor size which are generated from sin/cosine functions (values between -1 to +1) are then added to the word tensor. The positional encoder defines the ordering of the words passed to the encoder.

The resulting tensor is then passed through the feed forwarder network in order to get query vectors, key vectors and value vectors. Every word vector is split into Query vector, Value vector and Key Vector. Hence we have three times that of original word vector with (batch size x max words x 512) dimensions. This split of the word is required to carry out the multi head attention. The multi head means we perform the parallel operations in multiple stacks with copies of same query, key and value vectors. Self attention as to analyze the context and build context within the same sentence by comparing every within word sentence to each other. Since many sentences will not have exactly the max word size, padding tokens are added. Padding tokens are not considered while computing a loss and performing back propagation, hence these are masked out from those operations. The query, key and value vectors are broken into 8 small (64 dimensional) vectors. Then the query vector is multiplied by key vector. This will compare every word in English sentence i.e. query vector with every word in same English sentence which is the key vector, with result called as self attention matrix.

Some scaling is typically done on the Attention matrix before adding the padding mask to stabilize the training, which prevents the values after multiplication from exploding to very high or very low numbers. Scaling means dividing every value in the self attention tensor by a constant value, which is square root of the key dimension size vector for one head (used in the paper). Once scaling is completed the padding mask is added to the attention matrix. The padding mask is used to prevent the padding tokens from propagating values. Technically zero is used for pass through and negative infinity (-10^9) is used for masking, since a soft max is performed eventually.

The soft max operation uses exponents, were exponent of zero becomes 1 (pass through) and exponent of negative infinity becomes 0 (mask or block info).

After applying the padding mask and then applying the soft max operation we get the Attention Matrix. Each value in attention matrix quantifies how much attention each word should pay to each other word. The value matrix computed earlier is applied to the attention matrix to get a Value Tensors which are contextually aware. All the value tensors from different heads are concatenated.

Decoder

The decoder initially takes just the start token as input, followed by resulting words in subsequent iterations and end token is added at the end. The English word vectors generated from the encoder are also passed to the decoder. It predicts the next word given the first word, then next word given the first and second word and so forth.

Positional Encoding tags each word in the input with an ordering number before being sent for processing. Store information about the word order in the data itself rather than within the structure of the network. As the model is trained with lot of text data, it learns how to interpret those position encodings.

The word embeddings of the input sequence are passed to the first encoder. These are then transformed and propagated to the next encoder. The output from the last encoder in the encoder stack is passed to all the decoders in the decoder stack as shown in the figure below.

In addition to the self-attention and feed-forward layers, the decoders also have one more layer of the Encoder-Decoder Attention layer, which helps the decoder focus on the appropriate parts of the input sequence.

Self attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. The self-attention layer calculates a score for each pair of words in the sentence, and then uses these scores to determine how much attention to pay to each word when computing the representation of the word.

Transformer Design

Transformers: Attention and Feed Forward neural network

GPT = Generative Pre-Trained Transformer

Input is broken into little pieces called tokens, which are usually words, common word combinations, pieces of words for text input. For Image or Sound inputs, tokens are little chunks of the image or sound. The input has to be formatted as array of real numbers into a vector.

Each of these tokens are associated with a high-dimensional vector (long list of values) which encodes the meaning of chunk/piece. If these vectors are represented in a high dimensional coordinate space then words with similar meanings tend to land on vectors that are close to each other in that space. In a high-dimensional space different directions can encode different kinds of meaning, e.g. a particular direction encodes gender information, while another encodes location information. Many other distinct directions in this super high-dimensional space could correspond to other features which the model could represent.

The sequence of vectors then passes through attention block which allows the vectors to talk to each other and pass information back and forth to update their values. For example, the meaning of the word "model" in the phrase "a machine learning model" is different from its meaning in the phrase "a fashion model". The attention block determines which words in context are relevant to updating the meanings of which other words and how those meaning should be updated. Word meanings are encoded in the entries of the vectors. Vectors then pass through multi-layer perceptron or feed-forward layer in parallel. It's similar to asking long list of questions about each vector and updating them based on the answers.

After multiple repetitions (iterations) of attention and multi-layer perceptron steps, the essential meaning of the input text is baked into the last vector in the sequence. Each vector soaks up enough information, both from the context of all the other words in the input and from the general knowledge which was baked into the model weights through training. The final output layer is list of numbers representing the probability distribution for all possible next tokens for the text.

These model parameters are almost always referred to as weights since these parameters interact with the data being processed is through weighted sums. Matrices are filled with tunable parameters (ie. Weights) that transform vectors that are drawn from the data being processed. Each component in the matrix output is liked a weighted sum. The matrices are organized into different categories e.g. Embedding, Key, Query, Value, Output, Up-Projection, Down-Projection, Unembedding.

The embedding matrix consists of a predefined vocabulary with list of all possible words e.g. 50,000 words, were each word is represented by a single column. These columns determines what vector each word turns into in the first step. Words are embedded very geometrically as points in some high dimensional space, were there many dimensions e.g. GPT-3 has 12,288 dimensions. As a model tweaks and tunes its weights to determine how exactly words get embedded as vectors during training, it tends to settle on a set of embeddings where directions in the space have a kind of semantic meaning. For a simple word-to-vector model if we search for all the words whose embeddings are closest to "tower" then we will get very similar tower-ish vibes such as "skyscraper", "dome".

If we take the difference between the vectors for woman and man, it's very similar to the difference between king and queen. So to find a female monarch we can find king and add (woman minus man) direction, then searching for the embedding closest to that point. Hence during training the model found it advantageous to choose embeddings such that one direction in this space encodes gender information.

The dot product of two vector can be used as the measure as how well they align, as dot product involves multiplying the vector components and adding the results similar to weighted sum.

In the very first step we create an array of vectors based on input text, each one of the word are simply taken from the embedding matrix, such that each word encodes the meaning of the word without any input from its surroundings.

The network can only process a fixed number of vectors at a time known as context size, e.g. Context size of GPT-3 is 2048, so the data is represented by array of 2048 columns each of which has 12000 dimensions. The context size limits how much text the transformer can incorporate when it's making a prediction of the next word.

The Unembedding matrix has its entries begin at random, but they are learned during the training process. Unembedding matrix has one row for each word in the vocabulary, and each row has the same number of elements as the embedding dimension. It's very similar to the embedding matrix, just with the order swapped, so it adds another 617 million parameters to the network.

Softmax is used to ensure that a probability distribution over all possible next word has to be between 0 and 1, such that sequence of probability distribution numbers add up to 1.

Softmax is the standard way to turn an arbitrary list of numbers into a valid distribution in such a way that the largest values end up closest to 1, and the smaller values end up very close to 0.

The first step in a transformer is to associate each token with a high-dimensional vector, called its embedding. The directions in this high-dimensional space of all possible embeddings can correspond with semantic meaning.

The aim of a transformer is to progressively adjust these embeddings so that they don't merely encode an individual word, but instead they bake in some much, much richer contextual meaning. There are multiple distinct directions in this embedding space encoding the multiple distinct meanings of the word, and that a well-trained attention block calculates what you need to add to the generic embedding to move it to one of these specific directions, as a function of the context.

The computation we perform to produce a prediction of the next token is entirely a function of the last vector in the sequence. For the model to accurately predict the next word, the final vector in the sequence, which began with simple embedding of single the word, will have to have been updated by all of the attention blocks to represent much, much more than any individual word, somehow encoding all of the information from the full context window that's relevant to predicting the next word.

The initial embedding for each word is some high dimensional vector that only encodes the meaning of that particular word along with its position with no other context.

The goal is to have a series of computations produce a new refined set of embeddings where, for example, those corresponding to the nouns have ingested the meaning from their corresponding adjectives. The computations are preferred to be matrix-vector products, where the matrices are full of tuneable weights, things that the model will learn based on data.

In first step this process is for example each noun asking the question if there are any adjectives in its front. This question is encoded as a query vector with 128 dimensions. Computing the query vector is taking a certain matrix and multiplying it by the embedding.

Multiply the Wq matrix by all of the embeddings in the context, producing one query vector for each token. The entries of Wq matrix are parameters of the model and its behavior is learnt from training data.

Simultaneously, we have the key matrix (Wk), which is also multiplied by every one of the embeddings to produce a second sequence of vectors called as the keys. The keys are conceptually for answering the queries and they match with the queries whenever they closely align with each other

Similar to query matrix, the key matrix is also full of tuneable parameters, and maps the embedding vectors to that same smaller dimensional space. The keys .

To measure how well each key matches each query, we compute a dot product between each possible key-query pair (K1.Q1). The larger the dot products, the more the keys and queries align, such that those key embeddings attend to the query embedding. Hence we have a grid of values that can be any real number from negative infinity to infinity, giving a score for how relevant each word is to updating the meaning of every other word.

In order to normalize the values in these columns to be between 0 and 1, and for each column to add up to 1 as a probability distribution, the softmax is computed along each one of these columns.

The grid is now called attention patterns, were each column is giving weights according to how relevant the word on the left is to the corresponding value at the top.

Attention(Q, K, V) = softmax(K^T x Q / sqrt(dk))

The Q and K variables on the left represent the full arrays of query and key vectors respectively, which we get by multiplying the embeddings by the query and the key matrices. The left expression up in the numerator is a really compact way to represent the grid of all possible dot products between pairs of keys and queries. We divide all of these values by the square root of the dimension in that key query space for numerical stability. The softmax here is meant to be applied column by column.

The whole training process is more efficient if we simultaneously have it predict every possible next token following each initial subsequence of tokens in the input passage. Hence single training example acts as multiple training examples. This never allows later words to influence earlier words, as they could give away the answer for what comes next. Hence we need the later tokens influencing earlier ones to set to zero. This can achieved by setting all of those entries to be negative infinity before applying softmax, such that applying softmax turns all of those into zero keeping the columns normalized. This process is called masking.

The size of the attention pattern is equal to the square of the context size, making the context size as the bottleneck for large language models, making scaling them difficult. Hence context windows are extended to be bigger and many variations to the attention mechanism are proposed as below for making the context more scalable.

Sparse Attention Mechanism
Blockwise Attention
Linformer
Reformer
Ring Attention
Longformer
Adaptive Attention Span.

We then update the embeddings, allowing words to pass information to whichever other words they're relevant to. E.g. the embedding of Fluffy should cause change to Creature moving it to a different part of the dimensional embedding space such that it encodes a Fluffy creature.

In single head attention approach we have a value matrix (Wv) which is multiplied by the embedding of that first word, E.g. Fluffy, resulting into a value vector. The value vector which is in high-dimensional space as the embeddings, is then added to the embedding of the second word e.g. Creature. The value vector signifies the actual information we want to extract. Hence for the grid, the value matrix (Wv) is multiplied by every one of its embeddings to produce a sequence of value vectors, which are associated with the corresponding keys. Then for each column, multiply each of the value vectors by the corresponding weight in that column. E.g., under the embedding of Creature below, we add large proportions of the value vectors for Fluffy and Blue, while all of the other value vectors get almost zeroed out. To update the embedding associated with the column, we add together all of the rescaled values in the column and then you add that to the original embedding. This results in the more refined vector encoding the more contextually rich meaning, like that of a fluffy blue creature.

We apply the same weighted sum across all the columns in the grid, and then adding all of those changes to the corresponding embeddings, producing a full sequence of more refined embeddings popping out of the attention block. This entire process is called single head attention. The entire process is parameterized by three distinct matrices filled with tunable parameters, the key, the query, and the value matrices. Query and Key have 12288 columns matching the embedding dimension and 128 rows matching the key-query space for GPT-3. Value matrix is square matrix with 12288 rows & columns.

It is more efficient for the number of parameters for value map to be the same as the number devoted to the key and the query, so we can execute multiple attention heads in parallel. We constrain the overall value map to be a low rank transformation, by having value map factored as a product of two smaller matrices. The smaller matrices will have either row or column respectively of same size as the key-query space, with alternate column and row of embedding space. This makes all the four Query, Key, Value1, Value2 matrices of the same size within the attention head. The above is accurately described as self-attention head.

The cross-attention involves models that process two distinct types of data, like text in one language and text in another language that's part of an ongoing generation of a translation, or audio input of speech and an ongoing transcription. Cross-attention head differ from self-attention head, were the key and query maps act on different data sets and there is no masking. For example in translation model, the keys might come from one language while the queries come from another language and the attention pattern could describe which words from one language correspond to which words in another.

For every different type of contextual updates, the parameters of these key and query matrices would be different to capture the different attention patterns, and the parameters of our value map would be different based on what should be added to the embeddings. For example, if the words "they crashed" precedes the word "car" then it has implications for the shape and structure of the car. The weights are set to the maps for the model to accomplish the goal of predicting the next token. A full attention block inside a transformer consists of multi-headed attention, where we run attention operations in parallel, each with its own distinct key query and value maps.

GPT-3 uses 96 attention heads inside each block, were it has 96 distinct key and query matrices producing 96 distinct attention patterns. Then each head has its own distinct value matrices used to produce 96 sequences of value vectors.

These are all added together using the corresponding attention patterns as weights.

Hence for each position in the context and for each token, every one of these heads produces a proposed change to be added to the embedding in that position. The proposed changes are summed together one for each head, and the result is added to the original embedding of that position. The entire sum represents single slice that is outputted from the multi-headed attention block. Running many distinct heads in parallel, the model gets the capacity to learn many distinct ways that context changes mean.

Multilayer Perceptron (MLP)

The Multilayer Perceptron (MLP) block has a majority of the model parameters and seems to use this capacity to store facts.

Consider the example of fact that "Michael Jordan plays basketball" and how could this fact be stored. We make assumptions about the high-dimensional space, such that one of the directions represents the idea of a first name i.e. Michael, another direction represents the idea of the last name i.e. Jordan, and a third direction representing the idea of basketball.

If the processed vector's dot product with the first name Michael direction is one, then the vector encodes the idea of a person with that first name. If the dot product is zero or negative, then the vector doesn't align with that direction.

Similarly, its dot product with other directions tells us whether it represents the last name Jordan or basketball. The vector represents the entire full name, Michael Jordan when its dot product with both the first name Michael and last name Jordan directions is one. Since the text Michael Jordan spans two different tokens, we can assume that the earlier attention block has successfully passed information to the second vector of these two vectors so as to ensure that it can encode both names.

Multilayer Perceptron (MLP) block

A sequence of vectors flowing into the MLP block were each vector is associated with one of the tokens from the input text. Each individual vector from that sequence goes through a short series of operations (Linear -> ReLU -> Linear), with the output vector then gets added to the original input vector and the sum is the result vector which flows out. This sequence of operations is applied in parallel to every vector in the sequence, which associated with every token in the input. When a vector flows in that encodes first name Michael and last name Jordan, then this sequence of computations will produce something that includes that direction basketball, which is what will add on to the vector in that position.

The first step of the process we multiply each vector (E) with by a very big matrix called up projection matrix, filled with model parameters that are learned from data. The matrix multiplication is done by taking a bunch of dot products (R0.E,..,Rn.E) between the rows (R0..n) were each row of the matrix as being its own vector and the vector being processed (E). Each vector for each row in the matrix is asking different kinds of questions e.g. First Name is Jordan or European Country, probing for various features of the vector being processed. The total number of rows in the matrix is similar to the number of questions being asked. Often this step involves with adding Bias (B) vector to the output, which is also full of model parameters learned from data. Hence the first linear step computes (Wup.E + B).

If the entry that we're measuring is high for Michael plus Jordan, it would also necessarily be somewhat triggered by Michael plus Phelps and also Alexis plus Jordan, despite those being unrelated conceptually. But we prefer a simple yes or no for the full name. Hence the vector goes through Rectified Linear Unit (RLU) step were it takes all of the negative values and maps them to zero and leaves all of the positive values unchanged (imitating behavior of an AND Gate).

Models often use Gaussian Error Linear Unit (GELU) a slightly modified function which is bit smoother.

The next step is similar to the first step were we multiply the vector E with a large matrix called the down projection matrix and add the bias term. The number of dimensions in the output is backed down to the size of that embedding space. The matrix multiplication can be imagined as taking each column of the matrix and multiplying it by the corresponding term in the vector that it's processing and adding together all of those rescaled columns e.g. (n0C0 + n1C1 + ... + nmCm). The columns here have the same dimension as the embedding space, hence we can think of them as directions in that space. For example imagine that the model has learned to make that first column into this basketball direction, so when the first position is active we will be adding this column to the final result, otherwise it would be zero for inactive value. The model could also bake into this column and many other features that it wants to associate with something that has the full name Michael Jordan. At the same time, all of the other columns in large matrix indicate what will be added to the final result if the corresponding neuron is active. The Bias values are added to the computed (neuron) values every single time, regardless of its values. Hence the second linear step computes (Wdown.[] + B). The final result is added to the vector that flowed into the block at that position to get the final result.

Hence if the vector flowing in encoded both first name Michael and last name Jordan, then because this sequence of operations will trigger that AND gate, it will add on the basketball direction, so what pops out will encode all of those together.

Based on the evidence however, individual neurons very rarely represent a single clean feature like Michael Jordan. This is supported by Superposition hypothesis which states, if we have an n-dimensional space were we wanted to represent different features using directions that are all perpendicular to one another, such that a component added in one direction doesn't influence another in the other directions, then the maximum number of vectors we can fit is n which is the total number of dimensions. The Johnson-Lindenstrauss lemma also states that the number of vectors we can cram into a space that are nearly perpendicular (between 89 and 91 degrees) grows exponentially with the number of dimensions.

The large language models benefit from associating independent ideas with nearly perpendicular directions, as it's possible for it to store many, many more ideas than there are dimensions in the space that it's allotted. This partially explains why model performance seems to scale so well with size. Also individual features aren't going to be visible as a single neuron but a specific combination of neurons instead like a Superposition.

LLM Models

LLaMA 2 model, Jurassic Model, Amazon Titan

Mistral AI is a company which produces open source AI models with less parameters than GPT4.

OpenSource Models

There are over 325000 open source models on HuggingFace with top LLMs chart. These models are smaller compared to the proprietary models and have considerably less parameters. But they can be fine tuned or modified based on anyone's requirements. Many open source models are variations on LLaMA 2 model provided by Meta. Vicuna model is an open source model created on top of LLaMA 2. Bloom model is a multi-lingual language model created by BigScience. IBMs watsonx.ai also offers many version of Llama 2 and other foundation models. Hermes-13b. Falcon 180B is an open-access large language model that builds on the previous releases in the “Falcon” family and has 180 billion parameters.

Disadvantages of LLMs

Hallucinations results when the LLMs are trained on incomplete, contradictory or inaccurate data from misunderstanding context.
Bias happens when the source of training data is not diverse or not representative.
Security issues occur when Private Personal Information (PPI) is exposed when data is used for training models.

AI Tools

There are various tools which can be leveraged for training and working with AI Models.

Ollama is the tool used to run LLM on local machine. For example the below command runs the llama model.

ollama run llama2

Sunday, December 31, 2023

Large Language Models (LLMs) - Empowering Generative AI