Mahdee Mushfique Kamal

Understanding the LLM Development Cycle - Building, Training & Finetuning

Jan 19, 2025

This is a summary of ACM TechTalk by Sebastian Raschka by Mahdee Kamal

An LLM model is simply (pre)trained to predict the next word/token. LLMs generate text iteratively, one word at a time.
Input is prepared as sliding windows over text. Batching is used for faster training.
LLMs are trained on vast datasets (GPT-3 : 0.5T, LLaMA-1 : 1.4T, LLaMA-3 15T tokens). There is a trend toward both scaling up and improving data quality.

LLMs typically use a Transformer architecture, with key components including tokenization, embedding layers, and a masked multi-head attention module.
All the different LLMs have mostly the same architecture. They may differ in dropout, activation function & normalization. Eg: RMSNorm, rotational positional embeddings instead of absolute embeddings.
For different GPT variables are
- Number of parameteres
- Number of times the transformer block is repeated
- Number of heads in multi-head attention model.
- Number of dimension of the embeddings

Detailed Architecture:

LLM Detailed Architecture

SImplified Version: source

LLM Simplified Architecture

Pre-training involves a standard deep learning training loop with cross-entropy loss and Adam optimizer.
Key challenges: Scaling the training process with multiple GPU, involves multiple machines due to the massiveness of model & dataset.
Labels are the inputs shifted by +1 Ex: [In the heart of ] —> [the heart of the]
Epochs: 1-2 epochs is usually a sweet spot.

Due to the resource-intensive nature of pre-training, existing pre-trained models like LLaMA are often used as a base for further fine-tuning.

We only fine-tune the last few layers.

Usecase Classifier:

Replacing the output layer with a smaller one that suits the specific task (e.g., spam vs. non-spam)

Usecase Personal Assistant:

For building chatbots or personal assistants, instruction fine-tuning is used, where the model is trained on a dataset containing instructions and desired responses.
A prompt style template is applied to the dataset to guide the LLM in following instructions.

Fav tool to run LLM locally? - Lama
How start-ups fine-tunes the weights and biases of an OpenAI GPT? - Not possible, they uses prompt engineering.
What is more effective: continuing training a foundational model or developing a Q&A dataset for fine-tuning? - Both approaches work; it’s an open research problem, but Q&A data can provide context-efficient fine-tuning.
A little bit more about attention? - read on your own