Deep Learning for LLMs
A practitioner's guide to the deep learning concepts that power large language models. This isn't a comprehensive ML course — it's the specific knowledge you need to understand, use, and troubleshoot LLM systems.
Contents
- Why This Matters
- Core Concepts
- The Transformer Architecture
- Training & Optimization
- Tokenization & Embeddings
- Scaling & Emergence
- Alignment & Safety
- Inference & Deployment
- Fine-Tuning & Adaptation
Why This Matters
You don't need to train your own LLM to benefit from understanding how they work. This knowledge helps you:
- Debug unexpected behavior — Why is it generating nonsense? (temperature, tokenization)
- Optimize costs — Why are some prompts expensive? (token count, context length)
- Choose models wisely — What's the difference between Claude and GPT? (architecture, training)
- Design better prompts — Why does chain-of-thought work? (attention patterns, reasoning)
- Anticipate limitations — Why does it hallucinate? (training data, probability)
Core Concepts
Neural Network
A computational system inspired by biological neurons. Consists of layers of connected nodes that transform input data through weighted connections and activation functions.
LLM relevance: LLMs are massive neural networks (billions of parameters) that learn to predict the next token from vast amounts of text.
Input → [Layer 1] → [Layer 2] → ... → [Layer N] → Output
↓ ↓ ↓
weights weights weights
Parameters
The learnable values in a neural network — the weights and biases that are adjusted during training. Model size is often described by parameter count.
LLM relevance:
- GPT-3: 175B parameters
- GPT-4: ~1.8T parameters (estimated)
- Claude 3 Opus: ~200B+ parameters (estimated)
- Llama 3.1: 405B parameters
More parameters generally means more capacity to learn, but also more compute and cost.
Loss Function
A mathematical function that measures how wrong the model's predictions are. Training minimizes this loss.
LLM relevance: LLMs typically use cross-entropy loss — measuring how different the model's predicted probability distribution is from the actual next token. Lower loss = better predictions.
Gradient Descent
The optimization algorithm that adjusts parameters to minimize loss. Computes the gradient (direction of steepest increase) and moves parameters in the opposite direction.
LLM relevance: Training LLMs requires distributed gradient descent across thousands of GPUs, with sophisticated techniques like gradient accumulation and mixed precision.
Backpropagation
The algorithm for computing gradients efficiently by propagating error backwards through the network. Essential for training deep networks.
LLM relevance: Enables training of transformer models with hundreds of layers. The chain rule applied recursively through attention mechanisms.
Overfitting
When a model memorizes training data instead of learning generalizable patterns. Performs well on training data but poorly on new data.
LLM relevance: LLMs can memorize training text verbatim (copyright concerns). Regularization and diverse training data help prevent this.
The Transformer Architecture
The architecture behind all modern LLMs.
Transformer
The neural network architecture introduced in "Attention Is All You Need" (2017). Uses self-attention instead of recurrence to process sequences.
Key innovation: Processes all tokens in parallel, enabling efficient training on long sequences and massive parallelism on GPUs.
Input Tokens → [Embedding] → [Transformer Blocks × N] → [Output Head] → Predictions
↓
[Self-Attention]
[Feed-Forward]
[Layer Norm]
Self-Attention
The mechanism that allows each token to "attend to" (consider relationships with) every other token in the sequence.
How it works:
- Each token creates Query (Q), Key (K), and Value (V) vectors
- Attention scores = softmax(Q × K^T / √d)
- Output = weighted sum of Values based on attention scores
LLM relevance: This is why LLMs understand context — attention allows "it" to connect to "the cat" across many tokens.
"The cat sat on the mat. It was comfortable."
↑
Attention connects "It" to "cat"
Multi-Head Attention
Running multiple attention mechanisms in parallel, each learning different relationship types.
LLM relevance: Different heads might learn syntax, semantics, coreference, etc. GPT-3 has 96 attention heads per layer.
Context Window / Context Length
The maximum number of tokens the model can process in a single forward pass. Determines how much information the model can "see" at once.
Current limits:
- GPT-4 Turbo: 128K tokens
- Claude 3: 200K tokens
- Gemini 1.5 Pro: 1M tokens
LLM relevance: Longer context = more information available, but quadratic memory cost with standard attention.
Positional Encoding
Information added to token embeddings to convey position in the sequence. Without this, transformers wouldn't know word order.
Types:
- Absolute: Fixed position encodings (original transformer)
- Relative: Encode distance between tokens
- RoPE: Rotary position embeddings (used by Llama, modern models)
LLM relevance: RoPE enables better generalization to longer sequences than seen in training.
Feed-Forward Network (FFN)
Dense layers applied to each token position after attention. Where much of the "knowledge" is stored.
LLM relevance: Recent research suggests FFN layers store factual knowledge, while attention handles reasoning patterns.
Layer Normalization
Normalizes activations across features to stabilize training. Applied before or after attention and FFN.
LLM relevance: Essential for training very deep transformer models (100+ layers).
Training & Optimization
Pre-training
Training a model on vast amounts of unlabeled text to learn general language patterns. The foundation of all modern LLMs.
Objective: Predict the next token (causal language modeling) or masked tokens (BERT-style).
Scale: GPT-4 reportedly trained on trillions of tokens from the internet, books, code.
Fine-tuning
Additional training on a smaller, task-specific dataset to specialize a pre-trained model.
Types:
- Full fine-tuning: Update all parameters
- LoRA/QLoRA: Update only small adapter layers
- Instruction tuning: Train on instruction-following examples
RLHF (Reinforcement Learning from Human Feedback)
Training technique that aligns models with human preferences. Uses human ratings to train a reward model, then optimizes the LLM to maximize that reward.
Process:
- Collect human comparisons of model outputs
- Train a reward model on these preferences
- Fine-tune LLM using PPO to maximize reward
LLM relevance: This is what makes ChatGPT/Claude helpful rather than just completing text. Critical for safety and alignment.
Constitutional AI
Anthropic's approach to alignment. Model critiques its own outputs against a set of principles and revises accordingly.
LLM relevance: How Claude is trained to be helpful, harmless, and honest without as much human labeling.
Learning Rate
How much to adjust parameters in response to gradients. Too high = unstable training; too low = slow learning.
LLM relevance: LLM training uses careful learning rate schedules — warmup, then decay. Critical hyperparameter.
Batch Size
Number of examples processed before updating parameters. Larger batches = more stable gradients, more memory.
LLM relevance: LLMs use huge effective batch sizes (millions of tokens) through gradient accumulation across many GPUs.
Tokenization & Embeddings
Token
The basic unit of text that LLMs process. Not words — typically subword units that balance vocabulary size and sequence length.
Examples:
- "unhappiness" → ["un", "happiness"] or ["un", "hap", "pi", "ness"]
- "ChatGPT" → ["Chat", "G", "PT"]
- Spaces often included: " the" is one token
LLM relevance: Token count determines cost and context usage. Code/non-English text often uses more tokens per character.
Tokenizer
The algorithm that converts text to tokens and back. Different models use different tokenizers.
Common types:
- BPE (Byte Pair Encoding): GPT models
- SentencePiece: Llama, many open models
- Tiktoken: OpenAI's fast BPE implementation
LLM relevance: Same text = different token counts across models. Tokenizer determines vocabulary and edge cases.
Embedding
Dense vector representation of a token. Maps discrete tokens to continuous space where similar meanings are nearby.
LLM relevance:
- Input: Tokens → Embeddings (lookup table)
- Output: Embeddings → Probabilities (linear layer)
Embeddings are how models represent meaning mathematically.
Vocabulary Size
The number of unique tokens the model knows. Larger vocabulary = more tokens but shorter sequences.
Typical sizes:
- GPT-4: ~100K tokens
- Claude: ~100K tokens
- Llama 3: 128K tokens
Scaling & Emergence
Scaling Laws
Empirical observations that model performance improves predictably with more compute, data, and parameters.
Key insight: Performance follows power laws. 10× more compute ≈ predictable improvement.
LLM relevance: This is why labs keep building bigger models — returns remain positive at massive scale.
Emergent Capabilities
Abilities that appear suddenly at certain scales, not present in smaller models.
Examples:
- Chain-of-thought reasoning
- In-context learning
- Code generation
- Multilingual transfer
LLM relevance: You can't predict what a larger model will be able to do from smaller model behavior.
In-Context Learning
The ability to learn new tasks from examples in the prompt without weight updates. One of GPT-3's breakthrough capabilities.
LLM relevance: This is why few-shot prompting works. The model "learns" from examples in context, not through training.
Mixture of Experts (MoE)
Architecture where only a subset of parameters activates for each input. Enables larger models with lower compute.
Examples: GPT-4 (rumored), Mixtral, DBRX
LLM relevance: MoE models can be much larger but similar cost per token.
Alignment & Safety
Alignment
Ensuring AI systems pursue goals that match human intentions. The challenge of making AI do what we actually want.
Approaches:
- RLHF
- Constitutional AI
- Debate
- Interpretability
Hallucination
When models generate plausible-sounding but factually incorrect information. A fundamental limitation of current LLMs.
Why it happens: Models optimize for plausibility (matching training distribution), not truth. No built-in fact-checking.
Mitigations: RAG, grounding, verification, citations.
Jailbreaking
Techniques to bypass safety measures and elicit restricted outputs from aligned models.
Types:
- Prompt injection
- Many-shot attacks
- Persona manipulation
- Encoding tricks
Red Teaming
Systematic adversarial testing to find model vulnerabilities before deployment.
LLM relevance: Essential practice before launching LLM applications. Find failures before users do.
Inference & Deployment
Inference
Running a trained model to generate predictions. What happens when you send a prompt to an API.
Cost factors:
- Input tokens (processed in parallel)
- Output tokens (generated sequentially)
- Model size
- Hardware (GPU type)
Temperature
Parameter controlling randomness in token selection. Applied to logits before sampling.
Values:
- 0.0: Deterministic (always pick highest probability)
- 0.7: Balanced creativity
- 1.0+: More random, potentially incoherent
LLM relevance: Low temperature for factual tasks, higher for creative tasks.
Top-p (Nucleus Sampling)
Sample from the smallest set of tokens whose cumulative probability exceeds p. Alternative to temperature.
Example: top_p=0.9 means sample from tokens comprising top 90% of probability mass.
Top-k Sampling
Sample only from the k most likely tokens. Simple alternative to top-p.
LLM relevance: Often used with temperature for controlled randomness.
Logits
Raw, unnormalized scores output by the model before softmax. Higher logit = higher probability after normalization.
LLM relevance: Temperature operates on logits. Some APIs expose log-probabilities for analysis.
Quantization
Reducing numerical precision of model weights (e.g., 32-bit → 8-bit or 4-bit) to reduce memory and increase speed.
Trade-off: Smaller/faster but slight quality loss.
LLM relevance: Enables running large models on consumer hardware. Llama 3 70B runs on a single GPU with 4-bit quantization.
KV Cache
Caching key-value pairs from previous tokens to avoid recomputation during autoregressive generation.
LLM relevance: Essential optimization for inference. Why "prefill" is faster than generation.
Fine-Tuning & Adaptation
LoRA (Low-Rank Adaptation)
Efficient fine-tuning method that adds small trainable matrices to frozen model weights. Dramatically reduces compute and memory.
Interactive Code: 🚀 Run our LoRA Fine-tuning Tutorial in Google Colab
LLM relevance: Makes fine-tuning accessible without massive GPU clusters. Can create custom models affordably.
QLoRA
Combines LoRA with quantization for even more efficient fine-tuning. Fine-tune 65B models on a single GPU.
Instruction Tuning
Fine-tuning on (instruction, response) pairs to make models follow instructions better.
Datasets: FLAN, Alpaca, OpenAssistant, Dolly
LLM relevance: Why "base" models just complete text while "instruct" models follow instructions.
Preference Tuning (DPO)
Direct Preference Optimization — alternative to RLHF that directly optimizes on preference data without a reward model.
LLM relevance: Simpler, more stable than RLHF. Used in recent open models.
Quick Reference
Model Size & Capability
| Parameters | Example Models | Typical Use |
|---|---|---|
| 1-7B | Llama 3.2, Phi-3 | Local deployment, edge |
| 7-13B | Llama 3.1 8B, Mistral 7B | Balanced cost/capability |
| 30-70B | Llama 3.1 70B, Mixtral 8x22B | High capability, self-hosted |
| 100B+ | GPT-4, Claude 3 Opus | Frontier capabilities |
Inference Parameters Cheat Sheet
| Parameter | Low Value | High Value | Use Case |
|---|---|---|---|
| Temperature | 0.0-0.3 | 0.7-1.0 | Factual → Creative |
| Top-p | 0.1-0.5 | 0.9-1.0 | Focused → Diverse |
| Max tokens | Task-dependent | — | Control output length |
Notes
This guide focuses on concepts relevant to LLM practitioners. For comprehensive deep learning education, see Resources.
Feedback and suggestions are welcome!
Last updated: January 2026