What is an LLM?
Large Language Models: what they are and how they work
A Large Language Model (LLM) is a neural network trained to predict the next token in a sequence of text. Despite this simple objective, when trained on enough data with enough parameters, these models develop sophisticated language understanding and generation capabilities.
The Core Idea
At its heart, an LLM does one thing: given some text, predict what comes next.
This "next token prediction" is repeated to generate longer text:
Each prediction considers all the previous tokens (up to the context window limit). This allows the model to maintain coherence over long passages.
Architecture: Transformers
Modern LLMs use the Transformer architecture, introduced in 2017. The key innovation is the attention mechanism, which allows every token to "attend to" every other token when making predictions.
Why Attention Matters
In the sentence "The cat sat on the mat because it was tired," the word "it" needs to refer back to "cat." Attention lets the model weigh the relevance of all previous words when processing "it," so it can correctly associate them.
Simplified Transformer Structure
What Makes Them "Large"
The "large" in LLM refers to:
- Parameter Count
- Billions of learned weights. More parameters generally means more capability, but also more memory and compute. Modern LLMs range from 1B to 400B+ parameters.
- Training Data
- Trained on trillions of tokens of text from the internet, books, code, and other sources. The diversity and scale of training data is crucial for general capability.
- Compute
- Training requires massive GPU clusters running for weeks or months. A single training run can cost millions of dollars in compute.
Training vs Inference
| Aspect | Training | Inference |
|---|---|---|
| Goal | Learn the weights | Use the weights |
| Compute | Massive (clusters) | Moderate (single machine possible) |
| Data | Trillions of tokens | Just your prompt |
| Who does it | Labs with $10M+ budgets | Anyone with hardware |
| Duration | Weeks to months | Seconds per response |
When you "run an LLM locally," you're doing inference — using a pre-trained model to generate text. You're not training it.
Why They Need So Much Memory
During inference, the model weights must be loaded into memory. Each parameter is a number, typically stored as 16 bits (2 bytes) or less with quantization.
This is just for the weights. You also need memory for:
- KV Cache: Grows with context length (often 2-16GB)
- Activations: Intermediate computation results
- Overhead: Framework, buffers, etc.
What LLMs Can and Can't Do
Strengths
- Natural language understanding
- Text generation and summarization
- Translation
- Code generation
- Question answering
- Following complex instructions
Limitations
- No real-time knowledge (training cutoff)
- Can hallucinate facts confidently
- Limited context window
- No persistent memory between sessions
- Math and precise reasoning can be weak
- Can't learn from conversations
Key Terminology
- Token
- The basic unit of text for an LLM. Not quite words — common words are single tokens, rare words get split. Roughly 3-4 characters per token in English.
- Context Window
- The maximum number of tokens the model can process at once. Includes both input and output. See Context Window.
- Inference
- Running a trained model to generate predictions. What you do when you use an LLM.
- Parameters
- The learned values (weights) in the neural network. More parameters generally means more capability but more resource requirements.