A Large Language Model (LLM) is a neural network trained to predict the next token in a sequence of text. Despite this simple objective, when trained on enough data with enough parameters, these models develop sophisticated language understanding and generation capabilities.

The Core Idea

At its heart, an LLM does one thing: given some text, predict what comes next.

Input: "The capital of France is" │ ▼ ┌─────────────┐ │ LLM │ └─────────────┘ │ ▼ Output: " Paris" (with high probability)

This "next token prediction" is repeated to generate longer text:

"The capital of France is" → " Paris" "The capital of France is Paris" → "." "The capital of France is Paris." → " It" "The capital of France is Paris. It" → " is" ... and so on

Each prediction considers all the previous tokens (up to the context window limit). This allows the model to maintain coherence over long passages.

Architecture: Transformers

Modern LLMs use the Transformer architecture, introduced in 2017. The key innovation is the attention mechanism, which allows every token to "attend to" every other token when making predictions.

Why Attention Matters

In the sentence "The cat sat on the mat because it was tired," the word "it" needs to refer back to "cat." Attention lets the model weigh the relevance of all previous words when processing "it," so it can correctly associate them.

Simplified Transformer Structure

Input Tokens │ ▼ ┌─────────────────┐ │ Embedding │ Convert tokens to vectors └─────────────────┘ │ ▼ ┌─────────────────┐ │ Attention │ ← Each token attends to all others │ Layer │ ├─────────────────┤ │ Feed-Forward │ ← Process each position │ Layer │ └─────────────────┘ │ × N layers (32-80+ for large models) │ ▼ ┌─────────────────┐ │ Output Head │ Predict next token probabilities └─────────────────┘

What Makes Them "Large"

The "large" in LLM refers to:

Parameter Count
Billions of learned weights. More parameters generally means more capability, but also more memory and compute. Modern LLMs range from 1B to 400B+ parameters.
Training Data
Trained on trillions of tokens of text from the internet, books, code, and other sources. The diversity and scale of training data is crucial for general capability.
Compute
Training requires massive GPU clusters running for weeks or months. A single training run can cost millions of dollars in compute.

Training vs Inference

Aspect Training Inference
Goal Learn the weights Use the weights
Compute Massive (clusters) Moderate (single machine possible)
Data Trillions of tokens Just your prompt
Who does it Labs with $10M+ budgets Anyone with hardware
Duration Weeks to months Seconds per response

When you "run an LLM locally," you're doing inference — using a pre-trained model to generate text. You're not training it.

Why They Need So Much Memory

During inference, the model weights must be loaded into memory. Each parameter is a number, typically stored as 16 bits (2 bytes) or less with quantization.

Model Size Calculation: 7 billion parameters × 2 bytes (FP16) = 14 GB 70 billion parameters × 2 bytes = 140 GB 70 billion parameters × 0.5 bytes (4-bit) = 35 GB

This is just for the weights. You also need memory for:

What LLMs Can and Can't Do

Strengths

  • Natural language understanding
  • Text generation and summarization
  • Translation
  • Code generation
  • Question answering
  • Following complex instructions

Limitations

  • No real-time knowledge (training cutoff)
  • Can hallucinate facts confidently
  • Limited context window
  • No persistent memory between sessions
  • Math and precise reasoning can be weak
  • Can't learn from conversations

Key Terminology

Token
The basic unit of text for an LLM. Not quite words — common words are single tokens, rare words get split. Roughly 3-4 characters per token in English.
Context Window
The maximum number of tokens the model can process at once. Includes both input and output. See Context Window.
Inference
Running a trained model to generate predictions. What you do when you use an LLM.
Parameters
The learned values (weights) in the neural network. More parameters generally means more capability but more resource requirements.