What is an LLM - Local LLM Knowledge Base

A Large Language Model (LLM) is a neural network trained to predict the next token in a sequence of text. Despite this simple objective, when trained on enough data with enough parameters, these models develop sophisticated language understanding and generation capabilities.

The Core Idea

At its heart, an LLM does one thing: given some text, predict what comes next.

Input: "The capital of France is" │ ▼ ┌─────────────┐ │ LLM │ └─────────────┘ │ ▼ Output: " Paris" (with high probability)

This "next token prediction" is repeated to generate longer text:

"The capital of France is" → " Paris" "The capital of France is Paris" → "." "The capital of France is Paris." → " It" "The capital of France is Paris. It" → " is" ... and so on

Each prediction considers all the previous tokens (up to the context window limit). This allows the model to maintain coherence over long passages.

Architecture: Transformers

Modern LLMs use the Transformer architecture, introduced in 2017. The key innovation is the attention mechanism, which allows every token to "attend to" every other token when making predictions.

Why Attention Matters

In the sentence "The cat sat on the mat because it was tired," the word "it" needs to refer back to "cat." Attention lets the model weigh the relevance of all previous words when processing "it," so it can correctly associate them.

Simplified Transformer Structure

Input Tokens │ ▼ ┌─────────────────┐ │ Embedding │ Convert tokens to vectors └─────────────────┘ │ ▼ ┌─────────────────┐ │ Attention │ ← Each token attends to all others │ Layer │ ├─────────────────┤ │ Feed-Forward │ ← Process each position │ Layer │ └─────────────────┘ │ × N layers (32-80+ for large models) │ ▼ ┌─────────────────┐ │ Output Head │ Predict next token probabilities └─────────────────┘

What Makes Them "Large"

The "large" in LLM refers to:

Parameter Count: Billions of learned weights. More parameters generally means more capability, but also more memory and compute. Modern LLMs range from 1B to 400B+ parameters.
Training Data: Trained on trillions of tokens of text from the internet, books, code, and other sources. The diversity and scale of training data is crucial for general capability.
Compute: Training requires massive GPU clusters running for weeks or months. A single training run can cost millions of dollars in compute.

Training vs Inference

Aspect	Training	Inference
Goal	Learn the weights	Use the weights
Compute	Massive (clusters)	Moderate (single machine possible)
Data	Trillions of tokens	Just your prompt
Who does it	Labs with $10M+ budgets	Anyone with hardware
Duration	Weeks to months	Seconds per response

When you "run an LLM locally," you're doing inference — using a pre-trained model to generate text. You're not training it.

Why They Need So Much Memory

During inference, the model weights must be loaded into memory. Each parameter is a number, typically stored as 16 bits (2 bytes) or less with quantization.

Model Size Calculation: 7 billion parameters × 2 bytes (FP16) = 14 GB 70 billion parameters × 2 bytes = 140 GB 70 billion parameters × 0.5 bytes (4-bit) = 35 GB

This is just for the weights. You also need memory for:

KV Cache: Grows with context length (often 2-16GB)
Activations: Intermediate computation results
Overhead: Framework, buffers, etc.

What LLMs Can and Can't Do

Strengths

Natural language understanding
Text generation and summarization
Translation
Code generation
Question answering
Following complex instructions

Limitations

No real-time knowledge (training cutoff)
Can hallucinate facts confidently
Limited context window
No persistent memory between sessions
Math and precise reasoning can be weak
Can't learn from conversations

Key Terminology

Token: The basic unit of text for an LLM. Not quite words — common words are single tokens, rare words get split. Roughly 3-4 characters per token in English.
Context Window: The maximum number of tokens the model can process at once. Includes both input and output. See Context Window.
Inference: Running a trained model to generate predictions. What you do when you use an LLM.
Parameters: The learned values (weights) in the neural network. More parameters generally means more capability but more resource requirements.