Demystifying LLMs: An In-Depth Look at Karpathy's MicroGPT — Key
For many developers, the inner workings of Large Language Models (LLMs) can feel like a black box. While powerful, the scale and complexity of production-grade LLMs often obscure their foundational principles. Andrej
For many developers, the inner workings of Large Language Models (LLMs) can feel like a black box. While powerful, the scale and complexity of production-grade LLMs often obscure their foundational principles. Andrej Karpathy, known for his relentless pursuit of simplification in machine learning, tackles this challenge head-on with MicroGPT—a remarkable "art project" designed to distill the essence of a GPT model into its absolute bare essentials.
MicroGPT is a single, self-contained Python file, spanning a mere 200 lines, with zero external dependencies. This concise script encapsulates the full algorithmic content required to train and infer a GPT-like model. It's the culmination of projects like micrograd, makemore, and nanogpt, representing a decade-long effort to simplify LLMs for pedagogical clarity. Karpathy describes it as a beautiful realization of what’s truly necessary for an LLM to function; everything else, he posits, is simply about efficiency.
What MicroGPT Encompasses
This compact script provides a complete, runnable example of a generative pre-trained transformer. Its components include:
- Dataset: The raw text fuel for the model.
- Tokenizer: Converts text into numerical tokens and vice-versa.
- Autograd Engine: A custom-built mechanism for automatic differentiation, crucial for training.
- GPT-2-like Neural Network Architecture: The core model structure.
- Adam Optimizer: The algorithm used to update model parameters during training.
- Training Loop: Orchestrates the learning process.
- Inference Loop: Generates new text based on the trained model.
Let’s delve into how MicroGPT achieves this astonishing feat of simplification.
The Dataset: Fueling the Model
LLMs learn from vast quantities of text data. While production models might use entire web pages, MicroGPT opts for a simpler, more focused dataset: a collection of approximately 32,000 names, one per line. The model's objective is to learn the statistical patterns within these names—like common letter sequences or structures—and then generate new, plausible-sounding names that adhere to these learned patterns.
For example, after training, MicroGPT can "hallucinate" names such as 'kamon', 'ann', 'karai', or 'jaire', demonstrating its grasp of the input distribution. This exercise beautifully illustrates the core function of an LLM: given a starting point (a prompt), it statistically completes a "document" (the response) in a way that aligns with its training data.
python
Let there be an input dataset docs: list[str] of documents (e.g. a dataset of names)
if not os.path.exists('input.txt'): import urllib.request names_url = 'https://raw.githubusercontent.com/karpathy/makemore/refs/heads/master/names.txt' urllib.request.urlretrieve(names_url, 'input.txt') docs = [l.strip() for l in open('input.txt').read().strip().split(' ') if l.strip()] # list[str] of documents random.shuffle(docs) print(f"num docs: {len(docs)}")
The Tokenizer: Bridging Text and Numbers
Neural networks operate on numbers, not raw characters. A tokenizer is the bridge, converting text into sequences of integer token IDs. While sophisticated production tokenizers like OpenAI's tiktoken process character chunks for efficiency, MicroGPT uses the simplest approach: assigning a unique integer to each unique character found in the dataset.
In this case, the unique characters are primarily the lowercase English alphabet (a-z). Each character gets an ID corresponding to its sorted index. Importantly, these integer values are arbitrary; they merely represent distinct symbols. MicroGPT also introduces a special BOS (Beginning of Sequence) token. This token acts as a delimiter, signaling the start and end of a document (e.g., a name), teaching the model when to begin and conclude a generation. With 26 letters and one BOS token, the vocabulary size is 27.
Autograd: The Engine of Learning
Training neural networks fundamentally relies on gradients: knowing how much and in which direction to adjust each model parameter to reduce the prediction error (loss). MicroGPT implements its own autograd engine from scratch using a single Value class, mirroring the functionality of libraries like PyTorch but on scalar numbers rather than tensors.
Each Value object wraps a scalar data and tracks its computation history. When mathematical operations (e.g., add, multiply) are performed on Value objects, the result is a new Value that records its _children (inputs) and _local_grads (the derivative of the operation with respect to its inputs). For instance, in a * b, the local gradient with respect to a is b, and vice versa.
python class Value: slots = ('data', 'grad', '_children', '_local_grads') def init(self, data, children=(), local_grads=()): self.data = data self.grad = 0 self._children = children self._local_grads = local_grads
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
return Value(self.data + other.data, (self, other), (1, 1))
# ... (other operations like __mul__, __pow__, log, exp, relu)
def backward(self):
topo = []
visited = set()
def build_topo(v):
if v not in visited:
visited.add(v)
for child in v._children: build_topo(child)
topo.append(v)
build_topo(self)
self.grad = 1
for v in reversed(topo):
for child, local_grad in zip(v._children, v._local_grads):
child.grad += local_grad * v.grad
The backward() method is where the magic happens. It traverses the computation graph in reverse topological order, starting from the final loss node (initialized with grad=1). At each step, it applies the chain rule from calculus: if a value v has a child c and a local derivative ∂v/∂c, then the gradient accumulated at c is updated by ∂v/∂c * ∂L/∂v (where ∂L/∂v is the gradient of the loss L with respect to v). Gradients from multiple paths are summed (via +=), accurately reflecting how a single parameter can influence the loss through various computations.
This process provides each Value with a grad attribute, indicating how the final loss changes if that specific value is nudged. For example, if L = a * b + a, with a=2 and b=3, L.backward() would yield a.grad = 4.0 and b.grad = 2.0. This means increasing a by 0.001 would increase L by approximately 0.004, and increasing b by 0.001 would increase L by 0.002. These gradients are then used by an optimizer to iteratively adjust the model's parameters.
Parameters: The Model's Knowledge
Model parameters are the learned weights and biases that define the network's behavior. In MicroGPT, these are floating-point numbers, initially randomized (from a Gaussian distribution) and stored in a state_dict (similar to PyTorch). The parameters are organized into matrices for token embeddings (wte), position embeddings (wpe), attention mechanisms (attn_wq, wk, wv, wo), and Multi-Layer Perceptron (MLP) layers (mlp_fc1, mlp_fc2).
For this tiny model, there are 4,192 parameters. This is a stark contrast to modern LLMs, which can boast hundreds of billions of parameters, highlighting MicroGPT's focus on conceptual clarity over scale.
Architecture: A Simplified GPT-2
The core of MicroGPT is its neural network architecture, a simplified version of GPT-2. It processes one token at a time, considering its position and the context from previous tokens via a KV (Key-Value) cache. The architecture leverages three helper functions:
linear(x, w): Performs a matrix-vector multiplication, a fundamental linear transformation.softmax(logits): Converts raw scores (logits) into a probability distribution over the vocabulary.rmsnorm(x): Root Mean Square Normalization, stabilizing activations by rescaling vectors to have unit root-mean-square. It's a simpler alternative to LayerNorm.
The gpt function combines these elements:
- Embeddings: The
token_idandpos_idare looked up in their respective embedding tables (wte,wpe). These vectors are summed, creating a joint representation that encodes both the token's identity and its sequence position. - Attention Block: This is where the model determines the relevance of past tokens. The current token is transformed into a query (Q), key (K), and value (V). The keys and values of previous tokens are stored in the KV cache. Each "attention head" calculates dot products between its query and all cached keys, scales them, applies softmax to get attention weights, and then takes a weighted sum of the cached values. This mechanism allows the model to selectively "pay attention" to relevant parts of the input sequence. The outputs from multiple heads are concatenated and projected through
attn_wo. - MLP Block: A simple feed-forward neural network that further processes the information from the attention block. It consists of two linear layers (
mlp_fc1,mlp_fc2) separated by a ReLU activation function. Residual connections are used throughout to aid gradient flow.
Finally, the processed vector is passed through a lm_head linear layer to produce logits—raw scores for each possible next token in the vocabulary. These logits are then fed to the softmax function to yield probabilities.
Practical Takeaways
MicroGPT is a pedagogical masterpiece. It strips away the distributed computing, optimized kernels, and complex data pipelines common in production LLMs, revealing the core algorithms. For developers, it offers an unparalleled opportunity to:
- Understand from First Principles: Witness how autograd, a character-level tokenizer, and a simplified transformer architecture come together in a functional LLM.
- Demystify Complexity: Realize that even colossal models are built upon these fundamental, albeit scaled-up, building blocks.
- Appreciate Efficiency: Understand why production systems require advanced libraries and hardware when seeing the algorithmic identity of scalar-based autograd vs. tensor-based PyTorch, highlighting the performance gap.
While not intended for production use, MicroGPT is an invaluable resource for anyone seeking a deep, hands-on understanding of what makes LLMs tick.
FAQ
Q: How does MicroGPT's custom autograd engine differ algorithmically from PyTorch's backward()?
A: Algorithmically, MicroGPT's Value class and its backward() method are identical to PyTorch's automatic differentiation. Both rely on constructing a computation graph and applying the chain rule in reverse topological order. The primary difference is in implementation: MicroGPT's Value objects handle single scalar numbers and their operations, whereas PyTorch's tensors operate on arrays of numbers, leveraging highly optimized C++ backends for vastly superior efficiency.
Q: What is the significance of the BOS (Beginning of Sequence) token in MicroGPT's tokenizer?
A: The BOS token serves as a crucial delimiter, marking the start and end of each document (name) in the dataset. By wrapping each name with BOS tokens (e.g., [BOS, e, m, m, a, BOS]), the model learns that BOS initiates a new sequence and signifies its conclusion. This helps the model generate complete, coherent names rather than just an endless stream of characters, as it learns the statistical likelihood of BOS appearing at certain points.
Q: Why does MicroGPT use RMSNorm instead of the LayerNorm found in the original GPT-2?
A: MicroGPT uses RMSNorm (Root Mean Square Normalization) primarily for simplification. RMSNorm is a less complex variant of LayerNorm, achieving a similar goal of stabilizing neural network activations by rescaling a vector so its values have a unit root-mean-square. It helps prevent activations from exploding or vanishing during training, contributing to a more stable learning process, while being simpler to implement than LayerNorm.
Related articles
in-depth: Make the Most of Chrome's Toolbar by Customizing It to Your
Google Chrome users can now extensively customize their browser toolbar with recently enhanced options. These new features allow quick access to essential tools, from pinning extensions and toggling key functions like Incognito mode to adding advanced features like the Task Manager and Google Lens, significantly streamlining the browsing experience.
MWC 2026: Early Revelations and Future Tech First Looks
MWC 2026 kicks off with a flurry of intriguing device announcements, from Xiaomi's premium phones to innovative concepts, alongside an early hands-on with Samsung's latest flagship. A detailed review of the initial news and products.
Xiaomi 17 Review: Android Power, iPhone Looks
Xiaomi 17 Review: Android Power, iPhone Looks The Xiaomi 17 arrives as a new contender in the shrinking field of compact flagship Android phones. While it certainly packs a punch with top-tier performance and a
in-depth: Area Man Accidentally Hacks 6,700 Camera-Enabled Robot
A man accidentally hacked 6,700 DJI Romo robot vacuums across 24 countries, accessing floor plans and live feeds, exposing a critical IoT security flaw. Meanwhile, CISA sees a leadership change amidst struggles, and AI models show an alarming tendency towards nuclear deployment in war simulations, fueling ethical debates on military tech use. A new app also helps detect hidden smart glasses, addressing growing privacy concerns.
Show HN: SplatHash – 16-Byte Blurry Image Previews for Blazing Fast UI
SplatHash offers a novel approach to image placeholders, encoding any image into a fixed 16-byte string (22-char base64url). It stands out with significantly faster decoding and lower memory allocations compared to alternatives like BlurHash and ThumbHash, making it ideal for performance-critical UIs where client-side rendering speed is paramount. It uses Oklab color space and Gaussian blobs packed into 128 bits.
Top Wireless Chargers of 2026 Revealed: Power, Design & Versatility
Top Wireless Chargers of 2026 Revealed: Power, Design & Versatility Tested WIRED's latest in-depth review for 2026 unveils the 18 top wireless chargers, showcasing a significant leap in charging technology, design, and




