Autoregressive Transformer

I had been planning to make a YouTube video about this for quite some time. However, just as I was preparing to release it, Andrej Karpathy published their own excellent video on the topic, and it quickly went viral. After that, I decided to hold off on releasing my version. I am now turning the material into a written guide on the transformer architecture. We will implement a GPT-like transformer from scratch in Python using only PyTorch as a dependency. I assume familiarity with PyTorch, and for those new to it, Introduction to PyTorch Code Examples from Stanford provides a helpful starting point. Without further ado, let us get started!

Please note that the complete code for the GPT-like autoregressive transformer implemented in this article is available here: link. In 222 lines, it automatically downloads the dataset, tokenizes the text, pretrains the model, and generates sample text.

Given input text, our objective is to generate output text conditioned on that sequence. In traditional transformer models, excluding byte-latent transformers, the initial stage is always tokenization, regardless of whether the architecture is encoder-only (e.g., BERT), encoder-decoder (e.g., T5), or decoder-only (e.g., GPT-2). One straightforward method is character-level tokenization, where each individual character is mapped to a unique token ID. The following zero-dependency Python implementation is designed to handle both encoding and decoding for arbitrary input text data:

import string

class CharacterLevelTokenizer:
    """Represents a character-level tokenizer."""

    def __init__(self, text: str = string.printable) -> None:
        """Build the vocabulary from the given text."""

        self.vocab: list[str] = sorted(set(text))
        self.vocab_size: int = len(self.vocab)
        self.char_to_token: dict[str, int] = {char: idx for idx, char in enumerate(self.vocab)}

    def encode(self, text: str) -> list[int]:
        """Converts a string into a list of token IDs."""

        return [self.char_to_token[char] for char in text]

    def decode(self, token_ids: list[int]) -> str:
        """Converts a list of token IDs back into a string."""

        return "".join(self.vocab[token_id] for token_id in token_ids)

tokenizer = CharacterLevelTokenizer()
enc = tokenizer.encode("robot")    # -> [87, 84, 71, 84, 89]
dec = tokenizer.decode(enc)        # -> 'robot'

As shown above, CharacterLevelTokenizer treats each individual character as a separate token.

Note that while character-level tokenization is conceptually straightforward, most production-scale systems utilize subword tokenizers to achieve a more optimal balance between vocabulary size and representational capacity. By capturing frequent character sequences as single units, subword algorithms, such as Byte-Pair Encoding (BPE) or WordPiece, significantly enhance computational efficiency compared to more granular methods. However, in this article, we will be using CharacterLevelTokenizer.

At this stage, it may help to look at the transformer architecture as a whole to get a sense of its overall structure. After that, we can break it down and examine each component step by step to see how everything fits together. It may feel a bit overwhelming at first, which is natural, but do not be afraid, since in the end it is simply linear algebra arranged in a certain manner.

from dataclasses import dataclass
import torch
import torch.nn as nn
import torch.nn.functional as F

@dataclass(frozen=True)  # We like our dataclasses frozen!
class Config:
    """Transformer config."""

    vocab_size: int  # Tokenizer vocabulary size
    block_size: int  # Max sequence length (context window)
    n_layer: int     # Number of transformer layers
    n_head: int      # Attention heads per layer
    n_embd: int      # Embedding dimension (must be divisible by `n_head`)
    dropout: float   # Dropout probability
    bias: bool       # Whether `nn.Linear` and `nn.LayerNorm` use bias

    @property
    def head_size(self) -> int:
        """Returns the per-head dimension (embedding is split evenly across attention heads)."""

        return self.n_embd // self.n_head

class Transformer(nn.Module):
    """Autoregressive transformer language model."""

    def __init__(self, cfg: Config) -> None:
        """Initialize the building blocks of the transformer."""

        super().__init__()
        self.block_size = cfg.block_size

        self.tok_emb_table = nn.Embedding(cfg.vocab_size, cfg.n_embd)
        self.pos_emb_table = nn.Embedding(cfg.block_size, cfg.n_embd)

        self.blocks = nn.Sequential(*[Block(cfg) for _ in range(cfg.n_layer)])
        self.ln = nn.LayerNorm(cfg.n_embd, bias=cfg.bias)
        self.proj = nn.Linear(cfg.n_embd, cfg.vocab_size)

        # Weight tying: reduces the total number of parameters without degrading accuracy
        # Reference: https://arxiv.org/abs/1608.05859
        self.proj.weight = self.tok_emb_table.weight

    def forward(self, token_ids: torch.Tensor, targets: torch.Tensor | None = None) -> tuple:
        """Compute logits and optional loss."""

        B, T = token_ids.shape
        tok_emb = self.tok_emb_table(token_ids)
        pos_emb = self.pos_emb_table(torch.arange(T, device=token_ids.device))
        emb = tok_emb + pos_emb

        out = self.ln(self.blocks(emb))
        if targets is None:
            logits = self.proj(out[:, [-1], :])  # Last-token projection with time dimension
            loss = None
        else:
            logits = self.proj(out)
            loss = F.cross_entropy(logits.view(B * T, -1), targets.view(-1))

        return logits, loss

    @torch.no_grad()
    def generate(
        self,
        token_ids: torch.Tensor,
        max_new_token_ids: int,
        temperature: float = 0.7,
        top_k: int | None = None,
    ) -> torch.Tensor:
        """Generates tokens IDs autoregressively."""

        self.eval()
        for _ in range(max_new_token_ids):
            logits, _ = self(token_ids[:, -self.block_size :])
            logits = logits[:, -1, :] / temperature

            if top_k is not None:
                k = min(top_k, logits.size(-1))
                threshold = torch.topk(logits, k).values[:, [-1]]
                logits[logits < threshold] = -float("inf")

            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            token_ids = torch.cat((token_ids, next_token), dim=1)
        return token_ids

Let us focus on the forward method, the core inference function that is also used during generation. After tokenizing the input text, semantic information is captured in tok_emb, which produces token embeddings that represent each token's meaning as a numerical tensor. However, these embeddings do not encode token order. To incorporate positional information, we use learned positional embeddings computed in pos_emb rather than fixed sinusoidal encodings, as learned embeddings are more expressive and can adapt to task-specific positional patterns. The token and positional embeddings are then combined through simple addition to form a unified representation emb that encodes both meaning and position. This additive approach is sufficient, as it breaks permutation symmetry and allows the attention mechanism to infer and model positional structure without requiring more complex operations.

Learned positional embeddings are simple to implement and work well in practice, but they typically tie a model to the maximum sequence length used during training. Many modern architectures instead adopt Rotary Position Embedding (RoPE), which encodes position by rotating query and key vectors with position-dependent angles. This design allows attention to represent relative distances between tokens and often extrapolates more gracefully to longer contexts. For simplicity, however, this article uses learned positional embeddings.

Although the line out = self.ln(self.blocks(emb)) appears compact, it encapsulates a substantial portion of the transformer's computational core. Here, self.blocks represents a stack of transformer blocks, each composed of Multi-Head Attention (MHA) mechanisms and Multilayer Perceptrons (MLPs) that progressively refine the token embeddings by modeling complex semantic and contextual relationships across the sequence. Following these deep transformations, self.ln applies layer normalization to stabilize the network and ensure well-behaved gradients. From there, the forward pass branches depending on the objective: during inference (when targets are omitted), the model efficiently isolates the last token's representation before projecting it into vocabulary-sized logits, since predicting the next word only requires this final aggregated context. Conversely, during training, the entire sequence is projected and the resulting logits and targets are flattened to combine the batch and time dimensions, satisfying PyTorch's F.cross_entropy loss requirements.

While these final routing steps handle output formatting and the training objective, they are not where the model's main representational power resides. That capability comes from the repeated attention and feed-forward layers inside self.blocks. To understand how the model builds contextual meaning across a sequence, we will next unpack self.blocks(emb) and examine the Block class along with its core components, AttentionHead and MultiHeadAttention, to see how they interact under the hood.

class AttentionHead(nn.Module):
    """A single causal self-attention head."""

    def __init__(self, cfg: Config) -> None:
        """Initialize QKV projection and dropout, and cache causal mask to avoid recomputing it."""

        super().__init__()
        self.qkv = nn.Linear(cfg.n_embd, 3 * cfg.head_size, bias=cfg.bias)
        self.dropout = nn.Dropout(cfg.dropout)
        self.register_buffer("mask", torch.tril(torch.ones(cfg.block_size, cfg.block_size)))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Compute masked single-head self-attention for the input tensor."""

        _, T, C = x.shape
        q, k, v = self.qkv(x).split(self.qkv.out_features // 3, dim=-1)

        attn_scores = q @ k.transpose(-2, -1)
        attn_scores = attn_scores * C**-0.5  # Prevent softmax from blowing up
        attn_scores = attn_scores.masked_fill(self.mask[:T, :T] == 0, float("-inf"))  # Mask future tokens

        attn_weights = F.softmax(attn_scores, dim=-1)
        attn_weights = self.dropout(attn_weights)
        return attn_weights @ v

class MultiHeadAttention(nn.Module):
    """A Multi-Head Attention (MHA)."""

    def __init__(self, cfg: Config) -> None:
        """Initialize multi-head self-attention with output projection and dropout."""

        super().__init__()
        self.heads = nn.ModuleList([AttentionHead(cfg) for _ in range(cfg.n_head)])
        self.proj = nn.Linear(cfg.n_embd, cfg.n_embd)
        self.dropout = nn.Dropout(cfg.dropout)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Compute masked multi-head self-attention for the input tensor."""

        out = torch.cat([head(x) for head in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class Block(nn.Module):
    """Transformer block."""

    def __init__(self, cfg: Config) -> None:
        """Initializes a transformer block."""

        super().__init__()
        self.ln1 = nn.LayerNorm(cfg.n_embd, bias=cfg.bias)
        self.mha = MultiHeadAttention(cfg)
        self.ln2 = nn.LayerNorm(cfg.n_embd, bias=cfg.bias)
        self.mlp = nn.Sequential(
            nn.Linear(cfg.n_embd, 2 * cfg.n_embd, bias=cfg.bias),  # Flexible: can change to e.g. `4 * cfg.n_embd`
            nn.GELU(),
            nn.Linear(2 * cfg.n_embd, cfg.n_embd, bias=cfg.bias),
            nn.Dropout(cfg.dropout),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass of a Transformer block: attention + MLP with residuals and layer norms."""

        x = x + self.mha(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

The AttentionHead class implements a single causal self-attention head. Attention, introduced in the Attention Is All You Need paper, is a directional communication mechanism that allows tokens in a sequence to exchange information, with each token gathering information from others. Each token produces three tensors:

The interaction between queries and keys determines how strongly tokens attend to one another, while values carry the aggregated content. In an autoregressive transformer, causal self-attention restricts each token to attend only to previous tokens, preventing future information leakage during training and inference. Furthermore, the queries, keys, and values all originate from the same input sequence. Formally, for the token sequence length \(n\) and input dimension \(d\), query weights \(Q\), key weights \(K\), and value weights \(V\) are projected as \(Q, K, V \in \mathbb{R}^{n \times d}\) and the attention is computed as:

$$\text{Attention}(Q, K, V) = \text{softmax}\big(\frac{Q K^\top}{\sqrt{d}}\big) V$$

where \(QK^\top / \sqrt{d}\) computes scaled similarity scores between queries and keys, with the scaling factor preventing softmax saturation. The softmax converts these scores into attention weights, which can be viewed as how much of each page to read to answer the question. The weighted sum of \(V\) then produces an output that emphasizes the most relevant information, similar to extracting summarized notes from the most useful pages. In the implementation, attention is computed in the forward method of the AttentionHead class.

In MultiHeadAttention, we implement MHA. Conceptually, MHA consists of several parallel AttentionHead modules whose outputs are concatenated and passed through a final linear projection. This design allows the model to attend to multiple aspects of the input simultaneously, with each head learning distinct relational patterns such as syntactic structure, long-range dependencies, and subtle semantic cues, which enrich the overall representation. We then apply a dropout to the projected output to regularize the MHA module during training. Without this regularization, different heads may co-adapt and learn redundant patterns, and the projection layer after concatenation can become overly confident in certain features. Since transformers have large capacity, this helps reduce the risk of overfitting and encourages the model to learn more diverse and robust representations.

Have the TransformerBlock class implements a single transformer block, which serves as a core building unit of the model. The first layer normalization standardizes the input to stabilize training, and the MHA module allows the block to focus on multiple relationships across the sequence, with the residual connection ensuring that the original information is preserved. The second layer normalization prepares the data for the feed-forward network, which expands and transforms each position independently to capture higher-level features, while the GELU activation introduces nonlinearity and the dropout regularizes the output. The second residual connection adds the transformed features back to the input, helping gradients flow more effectively and enabling the block to learn complex patterns without losing essential information. With this, we now understand the line out = self.ln(self.blocks(emb)) and have everything we need to train a GPT-like model.

Finally, let us discuss the generate method. This method performs autoregressive text generation using a transformer-style language model. Starting from an initial sequence of token_ids, it generates one token at a time for up to max_new_token_ids steps. At each iteration, the model is given only the most recent block_size tokens, ensuring the input fits within the model’s context window. The logits corresponding to the last position are extracted and scaled by the temperature parameter to control the randomness of the output. Optionally, top-k filtering can be applied to restrict sampling to the k most likely tokens by masking all others. The filtered logits are then converted to probabilities using softmax, and the next token is sampled stochastically with torch.multinomial. This sampled token is appended to the sequence, and the process repeats until the specified number of tokens has been generated, producing the final extended token sequence.

This completes the GPT-like implementation of the autoregressive transformer! We can now train the model and generate text!

import urllib.request

def sample_batch(data: torch.Tensor, batch_size: int, block_size: int) -> tuple:
    """Randomly samples training sequences for Next-Token Prediction (NTP)."""

    idxs = torch.randint(len(data) - block_size, (batch_size,))                    # Random starting positions
    token_ids = torch.stack([data[idx : idx + block_size] for idx in idxs])        # Input sequences
    targets = torch.stack([data[idx + 1 : idx + block_size + 1] for idx in idxs])  # Same sequences shifted by +1 (NTP)
    return token_ids, targets

with urllib.request.urlopen("https://www.gutenberg.org/cache/epub/84/pg84.txt") as f:  # Read Frankenstein
    text = f.read().decode("utf-8")
tokenizer = CharacterLevelTokenizer(text)

device = torch.device("cuda:0" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu")
cfg = Config(vocab_size=tokenizer.vocab_size, block_size=32, n_layer=4, n_head=4, n_embd=64, dropout=0.0, bias=False)
max_iters, log_interval, batch_size, learning_rate = 2_000, 100, 256, 1e-3

data = torch.tensor(tokenizer.encode(text), device=device)
model = Transformer(cfg).to(device).train()

n_params, adamw_params, muon_params = 0, [], []
for param in model.parameters():
    n_params += param.numel()
    (adamw_params if param.ndim < 2 else muon_params).append(param)
print(f"Model parameters: {n_params:,}\n")

adamw = torch.optim.AdamW(adamw_params, lr=3e-4, betas=(0.9, 0.95), eps=1e-8, weight_decay=0.1)
muon = torch.optim.Muon(muon_params, lr=0.02, momentum=0.95, weight_decay=0.1)
for it in range(max_iters):
    token_ids, targets = sample_batch(data, batch_size, cfg.block_size)

    adamw.zero_grad()
    muon.zero_grad()
    _, loss = model(token_ids, targets)
    loss.backward()
    adamw.step()
    muon.step()

    if (it + 1) % log_interval == 0:
        print(f"[STEP {it + 1:04d}] Train loss: {loss.item():.4f}")

# Seed the model with the prompt "I am" to kick off text generation
token_ids = torch.tensor(tokenizer.encode("I am"), device=device).reshape(1, -1)
output = model.generate(token_ids, max_new_token_ids=512)[0]
print(f"\nOUTPUT:\n\n{tokenizer.decode(output.tolist())}")

Training logs from under 40 seconds of training:

Model parameters: 140,127

[STEP 0100] Train loss: 2.5102
[STEP 0200] Train loss: 1.9054
[STEP 0300] Train loss: 1.6241
[STEP 0400] Train loss: 1.5220
[STEP 0500] Train loss: 1.4411
[STEP 0600] Train loss: 1.4589
[STEP 0700] Train loss: 1.4249
[STEP 0800] Train loss: 1.4338
[STEP 0900] Train loss: 1.4335
[STEP 1000] Train loss: 1.3954
[STEP 1100] Train loss: 1.4112
[STEP 1200] Train loss: 1.4028
[STEP 1300] Train loss: 1.3580
[STEP 1400] Train loss: 1.3984
[STEP 1500] Train loss: 1.3844
[STEP 1600] Train loss: 1.3988
[STEP 1700] Train loss: 1.3683
[STEP 1800] Train loss: 1.3777
[STEP 1900] Train loss: 1.3845
[STEP 2000] Train loss: 1.3517

OUTPUT:

I am for which I might on the story
contentinued how a more animation in
the toon, however at the protection of the employment of Creation; and I was ought to be folleques my forths of the words of not burned with the dreadful as the expections
of years and speast, and the loves met steps of the lives which I had could about the been was grats before I wandered the throat, change of the cultic of his journey, and the miser of their knowledge.

I had not listened to be beat the follow
for and the articulat