microAttention

scaled dot-product attention — built from scratch by Reita Williams

Stage 0 / 6

The complete implementation

The entire scaled dot-product attention mechanism in 23 lines of pure Python and NumPy only. Every line is written to be understood.

softmax() causal_mask() scaled_dot_product_attention()

Dependencies

numpy, math

Total lines

MICROATTENTION.PY

                    import numpy as np
import math
def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / e_x.sum(axis=-1, keepdims=True)
def causal_mask(n):
    return np.triu(np.full((n, n), -1e9), k=1)
def scaled_dot_product_attention(Q, K, V, mask=None):
    scores = Q @ K.T / math.sqrt(Q.shape[-1])
    if mask is not None: scores = scores + mask
    weights = softmax(scores)
    return weights @ V, weights
sentence = ["the", "corpus", "was", "wrong"]
np.random.seed(42)
token_embeddings = np.random.randn(4, 8)
Q = token_embeddings @ np.random.randn(8, 8)
K = token_embeddings @ np.random.randn(8, 8)
V = token_embeddings @ np.random.randn(8, 8)
output, weights = scaled_dot_product_attention(Q, K, V, causal_mask(4))

                  

Functions

softmax

causal_mask

attention

Q / K / V: three projections, three purposes

Each token gets projected into three different vector spaces. Q asks: what am I looking for? K advertises: this is what I contain. V contributes: this is what I actually pass forward. The separation is elegant: the model learns searching and being-found as distinct operations.

Q = embeddings @ W_Q K = embeddings @ W_K V = embeddings @ W_V

Each matrix shape

(4 × 8)

d_k dimension

Q / K / V PROJECTIONS
# each word is an 8-dimensional vector
# multiply by weight matrix to project into Q, K, V spaces
Q = token_embeddings @ W_Q  # what am I looking for?
K = token_embeddings @ W_K  # what do I contain?
V = token_embeddings @ W_V  # what do I contribute?
# Q[3] is "wrong" asking its question
# K[1] is "corpus" presenting its key
# the dot product between them will be high

Three roles

Query: looking for

Key: advertises

Value: contributes

Dot products: similarity as a number

The dot product between a query vector and a key vector is a single number measuring how similar they are. Multiply element-by-element, then sum. High score means pointing in the same direction means relevant. Q @ K.T computes all n² scores simultaneously in one matrix multiplication.

scores = Q @ K.T / √d_k

Score matrix shape

(4 × 4)

Scale factor √d_k

√8 ≈ 2.83

SCORE COMPUTATION

1# Q is (4,8), K.T is (8,4) → scores is (4,4)
2# scores[i,j] = dot(Q[i], K[j]) / √8
3scores = Q @ K.T / math.sqrt(Q.shape[-1])
4
5# Q.shape[-1] reads d_k from the actual input
6# not hardcoded, works for any dimension
7
8# why divide by √d_k?
9# dot products grow with d_k, pushing softmax into saturation
10# and reducing gradients to near zero
11< span class="line-content" >< span class="cmt" ># scaling keeps variance at 1 regardless of size
                
                
                  Result
                  S
4×4 score matrix
                  scores[3,1] for "wrong" to "corpus" will be highest in row 3

Softmax: turning scores into weights

Softmax converts each row of raw scores into a probability distribution: all positive, summing to 1.0. The exponential function amplifies differences, so a moderate lead in raw scores becomes a large lead after softmax. The mechanism produced that concentration.

e^(x − max(x)) / sum weights sum to 1.0

Stability trick

x − max(x)

"wrong" → "corpus"

93.7%

NUMERICALLY STABLE SOFTMAX
def softmax(x):
    # subtract max before exponential
    # prevents overflow; max cancels in division
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / e_x.sum(axis=-1, keepdims=True)
    # axis=-1: each ROW gets its own softmax
    # each attending token has weights summing to 1
    # for "wrong" (row 3), after softmax:
    # the    → 0.020
    # corpus → 0.937  ← 93.7%
    # was    → 0.030
    # wrong  → 0.013

Effect

e^x

Exponential amplifies

Divide normalises

Value aggregation: the weighted blend

With attention weights in hand, multiply by V. Each token's output is a weighted blend across all value vectors, with each token contributing proportionally. For "wrong", 93.7% of the weight goes toward corpus's value vector. Output shape matches input shape.

output = weights @ V

Output shape

(4 × 8)

Returns

output + weights

VALUE AGGREGATION
    weights = softmax(scores)
    return weights @ V, weights
    # weights @ V is (4,4) @ (4,8) → (4,8)
    # same shape as token_embeddings
    # returning weights is a transparency choice
    # it lets you inspect what happened
    # that is how we see the 93.7%
    # "wrong" output =
    # 2.0%×V[the] + 93.7%×V[corpus]
    # + 3.0%×V[was] + 1.3%×V[wrong]

Output

Context vectors (4×8)

Weights (4×4)

Causal mask: enforcing the past

Each token can only see what came before it. The mask enforces this by adding −1,000,000,000 to all future positions before softmax. After the exponential, those positions become effectively zero. The upper triangle of the score matrix becomes zero, leaving a strictly lower-triangular attention pattern.

np.triu(full(-1e9), k=1) scores + mask → softmax

Masked positions

6 of 16 cells

After softmax

upper-tri = 0

CAUSAL MASK
def causal_mask(n):
    return np.triu(np.full((n, n), -1e9), k=1)
    # np.triu returns upper triangle, k=1 starts
    # one above the diagonal, so tokens can see themselves
    # for n=4 produces:
    # [  0   -1e9  -1e9  -1e9 ]
    # [  0    0   -1e9  -1e9 ]
    # [  0    0    0   -1e9 ]
    # [  0    0    0    0  ]
    # e^(-1e9) ≈ 0, so future tokens vanish

Mask

▼

tril structure

Added to scores

"wrong" attended to "corpus" at 93.7%

The sentence was chosen deliberately because I had written an essay arguing the corpus was wrong. The model had random weight matrices and a mathematical operation for measuring the angle between vectors. It found the relationship anyway. That is matrix multiplication doing exactly what it is supposed to do. Which is, when you think about it, precisely the point.

weights[3,1] = 0.937 seed 42 · d_k = 8

"wrong" → "corpus"

93.7%

"wrong" → everything else

6.3%

THE OUTPUT
sentence = ["the", "corpus", "was", "wrong"]
# attention weights per token (row):
# "the"    →  the=1.000
# "corpus" →  the=0.390, corpus=0.610
# "was"    →  the=0.220, corpus=0.510, was=0.270
# "wrong"  →  the=0.020, corpus=0.937, was=0.030, wrong=0.013
# the model did not know about the essay
# it just did the matrix multiplication
# understanding that does not make the result
# less interesting. it changes what kind of
# interesting it is.

weights[3,:]

→

the: 2.0%

→

corpus: 93.7%

→

was: 3.0%

→

wrong: 1.3%

1. introduction

There is a period when you are reading about transformer architecture where you are technically following along: you can repeat the words back, you can locate the formula, you can say "the query attends to the key and returns a weighted sum of the values" in a way that sounds right, yet you are not understanding it at all. You are carrying the sentence around like an object you cannot open.

The moment it opened for me was not when I read a better explanation. It was when I implemented it. Specifically when I ran my own sentence through the finished code and looked at the output. The sentence was "the corpus was wrong", chosen deliberately because I had written an essay arguing exactly that. When I ran those four words through the implementation I had just built, "wrong" attended to "corpus" at 93.7%. The model had no access to the essay. It just did the matrix multiplication.

What this is: scaled dot-product attention implemented in pure Python and NumPy: pure computation, every line written to be understood. The interactive pipeline above walks through each stage. The written sections below explain what each part is doing and why it is there.

This is a learning artefact built by an AI governance practitioner who needed to go inside the mechanism rather than continuing to write about it from the outside. The implementation is 23 lines. The understanding it required is the point.

2. the formula

Short enough to fit on one line. You see it cited in papers without surrounding explanation as if it is self-evident. It becomes clear once you have spent time with each part.

Attention(Q, K, V) = softmax( QK^T / √d_k ) · V

QK^T is the dot product between every query and every key at once. A (4×8) matrix times an (8×4) matrix gives a (4×4) score matrix where each cell is a similarity score between one query and one key.

÷ √d_k is the scaling factor. Dot products grow with dimension size, pushing softmax into saturation and causing gradients to vanish. Dividing by √d_k keeps variance at 1 regardless of d_k. The purpose becomes clear once you see what it prevents.

softmax(·) converts each row of scores into a probability distribution: all positive, all between zero and one, all summing to exactly one. The exponential function amplifies differences, which is why a moderate lead in raw scores becomes a 93.7% result.

· V is the final weighted sum. Each output token is a smooth, differentiable blend of all value vectors, proportioned by the attention weights.

3. q, k, v projections

Each token gets projected into three different vector spaces via three different learned weight matrices. In this implementation those matrices are random. In a real transformer they are the product of gradient descent across a training corpus, adjusted to minimise prediction error across billions of examples.

Q (Query): what is this token looking for? When "wrong" produces its output, it uses Q[3] to search the sequence for relevant information.

K (Key): what does this token contain? "corpus" advertises itself via K[1]. A high dot product between Q[3] and K[1] means "wrong" found what it was looking for in "corpus".

V (Value): what does this token contribute? It passes specific information forward, blended proportionally by the attention weights computed from Q and K.

The governance implication: the weight matrices are the corpus made mathematical. What the model learns to attend to is determined by the patterns in training data: which query vectors end up similar to which key vectors is the geometry the corpus shapes. The geometry determines the dot products. The dot products become the attention weights. Understanding that chain is a precondition for precise governance work.

4. attention scores

The dot product between two vectors is a single number measuring how much they point in the same direction. Multiply element by element, then sum. When the vectors are similar, meaning the words are related in the model's estimation, the score is high. When they are orthogonal, the score approaches zero.

This is the attention score between those two tokens. It is not a judgement, it is not comprehension, it is the cosine of an angle between two lists of numbers, and it functions as relevance. That is what the model uses to decide what to attend to.

Q @ K.T computes all n² scores simultaneously. For four tokens that is 16 scores. For a sequence of 2048 tokens it is over 4 million scores per head per layer. This quadratic scaling is the primary bottleneck motivating FlashAttention and related work, though the mathematical operation is the same at any scale.

5. softmax

Once you have the raw attention scores, you need to convert them into weights that sum to one. That is what softmax does. The formula: for each score x_i, compute e^(x_i) divided by the sum of e^(x_j) for all j. The exponential function means small differences in raw scores become large differences in output weights.

Subtracting the maximum before exponentiating (x - max(x)) prevents overflow. It is a numerical stability fix: the largest value becomes zero before exponentiation, keeping the computation stable. The mathematical result is identical either way.

The sharpening effect of softmax is why "wrong" attending to "corpus" at 93.7% looks so decisive. The raw dot product score was already the highest in row 3. The exponential amplified that lead into near-total concentration. The mathematics produced the result from a score that was highest by a moderate margin.

6. value aggregation

With attention weights in hand, the final step is a matrix multiply: weights @ V. Each output token is a weighted average of all value vectors, where the weights come from the softmax output.

For "wrong", the output vector is: 2.0% of V[the] + 93.7% of V[corpus] + 3.0% of V[was] + 1.3% of V[wrong]. The model is not retrieving "corpus". It is constructing a new representation of "wrong" that has been shaped by what it attended to across the whole sentence.

The function returns both the output and the attention weights. Returning the weights is a transparency decision: it makes the mechanism inspectable. You can see exactly what the model attended to and by how much. That distinction between having the output and being able to see the mechanism is one I think about in governance contexts regularly.

7. causal mask

The attention formula as written computes scores between every pair of tokens in both directions. For a model generating text one token at a time, each position should attend only to prior tokens.

The causal mask enforces this by setting position (i, j) to −1,000,000,000 whenever j is greater than i. After softmax, e^(−1e9) is effectively zero, so those positions contribute nothing. The upper triangle becomes zero, leaving a strictly lower-triangular attention pattern.

The mask is added to the scores before softmax rather than applied separately, which means the whole sequence of score, scale, mask, and softmax collapses to one expression: softmax(Q @ K.T / √d_k + mask). That is the formula from the paper, implemented directly.

8. what this taught me

Building an attention mechanism from scratch changed the texture of the understanding I already had. It moved things from being known to being felt, which is a different and more durable thing.

The gap between what these systems appear to do and what they are actually doing is where most of the harm in AI governance lives. You can write about that gap from the outside. But there is a difference between knowing it exists and having a clear enough picture of the mechanism that you can hold both sides of the gap in your head at the same time and see precisely where they diverge. When I watched "wrong" attend to "corpus" at 93.7% and felt the pull of saying the model had found something, and then immediately traced that feeling back to matrix multiplication and a random seed, I was experiencing both sides of the gap in the same moment.

I can now write about this with more precision about where exactly the appearance and the mechanism come apart. The result is a better-formed set of questions.

Read the essay this came from: Attention as the Core of Cultural Machines: the argument that the corpus shapes what the model learns to attend to, and why that matters for governance.

This is a one week compressed-time project. The implementation took several sessions. The documentation took several more. The understanding it required is ongoing. That is fine. The point was always to go inside the thing and stand within it, able to say something true about what you are looking at.