I'd read the attention equation a dozen times before I tried to write it. Reading it and implementing it turn out to be completely different kinds of understanding — the second one is unforgiving.

Start with the shapes

Most explanations open with the softmax. I found it easier to start with the shapes: queries, keys, and values are all just matrices, and the whole operation is a story about which dimensions have to line up.

Once I wrote the dimensions on paper — Q·Kᵀ giving an attention matrix of scores, then weighting V by it — the equation stopped being symbols and started being plumbing.

Understanding is mostly the slow removal of the ways you were confused.

What I'd tell past me

Write the smallest version first. A single head, no batching, tiny inputs you can verify by hand. Print the intermediate matrices. Only once it's correct and boring should you add heads, masks, and batches.