Because my general feeling about neural nets is: well, I understand gradient descent. If you can tell me what function you're computing from what inputs and what parameters, then maybe I'm slow at actually reinventing backprop from scratch, but I'm confident that I could reinvent backprop from scratch if I needed to. And if I'm lazy I can just ask pytorch to do it for me.
What I really wanted was a compact description of
In practice, (as far as I understand!) these inputs aren't wired up directly to other parts of the neural net, but rather you throw a whole matrix worth of parameters in front of them: the "raw" data is in some other linear space, and the parameters say how to smash them into the right linear space to do the dot products.
By doing this we let the network learn what "queries" it should be making, and what "keys" and "values" other data should produce.
Here I'm less confident I understand what's going on, but Karpathy's videos (especially "Let's build GPT") gave me some sense of it.
I think the important thing is that the transformer architecture is an alternative to RNNs: instead of your forward function being a whole lot of recursive applications of the net to its own hidden-state output, you put some attention units in, and let individual tokens in your stream dictate (via "query") what sort of past data they're interested in, and you let the past data itself (via "key") figure out how to advertise itself as interesting to future data, and (via "value") how to contribute some signal to be used by it.
None of anything above says anything about position of tokens in a stream: I understand trigonometric positional encodings are typically concatenated on to feature vectors so that the computation of query/key/value can depend usefully on that.