June 12, 2017. Eight researchers at Google Brain dropped a paper on arXiv with a deliberately provocative title — "Attention Is All You Need." Nine years later, 99% of AI research runs on what they wrote.
Late 1980s. Michael Jordan (the cognitive scientist, not the basketball player) and Jeffrey Elman proposed the RNN (Recurrent Neural Network). The idea was simple: feed both today's input and yesterday's output into the network, and it can process sequences.
It was elegant on paper. A single line of time, learned directly. Then the problem hit — RNNs forget the start of long sentences. Mathematically, "vanishing gradients." Past 10 words, the influence of the first word essentially vanishes.
1997. A PhD student at TU Munich and his advisor published a paper titled "Long Short-Term Memory." Acronym: LSTM.
Their core idea — put three gates inside each neural unit. ① forget gate: how much of yesterday to forget. ② input gate: how much new info to admit. ③ output gate: how much to expose right now. Each gate learns its own 0–1 value automatically.
LSTM solved the vanishing gradient problem. Dependencies 100 words apart became learnable. The 2014–2017 period was the golden age of LSTM — Google Translate, Apple Siri, and Amazon Alexa were all built on it.
December 2014, NeurIPS. A former Toronto PhD of Hinton's, now at Google Brain, published "Sequence to Sequence Learning with Neural Networks."
The birth of "Encoder-Decoder." One LSTM compresses the entire input sentence into a single vector (encoder), and a second LSTM unrolls that vector word by word into the output (decoder). Applied to English-French translation, it hit the then-SOTA BLEU 34.8. Sutskever — the same Sutskever from EP01's Hinton lab and EP02's AlexNet team — would later co-found OpenAI.
In 2015, Bengio's group introduced 'attention' as a side-channel that partially solved this. But fundamentally, LSTM's strict left-to-right processing was the bottleneck. GPUs love parallelism, and LSTM hated it.
June 12, 2017. arXiv:1706.03762. The title was a punch — "Attention Is All You Need." Eight authors, all at Google Brain or Google Research.
The thesis — "Throw out RNNs and CNNs. Attention alone is enough." Their architecture, called the Transformer, lets every token look at every other token simultaneously (Self-Attention). No sequential processing — fully parallel. Perfect for GPUs.
"We replace RNN and CNN entirely and still achieve state-of-the-art on every task. Training time is much shorter."
— Vaswani et al., abstract of "Attention is All You Need"Nine years later, in 2026, almost every AI model is Transformer-based. ChatGPT, Claude, Gemini, Llama, BERT, GPT-4, ViT, AlphaFold, Sora, Whisper, autonomous driving vision — all of it. "Attention is All You Need" is, by citation count, the #1 most-cited AI paper of all time (over 120,000 citations as of 2026).
An analogy: think of a search engine. You search "cat" (Query), and Google compares it to keywords on every webpage (Key), and pulls back the page contents (Value) that match best.
Transformer attention does the same thing. Each word (token) has its own Q, K, V vectors. A token's Q is compared against every other token's K — the more similar, the more of that token's V gets blended in. This happens for all tokens against all other tokens, simultaneously. So each token absorbs the meaning of the entire sentence in one pass.
In the next post (EP04), we follow this Transformer through GPT-1, GPT-3, and finally to November 30, 2022 — the day ChatGPT exploded onto the world. The technical foundation that made 100 million users possible in 2 months.