AI History · EP 03

The day one paper
unified all of AI

June 12, 2017. Eight researchers at Google Brain dropped a paper on arXiv with a deliberately provocative title — "Attention Is All You Need." Nine years later, 99% of AI research runs on what they wrote.

5 min read 2026.05.04 1997 → 2017

01First — the brief happiness of RNNs

Late 1980s. Michael Jordan (the cognitive scientist, not the basketball player) and Jeffrey Elman proposed the RNN (Recurrent Neural Network). The idea was simple: feed both today's input and yesterday's output into the network, and it can process sequences.

It was elegant on paper. A single line of time, learned directly. Then the problem hit — RNNs forget the start of long sentences. Mathematically, "vanishing gradients." Past 10 words, the influence of the first word essentially vanishes.

021997, two Germans built an RNN that doesn't forget

1997. A PhD student at TU Munich and his advisor published a paper titled "Long Short-Term Memory." Acronym: LSTM.

🧬
Sepp Hochreiter & Jürgen Schmidhuber
TU Munich · 1997 · Neural Computation 9(8):1735-1780

Their core idea — put three gates inside each neural unit. ① forget gate: how much of yesterday to forget. ② input gate: how much new info to admit. ③ output gate: how much to expose right now. Each gate learns its own 0–1 value automatically.

LSTM solved the vanishing gradient problem. Dependencies 100 words apart became learnable. The 2014–2017 period was the golden age of LSTM — Google Translate, Apple Siri, and Amazon Alexa were all built on it.

032014, Google bolted two heads onto LSTM

December 2014, NeurIPS. A former Toronto PhD of Hinton's, now at Google Brain, published "Sequence to Sequence Learning with Neural Networks."

📐
Ilya Sutskever · Oriol Vinyals · Quoc V. Le
Google Brain · NeurIPS 2014

The birth of "Encoder-Decoder." One LSTM compresses the entire input sentence into a single vector (encoder), and a second LSTM unrolls that vector word by word into the output (decoder). Applied to English-French translation, it hit the then-SOTA BLEU 34.8. Sutskever — the same Sutskever from EP01's Hinton lab and EP02's AlexNet team — would later co-found OpenAI.

⚠️ But there was a critical flaw
The encoder squeezes the entire input into one fixed-size vector. Short sentences were fine; long ones lost too much information. Cramming 50 words into a single vector was nearly impossible.

In 2015, Bengio's group introduced 'attention' as a side-channel that partially solved this. But fundamentally, LSTM's strict left-to-right processing was the bottleneck. GPUs love parallelism, and LSTM hated it.

04June 12, 2017, everything changed

June 12, 2017. arXiv:1706.03762. The title was a punch — "Attention Is All You Need." Eight authors, all at Google Brain or Google Research.

Vaswani · Shazeer · Parmar · Uszkoreit · Jones · Gomez · Kaiser · Polosukhin
Google Brain / Google Research · NeurIPS 2017

The thesis — "Throw out RNNs and CNNs. Attention alone is enough." Their architecture, called the Transformer, lets every token look at every other token simultaneously (Self-Attention). No sequential processing — fully parallel. Perfect for GPUs.

"We replace RNN and CNN entirely and still achieve state-of-the-art on every task. Training time is much shorter."

— Vaswani et al., abstract of "Attention is All You Need"

Nine years later, in 2026, almost every AI model is Transformer-based. ChatGPT, Claude, Gemini, Llama, BERT, GPT-4, ViT, AlphaFold, Sora, Whisper, autonomous driving vision — all of it. "Attention is All You Need" is, by citation count, the #1 most-cited AI paper of all time (over 120,000 citations as of 2026).

05So what is attention, exactly

An analogy: think of a search engine. You search "cat" (Query), and Google compares it to keywords on every webpage (Key), and pulls back the page contents (Value) that match best.

Transformer attention does the same thing. Each word (token) has its own Q, K, V vectors. A token's Q is compared against every other token's K — the more similar, the more of that token's V gets blended in. This happens for all tokens against all other tokens, simultaneously. So each token absorbs the meaning of the entire sentence in one pass.

🎯 Why "multi-head"
Transformers run attention across many heads in parallel. One head learns syntactic relations (subject-verb), another long-range dependencies ("this" → noun it refers to), another adjacent-word relations. Just like a person reading a sentence from multiple angles at once.

In the next post (EP04), we follow this Transformer through GPT-1, GPT-3, and finally to November 30, 2022 — the day ChatGPT exploded onto the world. The technical foundation that made 100 million users possible in 2 months.

🧪
Try it · AI Lab
Click any token, see what it attends to →
Click a word in "the cat sat on the mat" to see attention arrows + heatmap. Compare 3 different heads.
AI History · Series Navigation