AI History · EP 05

The bar-napkin idea
that built image AI

2014, a Montreal pub. A 28-year-old PhD student floated an idea over beers — "what if two networks played a game where one tries to fool the other?" His friends laughed. He went home that night, wrote the code, ran it once. It worked on the first try.

5 min read 2026.05.04 2014 → 2026

012014, the birth of GAN

🍺

Ian Goodfellow

b.1985 · Université de Montréal · NeurIPS 2014 · now DeepMind

The idea, in one sentence — "an adversarial game between a Generator and a Discriminator." The Generator produces fake images. The Discriminator tries to tell real from fake. Train them together: the Generator learns to make more realistic images, the Discriminator learns to spot subtler forgeries. At equilibrium, the Generator's output becomes indistinguishable from real.

The model was called GAN (Generative Adversarial Network). The first version made blurry 28×28 handwritten digits. Four years later, StyleGAN was generating photorealistic 1024×1024 fake faces that humans couldn't reliably distinguish from real photos.

"It just hit me at the bar. My friends said it would never work. I went home, had a glass of wine, coded it up. It worked on the first try. That's GAN."

— Ian Goodfellow, 2019 interview

02But GAN had a fatal flaw

From 2014 to 2020, GAN was the king of image generation. StyleGAN, BigGAN, CycleGAN — a flood of impressive follow-ups. But the field had been quietly aware of two chronic problems.

⚠️ The two GAN traps

① Mode collapse — the Generator learns "I just need to nail this one type" and loses diversity. Out of 1,000 categories, it ends up drawing only 10.
② Training instability — the two networks have to stay balanced. If one outpaces the other, learning halts. Catastrophic mid-training failures were common.

032020, a new path that starts from noise

June 2020. A UC Berkeley PhD student submitted a paper to NeurIPS titled "Denoising Diffusion Probabilistic Models." Acronym: DDPM.

🎨

Jonathan Ho

UC Berkeley → Google Brain · DDPM (NeurIPS 2020)

The idea is reversed. Take an image, add noise step by step until it becomes pure static (forward process). Train a network to learn that process. Then run it in reverse — start from noise and gradually remove it, and you get a coherent image (reverse process). Split into 1,000 small steps, training is stable.

It was much slower than GAN at first — minutes per image. But the quality, diversity, and stability were dominant. No mode collapse. Training never blew up. And — scaling it up just makes it better.

04August 2022, everything exploded

April 2022 — OpenAI released DALL-E 2. May — Google's Imagen. Both Diffusion-based. Both closed (API only).

And on August 22, 2022 — a German group changed everything.

🌊

Robin Rombach & Patrick Esser

CompVis (LMU Munich) → Stability AI · Stable Diffusion (2022.08.22)

The two authors of the Latent Diffusion paper (CVPR 2022), in collaboration with Stability AI, released Stable Diffusion — model weights, code, and training details all open source. Anyone could run it on their own GPU. Within days, tens of thousands of fine-tuned variants and hundreds of downstream tools appeared.

🌊 What one open release detonated

One year after Stable Diffusion's release, Civitai had registered over 100,000 fine-tuned models. ControlNet (precise control), LoRA (cheap fine-tuning), DreamBooth (face customization) — all the essential tools came out of the open-source community. A small open model rewrote the industry faster than the larger closed models from OpenAI and Google did.

052024, video too

February 15, 2024. OpenAI announced Sora. 60-second videos generated from text alone. Coherent camera motion, persistent characters, plausible physics. Technical core — Diffusion Transformer (DiT): replace U-Net with a Transformer backbone. The video's time axis is processed via attention.

The same year — June 2024 Runway Gen-3, around the same time China's Kuaishou Kling. December 2024, Google announced Veo 2 with quality matching or exceeding Sora. Some Hollywood VFX studios began adopting these tools early.

06So what does image-generating AI mean

In EP04, ChatGPT conquered language. In EP05, Diffusion conquered vision. Illustration, photography, design, VFX, video advertising — every one of these industries lost the previous-generation toolset within two years.

And in less obvious places — Diffusion is now used in semiconductor fab defect data synthesis (Intel GFA), in medical image enhancement, in drug molecule design (the next generation after AlphaFold). The mechanism — going from noise to meaningful pattern — turns out to apply to any data, not just images.

In the next post (EP06), we look at the actual hardware foundation that runs all of this — NVIDIA's GPU. From the 1999 GeForce 256 to the 2024 Blackwell. How CUDA became the academic standard, and why Google built its own chip (TPU).

🧪

Try it · AI Lab

Step through 1,000 noise → image steps →

See how Stable Diffusion works with one slider — confirm that Forward (training, adding noise) and Reverse (generation, removing noise) are mirror images.

AI History · Series Navigation

← Previous

EP04 · 1 Million Users in 5 Days — ChatGPT

EP06 · The Diner-Booth Company That Owns AI Compute

The bar-napkin ideathat built image AI