2014, a Montreal pub. A 28-year-old PhD student floated an idea over beers — "what if two networks played a game where one tries to fool the other?" His friends laughed. He went home that night, wrote the code, ran it once. It worked on the first try.
The idea, in one sentence — "an adversarial game between a Generator and a Discriminator." The Generator produces fake images. The Discriminator tries to tell real from fake. Train them together: the Generator learns to make more realistic images, the Discriminator learns to spot subtler forgeries. At equilibrium, the Generator's output becomes indistinguishable from real.
The model was called GAN (Generative Adversarial Network). The first version made blurry 28×28 handwritten digits. Four years later, StyleGAN was generating photorealistic 1024×1024 fake faces that humans couldn't reliably distinguish from real photos.
"It just hit me at the bar. My friends said it would never work. I went home, had a glass of wine, coded it up. It worked on the first try. That's GAN."
— Ian Goodfellow, 2019 interviewFrom 2014 to 2020, GAN was the king of image generation. StyleGAN, BigGAN, CycleGAN — a flood of impressive follow-ups. But the field had been quietly aware of two chronic problems.
June 2020. A UC Berkeley PhD student submitted a paper to NeurIPS titled "Denoising Diffusion Probabilistic Models." Acronym: DDPM.
The idea is reversed. Take an image, add noise step by step until it becomes pure static (forward process). Train a network to learn that process. Then run it in reverse — start from noise and gradually remove it, and you get a coherent image (reverse process). Split into 1,000 small steps, training is stable.
It was much slower than GAN at first — minutes per image. But the quality, diversity, and stability were dominant. No mode collapse. Training never blew up. And — scaling it up just makes it better.
April 2022 — OpenAI released DALL-E 2. May — Google's Imagen. Both Diffusion-based. Both closed (API only).
And on August 22, 2022 — a German group changed everything.
The two authors of the Latent Diffusion paper (CVPR 2022), in collaboration with Stability AI, released Stable Diffusion — model weights, code, and training details all open source. Anyone could run it on their own GPU. Within days, tens of thousands of fine-tuned variants and hundreds of downstream tools appeared.
February 15, 2024. OpenAI announced Sora. 60-second videos generated from text alone. Coherent camera motion, persistent characters, plausible physics. Technical core — Diffusion Transformer (DiT): replace U-Net with a Transformer backbone. The video's time axis is processed via attention.
The same year — June 2024 Runway Gen-3, around the same time China's Kuaishou Kling. December 2024, Google announced Veo 2 with quality matching or exceeding Sora. Some Hollywood VFX studios began adopting these tools early.
In EP04, ChatGPT conquered language. In EP05, Diffusion conquered vision. Illustration, photography, design, VFX, video advertising — every one of these industries lost the previous-generation toolset within two years.
And in less obvious places — Diffusion is now used in semiconductor fab defect data synthesis (Intel GFA), in medical image enhancement, in drug molecule design (the next generation after AlphaFold). The mechanism — going from noise to meaningful pattern — turns out to apply to any data, not just images.
In the next post (EP06), we look at the actual hardware foundation that runs all of this — NVIDIA's GPU. From the 1999 GeForce 256 to the 2024 Blackwell. How CUDA became the academic standard, and why Google built its own chip (TPU).