In 1989, at Bell Labs in New Jersey, a computer started reading handwritten ZIP codes for the first time. Then 23 years of silence. In the fall of 2012, one model destroyed the ImageNet leaderboard — and everything changed.
1989. AT&T Bell Labs in New Jersey. A 28-year-old French researcher named Yann LeCun stood up at a conference and announced something modest-sounding: "We trained a neural network to recognize handwritten digits."
The de facto inventor of the CNN (Convolutional Neural Network). Created LeNet in 1989 and LeNet-5 in 1998. Shared the 2018 Turing Award with Hinton and Bengio.
LeCun's 'LeNet' introduced two ideas that turned out to be the foundation of computer vision. ① Locality — a pixel only meaningfully relates to its neighbors. ② Weight sharing — slide the same small filter (e.g. 3×3) across the entire image.
Together these two ideas compressed the parameter count of a 100×100 image classifier from about a million weights down to nine. Suddenly the network was trainable.
From the 1998 LeNet-5 paper through about 2010, CNNs barely registered outside the research community. The dominant computer vision techniques were simpler — SVMs, HOG, SIFT. They were faster to train and worked better on the small datasets of the time.
The reason was the same as in EP01. Training a real neural network needed hundreds of thousands of labeled images plus a fast GPU, and neither was available. CNNs were filed under "elegant in theory, doesn't work in practice."
The 2012 ImageNet competition (ILSVRC). 1.2 million images, 1,000 categories. Results were announced in the fall. A team from the University of Toronto won by a brutal margin — Top-5 error went from 26.2% to 15.3% in a single shot.
Their model was 'AlexNet'. The three authors:
Hinton (the same Hinton from EP01) and his two PhD students. AlexNet was an 8-layer CNN trained on two NVIDIA GTX 580 GPUs — consumer gamer cards. That was the start of everything. ReLU activations, dropout regularization, and many of today's standard techniques were established in that paper.
And — the deep learning era had truly begun.
2012-2014. Everyone was racing to make networks deeper. AlexNet 8 layers → VGG 16 → 19. Then something strange started happening — past 20 layers, accuracy actually got worse.
In December 2015, He and colleagues at Microsoft Research Asia published ResNet. The core idea, in one sentence: "Add the input of a layer to its output." This is the skip connection (y = F(x) + x). That single change made it possible to train networks 152 layers deep.
ResNet hit human-level top-5 accuracy on ImageNet (3.57% error). And — almost every subsequent vision model, including Transformers, uses skip connections. As of 2026, the ResNet paper is one of the most-cited works in AI.
After Google's 2017 Transformer (the topic of EP03) conquered language, people started asking — "Could we use Transformers for images too?"
October 2020. Google Research published ViT (Vision Transformer). It chopped images into 16×16 patches, treated each as a token, and learned attention between patches. On large enough datasets, it beat CNNs.
The auto-portrait detection in your phone camera, Tesla's pedestrian detection, Samsung and LG fab defect inspection, medical X-ray analysis — all of it is CNN (or its descendants).
The 9-weight filter LeCun built in 1989 has grown into ResNet-152 with 10 million weights, ViT-Huge with 600 million. But the core idea — slide a small filter across an image — has not changed.
In the next post (EP03), we follow the parallel arc that started in 1997 when Sepp Hochreiter & Jürgen Schmidhuber published LSTM and ended with the 2017 Google paper "Attention is All You Need" that unified everything.