AI History · EP 02

How computers got eyes —
a 30-year story

In 1989, at Bell Labs in New Jersey, a computer started reading handwritten ZIP codes for the first time. Then 23 years of silence. In the fall of 2012, one model destroyed the ImageNet leaderboard — and everything changed.

5 min read 2026.05.04 1989 → 2020

011989, the machine that started reading mail

1989. AT&T Bell Labs in New Jersey. A 28-year-old French researcher named Yann LeCun stood up at a conference and announced something modest-sounding: "We trained a neural network to recognize handwritten digits."

📷
Yann LeCun
b.1960 · Bell Labs → NYU → Meta Chief AI Scientist

The de facto inventor of the CNN (Convolutional Neural Network). Created LeNet in 1989 and LeNet-5 in 1998. Shared the 2018 Turing Award with Hinton and Bengio.

LeCun's 'LeNet' introduced two ideas that turned out to be the foundation of computer vision. ① Locality — a pixel only meaningfully relates to its neighbors. ② Weight sharing — slide the same small filter (e.g. 3×3) across the entire image.

Together these two ideas compressed the parameter count of a 100×100 image classifier from about a million weights down to nine. Suddenly the network was trainable.

📮 What it was used for
AT&T deployed LeNet into actual production for the US Postal Service's automatic ZIP code reader. Through the 1990s, a substantial fraction of handwritten checks and mail routed in the US was being read by LeCun's network — though most people had no idea.

02And then 23 years on the sidelines

From the 1998 LeNet-5 paper through about 2010, CNNs barely registered outside the research community. The dominant computer vision techniques were simpler — SVMs, HOG, SIFT. They were faster to train and worked better on the small datasets of the time.

The reason was the same as in EP01. Training a real neural network needed hundreds of thousands of labeled images plus a fast GPU, and neither was available. CNNs were filed under "elegant in theory, doesn't work in practice."

03Fall 2012, the day everything changed

The 2012 ImageNet competition (ILSVRC). 1.2 million images, 1,000 categories. Results were announced in the fall. A team from the University of Toronto won by a brutal margin — Top-5 error went from 26.2% to 15.3% in a single shot.

Their model was 'AlexNet'. The three authors:

🏆
Alex Krizhevsky · Ilya Sutskever · Geoffrey Hinton
University of Toronto · NeurIPS 2012

Hinton (the same Hinton from EP01) and his two PhD students. AlexNet was an 8-layer CNN trained on two NVIDIA GTX 580 GPUs — consumer gamer cards. That was the start of everything. ReLU activations, dropout regularization, and many of today's standard techniques were established in that paper.

📌 What that day meant
AlexNet's 15.3% Top-5 error vs the 26.2% of the second-place ISI Japan team (using classical methods) was a 10-percentage-point gap — the largest leap in ImageNet history. From that day, every vision paper switched to CNNs. SVMs, HOG, and SIFT effectively disappeared.

And — the deep learning era had truly begun.

042015, the man who stacked 152 layers

2012-2014. Everyone was racing to make networks deeper. AlexNet 8 layers → VGG 16 → 19. Then something strange started happening — past 20 layers, accuracy actually got worse.

🇨🇳
Kaiming He
Microsoft Research Asia · ResNet (2015) · arXiv:1512.03385

In December 2015, He and colleagues at Microsoft Research Asia published ResNet. The core idea, in one sentence: "Add the input of a layer to its output." This is the skip connection (y = F(x) + x). That single change made it possible to train networks 152 layers deep.

ResNet hit human-level top-5 accuracy on ImageNet (3.57% error). And — almost every subsequent vision model, including Transformers, uses skip connections. As of 2026, the ResNet paper is one of the most-cited works in AI.

052020, Transformers ate computer vision too

After Google's 2017 Transformer (the topic of EP03) conquered language, people started asking — "Could we use Transformers for images too?"

October 2020. Google Research published ViT (Vision Transformer). It chopped images into 16×16 patches, treated each as a token, and learned attention between patches. On large enough datasets, it beat CNNs.

🎯 But ResNet still rules in industry
ViT took academic SOTA, but as of 2026 the actual deployed models in semiconductor inspection, autonomous driving vision, and medical imaging are mostly still ResNet-based. Why: more stable on small datasets, faster inference, easier mobile deployment. ViT only dominates when the model and dataset are huge.

06So where is the computer's eye now

The auto-portrait detection in your phone camera, Tesla's pedestrian detection, Samsung and LG fab defect inspection, medical X-ray analysis — all of it is CNN (or its descendants).

The 9-weight filter LeCun built in 1989 has grown into ResNet-152 with 10 million weights, ViT-Huge with 600 million. But the core idea — slide a small filter across an image — has not changed.

In the next post (EP03), we follow the parallel arc that started in 1997 when Sepp Hochreiter & Jürgen Schmidhuber published LSTM and ended with the 2017 Google paper "Attention is All You Need" that unified everything.

🧪
Try it · AI Lab
See a CNN filter slide across an image →
Watch a 3×3 kernel sweep an 8×8 input and build the feature map. Compare 6 different kernels (horizontal/vertical edge, Sobel, blur, sharpen, identity).
AI History · Series Navigation