Computer Vision
Computer vision enables machines to interpret and understand visual information from images and video. Understanding the different task types and the CNN architecture that powers most vision models is essential for choosing the right approach for your problem.
Vision Task Types​
| Task | Output | Architecture | When to Use |
|---|---|---|---|
| Image Classification | Single label for entire image | CNN (ResNet, VGG, Inception) | "Is this a cat or dog?" — one label per image |
| Object Detection | Bounding boxes + labels | SSD, YOLO, Faster R-CNN | "Where are the cars and people?" — multiple objects with locations |
| Semantic Segmentation | Pixel-level class label | FCN, U-Net, DeepLab | "Label every pixel" — autonomous driving, medical imaging |
Decision Guide​
"What is in this image?" → Image Classification (one label)
"Where are the objects?" → Object Detection (bounding boxes)
"Label every pixel" → Semantic Segmentation (pixel-level precision)
Use semantic segmentation when bounding boxes are not precise enough. For example, "detect yellow lane lines on roads" requires pixel-level precision — bounding boxes would be far too coarse.
"Customer segmentation" is a business term for clustering (K-Means), NOT semantic segmentation. Semantic segmentation is a computer vision technique for pixel-level image labeling. Do not confuse the two.
CNN Architecture Layers​
Convolutional Neural Networks (CNNs) are the backbone of most computer vision models. They process images through a sequence of specialized layers.
| Layer | What It Does | Details |
|---|---|---|
| Convolutional Layer | Applies filters to detect features | Each filter detects one feature type (edges, textures, shapes). Early layers detect simple features; deeper layers detect complex patterns |
| Pooling Layer (MaxPool) | Reduces spatial dimensions | Takes the max value in each region. Reduces computation and helps prevent overfitting |
| Flatten Layer | Converts 2D feature maps to 1D vector | Bridges between convolutional and fully connected layers |
| Fully Connected (Dense) Layer | Combines features for final output | Last layer maps to the number of output classes |
CNN Data Flow​
Input Image (224×224×3)
↓
Convolutional Layers (extract features: edges → shapes → objects)
↓
Pooling Layers (reduce spatial dimensions)
↓
Flatten (2D → 1D vector)
↓
Fully Connected Layers (combine features)
↓
Output (class probabilities via softmax)
Transfer Learning for Vision​
For most practical vision tasks, you do not train a CNN from scratch. Instead, use transfer learning:
- Load a pre-trained model (e.g., ResNet trained on ImageNet)
- Keep the convolutional layers — they capture universal visual features (edges, textures, shapes)
- Replace the last fully connected layer with one matching your number of classes
- Fine-tune on your dataset — optionally freeze early layers
import torchvision.models as models
import torch.nn as nn
# Load pre-trained ResNet
model = models.resnet50(pretrained=True)
# Freeze convolutional layers
for param in model.parameters():
param.requires_grad = False
# Replace the final FC layer for 10 classes
model.fc = nn.Linear(model.fc.in_features, 10)
Object Detection Architectures​
| Architecture | Speed | Accuracy | Approach |
|---|---|---|---|
| YOLO | Very fast (real-time) | Good | Single-pass: divides image into grid, predicts boxes and classes simultaneously |
| SSD | Fast | Good | Multi-scale feature maps for detecting objects of different sizes |
| Faster R-CNN | Slower | Highest | Two-stage: region proposal network + classification. Best accuracy |
Flashcards​
What are the three main computer vision task types?
Click to reveal1) Image Classification — one label per image. 2) Object Detection — bounding boxes + labels for multiple objects. 3) Semantic Segmentation — pixel-level class labels for every pixel in the image.