Skip to main content

Computer Vision

Computer vision enables machines to interpret and understand visual information from images and video. Understanding the different task types and the CNN architecture that powers most vision models is essential for choosing the right approach for your problem.

Vision Task Types​

TaskOutputArchitectureWhen to Use
Image ClassificationSingle label for entire imageCNN (ResNet, VGG, Inception)"Is this a cat or dog?" — one label per image
Object DetectionBounding boxes + labelsSSD, YOLO, Faster R-CNN"Where are the cars and people?" — multiple objects with locations
Semantic SegmentationPixel-level class labelFCN, U-Net, DeepLab"Label every pixel" — autonomous driving, medical imaging

Decision Guide​

"What is in this image?"           → Image Classification (one label)
"Where are the objects?" → Object Detection (bounding boxes)
"Label every pixel" → Semantic Segmentation (pixel-level precision)
Key Insight

Use semantic segmentation when bounding boxes are not precise enough. For example, "detect yellow lane lines on roads" requires pixel-level precision — bounding boxes would be far too coarse.

Common Misconception

"Customer segmentation" is a business term for clustering (K-Means), NOT semantic segmentation. Semantic segmentation is a computer vision technique for pixel-level image labeling. Do not confuse the two.

CNN Architecture Layers​

Convolutional Neural Networks (CNNs) are the backbone of most computer vision models. They process images through a sequence of specialized layers.

LayerWhat It DoesDetails
Convolutional LayerApplies filters to detect featuresEach filter detects one feature type (edges, textures, shapes). Early layers detect simple features; deeper layers detect complex patterns
Pooling Layer (MaxPool)Reduces spatial dimensionsTakes the max value in each region. Reduces computation and helps prevent overfitting
Flatten LayerConverts 2D feature maps to 1D vectorBridges between convolutional and fully connected layers
Fully Connected (Dense) LayerCombines features for final outputLast layer maps to the number of output classes

CNN Data Flow​

Input Image (224×224×3)
↓
Convolutional Layers (extract features: edges → shapes → objects)
↓
Pooling Layers (reduce spatial dimensions)
↓
Flatten (2D → 1D vector)
↓
Fully Connected Layers (combine features)
↓
Output (class probabilities via softmax)

Transfer Learning for Vision​

For most practical vision tasks, you do not train a CNN from scratch. Instead, use transfer learning:

  1. Load a pre-trained model (e.g., ResNet trained on ImageNet)
  2. Keep the convolutional layers — they capture universal visual features (edges, textures, shapes)
  3. Replace the last fully connected layer with one matching your number of classes
  4. Fine-tune on your dataset — optionally freeze early layers
import torchvision.models as models
import torch.nn as nn

# Load pre-trained ResNet
model = models.resnet50(pretrained=True)

# Freeze convolutional layers
for param in model.parameters():
param.requires_grad = False

# Replace the final FC layer for 10 classes
model.fc = nn.Linear(model.fc.in_features, 10)

Object Detection Architectures​

ArchitectureSpeedAccuracyApproach
YOLOVery fast (real-time)GoodSingle-pass: divides image into grid, predicts boxes and classes simultaneously
SSDFastGoodMulti-scale feature maps for detecting objects of different sizes
Faster R-CNNSlowerHighestTwo-stage: region proposal network + classification. Best accuracy

Flashcards​

1 / 8
Question

What are the three main computer vision task types?

Click to reveal
Answer

1) Image Classification — one label per image. 2) Object Detection — bounding boxes + labels for multiple objects. 3) Semantic Segmentation — pixel-level class labels for every pixel in the image.