Computer Vision

Computer vision enables machines to interpret and understand visual information from images and video. Understanding the different task types and the CNN architecture that powers most vision models is essential for choosing the right approach for your problem.

Vision Task Types

Task	Output	Architecture	When to Use
Image Classification	Single label for entire image	CNN (ResNet, VGG, Inception)	"Is this a cat or dog?" — one label per image
Object Detection	Bounding boxes + labels	SSD, YOLO, Faster R-CNN	"Where are the cars and people?" — multiple objects with locations
Semantic Segmentation	Pixel-level class label	FCN, U-Net, DeepLab	"Label every pixel" — autonomous driving, medical imaging

Decision Guide

"What is in this image?"           → Image Classification (one label)
"Where are the objects?"           → Object Detection (bounding boxes)
"Label every pixel"                → Semantic Segmentation (pixel-level precision)

Key Insight

Use semantic segmentation when bounding boxes are not precise enough. For example, "detect yellow lane lines on roads" requires pixel-level precision — bounding boxes would be far too coarse.

Common Misconception

"Customer segmentation" is a business term for clustering (K-Means), NOT semantic segmentation. Semantic segmentation is a computer vision technique for pixel-level image labeling. Do not confuse the two.

CNN Architecture Layers

Convolutional Neural Networks (CNNs) are the backbone of most computer vision models. They process images through a sequence of specialized layers.

Layer	What It Does	Details
Convolutional Layer	Applies filters to detect features	Each filter detects one feature type (edges, textures, shapes). Early layers detect simple features; deeper layers detect complex patterns
Pooling Layer (MaxPool)	Reduces spatial dimensions	Takes the max value in each region. Reduces computation and helps prevent overfitting
Flatten Layer	Converts 2D feature maps to 1D vector	Bridges between convolutional and fully connected layers
Fully Connected (Dense) Layer	Combines features for final output	Last layer maps to the number of output classes

CNN Data Flow

Input Image (224×224×3)
    ↓
Convolutional Layers (extract features: edges → shapes → objects)
    ↓
Pooling Layers (reduce spatial dimensions)
    ↓
Flatten (2D → 1D vector)
    ↓
Fully Connected Layers (combine features)
    ↓
Output (class probabilities via softmax)

Transfer Learning for Vision

For most practical vision tasks, you do not train a CNN from scratch. Instead, use transfer learning:

Load a pre-trained model (e.g., ResNet trained on ImageNet)
Keep the convolutional layers — they capture universal visual features (edges, textures, shapes)
Replace the last fully connected layer with one matching your number of classes
Fine-tune on your dataset — optionally freeze early layers

import torchvision.models as models
import torch.nn as nn

# Load pre-trained ResNet
model = models.resnet50(pretrained=True)

# Freeze convolutional layers
for param in model.parameters():
    param.requires_grad = False

# Replace the final FC layer for 10 classes
model.fc = nn.Linear(model.fc.in_features, 10)

Object Detection Architectures

Architecture	Speed	Accuracy	Approach
YOLO	Very fast (real-time)	Good	Single-pass: divides image into grid, predicts boxes and classes simultaneously
SSD	Fast	Good	Multi-scale feature maps for detecting objects of different sizes
Faster R-CNN	Slower	Highest	Two-stage: region proposal network + classification. Best accuracy

Flashcards

1 / 8

Question

What are the three main computer vision task types?

Click to reveal

Answer

1) Image Classification — one label per image. 2) Object Detection — bounding boxes + labels for multiple objects. 3) Semantic Segmentation — pixel-level class labels for every pixel in the image.

Vision Task Types​

Decision Guide​

CNN Architecture Layers​

CNN Data Flow​

Transfer Learning for Vision​

Object Detection Architectures​

Flashcards​