Semantic Segmentation
Assigns a class label to each pixel, but does not distinguish between instances of the same class.
1. Popular Datasets for Semantic Segmentation
A wide variety of datasets are available for training and evaluating semantic segmentation models. Each dataset has unique characteristics depending on the application domain, such as autonomous driving, general object segmentation, or medical imaging.
a. Autonomous Driving Datasets
- Cityscapes
- Description: Large-scale dataset for urban scene understanding. Includes street-level imagery from various German cities.
- Classes: 30 classes (e.g., road, person, car, building).
- Size: 5,000 fine-annotated images (2975 train, 500 validation, 1525 test), 20,000 coarse annotations.
- Resolution: 2048×1024 pixels.
- Link: Cityscapes Dataset
- Mapillary Vistas
- Description: A diverse dataset collected from various countries, designed for street-level segmentation tasks.
- Classes: 66 object categories.
- Size: 25,000 annotated high-resolution images.
- Resolution: Varies, high resolution.
- Link: Mapillary Vistas
- CamVid
- Description: One of the earlier datasets for autonomous driving, with video sequences.
- Classes: 32 classes.
- Size: 701 labeled frames.
- Resolution: 960×720 pixels.
- Link: CamVid Dataset
b. General Object Segmentation Datasets
- PASCAL VOC 2012
- Description: Widely used dataset for image classification, detection, and segmentation tasks.
- Classes: 21 object categories.
- Size: 1,464 images (train), 1,449 (val), 1,456 (test).
- Resolution: Varies.
- Link: PASCAL VOC
- MS COCO (Common Objects in Context)
- Description: Large-scale object detection, segmentation, and captioning dataset.
- Classes: 80 object categories.
- Size: 123,287 images with pixel-wise annotations.
- Resolution: Varies.
- Link: MS COCO
c. Medical Imaging Datasets
- Lung Nodule Analysis (LUNA)
- Description: A dataset for detecting and segmenting lung nodules in CT scans.
- Classes: Lung nodules.
- Size: 888 CT scans.
- Resolution: Voxel-wise 3D data.
- Link: LUNA Dataset
- BraTS (Brain Tumor Segmentation)
- Description: A dataset for segmenting brain tumors using MRI scans.
- Classes: Tumor, normal tissue.
- Size: 500+ 3D MRI images.
- Resolution: Voxel-wise 3D data.
- Link: BraTS Dataset
2. Popular Models for Semantic Segmentation
a. Fully Convolutional Networks (FCN)
- Architecture: The first major deep learning-based approach for semantic segmentation. FCNs use convolutional layers for both feature extraction and pixel-wise classification.
- Key Idea: Replace fully connected layers with convolutional layers for dense pixel prediction.
- References: Long et al., 2015, FCN Paper
b. U-Net
- Architecture: Symmetrical “U”-shaped network with encoder-decoder architecture.
- Key Idea: Encoder extracts features, and the decoder upsamples to full resolution using skip connections to recover spatial information.
- Applications: Extremely popular in medical imaging.
- References: Ronneberger et al., 2015, U-Net Paper
c. DeepLab (v1, v2, v3, v3+)
- Architecture: Encoder-decoder architecture with Atrous (dilated) convolutions to capture multi-scale context.
- Key Idea: Atrous convolutions and ASPP (Atrous Spatial Pyramid Pooling) enable capturing objects at multiple scales.
- DeepLab v3+: Combines spatial pyramid pooling with a decoder module for better object boundary detection.
- References: Chen et al., 2017, DeepLab Paper
d. SegNet
- Architecture: Encoder-decoder architecture that recovers resolution using max-pooling indices from the encoder.
- Key Idea: Efficient upsampling using indices from max-pooling in the encoder, which reduces computational cost.
- References: Badrinarayanan et al., 2015, SegNet Paper
e. PSPNet (Pyramid Scene Parsing Network)
- Architecture: Uses a pyramid pooling module to capture multi-scale global context.
- Key Idea: Captures global scene-level context using pyramid pooling before the final pixel-wise prediction.
- References: Zhao et al., 2017, PSPNet Paper
f. HRNet (High-Resolution Network)
- Architecture: Maintains high-resolution feature maps throughout the network while using multi-scale fusion.
- Key Idea: Avoids the downsampling-heavy nature of many segmentation networks, preserving spatial detail.
- References: Wang et al., 2020, HRNet Paper
3. Important Hyperparameters
Tuning hyperparameters is crucial to improving the performance of segmentation models. Below are the most important ones:
a. Learning Rate
- Description: Controls the step size during optimization.
- Typical Range: 0.0001 – 0.01 (with learning rate schedules like cosine annealing, or warm restarts).
b. Batch Size
- Description: The number of samples processed before the model is updated.
- Typical Range: 2 – 16 (for large images due to memory constraints).
c. Number of Filters / Feature Maps
- Description: Number of filters in convolutional layers, which controls model capacity.
- Typical Range: 32 – 512 per layer, depending on model depth and complexity.
d. Optimizer
- Popular Choices:
- Adam (adaptive learning rate).
- SGD with momentum (common in large-scale datasets).
e. Weight Decay / L2 Regularization
- Description: Helps prevent overfitting by penalizing large weights.
- Typical Range: 0.0001 – 0.001.
4. Popular Loss Functions
In semantic segmentation, the loss function plays a critical role in training the network by minimizing pixel-wise errors. Common loss functions include:
a. Cross-Entropy Loss
- Description: The most common loss for multi-class pixel-wise classification tasks.
- Formula: \(L = - \sum_{i} y_i \log(\hat{y}_i)\) where $ y_i $ is the true label and $ _i $ is the predicted probability.
b. Dice Loss
- Description: Measures the overlap between predicted and ground truth segmentation.
- Formula: \(L = 1 - \frac{2 |Y \cap \hat{Y}|}{|Y| + |\hat{Y}|}\) where $ Y $ is the ground truth mask, and $ $ is the predicted mask.
c. Intersection over Union (IoU) Loss
- Description: Measures the overlap between the predicted and ground truth areas.
- Formula: \(\text{IoU} = \frac{|Y \cap \hat{Y}|}{|Y \cup \hat{Y}|}\) where $ |Y| $ is the area of ground truth and $ || $ is the area of prediction.
d. Tversky Loss
- Description: A generalization of Dice Loss, controlling false positives and false negatives.
- Formula: $ L = 1 - $
5. Other Important Topics
a. Data Augmentation
- Techniques: Random crop, horizontal/vertical flipping, color jittering, and elastic deformation.
- Purpose: Prevent overfitting and improve generalization.
b. Post-Processing Techniques
- CRF (Conditional Random Field): Often used as a post-processing step to refine segmentation boundaries by enforcing spatial consistency.
c. Evaluation Metrics
- Mean Intersection over Union (mIoU): The most widely used evaluation metric for segmentation tasks.
- Pixel Accuracy: The ratio of correctly predicted pixels to total pixels.
6. References
- Long, J., Shelhamer, E., & Darrell, T. (2015). “Fully Convolutional Networks for Semantic Segmentation.” CVPR. Link
- Ronneberger, O., Fischer, P., & Brox, T. (2015). “U-Net: Convolutional Networks for Biomedical Image Segmentation.” MICCAI. Link
- Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs.” PAMI. Link
- Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). “Pyramid Scene Parsing Network.” CVPR. Link
- Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., & Liu, W. (2020). “Deep High-Resolution Representation Learning for Visual Recognition.” PAMI. Link