Face Mask Detection
Multi-task classification on 50,000+ face images, simultaneously predicting mask-wearing correctness (4 classes) and mask type (3 classes). Architecture hand-built from scratch: EfficientNet-B0 backbone feeding a custom FPN with three YOLOv3-style classification heads at 13×13, 26×26 and 52×52 scales.
Medical vs non-medical classification on held-out test set. Professor's words: "I don't understand how."
4-class: correct / mouth exposed / nose exposed / fully off. Multi-scale FPN heads per class.
Diverse dataset across races and mask types. Weighted multi-task loss (CrossEntropyLoss × 2).
Architecture Design
The core idea was borrowed from object detection rather than standard classification: YOLOv3 uses multi-scale feature maps (large, medium, small) so that detectors at each scale specialise in objects of different sizes. Applied here to a classification task, each FPN scale gets its own classification head, and their predictions are aggregated. This is unconventional for a classification problem — detection-style heads are normally used to predict bounding boxes — but the intuition holds: fine-grained mask-wearing cues (nose exposure, chin gap) live at different spatial frequencies than coarse mask-type features.
The three heads share an EfficientNet-B0 backbone (features extracted at layers 4, 6, 8), with a custom FPN upsampling and concatenating lateral connections at each scale. Each head outputs logits for its task independently, and the two cross-entropy losses are summed with equal weight during training. The professor's scepticism about the accuracy was reasonable — mask type is a genuinely easy sub-task visually — but the architecture design was the point.