Road Damage Detection at Scale: YOLOv5 Ensemble for Real-Time, Low-Cost Road Infrastructure Monitoring

Links:

Introduction

This work applies Faster R-CNN and YOLO (“You Only Look Once”) to the 2020 Global Road Damage Detection (GRDC) Challenge (IEEE). Using the GRDC dataset of annotated road surface images, the goal is to detect and classify common distresses (e.g., longitudinal, lateral, alligator cracks, potholes) in near real time from a dashboard-mounted smartphone feed.
➡️ Jump to Example Video Predictions.

We propose a supervised object-detection solution using YOLOv5 and Faster R-CNN, delivering a 0.68 F1-score and ranking top-5 of 121 teams (Dec 2021), while preserving real-time viability on GPU.

Dataset

The GRDC dataset includes 21,041 images (600×600 and 720×720) captured via dashboard-mounted smartphones (~25 mph) and hand-annotated by researchers from IIT Roorkee and the University of Tokyo.¹ To improve geographic generalization, the dataset spans Japan (10,506), India (7,706), and the Czech Republic (2,829).

Although eight road-distress classes exist per the Japanese Maintenance Guidebooks,² the competition focuses on the four most frequent: Longitudinal cracks (D00), Lateral cracks (D10), Alligator cracks (D20), and Potholes (D40).

Methodology

We compare two-stage and one-stage detectors on this task.

Faster R-CNN (two-stage): Region proposals → RoI classification + box regression.³ It improved mAP and speed vs. earlier R-CNN variants but still has higher inference costs than modern one-stage methods.⁴
YOLO (one-stage): Predicts boxes directly over a grid and uses non-maximum suppression (NMS) to finalize detections—optimized for speed.⁵ The latest YOLOv5 implementations can reach single-digit millisecond inference with GPUs.⁶

We trained YOLOv5-x (142M params) and YOLOv5-l (77M), plus Faster R-CNN. YOLOv5 variants outperformed Faster R-CNN in both F1 and inference time, so we adopted YOLO as the base.

Ensembling (EM) and Test-Time Augmentation (TTA)

To boost accuracy:

Model Ensembling (EM): Average predictions from multiple YOLOv5 models trained with different hyperparameters (batch size, optimizer, LR). Ensembling reduces variance and improves generalization, at the cost of longer inference and less interpretability.⁷
Test-Time Augmentation (TTA): Run inference on flips and scaled versions (1.30×, 0.83×, 0.67×) of the same image, then merge via NMS. This further reduces generalization error.

We also combine EM + TTA: for each augmented view, run all ensemble models; aggregate and NMS to finalize boxes.

Training Setup

We split GRDC train into 98% train / 2% val (20,621 / 420).
YOLOv5 used its standard augmentation pipeline (flip, saturation, hue). Faster R-CNN trained on the unaltered set.

Results

Per GRDC rules, submissions were evaluated on two held-out test sets (test1: 2,631 images; test2: 2,664 images).⁸

Baseline training (default hyperparameters):
- YOLOv5-x: F1 0.52, YOLOv5-l: F1 0.52, Faster R-CNN: F1 0.50
Hyperparameter tuning highlights:
- YOLOv5 best with batch 8–32, SGD + Nesterov; Faster R-CNN best with batch 8–16, SGD.
Ensemble of six YOLOv5 (x + l; batch 32/16/8; 150 epochs): F1 0.57
- Met the ≤0.5s/image target: typical 0.21–0.40s, max ~0.42s per image on GPU.
Adding TTA increased to F1 0.59.
Grid search over NMS and confidence (C) delivered the best score: F1 0.68 with C = 0.25, NMS = 0.999 (top-5 on leaderboard).

System Implementation

A practical deployment path is smartphone-only data capture:

Real-time or offline: Run on-device or stream to an edge/server for batch scoring.
Multi-angle capture: Use multiple phones per vehicle (different fields of view) to improve recall via consensus across angles.
Geo-indexed mapping: Use EXIF GPS coordinates to build road-quality maps at the segment level. Aggregate distress frequency × severity (confidence) into a segment score to prioritize maintenance. Export to a tabular store for DOT analytics across neighborhoods/cities/states.

Complementary signals: Low-cost smartphone accelerometer data can estimate road roughness (e.g., IRI proxies),⁹ which—combined with vision-based surface distress—yields a more complete health index.¹⁰

Conclusion

Human-only road inspections are expensive and infrequent. A YOLOv5 ensemble + TTA approach provides accurate, fast, and low-cost monitoring from commodity smartphones. We reached F1 0.68 on GRDC test sets (top-5 of 121 teams), while respecting a ≤0.5s per-image budget—supporting real-time use.

Product takeaway for transportation leaders: Start with a phone-based pilot to quantify savings and coverage gains; layer in multi-angle capture and geo-indexed scoring to operationalize maintenance planning. Add roughness sensing where budgets allow to round out the KPI set.

Example Video Predictions

i) Longitudinal Crack Detection

ii) Lateral Crack Detection

iii) Alligator Crack Detection

iv) Pothole Detection

Thanks for reading—check out DeepRoad AI to learn more: https://deeproad.ai/

Arya, H., Maeda, S. K. Ghosh, D. Toshniwal, A. Mraz, T. Kashiyama, and Y. Sekimoto, Deep learning-based road damage detection and classification for multiple countries, Automation in Construction, 132, 2021. ↩︎
Japan Road Association, Maintenance Guidebook for Road Pavements, 2013. http://www.road.or.jp/english/publication/index.html (accessed 2021-12-15). ↩︎
Ren, S., He, K., Girshick, R., Sun, J., Faster R-CNN: Towards real-time object detection with region proposal networks, NeurIPS, 2015. ↩︎
Girshick, R., Fast R-CNN, ICCV, 2015. ↩︎
Bochkovskiy, A., Wang, C.-Y., Liao, H.-Y. M., YOLOv4: Optimal speed and accuracy of object detection, arXiv:2004.10934, 2020. ↩︎
Solawetz, J., YOLOv5 is here: State-of-the-art object detection at 140 FPS, 2020. https://blog.roboflow.com/yolov5-is-here/ ↩︎
Dietterich, T. G., Kong, E. B., Machine learning bias, statistical bias, and statistical variance of decision tree algorithms, Citeseer Tech. Rep., 1995. ↩︎
GRDC Organizing Team, Data, https://rdd2020.sekilab.global/data/, 2020. ↩︎
Douangphachanh, V., Oneyama, H., Estimation of road roughness condition from smartphones under realistic settings, 13th Int’l Conf. on ITS Telecommunications (ITST), 2013. ↩︎
Mucka, P., Current approaches to quantify the longitudinal road roughness, Int. Journal of Pavement Engineering, 17(8):659–679, 2016. ↩︎