Abstract:
To address the challenge of traffic vehicle over-limit detection, a vehicle 3D dimension estimation method integrating binocular vision and multi-stage features was proposed. First, the YOLOv8 network was adopted to detect vehicles and extract regions of interest, and the ResNet18 network was used to extract deep semantic features. The HRNet network was then employed to detect the predefined key points of the vehicle. Constrained by the physical dimensions of the vehicle, the internal and external parameters of the binocular camera were iteratively optimized via gradient descent, and the initial 3D dimensions of the vehicle were calculated based on the triangulation principle. Finally, the deep features and the initial dimensions were concatenated into multi-modal features, which were fed into the designed binocular feature fusion multi-layer perceptron (BFMLP) model for regression, and the accurate length, width, and height of the vehicle were obtained. Comparative experiments, ablation experiments, and typical case analyses were conducted in real traffic scenarios. The results show that the mean relative error (MRE) of the estimated 3D dimensions by the proposed method is 0.04, which is significantly superior to traditional end-to-end regression methods, demonstrating the effectiveness of the synergistic optimization of multi-modal feature fusion and geometric constraints. The method exhibits stable measurement accuracy and engineering feasibility in controlled traffic scenes. The proposed method achieves high-precision and real-time vehicle 3D dimension estimation based on deep learning, providing effective technical support for intelligent transportation management.