Task-Specific Zero-shot Quantization-Aware Training for Object Detection

Quantization is a key technique to reduce network size and computational complexity by representing the network parameters with a lower precision. Traditional quantization methods rely on access to original training data, which is often restricted due to privacy concerns or security challenges. Zero-shot Quantization (ZSQ) addresses this by using synthetic data generated from pre-trained models, eliminating the need for real training data. Recently, ZSQ has been extended to object detection. However, existing methods use unlabeled task-agnostic synthetic images that lack the specific information required for object detection, leading to suboptimal performance. In this paper, we propose a novel task-specific ZSQ framework for object detection networks, which consists of two main stages. First, we introduce a bounding box and category sampling strategy to synthesize a task-specific calibration set from the pre-trained network, reconstructing object locations, sizes, and category distributions without any prior knowledge. Second, we integrate task-specific training into the knowledge distillation process to restore the performance of quantized detection networks. Extensive experiments conducted on the MS-COCO and Pascal VOC datasets demonstrate the efficiency and state-of-the-art performance of our method.

Motivation

We visualize different synthetic image types and analyze their effects on zero-shot quantization-aware training with the Mask R-CNN object detection network. The results demonstrate that task-specific images (Fig. (d)) enable the extraction of a richer set of features compared to task-agnostic images (Fig. (c)). In contrast, task-mismatched images (Fig. (b)) cause more significant performance degradation. These findings highlight the importance of using both task-specific calibration datasets and task-specific training to ensure the effectiveness of zero-shot quantization-aware training for object detection networks.

Contribution

Identify the drawback of task-agnostic calibration: We emphasize the significance of task-specific synthetic images for zero-shot quantization of object detection networks.
Task-specific object detection images synthesis: We propose a bounding box sampling method tailored for object detection to reconstruct object categories, locations, and sizes in synthetic samples without any prior knowledge.
Task-specific quantized network distillation: We interate task-specific finetuning into quantized network distillation, effectively restoring the performance of quantized object detection networks.

Overall Structure

We constuct a task-specific condensed calibration set in Stage 1, followed by quantization-aware training with task-specific distillation in Stage 2.

Comparison with YOLO Networks

We conduct experiments using both the widely adopted YOLOv5 series and the latest YOLOv11 series models, with strong real-data QAT baselines such as LSQ and LSQ+. While LSQ and LSQ+ are trained on 120k real images, our approach requires only 2k ground-truth labels to generate the calibration set. Despite this significant reduction in training data, our method achieves strong performance across various bit-widths and network sizes, while also offering improved training efficiency.

Comparison with Mask R-CNN Networks

We also conduct experiments using Mask R-CNN with both CNN and Swin-T/S backbones. The strong performance across these setups highlights the generalizability of our method to different object detection architectures.

Comparison with Task-Agnostic Methods

We demonstrate that incorporating task-specific loss enhances detection performance across various models, including YOLO11-s/m and Swin-T/S. This improvement stems from two key aspects:

Task-specific training loss enriches the image generation process by embedding detailed information—such as bounding box categories and coordinates—leading to a data distribution that more closely mimics real-world images.
During quantization-aware training, it allows the model to learn directly from labels, thereby improving its ability to extract and interpret meaningful visual information.

BibTeX

@misc{li2025taskspecificzeroshotquantizationawaretraining,
        title={Task-Specific Zero-shot Quantization-Aware Training for Object Detection}, 
        author={Changhao Li and Xinrui Chen and Ji Wang and Kang Zhao and Jianfei Chen},
        year={2025},
        eprint={2507.16782},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2507.16782}, 
  }