Task-Specific Zero-shot Quantization-Aware Training for Object Detection

Tsinghua University
ICCV 2025

*Indicates Equal Contribution

Abstract

Quantization is a key technique to reduce network size and computational complexity by representing the network parameters with a lower precision. Traditional quantization methods rely on access to original training data, which is often restricted due to privacy concerns or security challenges. Zero-shot Quantization (ZSQ) addresses this by using synthetic data generated from pre-trained models, eliminating the need for real training data. Recently, ZSQ has been extended to object detection. However, existing methods use unlabeled task-agnostic synthetic images that lack the specific information required for object detection, leading to suboptimal performance. In this paper, we propose a novel task-specific ZSQ framework for object detection networks, which consists of two main stages. First, we introduce a bounding box and category sampling strategy to synthesize a task-specific calibration set from the pre-trained network, reconstructing object locations, sizes, and category distributions without any prior knowledge. Second, we integrate task-specific training into the knowledge distillation process to restore the performance of quantized detection networks. Extensive experiments conducted on the MS-COCO and Pascal VOC datasets demonstrate the efficiency and state-of-the-art performance of our method.

Motivation


MY ALT TEXT

We visualize different synthetic image types and analyze their effects on zero-shot quantization-aware training with the Mask R-CNN object detection network. The results demonstrate that task-specific images (Fig. (d)) enable the extraction of a richer set of features compared to task-agnostic images (Fig. (c)). In contrast, task-mismatched images (Fig. (b)) cause more significant performance degradation. These findings highlight the importance of using both task-specific calibration datasets and task-specific training to ensure the effectiveness of zero-shot quantization-aware training for object detection networks.

Contribution

  1. Identify the drawback of task-agnostic calibration: We emphasize the significance of task-specific synthetic images for zero-shot quantization of object detection networks.
  2. Task-specific object detection images synthesis: We propose a bounding box sampling method tailored for object detection to reconstruct object categories, locations, and sizes in synthetic samples without any prior knowledge.
  3. Task-specific quantized network distillation: We interate task-specific finetuning into quantized network distillation, effectively restoring the performance of quantized object detection networks.

Overall Structure


MY ALT TEXT

We constuct a task-specific condensed calibration set in Stage 1, followed by quantization-aware training with task-specific distillation in Stage 2.

Comparison with YOLO Networks


MY ALT TEXT

We conduct experiments using both the widely adopted YOLOv5 series and the latest YOLOv11 series models, with strong real-data QAT baselines such as LSQ and LSQ+. While LSQ and LSQ+ are trained on 120k real images, our approach requires only 2k ground-truth labels to generate the calibration set. Despite this significant reduction in training data, our method achieves strong performance across various bit-widths and network sizes, while also offering improved training efficiency.

Comparison with Mask R-CNN Networks


CNN Mask RCNN ViT Mask RCNN

We also conduct experiments using Mask R-CNN with both CNN and Swin-T/S backbones. The strong performance across these setups highlights the generalizability of our method to different object detection architectures.

Comparison with Task-Agnostic Methods


MY ALT TEXT

We demonstrate that incorporating task-specific loss enhances detection performance across various models, including YOLO11-s/m and Swin-T/S. This improvement stems from two key aspects:
  1. Task-specific training loss enriches the image generation process by embedding detailed information—such as bounding box categories and coordinates—leading to a data distribution that more closely mimics real-world images.
  2. During quantization-aware training, it allows the model to learn directly from labels, thereby improving its ability to extract and interpret meaningful visual information.

BibTeX

@misc{li2025taskspecificzeroshotquantizationawaretraining,
        title={Task-Specific Zero-shot Quantization-Aware Training for Object Detection}, 
        author={Changhao Li and Xinrui Chen and Ji Wang and Kang Zhao and Jianfei Chen},
        year={2025},
        eprint={2507.16782},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2507.16782}, 
  }