COPSD | Constitutional On-Policy Safe Distillation

Abstract

Why COPSD?

Augmented OPSD within Safety Boundaries eliminates geometric leakage to prevent response and entropy collapse.

Comprehensive 12-Benchmark Evaluation enables our 4B student model to surpass the 235B Oracle's safety performance.

Native VERL & vLLM Implementation provides a high-throughput, cluster-ready production pipeline with accelerated token rollouts.

Anti-Over-Refusal & Context-Aware Data mitigates hyper-conservatism through open-sourced multimodal splits that master subtle real-world environments.

Method

Two Stages

Cross-SFT first calibrates the teacher. Then constitution-conditioned OPSD distills safer behavior without collapsing expressiveness.

Stage 1. Cross-SFT cold-start. Stage 2. Constitution-conditioned on-policy distillation.

Results

Less Text, More Signal

Pareto Improvement

COPSD moves up and right.

Safety versus helpfulness results for COPSD

Student Beats Teacher

Student_C surpasses Teacher_C and OPD.

Teacher student comparison under COPSD and OPD

Ablation

Teacher tuning matters.

Safety Table

Main numbers.

1 / 4

Citation

BibTeX

@article{wen2026copsd,
  title        = {Constitutional On-Policy Safe Distillation},
  author       = {Ming Wen and Yuxuan Liu and Kun Yang and Yunhao Feng and Zhuoer Xu and Yuhao Sun and Shiwen Cui and Xiang Zheng and Xingjun Ma and Yu-Gang Jiang},
  journal      = {arXiv preprint arXiv:2606.03089},
  year         = {2026},
  url          = {https://arxiv.org/abs/2606.03089}
}