Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models
Published in Under Review, 2025
We introduce CitePretrainBench, the first benchmark that checks whether an LLM can trace answers back to the passages it saw during continual pre‑training. By adding Active Indexing—a bidirectional fact-identifier augmentation that explicitly teaches citation during pre‑training, LLMs achieve noticeable improvement over baselines.
Recommended citation: Yukun Huang, Sanxing Chen, Jian Pei, Manzil Zaheer, Bhuwan Dhingra. (2025). "Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models." Under Review.
Download Paper
