Abstract
Researchers develop a VerbatimRAG-based extractive question answering system using a novel ground truth dataset and ModernBERT model to improve accurate information retrieval from research papers.
Academic researchers need efficient and reliable methods for collecting high-quality information from trusted sources, but modern tools for AI-assisted research still suffer from the tendency of Large Language Models (LLMs) to produce factually inaccurate or nonsensical output, commonly referred to as hallucinations. We apply the extractive question answering system VerbatimRAG to research papers in the ACL Anthology, directly mapping user queries to verbatim text spans in retrieved documents. We contribute a novel ground truth dataset for the task of mapping user queries to relevant text spans in research papers, and use it to train and evaluate a variety of extractive models. Human annotation is performed by NLP researchers and is based on synthetic user queries generated using a custom pipeline based on the ScIRGen methodology, paired with chunks of research papers retrieved by VerbatimRAG. On this benchmark, a 150M-parameter ModernBERT token classifier trained on silver supervision from our pipeline achieves the best word-level F1 (53.6), ahead of the strongest evaluated LLM extractor (48.7).
Community
๐ง๐ผ๐ฑ๐ฎ๐ ๐๐ฒ ๐ฎ๐ฟ๐ฒ ๐ฟ๐ฒ๐น๐ฒ๐ฎ๐๐ถ๐ป๐ด ๐ฎ ๐ป๐ฒ๐ ๐ณ๐ฎ๐บ๐ถ๐น๐ ๐ผ๐ณ ๐น๐ถ๐ด๐ต๐๐๐ฒ๐ถ๐ด๐ต๐ ๐ฆ๐ข๐ง๐ ๐ฒ๐ ๐๐ฟ๐ฎ๐ฐ๐๐ถ๐๐ฒ ๐บ๐ผ๐ฑ๐ฒ๐น๐ ๐ณ๐ผ๐ฟ ๐ด๐ฟ๐ผ๐๐ป๐ฑ๐ฒ๐ฑ ๐ฅ๐๐.
Two ๐ญ๐ฑ๐ฌ๐ -๐ฝ๐ฎ๐ฟ๐ฎ๐บ๐ฒ๐๐ฒ๐ฟ ModernBERT span extractors trained as token-classifiers. They ๐ฏ๐ฒ๐ฎ๐ public extractive baselines (Zilliz Semantic Highlight, Provence) across ACL, RAGBench, Squeez, and QASPER, and outperform LLM-based extractors 100x their size on our ACL-Verbatim benchmark.
Given a query and a retrieved chunk, the extractor returns the exact text spans that support the answer.
Rather than generating an answer with an LLM, you get verbatim evidence directly from the source: paragraphs, table captions, code blocks, or other relevant text.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AstroRAG -- A Pagerank-Based Retrieval-Augmented Generation Pipeline for Question Answering in Astronomy (2026)
- Fine-grained Claim-level RAG Benchmark for Law (2026)
- Empirical Evaluation of PDF Parsing and Chunking for Financial Question Answering with RAG (2026)
- A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering (2026)
- OCC-RAG: Optimal Cognitive Core for Faithful Question Answering (2026)
- RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration (2026)
- A multilingual hallucination benchmark: MultiWikiQHalluA (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.21102 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash