arxiv:2606.12451

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

Published on Jun 4

· Submitted by

Ashutosh Hathidara on Jun 12

SAP

Upvote

Authors:

Abstract

Parametric tool retrieval models show reduced performance and understanding when evaluated with realistic ambiguous queries compared to standard benchmarks, revealing a dissociation between knowledge retrieval and true tool comprehension.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce ToolSense, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at https://github.com/SAP/toolsense.

View arXiv page View PDF GitHub 1 Add to collection

Community

ashutosh1919

Paper submitter about 14 hours ago

Parametric tool retrieval trains LLMs to act as their own retrievers by encoding tools as virtual tokens, achieving >90% recall on standard benchmarks. However, these benchmarks rely on verbose, fully-specified queries and constrained trie decoding—making it impossible to tell if the model truly understands its tools or is simply pattern-matching.

We introduce ToolSense, an open-source diagnostic framework that automatically generates three benchmarks from any tool catalog: a Realistic Retrieval Benchmark (RRB) with user-style queries at three ambiguity levels, an MCQ factual probe, and a QA inferential probe. Applying ToolSense to ToolBench (~47k tools) reveals a striking knowledge-retrieval dissociation: top parametric configurations collapse by 50–64 percentage points on realistic queries, falling below dense embedding baselines. Factual probing further shows that Stage 2 retrieval fine-tuning systematically erases the tool knowledge acquired during Stage 1 memorization. The best mitigation we found is combining LoRA with multi-format memorization.

librarian-bot

about 1 hour ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.12451

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2606.12451 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.12451 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2606.12451 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.