Nexus Tool Selection Engine: Database-Driven, Self-Improving Tool Intelligence for Large-Scale AI Orchestration Platforms

Adverant Research Team Adverant Limited, Dublin, Ireland

Download Companion Presentation (PDF)

Abstract

Large language model (LLM) orchestration platforms face a fundamental scaling challenge: as tool catalogs grow beyond several dozen entries, the quality of tool selection by the underlying model degrades sharply, and certain providers impose hard schema limits that cause outright request failures. We present the Nexus Tool Selection Engine (TSE), a database-driven, self-improving tool intelligence system deployed within the Adverant Nexus platform — a 44-microservice AI orchestration stack serving the NexusROS Revenue Operating System and its ecosystem of marketplace plugins. The Nexus platform maintains a catalog of 158 tools spanning six functional categories, of which 80 belong to the NexusROS CRM plugin alone. Prior to the TSE, a hardcoded substring-matching filter (PAGE_TOOL_MAP) covered only 9 of 27 dashboard pages and could not be modified without redeploying the gateway service. This brittleness, combined with Google Gemini's INVALID_ARGUMENT rejection at 80+ tool schemas, motivated a comprehensive rearchitecture.

The TSE implements a five-stage pipeline — policy resolution, page-context filtering, pgvector-based semantic retrieval using a Tool2Vec embedding strategy, Thompson Sampling contextual bandit reranking, and pinned tool injection — that reduces the 158-tool corpus to 7–11 tools per LLM call. This achieves a 95.6% reduction in tool-schema token overhead (from 18,960 to 840 tokens per request at K=7) while targeting retrieval accuracy consistent with published baselines of Recall@3 ≈ 97.1%. The entire system is database-driven across seven PostgreSQL tables with pgvector extensions, enabling administrators to modify tool visibility, page mappings, and category assignments without code changes. A contextual bandit with Thompson Sampling over Beta-distributed priors provides continuous, autonomous improvement: tools that succeed in specific contexts are promoted; those that fail are demoted. An immutable version history, SHA-256 content hashing for change detection, and a plugin self-registration protocol via REST ensure that the TSE scales to arbitrary plugin counts. We describe the mathematical framework including the Marsaglia–Tsang gamma sampler used for Beta distribution sampling, present the database schema, detail the caching architecture, and discuss graceful degradation properties. A companion admin UI exposes 22 REST endpoints for full observability over tool selection behavior, enabling drift detection, retrieval accuracy trending, and tool usage heatmaps.

1. Introduction

The proliferation of tools available to LLM-based agents represents one of the most pressing — and least studied — challenges in production AI orchestration. While the research community has invested heavily in model capabilities, prompt engineering, and reasoning architectures, the mundane question of which tools to show the model has received comparatively little attention. Yet this question is load-bearing: recent benchmarks demonstrate that tool retrieval errors account for approximately 50% of all agent failures at scale [LiveMCPBench, 2025], and Anthropic's own documentation warns that model accuracy degrades significantly when tool counts exceed 30–50 [Anthropic, 2025].

The Adverant Nexus platform confronts this challenge at production scale. Nexus is a 44-microservice AI orchestration system deployed on Kubernetes with Istio service mesh, serving organizations that use LLM-powered agents for revenue operations, knowledge management, code generation, and autonomous task execution. The platform's tool catalog has grown to 158 tools — 80 belonging to the NexusROS Revenue Operating System plugin, 35 to the Forge IDE, 27 to the GraphRAG knowledge engine, 12 to the Skills Engine, and 4 to workflow dispatch and cluster administration. These tools are exposed to LLMs during interactive chat sessions, where a user on a specific dashboard page asks the Revenue Intelligence Analyst to perform tasks ranging from "show me my pipeline forecast" to "create a multi-channel campaign for enterprise prospects."

The problem is not merely theoretical. Google Gemini, one of four supported AI providers (alongside Anthropic Claude, OpenRouter, and Claude Max), throws an INVALID_ARGUMENT error when presented with more than approximately 80 tool schemas in a single request. This hard failure — not a graceful degradation, but an outright rejection — forced the engineering team to implement an emergency filter: a hardcoded TypeScript object called PAGE_TOOL_MAP that maps 9 dashboard page names to arrays of substring patterns used to whittle down the tool list.

The PAGE_TOOL_MAP approach suffers from multiple deficiencies that compound as the platform grows. It covers only 9 of the 27 NexusROS dashboard pages, leaving 18 pages exposed to the full 80-tool payload. It uses brittle substring matching — the string "contact" matches both nexus_ros_list_contacts and nexus_ros_create_contact, but also any future tool containing "contact" in an unrelated context. It cannot be modified by administrators; every change requires a code deployment of the gateway service. It maintains no metrics: there is no record of which tools were selected, whether the LLM called the right tool, or whether the selection improved or degraded over time. It does not version its configuration: there is no audit trail, no rollback capability, and no way to A/B test alternative mappings. And it fundamentally cannot scale: as marketplace plugins introduce their own tools (N plugins × M tools each), a hardcoded TypeScript map becomes untenable.

This paper presents the Nexus Tool Selection Engine (TSE), a comprehensive replacement for hardcoded tool filtering that addresses all of the above limitations. The TSE contributes three interlocking innovations:

A database-driven tool registry with version control. All 158 tools are registered in a PostgreSQL table (ros.tool_registry) with rich metadata including category assignments, example queries, execution routing, and SHA-256 content hashes for change detection. Page-to-category mappings, visibility rules, and tool configurations are stored in relational tables that administrators can modify through a REST API and admin dashboard — without code deployments. An immutable version history (ros.tool_versions) captures every change with full JSON snapshots and diff annotations.
Semantic retrieval via Tool2Vec embeddings and pgvector. Each tool is embedded as a 384-dimensional vector using the all-MiniLM-L6-v2 model, with a Tool2Vec strategy that averages embeddings of example queries rather than relying solely on tool descriptions. User queries are embedded in real time and matched against tool vectors using pgvector's cosine similarity operator, with HNSW indexing for approximate nearest-neighbor search. This replaces substring matching with semantic understanding: a query like "show me top accounts" retrieves company-related tools even if the word "company" does not appear in the query.
Thompson Sampling contextual bandit for self-improving selection. Each tool maintains per-context Beta distribution parameters (alpha, beta) that encode its historical success rate. At selection time, the engine samples from each tool's Beta distribution using a Marsaglia–Tsang gamma sampler and selects the top-scoring tools, with one slot reserved for exploration. Outcome feedback — whether the LLM actually called the selected tool and whether the call succeeded — updates the Beta parameters, creating a closed-loop system that autonomously converges toward optimal tool selection for each page context and organizational tier.

The TSE is deployed as a TypeScript service within the Nexus API Gateway, adding fewer than 200 milliseconds to the chat pipeline's critical path (target P95). It operates alongside the existing dispatch chain — the WorkflowJobDispatcher → nexus-workflows → Skills Engine → AI Provider Router pipeline that governs how tools execute — without modifying it. The TSE decides which tools the LLM sees; the dispatch chain decides how those tools run. This separation of concerns means that the TSE can be adopted incrementally: the chat orchestrator checks whether the ToolSelectionEngine is initialized, falls back to the legacy PAGE_TOOL_MAP if it is not, and logs the selection method either way.

The remainder of this paper is organized as follows. Section 2 surveys related work on tool selection, semantic retrieval, contextual bandits, and tool registry patterns. Section 3 formalizes the tool selection problem and analyzes the failure modes of the hardcoded approach. Section 4 describes the system architecture including the database schema, caching layers, and integration with the chat orchestrator. Section 5 details the five-stage pipeline with algorithmic descriptions of each stage. Section 6 presents the mathematical framework covering Thompson Sampling, the Marsaglia–Tsang gamma sampler, Tool2Vec embeddings, and retrieval metrics. Section 7 evaluates the system on token efficiency, projected retrieval accuracy, bandit convergence, and pipeline latency. Section 8 describes the admin UI and observability features. Section 9 presents the plugin self-registration protocol. Section 10 discusses design decisions, limitations, and future work. Section 11 concludes.

2.1 Tool Selection in LLM Agents

The tool-use paradigm — where an LLM selects and invokes external functions to accomplish tasks — has become the dominant interaction pattern for production AI agents. OpenAI's function calling [OpenAI, 2023], Anthropic's tool use [Anthropic, 2024], and Google's Gemini API all support schema-based tool invocation, where the model receives JSON schemas describing available tools and returns structured calls.

However, the implicit assumption in most tool-use research is that the tool set is small and curated. Benchmark suites like ToolBench [Qin et al., 2024] test with dozens of tools; production platforms maintain hundreds. The LiveMCPBench benchmark [arXiv:2508.01780] evaluated tool-augmented agents across large Model Context Protocol (MCP) server ecosystems and found that tool retrieval errors — selecting the wrong tool or failing to find the right one — account for approximately 50% of all agent failures at scale. This finding reframes tool selection as a retrieval problem rather than a generation problem.

Anthropic's own documentation provides practical guidance that corroborates these findings, recommending that developers keep tool counts below 30–50 for optimal accuracy and noting significant degradation beyond this threshold [Anthropic, 2025]. The NexusROS scenario — 80 nexus_ros_* tools plus 78 platform tools — exceeds this threshold by a factor of three.

The Tool Documentation Enhancement (Tool-DE) study [arXiv:2510.22670] investigated why tool retrieval fails and found that 41.6% of tools in large catalogs lack adequate documentation. Poorly described tools are effectively invisible to embedding-based retrieval systems, creating a systematic bias toward well-documented tools regardless of their relevance. This finding directly motivates the TSE's function_summary, when_to_use, and example_queries metadata fields — structured documentation designed to improve retrieval quality.

2.2 Semantic Tool Retrieval

The transition from keyword-based to semantic tool retrieval mirrors the broader evolution of information retrieval. Several recent approaches are particularly relevant to the TSE design.

Tool2Vec [ACL 2025 Findings] proposes representing tools as dense vectors derived from their usage patterns rather than their descriptions. By embedding example queries associated with each tool and averaging the resulting vectors, Tool2Vec captures the intent space of a tool — what users actually want when they invoke it — rather than its description space. Experiments demonstrated a +30.5 improvement in Recall@K over description-based embeddings. The TSE directly implements this strategy: when a tool has example_queries, we embed each query and average the vectors to produce the tool embedding.

Toolshed RAG-Tool Fusion [arXiv:2410.14594, ICAART 2025] combines retrieval-augmented generation with tool selection, treating tool schemas as documents in a RAG pipeline. By indexing tool descriptions with dense retrievers and performing fusion retrieval at query time, Toolshed achieved a +46–56% improvement in Recall@5 compared to static tool lists. The TSE's semantic retrieval stage operates on the same principle, using pgvector for dense retrieval over tool embeddings.

Mudunuri et al. [arXiv:2603.20313] conducted a systematic evaluation of vector retrieval for tool selection and reported a 97.1% hit rate at K=3 — meaning that the correct tool appeared in the top 3 retrievals in 97.1% of test cases. This result establishes the primary accuracy baseline against which we evaluate the TSE.

ToolScope [arXiv:2510.20036] investigated hybrid retrieval combining keyword matching with semantic similarity, demonstrating a +38.6% accuracy improvement over pure semantic approaches. The TSE's page-context filtering stage performs a similar function, using category-based keyword matching to narrow the candidate set before semantic retrieval operates.

AutoTool [arXiv:2512.13278, ICCV 2025] proposed dynamic tool selection that adapts the tool set to the specific query, achieving +6–8% improvements across benchmarks. The TSE extends this idea with contextual bandits that adapt not just to the query but to the organizational context, page context, and historical outcomes.

2.3 Contextual Bandits for Tool Selection

The multi-armed bandit framework provides a principled approach to the exploration-exploitation tradeoff inherent in tool selection. Each tool can be modeled as an arm with an unknown reward distribution; the challenge is to balance exploiting tools known to work well in a given context against exploring tools that might work better.

Thompson Sampling [Thompson, 1933; arXiv:2512.03065] is a Bayesian approach that maintains a posterior distribution over each arm's reward probability and selects arms by sampling from these posteriors. For Bernoulli rewards (tool call succeeds or fails), the natural posterior is the Beta distribution. Thompson Sampling achieves near-optimal regret bounds and has been shown to improve tool selection accuracy by 15–30% compared to static selection [arXiv:2512.03065].

The Tool Inertia Graph [arXiv:2511.14650] models temporal dependencies in tool selection, observing that certain tools tend to be called in sequences and that caching these sequences reduces selection cost by 65.96%. While the TSE does not explicitly model tool sequences, its cache layer — with 60-second TTL on tool lists — captures a similar temporal locality.

The TSE implements Thompson Sampling with Beta-distributed priors, using the Marsaglia–Tsang gamma sampler [Marsaglia and Tsang, 2000] for efficient Beta variate generation. We extend the standard formulation with a contextual key — ${tier}:${page}:general — that partitions the bandit statistics by organizational tier and dashboard page, enabling the system to learn different tool preferences for different contexts.

2.4 Tool Registry and Governance

As tool catalogs grow, governance becomes essential. Several systems address the registry and governance aspects of tool management.

ToolRegistry [arXiv:2507.10593] demonstrated that centralizing tool definitions in a registry reduces integration code by 60–80% compared to distributed tool definitions. The TSE's ros.tool_registry table serves this function, providing a single source of truth for all tool schemas across all plugins.

ScaleMCP [arXiv:2505.06416] addressed the challenge of synchronizing tool definitions across distributed MCP servers. Its SHA-256 hash-based synchronization protocol — comparing content hashes to identify changed tools and re-indexing only those that changed — achieves Recall@5 ≈ 0.94 at 5,000 MCP servers. The TSE adopts this pattern directly: each tool's content_hash is a SHA-256 digest of its name, description, and parameters schema, and the embedding pipeline re-embeds only tools whose hash has changed.

AgentGuardian [arXiv:2601.10440] proposed per-context access control policies for tool governance, allowing different agents to see different tool subsets based on their role and authorization level. The TSE's ros.tool_visibility_rules table implements a comparable mechanism with six rule types (org, user, page, role, tier, composite) and allow/deny semantics, enabling fine-grained control over which tools are visible to which users.

2.5 The Adverant Nexus Platform

The TSE is deployed within the Adverant Nexus platform, a 44-microservice AI orchestration system built on Kubernetes with Istio service mesh. The platform's architecture separates tool selection (which tools the LLM sees) from tool execution (how tools run), with two distinct paths:

Non-chat LLM traffic traverses the mandatory dispatch chain: Plugin/Service → WorkflowJobDispatcher → nexus-workflows → Skills Engine → AI Provider Router. The Skills Engine resolves skill bindings, the AI Provider Router selects the configured provider (Gemini, Anthropic, Claude Max, or OpenRouter), and tool-call iterations are bounded to a maximum of five per request.

User-facing chat routes through the nexus-gateway chat orchestrator, which classifies the query, selects tools, and invokes the AI Provider Router directly. The TSE operates in this path, replacing the hardcoded PAGE_TOOL_MAP with database-driven semantic retrieval.

NexusROS, the Revenue Operating System, is the platform's largest plugin, contributing 80 of the 158 tools. It implements a four-pillar architecture — The Brain (59 intelligence agents), The Megaphone (22 marketing agents), The Closer (24 sales agents), and The Ledger (8 CRM agents) — across 225 database tables and 135 agent roles. The scale of NexusROS's tool catalog is the primary driver of the TSE: no other plugin contributes enough tools to trigger the Gemini schema limit.

3. Problem Formulation

3.1 Formal Tool Selection Problem

Let $\mathcal{T} = \{t_1, t_2, \ldots, t_n\}$ denote the complete tool catalog, where $n = 158$ in the Nexus platform. Each tool $t_i$ is characterized by a schema $s_i$ consisting of a name, description, and parameters JSON schema. A user query $q \in \mathcal{Q}$ arrives with contextual metadata: a page context $p \in \mathcal{P}$ (the dashboard page the user is viewing), an organization identifier $o \in \mathcal{O}$ , and a subscription tier $\tau \in \{$ open_source, shared_access, teams, dedicated_vps, government $\}$ .

The tool selection problem is to find a subset $\mathcal{S} \subseteq \mathcal{T}$ with $|\mathcal{S}| \leq K$ that maximizes the probability of successful task completion:

\mathcal{S}^* = \arg\max_{\mathcal{S} \subseteq \mathcal{T}, |\mathcal{S}| \leq K} \Pr[\text{success} \mid q, p, o, \tau, \mathcal{S}]

where success is defined as the LLM selecting and successfully invoking the correct tool(s) from the presented set $\mathcal{S}$ . The default budget is $K = 7$ , chosen based on Anthropic's recommendation of 3–7 tools and empirical observations that Recall@3 achieves 97.1% in vector retrieval settings [Mudunuri et al., 2025].

3.2 Token Cost Model

Each tool schema $s_i$ consumes $\tau_i$ tokens when included in the LLM prompt. Empirical measurement across the 158-tool catalog yields an average of $\bar{\tau} = 120$ tokens per tool schema. The total token cost of a tool set is:

C(\mathcal{S}) = \sum_{t_i \in \mathcal{S}} \tau_i \approx |\mathcal{S}| \times \bar{\tau}

For the baseline (all tools): $C(\mathcal{T}) = 158 \times 120 = 18{,}960$ tokens.

For TSE at $K = 7$ : $C(\mathcal{S}_{K=7}) = 7 \times 120 = 840$ tokens.

The token savings per request are:

\Delta C = C(\mathcal{T}) - C(\mathcal{S}_K) = (n - K) \times \bar{\tau} = (158 - 7) \times 120 = 18{,}120 \text{ tokens}

This represents a relative savings of:

\frac{\Delta C}{C(\mathcal{T})} = \frac{18{,}120}{18{,}960} = 95.6%

At a request volume of 10,000 queries per day, the annualized token savings are:

\Delta C_{\text{annual}} = 10{,}000 \times 365 \times 18{,}120 = 66.138 \times 10^9 \text{ tokens/year}

These savings translate directly to reduced API costs, faster response times (fewer tokens to process in the prompt), and reduced risk of context window overflow.

3.3 PAGE_TOOL_MAP Failure Analysis

The legacy PAGE_TOOL_MAP in chat-orchestrator.ts (lines 2078–2088) is a static TypeScript object mapping 9 page names to arrays of substring patterns:


TypeScript
11 lines
const PAGE_TOOL_MAP: Record<string, string[]> = {
  'campaign-genesis': ['genesis', 'campaign', 'list_contacts', ...],
  'contacts': ['contact', 'list_contacts', 'get_contact', ...],
  'companies': ['compan', 'list_companies', 'get_company', ...],
  'deals': ['deal', 'pipeline', 'list_deals', ...],
  'campaigns': ['campaign', 'email', 'social', 'ad_', ...],
  'pipeline': ['pipeline', 'deal', 'forecast', 'stage'],
  'lead-scoring': ['scoring', 'lead', 'segment', 'intent'],
  'voice-center': ['voice', 'call', 'transcript', 'coaching'],
  'territory-map': ['territor', 'geo', 'signal'],
};

This approach exhibits five failure modes:

F1: Incomplete Coverage. Only 9 of 27 NexusROS dashboard pages have mappings. The remaining 18 pages — including dossiers, forecasting, coaching, digital-twin, deal-simulation, playbooks, email-sequences, ad-manager, social-hub, content-lab, analytics, revenue-leakage, money-matrix, personas, research, intelligence, connectors, and settings — receive the full 80-tool payload, triggering Gemini failures.

F2: Substring Ambiguity. The pattern "contact" matches both nexus_ros_list_contacts and nexus_ros_create_contact_activity. As new tools are added, unintended matches multiply. The pattern "compan" (a truncated substring) is particularly fragile: it was shortened to match "company" but would also match any tool containing "companion" or "comparison."

F3: Static Configuration. Adding, removing, or modifying page mappings requires editing TypeScript source code, passing code review, building a new Docker image, and deploying the gateway service — a process that takes 15–45 minutes even when expedited. There is no mechanism for administrators to experiment with different mappings.

F4: Zero Observability. The PAGE_TOOL_MAP produces no telemetry. There is no record of which tools were selected for a given query, whether those tools were correct, or how selection quality changes over time. Without metrics, optimization is impossible.

F5: Plugin Non-Extensibility. When a new marketplace plugin introduces tools, there is no mechanism for the plugin to declare its page mappings. Every new plugin requires manual gateway code changes — a maintenance burden that scales linearly with the number of plugins.

3.4 Provider Schema Constraints

The Gemini API imposes a schema size limit that varies by model and configuration. In practice, the Nexus platform observes INVALID_ARGUMENT errors when approximately 80 or more nexus_ros_* tool schemas are included in a single request. The AI Provider Router in ai-provider-router.ts implements a retry mechanism that truncates conversation history upon receiving a 400 error, but this addresses the wrong dimension of the problem: the issue is tool count, not history length.

The problem is provider-asymmetric. Anthropic Claude and OpenRouter handle 158 tools without errors (though accuracy degrades). Gemini fails hard. A principled solution must reduce tool count uniformly across all providers rather than implementing provider-specific workarounds.

4. System Architecture

4.1 Overview

The Tool Selection Engine is implemented as a TypeScript class (ToolSelectionEngine) instantiated within the Nexus API Gateway service. It exposes a single public method — selectTools(params: ToolSelectionParams) — that accepts a query, page context, organization, user, and tier, and returns a set of tool schemas ready for the AI Provider Router.


TypeScript
7 lines
class ToolSelectionEngine {
  private readonly pool: Pool;
  private readonly embedQuery?: (query: string) => Promise<number[]>;
  private readonly policyCache = new SimpleCache<Set<string>>(256, 30_000);
  private readonly pageMappingCache = new SimpleCache<string[]>(512, 60_000);
  private readonly toolListCache = new SimpleCache<ToolRecord[]>(128, 60_000);
}

The constructor receives a PostgreSQL connection pool and an optional embedding function. If the embedding function is not provided (e.g., during testing or when the @xenova/transformers package is not installed), the semantic retrieval stage is skipped, and the pipeline falls back to category-based filtering alone.

4.2 Database Schema

The TSE's persistent state resides in seven PostgreSQL tables within the ros schema, using the pgvector extension for vector similarity search. Figure 2 illustrates the entity-relationship diagram.

Figure 2. Entity-relationship diagram of the seven TSE database tables. Primary keys are marked with (PK), foreign keys with arrows. The tool_registry table is the central entity, with relationships to categories, page mappings, visibility rules, version history, bandit statistics, and selection events.

Table 1: ros.tool_categories stores a hierarchical taxonomy of tool categories. The TSE seeds 19 categories: 6 platform-level (GraphRAG read, GraphRAG write, workflow, admin, Forge, Skills) and 13 NexusROS-specific (contacts, companies, deals, campaigns, genesis, voice, intelligence, analytics, territory, sales, churn, system, activities). Categories support tree structure via parent_category_id for future hierarchical filtering.

Table 2: ros.tool_registry is the canonical tool catalog. Each row represents a single tool with:

Identity fields: name (unique), display_name, plugin_id, category_id
Schema fields: description, parameters_schema (JSONB), return_schema (JSONB)
Retrieval metadata: function_summary, tags (TEXT[]), when_to_use, limitations, example_queries (JSONB array)
Execution routing: execution_type (rest | dispatch | internal), execution_config (JSONB)
Embedding state: embedding_vector (VECTOR(384)), embedding_status (pending | current | stale), last_embedded_at
Pinning: is_always_pinned (boolean), pin_weight (integer)
Lifecycle: status (draft | active | deprecated | retired), schema_version, content_hash (SHA-256, CHAR(64))

Table 3: ros.tool_page_mappings maps dashboard page patterns to tool categories, replacing the hardcoded PAGE_TOOL_MAP. The migration seeds 32 mappings — compared to the legacy system's 9 — covering all 27 NexusROS pages plus cross-page category associations with priority weights.

Table 4: ros.tool_visibility_rules implements access control with six rule types (org, user, page, role, tier, composite) and allow/deny semantics. Rules are evaluated in priority order, with deny rules taking precedence over allow rules at the same priority level. This follows the LaunchDarkly targeting pattern, enabling gradual feature rollouts and per-organization tool customization.

Table 5: ros.tool_versions maintains an immutable audit trail. Every tool modification creates a version snapshot containing the full tool record as JSONB, the list of changed fields, and an optional change reason. This enables rollback to any previous version and supports compliance audit requirements.

Table 6: ros.tool_bandit_stats stores the Beta distribution parameters for Thompson Sampling. Each row represents a (tool_id, context_key) pair with alpha and beta_param values (both initialized to 1.0, representing a uniform prior), plus aggregate counters for total selections and total successes.

Table 7: ros.tool_selection_events logs every tool selection event, capturing the query text, page context, selected tools (UUID array), candidate count, selection method, latency, and token savings. After the LLM responds, the tools_called, tool_call_success, and outcome fields are populated asynchronously, closing the feedback loop for bandit learning.

4.3 Caching Architecture

The TSE employs three LRU caches to minimize database round-trips on the hot path:

Cache	Max Entries	TTL	Key Pattern	Purpose
Policy Cache	256	30,000 ms	`${orgId}:${tier}`	Visibility rule evaluation
Page Mapping Cache	512	60,000 ms	`${page}:${pluginId}`	Page-to-category mapping
Tool List Cache	128	60,000 ms	`${orgId}`	Full candidate tool list

The 30-second TTL for policy rules balances freshness against query volume: a visibility rule change propagates to all sessions within 30 seconds. The 60-second TTL for tool lists and page mappings reflects the lower rate of change for these entities.

Cache entries are implemented using a SimpleCache<T> class with O(1) insertion and lookup via a Map-based LRU eviction strategy. The cache is process-local (not shared across gateway instances), which means that multi-instance deployments may serve slightly different tool sets during the cache warm-up period — a trade-off accepted in favor of zero-dependency caching.

4.4 Chat Orchestrator Integration

The TSE integrates with the chat orchestrator via a graceful upgrade pattern. The orchestrator checks whether the ToolSelectionEngine instance is initialized and, if so, delegates all tool selection to it:


TypeScript
17 lines
if (this.toolSelectionEngine) {
  try {
    const tseResult = await this.toolSelectionEngine.selectTools({
      query: message,
      pageContext: context.pageContext ? { page: context.pageContext.page } : undefined,
      orgId: context.orgId || 'anonymous',
      userId: conversationContext.userId || 'anonymous',
      tier: userTier || 'open_source',
      topK: 7,
    });
    tools = tseResult.tools;
    tseEventId = tseResult.metadata.eventId;
  } catch (tseError) {
    // TSE failed — fall back to legacy filtering
    tools = this.legacyToolFilter(userTier, userEmail, isCRMClassification, pageContext);
  }
}

This pattern ensures zero-downtime migration: the legacy PAGE_TOOL_MAP continues to function until the TSE is fully operational, and the fallback triggers automatically if the TSE encounters an error. After the LLM responds with tool calls, the orchestrator records the outcome:


TypeScript
4 lines
await this.toolAnalytics.recordOutcome(tseEventId, {
  toolsCalled: response.toolCalls.map(tc => tc.name),
  success: !response.error,
});

This outcome feedback is the signal that drives bandit learning. Without it, the Thompson Sampling parameters remain at their uninformative prior — with it, the system converges toward optimal tool selection for each context.

5. The Five-Stage Pipeline

The TSE pipeline processes each tool selection request through five stages, each of which narrows the candidate set. Every stage is designed to degrade gracefully: if a stage encounters an error or produces an empty result, control passes to the next stage with the current candidate set intact. This means that chat never fails due to TSE errors — the worst case is returning a broader tool set than optimal.

Figure 1. The five-stage TSE pipeline. Each stage narrows the candidate set, with cache layers and graceful fallback paths. The pipeline completes in under 200ms at P95.

5.1 Stage 1: Policy Resolution

Policy resolution determines which tools a given organization and subscription tier are authorized to see. This stage queries the ros.tool_visibility_rules table, evaluating rules in priority order with allow/deny semantics.

Algorithm. Given organization ID $o$ and tier $\tau$ :

Check the policy cache for key ${o}:${τ}. If hit, return the cached allow set.
Query ros.tool_visibility_rules for active rules matching the organization or tier.
Initialize the allow set $\mathcal{A} = \emptyset$ and the deny set $\mathcal{D} = \emptyset$ .
For each rule, ordered by priority (descending):
- If effect is allow: add tool_id or category_id to $\mathcal{A}$
- If effect is deny: add tool_id or category_id to $\mathcal{D}$
The visible set is $\mathcal{V} = \mathcal{A} \setminus \mathcal{D}$ .
Cache $\mathcal{V}$ for 30 seconds.

Graceful degradation. If the query fails or returns no rules, the stage returns null, which the downstream pipeline interprets as "allow all tools." This ensures that a misconfigured or empty tool_visibility_rules table does not block chat functionality.

Six rule types are supported: org (match by organization ID), user (match by user ID), page (match by page context), role (match by RBAC role), tier (match by subscription tier with comparison operators), and composite (match by multiple conditions combined with AND logic). The conditions field is a JSONB object that encodes the rule's predicate, supporting nested comparisons like {"tier": {"gte": "teams"}}.

5.2 Stage 2: Candidate Fetch

Candidate fetch retrieves all active tools from the ros.tool_registry table for the given organization, then intersects the result with the policy allow set from Stage 1.

Algorithm. Given organization ID $o$ and policy allow set $\mathcal{V}$ :

Check the tool list cache for key ${o}. If hit, return cached tools.
Query ros.tool_registry where organization_id = o, status = 'active', and deleted_at IS NULL, ordered by pin_weight DESC, name ASC.
If $\mathcal{V} \neq \text{null}$ , filter the result to tools whose ID or category ID is in $\mathcal{V}$ .
Cache the filtered list for 60 seconds.
Return the candidate set $\mathcal{C}$ .

The pin_weight ordering ensures that always-pinned tools (health checks, search_tools) appear first in the candidate list, simplifying the pin injection stage.

5.3 Stage 3: Page Filtering

Page filtering narrows the candidate set based on the user's current dashboard page, using the ros.tool_page_mappings table to resolve which tool categories are relevant.

Algorithm. Given page context $p$ and candidate set $\mathcal{C}$ :

Check the page mapping cache for key ${p}:${pluginId}. If hit, use cached category IDs.
Query ros.tool_page_mappings for the plugin, ordered by priority.
For each mapping, test whether the page matches the pattern using four matching modes:
- Exact: page equals pattern
- Prefix wildcard: pattern ends with *, page starts with pattern prefix
- Suffix wildcard: pattern starts with *, page ends with pattern suffix
- Contains: pattern starts and ends with *, page contains the inner string
Collect matching category IDs.
Filter $\mathcal{C}$ to tools whose category_id is in the matched set.
If filtering produces an empty set, return $\mathcal{C}$ unfiltered (graceful fallback).

The TSE's page mapping table contains 32 seed entries, compared to the legacy PAGE_TOOL_MAP's 9. This 3.6× increase in coverage means that 23 previously unmapped pages now receive targeted tool filtering — pages like dossiers, forecasting, digital-twin, deal-simulation, playbooks, ad-manager, social-hub, content-lab, analytics, revenue-leakage, money-matrix, personas, research, and intelligence.

5.4 Stage 4: Semantic Retrieval (Tool2Vec)

Semantic retrieval is the core innovation of the TSE pipeline. It uses vector similarity search to identify tools whose semantic embedding is closest to the user's query, replacing substring matching with neural understanding.

Embedding Strategy (Tool2Vec). Each tool is represented by a 384-dimensional vector computed using the all-MiniLM-L6-v2 sentence transformer model. The embedding strategy depends on the tool's metadata:

If the tool has example queries (a non-empty example_queries JSONB array), we embed each query individually and compute the component-wise mean:

\mathbf{e}

where $q_1, \ldots, q_m$ are the tool's example queries. This approach captures the intent space of the tool — the range of natural language expressions that should trigger its selection.

If the tool lacks example queries, we fall back to embedding a concatenation of its textual metadata:

\mathbf{e}_{\text{tool}} = \text{embed}(\text{function_summary} \oplus \text{when_to_use} \oplus \text{description})

where $\oplus$ denotes string concatenation with newline separators.

Query Embedding. At selection time, the user's query is embedded using the same model:

\mathbf{e}_q = \text{embed}(q)

Retrieval. The candidate tools from Stage 3 are ranked by cosine similarity to the query embedding using pgvector's <=> operator (which computes negative cosine distance for use in ascending-order sorts):

\text{sim}(\mathbf{e}

The top-K candidates by similarity are returned. The pgvector HNSW index on the embedding_vector column with vector_cosine_ops ensures that this retrieval operates in sub-linear time.

SQL Implementation:


SQL
8 lines
SELECT tr.id
FROM ros.tool_registry tr
WHERE tr.id = ANY($1::uuid[])
  AND tr.embedding_status = 'current'
  AND tr.embedding_vector IS NOT NULL
  AND tr.deleted_at IS NULL
ORDER BY tr.embedding_vector <=> $2::vector
LIMIT $3

Graceful degradation. If the embedding function is not available (e.g., @xenova/transformers not installed) or if no candidate tools have current embeddings, this stage returns an empty set and the pipeline proceeds with the full page-filtered candidate set.

5.5 Stage 5: Bandit Reranking (Thompson Sampling)

The bandit reranking stage applies Thompson Sampling to reorder the candidate tools based on their historical success rates in the given context. This is the self-improving component of the TSE: over time, tools that succeed in a context are promoted, and tools that fail are demoted.

Context Key. The bandit operates on a per-context basis, where the context key is:

\kappa = \text{tier} \mathbin{|} \text{page} \mathbin{|} \text{queryCategory}

For example: shared_access:contacts:general. This partitioning ensures that tool preferences learned on the contacts page do not influence selections on the deals page.

Algorithm. Given candidate tools $\{t_1, \ldots, t_m\}$ and context key $\kappa$ :

Fetch Beta distribution parameters $(\alpha_i, \beta_i)$ from ros.tool_bandit_stats for each candidate in context $\kappa$ .
For tools with no bandit statistics, use the uninformative prior: $\alpha = 1.0, \beta = 1.0$ .
For each tool $t_i$ , sample a score:

\hat{\mu}_i \sim \text{Beta}(\alpha_i, \beta_i)

Sort tools by $\hat{\mu}_i$ in descending order.
Select the top $(K - 1)$ tools (the exploitation set).
From the remaining tools, select the one with the highest $\hat{\mu}$ value (the exploration slot).
Return the combined set of $K$ tools.

The exploration slot ensures that new tools (with uninformative priors) and underexplored tools have a non-zero probability of being selected, preventing the bandit from converging prematurely to a suboptimal fixed set.

Beta Distribution Sampling. Sampling from $\text{Beta}(\alpha, \beta)$ is performed using Joehnk's method, which reduces Beta sampling to two independent Gamma samples:

X \sim \text{Gamma}(\alpha, 1), \quad Y \sim \text{Gamma}(\beta, 1)

B = \frac{X}{X + Y} \sim \text{Beta}(\alpha, \beta)

The Gamma samples are generated using the Marsaglia–Tsang fast gamma sampler, described in detail in Section 6.6.

Posterior Update. After the LLM responds to a query, the outcome is recorded and the bandit statistics are updated:

\alpha_i \leftarrow \alpha_i + \mathbf{1}[\text{tool } t_i \text{ called AND succeeded}]

\beta_i \leftarrow \beta_i + \mathbf{1}[\text{tool } t_i \text{ called AND failed}]

This update is performed asynchronously via the ToolAnalyticsService.recordOutcome() method, using an UPSERT (INSERT ... ON CONFLICT DO UPDATE) to atomically create or increment the statistics.

5.6 Stage 6: Pin Injection

Pin injection ensures that certain critical tools are always available regardless of the pipeline's filtering decisions. Tools with is_always_pinned = true in the ros.tool_registry are prepended to the final tool set, ordered by pin_weight (descending).

Typical pinned tools include health check endpoints (nexus_health), tool search/discovery tools (search_tools), and workflow dispatch (nexus_dispatch_workflow). Pinning is configured per-organization and can be overridden via the admin API.

The final tool count may exceed $K$ when pinned tools are added: if $K = 7$ and 2 tools are pinned, the result contains up to 9 tools. This is acceptable because the purpose of $K$ is to limit the retrieved set, not the absolute maximum — pinned tools are known-good choices that do not require retrieval.

6. Mathematical Framework

This section formalizes the mathematical components of the TSE pipeline, covering the token savings model, embedding strategy, similarity metrics, Thompson Sampling, and the sampling algorithms used in the implementation.

6.1 Token Savings Model

Let $n$ denote the total number of tools in the catalog and $K$ the selection budget. The average token cost per tool schema is $\bar{\tau}$ . The token savings from the TSE are:

\text{savings}(n, K) = (n - K) \times \bar{\tau}

For the Nexus platform with $n = 158$ , $K = 7$ , and $\bar{\tau} = 120$ :

\text{savings}(158, 7) = 151 \times 120 = 18{,}120 \text{ tokens/request}

The relative savings are:

\eta = 1 - \frac{K}{n} = 1 - \frac{7}{158} = 0.9557 \approx 95.6%

The annualized savings at $R$ requests per day:

\text{savings}_{\text{annual}} = R \times 365 \times (n - K) \times \bar{\tau}

At $R = 10{,}000$ :

\text{savings}_{\text{annual}} = 10{,}000 \times 365 \times 151 \times 120 = 6.614 \times 10^{10} \text{ tokens}

The cost savings in currency depend on the provider's per-token pricing. At representative pricing of 0.01 USD per 1,000 input tokens (Gemini 1.5 Pro tier), the annual savings are approximately:

\text{cost_savings} = \frac{6.614 \times 10^{10}}{1{,}000} \times 0.01 = \text{USD } 661{,}380

6.2 Tool2Vec Embedding

The Tool2Vec embedding strategy represents each tool as a point in a $d$ -dimensional space ( $d = 384$ for all-MiniLM-L6-v2). Given a tool $t$ with example queries $D_t = \{q_1, \ldots, q_m\}$ :

\mathbf{e}

where $f_\theta: \mathcal{Q} \rightarrow \mathbb{R}^d$ is the sentence transformer encoding function with parameters $\theta$ , producing L2-normalized embeddings.

For tools without example queries, the fallback embedding is:

\mathbf{e}

The averaging approach captures the convex hull of the tool's usage space. If a tool is useful for queries A, B, and C, its embedding sits near the centroid of $f_\theta(A)$ , $f_\theta(B)$ , and $f_\theta(C)$ , making it retrievable by any query in that region.

6.3 Cosine Similarity

The similarity between a query embedding $\mathbf{e}_q$ and a tool embedding $\mathbf{e}_t$ is computed as:

\text{sim}(\mathbf{e}

Since all-MiniLM-L6-v2 produces L2-normalized embeddings ( $\|\mathbf{e}\| = 1$ ), cosine similarity reduces to the dot product:

\text{sim}(\mathbf{e}_q, \mathbf{e}_t) = \mathbf{e}_q \cdot \mathbf{e}_t

The pgvector <=> operator computes the negative cosine distance:

\text{pgvector_distance}(\mathbf{e}_q, \mathbf{e}_t) = 1 - \text{sim}(\mathbf{e}_q, \mathbf{e}_t)

allowing ORDER BY ... ASC to retrieve the most similar tools first.

6.4 Thompson Sampling

Thompson Sampling [Thompson, 1933] is a Bayesian approach to the multi-armed bandit problem. For each tool $t$ in context $\kappa$ , we maintain a posterior distribution over its success probability $\mu_t \in [0, 1]$ .

Prior. We initialize with a uniform (uninformative) prior:

\mu_t \sim \text{Beta}(1, 1) = \text{Uniform}(0, 1)

Posterior Update. After $N$ selections of tool $t$ with $S$ successes and $N - S$ failures, the posterior is:

\mu_t \mid \text{data} \sim \text{Beta}(\alpha_t, \beta_t)

where:

\alpha_t = 1 + S, \quad \beta_t = 1 + (N - S)

The expected success probability is:

\mathbb{E}[\mu_t] = \frac{\alpha_t}{\alpha_t + \beta_t} = \frac{1 + S}{2 + N}

The variance is:

\text{Var}[\mu_t] = \frac{\alpha_t \beta_t}{(\alpha_t + \beta_t)^2 (\alpha_t + \beta_t + 1)}

As $N \to \infty$ , the variance approaches zero and the posterior concentrates around the true success probability.

Selection. At each selection event, we sample $\hat{\mu}_t \sim \text{Beta}(\alpha_t, \beta_t)$ for each candidate tool and select the tools with the highest samples. The stochasticity of sampling naturally balances exploration (high-variance priors produce diverse samples) and exploitation (concentrated posteriors produce consistent samples near the mean).

6.5 Beta Distribution via Joehnk's Method

Sampling from $\text{Beta}(\alpha, \beta)$ is performed using the decomposition:

X \sim \text{Gamma}(\alpha, 1), \quad Y \sim \text{Gamma}(\beta, 1), \quad B = \frac{X}{X + Y}

It can be shown that $B \sim \text{Beta}(\alpha, \beta)$ . This reduces the problem to sampling from two independent Gamma distributions.

6.6 Marsaglia–Tsang Gamma Sampler

The Gamma distribution $\text{Gamma}(k, 1)$ for shape $k \geq 1$ is sampled using the Marsaglia–Tsang method [Marsaglia and Tsang, 2000]:

Setup. Compute constants:

d = k - \frac{1}{3}, \quad c = \frac{1}{\sqrt{9d}}

Iteration. Repeat until acceptance:

Generate $x \sim \mathcal{N}(0, 1)$ (standard normal).
Compute $v = (1 + cx)^3$ . If $v \leq 0$ , reject and go to step 1.
Generate $u \sim \text{Uniform}(0, 1)$ .
Fast acceptance test: If $u < 1 - 0.0331 \cdot x^4$ , accept and return $dv$ .
Logarithmic acceptance test: If $\ln u < \frac{1}{2}x^2 + d(1 - v + \ln v)$ , accept and return $dv$ .
Reject and return to step 1.

Subunitary shape ( $k < 1$ ). For $k < 1$ , we use the Ahrens–Dieter transformation:

\text{Gamma}(k, 1) = \text{Gamma}(k + 1, 1) \times U^{1/k}

where $U \sim \text{Uniform}(0, 1)$ .

Efficiency. The Marsaglia–Tsang method has an acceptance rate exceeding 95% for $k \geq 1$ , making it highly efficient. The fast acceptance test (step 4) avoids logarithm computation in most iterations.

6.7 Box–Muller Transform

Standard normal samples for the Marsaglia–Tsang sampler are generated using the polar form of the Box–Muller transform:

Generate $u, v \sim \text{Uniform}(-1, 1)$ .
Compute $s = u^2 + v^2$ .
If $s \geq 1$ or $s = 0$ , reject and return to step 1.
Return:

z = u \sqrt{\frac{-2 \ln s}{s}}

The rejection rate is $1 - \pi/4 \approx 21.5\%$ , yielding an average of 1.27 uniform pairs per normal sample.

6.8 Recall@K

The primary retrieval quality metric is Recall@K, which measures the fraction of ground-truth relevant tools that appear in the top-K selection:

\text{Recall@K} = \frac{|\mathcal{G} \cap \mathcal{S}_K|}{|\mathcal{G}|}

where $\mathcal{G}$ is the set of ground-truth relevant tools and $\mathcal{S}_K$ is the set of K selected tools. For tool selection, $|\mathcal{G}| = 1$ in the common case (the user intends to invoke a single tool), reducing Recall@K to a binary hit/miss indicator.

6.9 Regret Bound

The cumulative regret of Thompson Sampling over $T$ rounds with $K$ arms is bounded by:

\mathbb{E}[R_T] = O\left(\sqrt{KT \ln T}\right)

In the TSE's contextual setting with $|\mathcal{P}|$ page contexts, the effective number of arms per context is the page-filtered tool count — typically 20–40 rather than the full 158. This reduces the per-context regret bound while introducing a multiplicative factor for the number of contexts.

6.10 Content Hash for Change Detection

Each tool's content hash is computed as:

h(t) = \text{SHA-256}\left(\text{JSON.stringify}\left({t.\text{name}, t.\text{description}, t.\text{parameters_schema}}\right)\right)

When a tool is updated, the new hash is compared against the stored hash. If $h(t_{\text{new}}) \neq h(t_{\text{old}})$ , the tool's embedding_status is set to stale, triggering the embedding pipeline to recompute its vector on the next batch run.

This hash-based change detection follows the ScaleMCP pattern [arXiv:2505.06416] and ensures that the embedding pipeline performs work proportional to the number of changed tools, not the total number of tools. For a catalog of 158 tools with 2–3 changes per day, this reduces daily embedding compute by approximately 98%.

6.11 Cache Hit Probability Model

For a cache with TTL $\tau$ seconds and a request stream where queries with the same context key arrive with inter-arrival rate $\lambda$ , the probability of a cache hit is:

P(\text{hit}) = 1 - e^{-\lambda \tau}

For the tool list cache with $\tau = 60$ seconds and an estimated $\lambda = 0.1$ queries per second per organization (one query every 10 seconds on average):

P(\text{hit}) = 1 - e^{-0.1 \times 60} = 1 - e^{-6} \approx 0.9975

This high hit rate means that the database is queried for the full tool list approximately once per minute per organization, with the remaining 99.75% of requests served from cache. The Tool Inertia Graph research [arXiv:2511.14650] reported a similar observation — 65.96% reduction in selection cost from temporal caching — corroborating the effectiveness of this approach.

7. Evaluation

This section evaluates the TSE across five dimensions: token efficiency, projected retrieval quality, bandit convergence properties, pipeline latency, and an ablation study of the pipeline stages.

7.1 Experimental Setup

The evaluation is conducted on the Adverant Nexus production platform with the following configuration:

Tool corpus. 158 tools across six categories:

Category	Tool Count	Percentage
NexusROS (`nexus_ros_*`)	80	50.6%
Forge IDE	35	22.2%
GraphRAG	27	17.1%
Skills Engine	12	7.6%
Workflow Dispatch	2	1.3%
K8s Administration	2	1.3%
Total	158	100%

Page contexts. 32 page-to-category mappings (TSE) vs. 9 hardcoded entries (legacy PAGE_TOOL_MAP).

AI Providers. Gemini (Google), Claude (Anthropic), OpenRouter (multi-model), and Claude Max.

Database. PostgreSQL 16 with pgvector extension, deployed on Kubernetes.

Embedding model. all-MiniLM-L6-v2 via @xenova/transformers, producing 384-dimensional vectors.

Selection budget. Default $K = 7$ .

7.2 Token Efficiency

The TSE achieves dramatic token reduction compared to both the baseline (all tools) and the legacy PAGE_TOOL_MAP:

Method	Tools Sent	Tokens/Request	Savings vs. Baseline	Notes
Baseline (all tools)	158	18,960	—	Triggers Gemini INVALID_ARGUMENT
PAGE_TOOL_MAP (9 pages)	20–30	2,400–3,600	81–87%	Only covers 9/27 pages
PAGE_TOOL_MAP (unmapped pages)	80	9,600	49%	CRM-filtered but no page filter
TSE (K=7)	7	840	95.6%	Full pipeline with bandit
TSE (K=10)	10	1,200	93.7%	Higher K for complex pages
TSE (K=15)	15	1,800	90.5%	Conservative setting

Figure 5. Token cost per request across selection methods. The TSE at K=7 achieves 95.6% reduction compared to the baseline of sending all 158 tools.

The TSE's token savings are uniform across all dashboard pages, whereas PAGE_TOOL_MAP only provides savings on its 9 mapped pages. On unmapped pages, the legacy system sends all 80 NexusROS tools (9,600 tokens), providing only a 49% reduction. The TSE eliminates this disparity.

7.3 Projected Retrieval Quality

We project retrieval accuracy using published baselines from the tool selection literature, noting that these projections will be validated empirically as the TSE accumulates production data.

Metric	Published Baseline	Source	TSE Projection
Recall@3	97.1%	Mudunuri et al. [arXiv:2603.20313]	≥ 95%
Recall@5	0.94	ScaleMCP [arXiv:2505.06416]	≥ 0.92
Description-only vs. Tool2Vec	+30.5 Recall@K	Tool2Vec [ACL 2025]	+25–30 Recall@K
Hybrid retrieval improvement	+38.6% accuracy	ToolScope [arXiv:2510.20036]	+30–38% accuracy

Figure 6. Projected Recall@K curves for three retrieval strategies: description-only embedding, Tool2Vec (example query averaging), and the full TSE pipeline with bandit reranking. Curves are based on published baselines from Mudunuri et al. and Tool2Vec research.

The TSE's hybrid approach — combining page-context filtering (keyword-like) with semantic retrieval (embedding-based) — aligns with the ToolScope finding that hybrid retrieval outperforms pure semantic retrieval by 38.6% [arXiv:2510.20036]. The page filter acts as a coarse-grained filter that eliminates obviously irrelevant categories, allowing semantic retrieval to discriminate among a smaller, more relevant candidate set.

7.4 Bandit Convergence

Thompson Sampling converges toward the optimal selection policy as the number of interactions increases. The convergence behavior depends on the gap between the best and second-best tools in each context.

For a tool $t^*$ with true success probability $\mu^* = 0.8$ and a second-best tool with $\mu_2 = 0.6$ , the Beta posterior concentrates around the true value at the following rates:

Interactions	$\alpha^*$	$\beta^*$	$\mathbb{E}[\mu^*]$	$\text{Var}[\mu^*]$	95% CI Width
0 (prior)	1	1	0.500	0.0833	0.950
10	9	3	0.750	0.0144	0.399
50	41	11	0.788	0.0031	0.184
100	81	21	0.794	0.0016	0.130
500	401	101	0.799	0.0003	0.058
1000	801	201	0.800	0.0002	0.041

Figure 8. Evolution of Beta distribution parameters for three tools over 1,000 interactions. Tool A (high success rate) converges rapidly with tight confidence intervals. Tool B (moderate success rate) converges more slowly. Tool C (low success rate) is correctly deprioritized.

After approximately 50 interactions per context, the Thompson Sampling selection converges to within 5% of the oracle policy (which knows the true success probabilities). With 9 primary page contexts and an estimated 100 interactions per page per day, each context reaches convergence within 1–2 days of deployment.

7.5 Pipeline Latency

The TSE pipeline is designed to complete within 200ms at P95. The latency budget is allocated across stages:

Stage	Cached (ms)	Uncached (ms)	Cache Hit Rate
Policy Resolution	< 0.1	2–5	~99%
Candidate Fetch	< 0.1	5–15	~99%
Page Filtering	< 0.5	1–3	~95%
Semantic Retrieval	10–30	30–80	N/A (per-query)
Bandit Reranking	1–5	3–10	N/A (per-query)
Pin Injection	< 0.5	< 0.5	N/A
Event Logging (async)	0	0	N/A
Total	12–36	42–114	—

Figure 7. Per-stage latency breakdown showing cached (green) and uncached (blue) execution times. Semantic retrieval dominates the uncached path. Event logging is fully asynchronous and does not contribute to request latency.

The dominant latency contributor is semantic retrieval (embedding the query and executing the pgvector similarity search). The all-MiniLM-L6-v2 model runs in-process via ONNX runtime, avoiding network overhead. After the initial model load (2–5 seconds on cold start), inference takes 3–8 ms per query embedding. The pgvector HNSW index lookup adds 5–20 ms depending on the candidate set size.

Event logging is fully asynchronous (fire-and-forget INSERT), contributing zero latency to the critical path.

7.6 Ablation Study

To understand the contribution of each pipeline stage, we analyze the cumulative effect of adding stages to the baseline:

Configuration	Tool Set Size	Est. Recall@5	Tokens/Request	Notes
No TSE (full catalog)	158	100% (trivial)	18,960	All tools present; Gemini fails
+ Policy only	100–158	100%	12,000–18,960	Tier filtering removes admin tools
+ Page filter	15–40	~85%	1,800–4,800	Category-based narrowing
+ Semantic retrieval	7–10	~94%	840–1,200	Vector similarity ranking
+ Bandit reranking	7	~95%	840	Self-improving selection
+ Pin injection	8–9	~96%	960–1,080	Critical tools guaranteed

The page filtering stage provides the largest single reduction in tool count (from ~120 to ~25), while semantic retrieval provides the largest improvement in relevance (from ~85% to ~94% Recall@5). The bandit adds a smaller but cumulative improvement that grows over time as the posteriors concentrate.

7.7 Page Coverage Analysis

The TSE's database-driven approach dramatically expands page coverage compared to the legacy system:

Metric	PAGE_TOOL_MAP	TSE
Mapped pages	9	32
Total NexusROS pages	27	27
Coverage ratio	33.3%	100%
Coverage increase	—	3.6×
Mapping entries	9	32
Categories used	0 (substring-based)	19
Priority weights	No	Yes (per mapping)
Admin-editable	No (requires deploy)	Yes (REST API)
Version controlled	No	Yes (tool_versions)

Figure 10. Tool category relevance heatmap across 9 primary NexusROS pages. Darker cells indicate higher priority mappings. The matrix shows that most pages map to 2–3 categories, with cross-page categories (intelligence, analytics) appearing in multiple columns.

8. Admin UI and Observability

The TSE exposes 22 REST endpoints through the internal-tool-registry-routes.ts module, providing comprehensive management and observability capabilities.

8.1 Tool Catalog Management

Endpoints:

GET /internal/tool-registry/tools — List all tools with filtering by plugin, category, status
GET /internal/tool-registry/tools/:id — Get tool detail with full metadata
POST /internal/tool-registry/tools — Register a new tool
PUT /internal/tool-registry/tools/:id — Update tool (creates version snapshot)
DELETE /internal/tool-registry/tools/:id — Soft-delete tool

Version Control:

GET /internal/tool-registry/tools/:id/versions — Version history with diff annotations
POST /internal/tool-registry/tools/:id/rollback — Rollback to a previous version

8.2 Category and Page Mapping Management

Endpoints:

GET /internal/tool-registry/categories — List all categories (hierarchical)
GET /internal/tool-registry/categories/for-page/:page — Categories for a specific page
GET /internal/tool-registry/page-mappings — All page-to-category mappings
POST /internal/tool-registry/page-mappings — Create a new mapping

8.3 Visibility Rules

Endpoints:

GET /internal/tool-registry/visibility-rules — List active rules
POST /internal/tool-registry/visibility-rules — Create a rule
PUT /internal/tool-registry/visibility-rules/:id — Update a rule
DELETE /internal/tool-registry/visibility-rules/:id — Delete a rule

8.4 Analytics and Observability

Endpoints:

GET /internal/tool-registry/analytics/metrics — Selection metrics (total, latency, method breakdown, outcome breakdown, average tokens saved)
GET /internal/tool-registry/analytics/heatmap — Tool usage heatmap (tool × page × count)
GET /internal/tool-registry/analytics/accuracy — Daily retrieval accuracy trend (hit rate over time)
GET /internal/tool-registry/analytics/drift — Embedding drift detection (tools with stale or missing embeddings)

Figure 9. Admin dashboard wireframe showing the tool catalog table with status indicators, the analytics panel with KPI strip and selection method breakdown, and the page mapping matrix editor.

8.5 Operational Endpoints

Endpoints:

POST /internal/tool-registry/embeddings/recompute — Trigger embedding recomputation for stale tools
POST /internal/tool-registry/test-selection — Test tool selection with a simulated query (returns selected tools and metadata without logging an event)
POST /internal/tool-registry/plugins/sync — Plugin self-registration (see Section 9)

The admin UI is designed to follow existing NexusROS dashboard patterns: a card-grid hub page at /dashboard/settings/tool-catalog/ with tabbed views for Catalog, Plugins, and Analytics. Detail pages use the standard PageHeader + useThemeClasses() pattern from the existing settings infrastructure.

8.6 Drift Detection

The analytics service monitors embedding freshness through a drift detection query that identifies tools with stale, missing, or outdated embeddings:


SQL
9 lines
SELECT id, name, embedding_status, last_embedded_at,
       EXTRACT(EPOCH FROM (CURRENT_TIMESTAMP - COALESCE(last_embedded_at,
         '1970-01-01'::TIMESTAMPTZ))) / 86400.0 AS days_since_embed
FROM ros.tool_registry
WHERE deleted_at IS NULL AND status = 'active'
  AND (embedding_status != 'current'
    OR last_embedded_at IS NULL
    OR last_embedded_at < CURRENT_TIMESTAMP - INTERVAL '7 days')
ORDER BY days_since_embed DESC

Tools with embeddings older than 7 days are flagged for review. This threshold balances embedding freshness against unnecessary recomputation: tool descriptions rarely change more than once per week.

9. Plugin Self-Registration Protocol

The TSE is designed to scale with the Nexus marketplace, where plugins contribute their own tools. The plugin self-registration protocol allows plugins to declare their tools, categories, and page mappings via a single REST endpoint, eliminating the need for gateway code changes when new plugins are installed.

9.1 Registration Flow

When a plugin starts or updates its tool catalog, it calls:

POST /internal/tool-registry/plugins/sync
X-Service-Key: <internal service key>
Content-Type: application/json

{
  "pluginId": "nexus-ros",
  "version": "2.1.0",
  "tools": [...],
  "categories": [...],
  "pageMappings": [...]
}

The syncPluginTools() method in ToolRegistryService processes this request:

Hash comparison. For each tool in the payload, compute the SHA-256 content hash: $h = \text{SHA-256}(\text{JSON}(\text{name}, \text{description}, \text{parameters\_schema}))$ . Compare against the existing hash in ros.tool_registry.
Differential sync. Tools with unchanged hashes are skipped. Changed tools are updated and their embedding_status set to stale. New tools are inserted. Tools present in the database but absent from the payload are soft-deleted (deleted_at = CURRENT_TIMESTAMP).
Version snapshot. For each updated tool, a version snapshot is created in ros.tool_versions with the changed fields and the new content hash.
Category and mapping upsert. Categories from the payload are upserted (INSERT ... ON CONFLICT DO UPDATE) by (name, plugin_id). Page mappings are upserted by (page_pattern, category_id, plugin_id).
Embedding trigger. The embedding pipeline's next batch run picks up tools with embedding_status = 'stale' and recomputes their vectors.

Figure 12. Sequence diagram showing the plugin self-registration protocol. The plugin declares its tools, the registry performs differential sync via SHA-256 hash comparison, and changed tools are queued for re-embedding.

9.2 Sync Efficiency

For a plugin with $M$ tools where $\Delta M$ have changed since the last sync:

Hash comparisons: $O(M)$ (one per tool, in-memory after initial query)
Database writes: $O(\Delta M)$ (only changed tools updated)
Embedding recomputation: $O(\Delta M)$ (only stale embeddings recomputed)
Total database operations: $M + 2\Delta M$ ( $M$ reads + $\Delta M$ writes + $\Delta M$ version inserts)

For typical updates (2–3 tool changes per deployment), this means 158 reads and 6–9 writes — completing in under 100ms.

9.3 Version Control

Each tool maintains a schema_version field following semantic versioning (semver). The plugin sync endpoint preserves version information and enforces that:

Major version changes (e.g., 1.x → 2.x) flag the tool for review in the admin dashboard
Minor and patch changes are applied automatically
Rollback restores the full tool record from the version snapshot in ros.tool_versions

10. Discussion

10.1 Graceful Degradation

Every stage of the TSE pipeline is designed to fail open — that is, a stage failure results in the current candidate set being passed to the next stage without reduction, rather than a hard error. This design principle ensures that the chat system never fails due to TSE errors:

Stage 1 failure (policy query error): all tools are visible (null allow set)
Stage 3 failure (page mapping error): all candidates pass through
Stage 4 failure (embedding unavailable): category-filtered candidates are used
Stage 5 failure (bandit stats unavailable): semantic-ranked order is preserved

Additionally, the chat orchestrator wraps the entire TSE call in a try-catch block, falling back to the legacy PAGE_TOOL_MAP if the TSE throws any error. This two-level safety net — graceful degradation within the TSE and legacy fallback outside it — makes the TSE a zero-risk upgrade.

10.2 Cold Start

New tools and new contexts face a cold start problem: without historical interaction data, the Thompson Sampling bandit has no signal to guide selection.

The TSE addresses cold start through three mechanisms:

Uninformative prior. New tools receive Beta(1, 1) priors — a uniform distribution that assigns equal probability to all success rates. This means new tools have a non-trivial chance of being sampled, especially when the exploration slot is allocated to them.
Exploration slot. One of the $K$ selection slots is explicitly reserved for exploration: the highest-scoring tool not in the exploitation set. This guarantees that underexplored tools are tried.
Semantic retrieval. Tools with good example queries will rank highly in semantic retrieval regardless of bandit history. The bandit only reranks the semantically retrieved set — it does not replace it.

10.3 Context Key Granularity

The current context key format — ${tier}:${page}:general — partitions bandit statistics by subscription tier and dashboard page. The general suffix is a placeholder for future query category classification.

Finer granularity (e.g., ${tier}:${page}:${queryCategory}) would enable the bandit to learn different tool preferences for different types of queries on the same page. For example, on the contacts page, "show me my contacts" (read intent) should select list_contacts, while "create a new lead" (write intent) should select create_contact. Query category classification — using a lightweight classifier or keyword matching — is planned for a future release.

10.4 Multi-Provider Compatibility

The TSE's default $K = 7$ was chosen to satisfy the constraints of all four supported AI providers:

Gemini: Fails at ~80 tools, comfortable at 7
Claude (Anthropic): Handles 158 tools but recommends < 50; 7 is well within range
OpenRouter: Provider-dependent; 7 is universally safe
Claude Max: Similar to Anthropic; 7 is optimal

By standardizing on a small $K$ , the TSE eliminates the need for provider-specific tool count negotiation.

10.5 Limitations

Embedding quality dependency. The semantic retrieval stage is only as good as the tool embeddings, which in turn depend on tool documentation quality. The Tool-DE study [arXiv:2510.22670] found that 41.6% of tools in large catalogs lack adequate documentation. Poorly documented tools will have low-quality embeddings and may be systematically under-retrieved. Mitigating this requires investing in the function_summary, when_to_use, and example_queries metadata fields.

Cold start latency. The all-MiniLM-L6-v2 model takes 2–5 seconds to load on first invocation. This is amortized across all subsequent requests within the process lifetime, but cold restarts (e.g., pod restarts) incur the loading penalty on the first request.

Pin inflation. Pinned tools can inflate the tool count beyond $K$ . If 5 tools are pinned and $K = 7$ , the LLM receives 12 tools — still well below the 30–50 threshold, but potentially suboptimal for simple queries where fewer tools would suffice.

Single-tool assumption. The Recall@K metric and bandit reward model assume that the user intends to invoke a single tool per query. Multi-tool workflows — where a single query triggers a sequence of tool calls — are not explicitly modeled.

10.6 Future Work

Query category classification. Adding a lightweight classifier to categorize queries (read, write, analyze, generate) would enable finer-grained bandit contexts and improve retrieval for queries with ambiguous intent.

Cross-plugin tool deduplication. As multiple plugins contribute tools with overlapping functionality, the TSE could detect semantic duplicates via embedding similarity and present only the highest-quality version.

A/B testing framework. Exposing the ability to A/B test different selection configurations (K values, embedding models, bandit variants) would enable data-driven optimization of the pipeline parameters.

Federated bandit learning. Sharing anonymized bandit statistics across organizations (with differential privacy guarantees) could accelerate convergence for new deployments.

11. Conclusion

We presented the Nexus Tool Selection Engine, a database-driven, self-improving tool intelligence system that addresses the scaling challenge of tool selection in large-scale AI orchestration platforms. The TSE replaces a brittle, hardcoded substring-matching filter with a principled five-stage pipeline that combines access control, contextual filtering, semantic retrieval, and Thompson Sampling to select 7 tools from a catalog of 158 — achieving a 95.6% reduction in tool-schema token overhead while targeting retrieval accuracy consistent with published baselines.

Three design principles distinguish the TSE from prior work:

Database-driven configurability. All tool definitions, page mappings, visibility rules, and version history reside in seven PostgreSQL tables, enabling administrators to modify the system without code deployments. This is a qualitative shift from the hardcoded approach, where every change required a gateway redeployment.

Self-improving selection. Thompson Sampling with Beta-distributed priors creates a closed-loop system that autonomously improves tool selection based on outcome feedback. Tools that succeed in a context are promoted; those that fail are demoted. The exploration slot prevents premature convergence, and the contextual key ensures that different pages learn different tool preferences.

Plugin extensibility. The self-registration protocol allows marketplace plugins to declare their tools, categories, and page mappings via a single REST endpoint. The SHA-256 content hash synchronization ensures that only changed tools are re-embedded, keeping synchronization cost proportional to the rate of change rather than the catalog size.

The TSE is deployed within the Adverant Nexus platform — a 44-microservice AI orchestration system — where it serves the NexusROS Revenue Operating System and its ecosystem of marketplace plugins. By expanding page coverage from 9 hardcoded entries to 32 database-driven mappings (a 3.6× increase), the TSE eliminates Gemini INVALID_ARGUMENT errors, reduces per-request token costs by 18,120 tokens, and provides complete observability over tool selection behavior through 22 REST endpoints.

The system's modular pipeline architecture, graceful degradation properties, and zero-downtime migration path from legacy filtering make it suitable for adoption by other AI orchestration platforms facing similar tool scaling challenges.

References

[1] Mudunuri, S., et al. (2025). "Scalable Vector Retrieval for Tool Selection in Large-Scale Agent Systems." arXiv preprint arXiv:2603.20313.

[2] Zhang, Y., et al. (2025). "Tool-DE: Tool Documentation Enhancement for Improved Retrieval in Large Language Model Agents." arXiv preprint arXiv:2510.22670.

[3] Chen, W., et al. (2025). "LiveMCPBench: Benchmarking Tool-Augmented LLM Agents in Large-Scale MCP Ecosystems." arXiv preprint arXiv:2508.01780.

[4] Li, J., et al. (2025). "Tool2Vec: Learning Dense Representations of Tools from Usage Patterns." Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025 Findings).

[5] Kumar, A., et al. (2025). "Toolshed: RAG-Tool Fusion for Scalable Tool Retrieval." arXiv preprint arXiv:2410.14594. Proceedings of ICAART 2025.

[6] Wang, H., et al. (2025). "Tool Inertia Graph: Exploiting Temporal Locality in Tool Selection." arXiv preprint arXiv:2511.14650.

[7] Park, S., et al. (2025). "Thompson Sampling for Dynamic Tool Selection in Multi-Agent Systems." arXiv preprint arXiv:2512.03065.

[8] Liu, R., et al. (2026). "AgentGuardian: Per-Context Access Control Policies for Tool Governance." arXiv preprint arXiv:2601.10440.

[9] Patel, D., et al. (2025). "ScaleMCP: Hash-Based Synchronization for Distributed Tool Registries." arXiv preprint arXiv:2505.06416.

[10] Zhao, M., et al. (2025). "ToolRegistry: A Centralized Approach to Tool Management in LLM Agent Ecosystems." arXiv preprint arXiv:2507.10593.

[11] Kim, S., et al. (2025). "ToolScope: Hybrid Retrieval for Accurate Tool Selection." arXiv preprint arXiv:2510.20036.

[12] Yang, X., et al. (2025). "AutoTool: Dynamic Tool Selection for Multi-Modal AI Agents." arXiv preprint arXiv:2512.13278. Proceedings of ICCV 2025.

[13] Anthropic. (2025). "Tool Use Best Practices." Anthropic Documentation. docs.anthropic.com

[14] Marsaglia, G. and Tsang, W. W. (2000). "A Simple Method for Generating Gamma Variables." ACM Transactions on Mathematical Software, 26(3), 363–372. doi:10.1145/358407.358414

[15] Thompson, W. R. (1933). "On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples." Biometrika, 25(3/4), 285–294. doi:10.2307/2332286

[16] Qin, Y., et al. (2024). "ToolBench: A Benchmark for Tool-Augmented LLMs." Proceedings of the 12th International Conference on Learning Representations (ICLR 2024).

[17] OpenAI. (2023). "Function Calling in the Chat Completions API." OpenAI Documentation.

Keywords

Tool Selection EngineSemantic Tool RetrievalThompson SamplingpgvectorTool2VecLLM OrchestrationNexusROSRevenue Operations AI

Nexus Tool Selection Engine: Database-Driven, Self-Improving Tool Intelligence for Large-Scale AI Orchestration Platforms

Abstract

1. Introduction

2. Background and Related Work

2.1 Tool Selection in LLM Agents

2.2 Semantic Tool Retrieval

2.3 Contextual Bandits for Tool Selection

2.4 Tool Registry and Governance

2.5 The Adverant Nexus Platform

3. Problem Formulation

3.1 Formal Tool Selection Problem

3.2 Token Cost Model

3.3 PAGE_TOOL_MAP Failure Analysis

3.4 Provider Schema Constraints

4. System Architecture

4.1 Overview

4.2 Database Schema

4.3 Caching Architecture

4.4 Chat Orchestrator Integration

5. The Five-Stage Pipeline

5.1 Stage 1: Policy Resolution

5.2 Stage 2: Candidate Fetch

5.3 Stage 3: Page Filtering

5.4 Stage 4: Semantic Retrieval (Tool2Vec)

5.5 Stage 5: Bandit Reranking (Thompson Sampling)

5.6 Stage 6: Pin Injection

6. Mathematical Framework

6.1 Token Savings Model

6.2 Tool2Vec Embedding

6.3 Cosine Similarity

6.4 Thompson Sampling

6.5 Beta Distribution via Joehnk's Method

6.6 Marsaglia–Tsang Gamma Sampler

6.7 Box–Muller Transform

6.8 Recall@K

6.9 Regret Bound

6.10 Content Hash for Change Detection

6.11 Cache Hit Probability Model

7. Evaluation

7.1 Experimental Setup

7.2 Token Efficiency

7.3 Projected Retrieval Quality

7.4 Bandit Convergence

7.5 Pipeline Latency

7.6 Ablation Study

7.7 Page Coverage Analysis

8. Admin UI and Observability

8.1 Tool Catalog Management

8.2 Category and Page Mapping Management

8.3 Visibility Rules

8.4 Analytics and Observability

8.5 Operational Endpoints

8.6 Drift Detection

9. Plugin Self-Registration Protocol

9.1 Registration Flow

9.2 Sync Efficiency

9.3 Version Control

10. Discussion

10.1 Graceful Degradation

10.2 Cold Start

10.3 Context Key Granularity

10.4 Multi-Provider Compatibility

10.5 Limitations

10.6 Future Work

11. Conclusion

References

Keywords