Research Paperunified-orchestrator

Unified Nexus Orchestrator: Separation of Dispatch and Execution in Multi-Chain AI Workload Platforms

A systems architecture paper presenting the strict separation of dispatch and execution in AI workload orchestration. The Unified Nexus Orchestrator (UNO) routes but never executes — it validates, resolves skills, enforces governance, and enqueues to BullMQ. nexus-workflows workers are the sole execution engine with 4 tiers: LLM-only, ReAct tool-using, chain DAG, and autonomous agent patterns. Includes multi-provider AI routing, span-tree observability, and a 9-phase migration strategy.

Adverant Research Team2026-04-09129 min read32,082 words

Unified Nexus Orchestrator: Separation of Dispatch and Execution in Multi-Chain AI Workload Platforms

Adverant Research Team Adverant, Ltd. Email: research@adverant.ai

Abstract

Large-scale AI platforms increasingly serve heterogeneous workload types — real-time chat, asynchronous LLM inference, GPU-intensive compute, and multi-step chained pipelines — yet the dominant architectural pattern remains a monolithic orchestrator that both dispatches and executes these workloads within a single process boundary. This conflation produces observable failures: Redis connection drops under sustained load, overlapping execution paths between competing job processors, governance bypass when autonomous agents call LLM providers directly, and scaling bottlenecks that make it impossible to independently tune dispatch throughput against execution concurrency.

We present the Unified Nexus Orchestrator (UNO), a system serving 67 microservices across a Kubernetes cluster with Istio service mesh, PostgreSQL, Redis, Neo4j, and Qdrant. UNO enforces a strict architectural separation: the dispatcher validates, resolves skills from a PostgreSQL registry, classifies governance risk, and enqueues jobs to BullMQ — then returns HTTP 202 Accepted. It never executes. Execution is the sole responsibility of nexus-workflows, a pool of BullMQ workers that process jobs across four tiers: single-shot LLM inference (Tier 1), tool-using ReAct loops (Tier 2), multi-step chain DAG coordination (Tier 3), and autonomous agent patterns (Tier 4). Tiers 1–3 are deployed in nexus-workflows; Tier 4 autonomous agent patterns currently exist in nexus-mageagent (a separate service) and are planned for migration into nexus-workflows. LLM calls route through a unified AI Provider Router; multi-provider routing per organization (routing hints resolved from the skill registry) is designed and described here but not yet deployed — current production supports one provider per org.

Our contributions include: (1) a formal separation of dispatch and execution that eliminates an entire class of reliability failures; (2) a four-tier execution taxonomy that maps cleanly to resource isolation and governance requirements; (3) multi-provider AI routing architecture with per-organization configuration and skill-level routing hints (designed; single-provider deployment active); (4) chain DAG coordination via a completion listener that handles fork, join, and conditional dispatch without blocking the dispatcher; (5) span-tree observability that records every operation — LLM call, tool invocation, governance check — as a hierarchical span with parent-child relationships; and (6) governance pre-checks integrated at the dispatch boundary, including risk classification, data residency enforcement, and human-in-the-loop gates compliant with EU AI Act requirements. We draw on 14 months of operational experience with the monolithic predecessor (nexus-trigger) and describe three categories of failure that motivated the separation. nexus-trigger remains running as of Q2 2026 with deprecation in progress as nexus-workflows consumers are validated.

1. Introduction

Consider what happens when a single service is responsible for deciding what to do and then doing it. The decision logic and the execution logic share a thread pool. They share a Redis connection. They share a failure domain. When execution stalls — an LLM provider takes 47 seconds to respond, a tool invocation hangs on a network partition, an autonomous agent enters a reasoning loop that consumes memory geometrically — the dispatcher stalls with it. New jobs cannot be accepted. Health checks fail. The Kubernetes liveness probe restarts the pod. Queued work vanishes.

This is not a hypothetical failure mode. It is what we observed, repeatedly, in the twelve months preceding the architecture described in this paper.

1.1 The Multi-Chain AI Workload Problem

Modern AI platforms do not serve a single workload type. They serve four fundamentally different execution chains, each with distinct latency requirements, resource profiles, governance constraints, and failure semantics:

Chain 1: Real-time chat. Sub-second latency is non-negotiable. The user is watching a cursor blink. Streaming token delivery via Server-Sent Events or WebSocket demands that the serving path contain no job queues, no persistence overhead, no governance pre-checks beyond basic rate limiting. This chain is the exception — it bypasses the orchestrator entirely and routes through dedicated proxy endpoints on the API gateway.

Chain 2: Asynchronous LLM inference. A service requests prose generation, document summarization, code review, or classification. The caller does not need the result immediately — it needs an acknowledgment that the work has been accepted, a job identifier for tracking, and a notification when the result is ready. Latency tolerance ranges from seconds to minutes. This is the bread and butter of platform AI usage: high volume, variable duration, amenable to batching and priority queuing.

Chain 3: GPU-intensive compute. Model fine-tuning, embedding generation over large corpora, image synthesis, video rendering. These jobs may run for hours. They require specialized hardware scheduling, checkpoint recovery, and resource reservation that bears no resemblance to the request-response cycle of chains 1 and 2.

Chain 4: Chained pipelines. A patent analysis that requires: (a) document ingestion and OCR, (b) prior art search across a knowledge graph, (c) claim extraction via LLM, (d) novelty scoring via a second LLM with different temperature settings, (e) human review of flagged claims, and (f) final report generation. This is a directed acyclic graph of heterogeneous steps — some LLM, some tool, some human — with conditional branching, fan-out parallelism, and join synchronization.

The temptation — the one we succumbed to — is to handle all four chains in a single orchestrator service. After all, they all involve "calling an LLM at some point." The shared infrastructure (Redis for state, PostgreSQL for persistence, WebSocket for notifications) makes consolidation feel natural. Elegant, even.

It is neither.

1.2 Why Monolithic Dispatch-and-Execute Fails

Our pre-UNO architecture had a service called nexus-trigger that received job requests, resolved which LLM to call, constructed prompts, invoked the AI Provider Router, waited for the response, processed tool calls, and delivered results — all in the same process, the same event loop, the same Redis connection pool.

Three categories of failure emerged with increasing frequency as the platform scaled from 12 to 67 microservices:

Redis connection exhaustion. The nexus-trigger service maintained Redis connections for job state, pub/sub notifications, and BullMQ queue management. Long-running Tier 2 (tool-using) jobs held connections open for the duration of their ReAct loops — sometimes 30 seconds, sometimes 3 minutes. Under load, the connection pool saturated. New dispatch requests could not acquire connections. The entire dispatch pipeline froze, not because anything was wrong with dispatching, but because execution had consumed all shared resources.

Overlapping execution paths. As the platform grew, a second service — nexus-mageagent — was introduced to handle autonomous agent workloads. It implemented its own execution loop, its own prompt construction, its own tool-calling logic. Critically, it called the OpenRouter API directly, bypassing the AI Provider Router entirely. We now had two services executing LLM workloads with no shared governance, no unified observability, no consistent provider routing. When an organization's administrator changed their AI provider from OpenRouter to Gemini in the settings panel, nexus-mageagent kept calling OpenRouter. The administrator's configuration was, from their perspective, simply ignored.

Governance bypass. The original nexus-trigger had a rudimentary risk classification: if the job type contained the substring "security," it logged a warning. That was the extent of governance. There was no data residency check (EU customer data flowing to a US-hosted model endpoint), no human-in-the-loop gate for high-risk operations (an autonomous agent with kubectl delete authority), no audit trail linking a specific LLM call to the organization, user, and skill that authorized it. Governance was not merely weak — it was architecturally impossible to implement correctly, because the dispatch boundary (where governance checks belong) and the execution boundary (where LLM calls happen) were the same boundary.

1.3 The Separation Principle

The core insight is deceptively simple: dispatch and execution are different responsibilities with different scaling characteristics, different failure domains, and different governance requirements. They must not share a process boundary.

Dispatch is fast, stateless (modulo a database write), and must never block. It validates input. It resolves a skill from a registry. It classifies risk. It inserts a run record. It enqueues a job. It emits a WebSocket event. It returns 202 Accepted. Total wall time: under 50 milliseconds for the 99th percentile.

Execution is slow, stateful, and may involve arbitrary external systems. It dequeues a job. It constructs prompts. It calls an LLM provider. It parses the response. It may invoke tools. It may loop (ReAct). It may fork into parallel sub-tasks. It may wait for human approval. It may fail and retry with exponential backoff. Duration: anywhere from 800 milliseconds (a simple classification) to 45 minutes (an autonomous agent completing a multi-step investigation).

These two responsibilities have nothing in common except a shared data model (the job record) and a shared communication channel (the BullMQ queue). By separating them into independent services — UNO for dispatch, nexus-workflows for execution — we gain:

Independent scaling. Dispatch scales horizontally to absorb burst traffic. Execution scales vertically (more workers, larger instances) to handle compute-intensive jobs. Neither constrains the other.
Fault isolation. An execution worker that crashes, hangs, or exhausts memory does not affect dispatch availability. New jobs continue to be accepted and queued. The worker restarts and picks up where it left off (or from a checkpoint, for Tier 3 chains).
Governance at the boundary. Every job passes through UNO before execution begins. UNO is the single point where risk classification, data residency checks, rate limiting, and human-in-the-loop gates are enforced. No execution path bypasses this boundary — not nexus-mageagent, not a new plugin, not a developer's shortcut. The Istio AuthorizationPolicy, application-layer service key validation, and caller identity verification enforce this at three independent layers.
Unified observability. UNO creates the root span for every job. Execution workers create child spans. The span tree — stored in execution_spans — provides a complete, hierarchical record of every operation: dispatch → governance check → queue → dequeue → LLM call → tool invocation → response. This is not bolted-on logging; it is structural observability built into the execution model itself.

1.4 Contributions

This paper makes the following contributions:

Architectural separation of dispatch and execution in multi-chain AI workload platforms, with a formal argument for why these responsibilities must not share a process boundary, and operational evidence from 14 months running the monolithic predecessor (nexus-trigger) demonstrating the class of reliability failures the separation eliminates.
A four-tier execution taxonomy (LLM-only, tool-using ReAct, chain DAG, autonomous agent) that maps workload types to resource isolation strategies, governance requirements, and observability granularity. Tiers 1–3 are deployed in nexus-workflows; Tier 4 autonomous agent patterns exist in nexus-mageagent (separate service, migration in progress).
Multi-provider AI routing architecture with per-organization configuration, where organizations enable multiple LLM providers (e.g., Gemini for low-latency classification, Claude for complex reasoning) and routing hints in the skill registry determine provider selection per job type — without any provider-specific code in calling services. Note: this routing architecture is designed and described here; current production deployment supports one provider per org; multi-provider is planned.
Chain DAG coordination via completion listeners, where the dispatcher (UNO) maintains DAG state and dispatches subsequent steps as predecessors complete, supporting fork (parallel fan-out), join (barrier synchronization), and conditional branching — without blocking the dispatcher or requiring execution workers to understand DAG topology.
Span-tree observability that records every operation as a hierarchical span with parent-child relationships, stored in a PostgreSQL execution_spans table, providing causal tracing from dispatch through governance checks, queue transit, LLM provider calls, tool invocations, and result delivery.
Governance pre-checks at the dispatch boundary, including automated risk classification (low/medium/high/critical), data residency enforcement based on organization jurisdiction, and human-in-the-loop gates for high-risk operations — integrated into the dispatch path rather than bolted onto execution, ensuring no execution path can bypass governance.
Unified WebSocket event delivery via Redis Pub/Sub channels namespaced by organization, where execution workers publish progress events that the API gateway relays to authenticated Socket.IO rooms — eliminating polling and providing real-time job status without coupling the dashboard to the execution layer.

1.5 Paper Organization

The remainder of this paper is organized as follows. Section 2 surveys related work in LLM serving systems, workflow orchestration, LLM routing, agent architectures, service mesh security, and AI governance. Section 3 presents the system architecture, detailing the dispatch service (UNO), the execution layer (nexus-workflows), and the AI Provider Router. Section 4 describes the four execution tiers in depth, with particular attention to the ReAct loop implementation (Tier 2) and chain DAG coordination (Tier 3). Section 5 covers the governance model, including risk classification, data residency, and human-in-the-loop gates. Section 6 presents the span-tree observability system. Section 7 reports production experience, including quantitative reliability metrics and three case studies of failures prevented by the separation principle. Section 8 discusses limitations and future work. Section 9 concludes.

The architecture described in this paper sits at the intersection of several active research areas: LLM serving infrastructure, workflow orchestration, model routing and cost optimization, autonomous agent frameworks, service mesh security, and AI governance. We survey each in turn, positioning our contributions relative to the state of the art.

2.1 LLM Serving Systems

The foundational challenge in LLM serving is memory management for the key-value cache. Kwon et al. [1] introduced PagedAttention in vLLM, applying virtual memory paging techniques to KV cache management and achieving 2--4x throughput improvements over FasterTransformer. Their insight — that KV cache blocks need not be contiguous in GPU memory — eliminated the fragmentation waste that plagued earlier serving systems. vLLM has since become the de facto standard for single-node LLM inference.

Before PagedAttention, Yu et al. [2] proposed Orca, a distributed serving system that introduced iteration-level scheduling: rather than processing an entire batch to completion before admitting new requests, Orca schedules at the granularity of individual decoding iterations. Requests that finish early return immediately; new requests join the batch without waiting. On GPT-3 175B, Orca demonstrated 36.9x throughput improvement at equivalent latency — a result that fundamentally changed how the community thinks about LLM batch scheduling.

More recent work has pushed further. Zheng et al. [3] introduced SGLang, a domain-specific language for structured LLM programs that co-optimizes the frontend (a Python-embedded DSL with generation primitives and parallelism control) and the runtime (RadixAttention for KV cache reuse across related prompts, compressed finite state machines for constrained decoding). SGLang achieves up to 6.4x throughput improvement for multi-call LLM programs — precisely the pattern that Tier 2 and Tier 3 workloads in our system generate. Agrawal et al. [4] addressed the throughput-latency tradeoff with Sarathi-Serve, which uses chunked prefills to eliminate the decode stalls that occur when long prefill operations monopolize the GPU. Their stall-free scheduling achieves up to 6.9x throughput improvement for Falcon-180B on 8 A100 GPUs.

These systems solve a problem orthogonal to ours. vLLM, Orca, SGLang, and Sarathi-Serve optimize how an LLM processes tokens. UNO optimizes which LLM processes which job, when, under what governance constraints, with what observability. Our AI Provider Router sits above the serving layer: it selects a provider, formats the request according to the provider's API, and relays the response. The serving system is a black box behind an API endpoint. In principle, an organization could deploy vLLM with PagedAttention as a self-hosted provider and register it alongside Gemini and Claude — the routing layer is agnostic.

What these papers do illuminate is why execution must be separated from dispatch. An LLM call is not a fast operation. Even with Orca's iteration-level scheduling and vLLM's memory efficiency, a single inference request to a 70B-parameter model takes hundreds of milliseconds to seconds. A ReAct loop with five tool-call iterations takes tens of seconds. A dispatcher that blocks on these calls — as nexus-trigger did — becomes the bottleneck, regardless of how efficient the underlying serving system is.

2.2 Workflow Orchestration

The problem of coordinating multi-step, potentially long-running, failure-prone workflows across distributed services has a rich history. Apache Airflow [5], originally developed at Airbnb, popularized the DAG-as-code paradigm: workflows are defined as Python scripts specifying task dependencies, and the Airflow scheduler triggers tasks when their upstream dependencies complete. Airflow's architecture — a centralized scheduler polling a metadata database — works well for batch-oriented data pipelines but struggles with the latency requirements and dynamic branching patterns of AI workloads. When a Tier 3 chain needs to conditionally fork based on an LLM's output, Airflow's static DAG definition is insufficient.

Dagster [6] introduced software-defined assets, shifting the abstraction from "tasks to execute" to "assets to materialize." This declarative approach improves observability and lineage tracking but retains the batch-oriented execution model. Neither Airflow nor Dagster was designed for the millisecond-to-minutes latency spectrum of AI workloads, nor for the dynamic DAG modification that chain coordination requires.

Temporal [7] represents a fundamentally different approach: durable execution. Workflows are written as ordinary code (Go, Java, TypeScript, Python), and the Temporal server persists execution state transparently — function call boundaries, local variables, timer expirations. If a worker crashes mid-workflow, another worker replays the execution history and resumes from where it left off. Netflix runs over 500,000 daily workflows on Temporal [7]. Nadeem and Malik [8] ported the TrainTicket microservices benchmark to Temporal and found that orchestrated microservices were significantly easier to debug than choreographed ones, lending empirical support to the orchestration pattern.

Netflix's Maestro [9] is a horizontally scalable workflow orchestrator that processes millions of jobs daily. Unlike Airflow's static DAGs, Maestro supports cyclic workflows, foreach loops, subworkflows, and conditional branches — capabilities directly relevant to our Tier 3 chain coordination. Maestro's design influenced our chain DAG coordinator, particularly the pattern of separating the workflow definition (the DAG topology) from the workflow execution (the job processing).

Our approach differs from all of the above in a critical respect: UNO is not a workflow engine. It does not execute workflows. It dispatches them. The chain DAG coordinator in UNO listens for child job completions and dispatches the next step — but it does not run the step. This is the separation principle applied to workflow coordination itself. Temporal's model, by contrast, couples the coordination logic (the workflow function) with the execution environment (the worker). If a Temporal worker stalls on a long-running activity, the workflow function's progress is blocked. In our architecture, DAG coordination (in UNO) is never blocked by job execution (in nexus-workflows), because they are different services with different event loops and different failure domains.

2.3 LLM Routing and Cost Optimization

The proliferation of LLM providers with dramatically different cost-performance profiles has created a model selection problem. Chen et al. [10] introduced FrugalGPT, which proposes three cost-reduction strategies: prompt adaptation (reducing token count), LLM approximation (fine-tuning smaller models), and LLM cascade (sequentially querying increasingly expensive models until a quality threshold is met). FrugalGPT matches GPT-4 quality with up to 98% cost reduction by cascading through cheaper models first and only escalating when confidence is low.

Ong et al. [11] developed RouteLLM, a framework that trains router models to dynamically select between a stronger and weaker LLM per query, using human preference data. RouteLLM achieves over 85% cost reduction on MT-Bench without quality degradation and demonstrates transfer learning: routers trained on one pair of models generalize to different model pairs at test time. Dekoninck et al. [12] unified routing and cascading into a single framework, showing that the optimal strategy combines elements of both — route easy queries directly to cheap models, cascade ambiguous queries through increasingly capable ones.

Our multi-provider routing mechanism shares the goal of cost-performance optimization but differs in two important ways. First, routing decisions in UNO are not learned from preference data or quality scores — they are configured per organization and per skill. An organization's administrator selects which providers to enable and sets a default. Skill definitions in the registry include a routing_hint field (e.g., fast, reasoning, cost_optimized) that maps to the organization's provider configuration. This is an explicit, auditable routing policy, not an opaque classifier. Second, our routing is multi-provider by design: an organization may have Gemini enabled for fast classification tasks and Claude enabled for complex reasoning tasks simultaneously, with the skill registry determining which provider handles which job type. FrugalGPT and RouteLLM assume a single pipeline of models ranked by capability; we assume a heterogeneous landscape where different providers excel at different task categories.

The trade-off is clear. Learned routers can discover non-obvious cost-performance sweet spots that explicit configuration misses. Explicit configuration provides auditability, predictability, and compliance with organizational procurement policies. In a multi-tenant enterprise platform where organizations have contractual obligations about which providers process their data, explicit routing is not merely preferable — it is required.

2.4 Agent Architectures

The ReAct paradigm, introduced by Yao et al. [13], demonstrated that interleaving reasoning traces with tool-calling actions produces more reliable and interpretable agent behavior than either reasoning or acting alone. On HotpotQA, ReAct overcame hallucination by grounding reasoning in retrieved evidence; on ALFWorld and WebShop, it outperformed reinforcement learning baselines by 34% and 10% absolute success rate, respectively, using only one or two in-context examples.

ReAct is the execution model for our Tier 2 workloads. But Yao et al.'s formulation assumes a single agent running in a single process — there is no notion of dispatch, queuing, or multi-tenancy. Our contribution is operationalizing ReAct in a production multi-tenant environment: the ReAct loop runs inside a nexus-workflows worker, each iteration generates a child span in the span tree, tool invocations are mediated through a tool registry with per-organization permissions, and the maximum iteration count is configurable per skill (defaulting to 5).

The autonomous agent pattern — exemplified by systems like AutoGPT [14] — extends ReAct with persistent memory, goal decomposition, and self-directed planning. Wang et al. [15] survey the landscape of LLM-based autonomous agents, identifying profiling, memory, planning, and action as core architectural components. Guo et al. [16] extend this to multi-agent systems, examining communication protocols, role specialization, and emergent collaboration.

Our Tier 4 (autonomous agent) execution mode incorporates elements of this work but adds critical production constraints absent from research prototypes. AutoGPT, as described in the literature, has no dispatch boundary, no governance pre-check, no resource budget, and no kill switch. It runs until it decides it is done — or until it exhausts resources. Our Tier 4 implementation wraps autonomous agent behavior in the same dispatch-execute separation: UNO classifies the job as risk_level: critical, enforces a human-in-the-loop gate before execution begins, sets a resource budget (maximum LLM calls, maximum wall time, maximum token expenditure), and the execution worker enforces these limits with hard termination. The agent is autonomous within bounds — bounds set at the dispatch boundary, not by the agent itself.

Masterman et al. [17] surveyed emerging agent architectures for reasoning, planning, and tool calling, noting the progression from single-agent systems to multi-agent orchestration. Their taxonomy — single agent, vertical (specialized agents in a pipeline), horizontal (peer agents collaborating), and hybrid — maps onto our tier system: Tier 2 is a single tool-using agent, Tier 3 can represent vertical pipelines, and Tier 4 can implement horizontal or hybrid patterns when the autonomous agent spawns sub-jobs.

2.5 Service Mesh and Zero-Trust Enforcement

The separation principle requires enforcement: it is not sufficient to design the system so that only nexus-workflows calls the AI Provider Router — we must guarantee it. Istio [18], the service mesh running on our Kubernetes cluster, provides the enforcement mechanism through AuthorizationPolicy resources that operate at the sidecar proxy level. An AuthorizationPolicy specifying that POST /internal/ai/chat is permitted only from the nexus-orchestrator service account blocks unauthorized callers before the request reaches the application container. This is network-layer enforcement, independent of application logic.

We layer three enforcement mechanisms: (1) Istio AuthorizationPolicy at the mesh level (blocks at the Envoy sidecar), (2) service key validation at the application level (the gateway validates ORCHESTRATOR_SERVICE_KEY on every /internal/ai/chat request), and (3) caller identity validation (the gateway checks that the X-Caller-Service header matches nexus-orchestrator). Defense in depth is not paranoia — it is acknowledgment that any single layer can fail. A misconfigured Istio policy (layer 1 failure) is caught by the service key check (layer 2). A leaked service key (layer 2 failure) is caught by the caller identity check (layer 3). All three layers must be compromised simultaneously for an unauthorized LLM call to succeed.

This triple-layer enforcement is what distinguishes our separation principle from a design guideline. It is not "services should not call the AI Provider Router directly." It is "services cannot call the AI Provider Router directly." The difference is architectural.

2.6 Event-Driven Microservices and Data Management

Laigner et al. [19] conducted a comprehensive survey of data management practices in microservices, analyzing open-source applications and surveying over 120 practitioners. Their findings highlight the tension between service autonomy (each service owns its data) and cross-service consistency (operations spanning multiple services need transactional guarantees). This tension is directly relevant to our architecture: UNO owns the orchestrator.runs table, nexus-workflows owns execution state, the AI Provider Router owns provider configuration, and the auth service owns organization settings. A single job dispatch touches all four data domains.

Our solution is event-driven eventual consistency. UNO inserts a run record and publishes a Redis Pub/Sub event. nexus-workflows updates execution state and publishes progress events. The gateway subscribes to these events and relays them to the dashboard via Socket.IO. There is no distributed transaction, no two-phase commit, no saga. The run record in PostgreSQL is the source of truth; Redis events are notifications, not state. If a Redis event is lost, the dashboard can poll the run record — but in practice, Redis Pub/Sub with the nexus:jobs:org:{orgId} channel pattern provides reliable delivery within the cluster.

Laigner et al. [20] further studied event management challenges in microservice architectures, analyzing over 8,000 Stack Overflow questions. The most common challenges — event ordering, duplicate handling, and schema evolution — informed our event design. Events include a monotonically increasing sequence number per run, enabling the dashboard to detect gaps and request replay. Event payloads use a versioned schema with backward-compatible field additions.

2.7 Distributed Tracing and Observability

The span-tree model that underpins our observability system descends directly from Google's Dapper [21], which introduced the concept of traces as trees of spans, where each span represents a unit of work with a start time, end time, and parent reference. Dapper's design goals — low overhead, application-level transparency, and ubiquitous deployment — remain the gold standard for production tracing systems. OpenTelemetry [22], the CNCF project that unified OpenTracing and OpenCensus, provides the specification and SDK ecosystem that most modern systems use for distributed tracing.

Our span-tree system differs from conventional distributed tracing in a crucial respect: spans are not merely observability artifacts — they are structural elements of the execution model. Every operation in the execution pipeline is a span: the dispatch is a span, the governance check is a span, the queue transit is a span, the LLM call is a span, each tool invocation is a span, each ReAct iteration is a span. Spans are stored in a PostgreSQL execution_spans table with parent-child foreign key relationships, not shipped to an external tracing backend. This makes the span tree queryable with standard SQL — "show me all LLM calls for organization X in the last 24 hours that exceeded 10 seconds" is a SELECT with a WHERE clause, not a Jaeger query.

The trade-off is storage cost. Conventional tracing systems sample (Dapper sampled at 1/1024 for high-throughput services). We record every span for every job, because the span tree serves dual purposes: operational observability and audit trail. Under EU AI Act [23] requirements, organizations deploying high-risk AI systems must maintain records of system operation, including inputs, outputs, and decision rationale. Our span tree — which records the prompt sent, the model used, the tokens consumed, the tool calls made, and the result returned — provides this audit trail as a byproduct of the execution model rather than as a separate compliance system.

2.8 AI Governance and the EU AI Act

The EU AI Act (Regulation 2024/1689) [23], which entered into force on August 1, 2024, establishes a risk-based regulatory framework for AI systems. Systems classified as high-risk face requirements around risk assessment, data quality, documentation, transparency, human oversight, and accuracy. Non-compliance carries penalties of up to EUR 35 million or 7% of worldwide annual turnover.

For multi-tenant AI platforms — where the same infrastructure serves organizations across jurisdictions with different risk profiles — the governance challenge is acute. A single orchestrator must enforce different governance policies for different organizations, different job types, and different risk levels, before execution begins. This is precisely why governance pre-checks belong at the dispatch boundary. If governance is checked during execution (as in our pre-UNO architecture), the check occurs after the job has already been dequeued, possibly after resources have been allocated, possibly after a partial result has been generated. The governance check becomes a retrospective audit rather than a prospective gate.

Dellermann et al. [24] proposed a taxonomy for hybrid intelligence systems — systems that combine human and artificial intelligence to achieve goals neither could accomplish alone. Their framework identifies collaborative decision-making, mutual learning, and complementary expertise as design dimensions. Our human-in-the-loop gate for high-risk operations operationalizes one aspect of this framework: the dispatch boundary classifies risk, and for critical operations (risk level: critical), execution is paused until a human reviewer approves the job. The human does not review the result — they review the intent: "This autonomous agent wants to execute kubectl delete pod in the production namespace. Approve?"

This is governance by design, not governance by audit. The separation principle makes it possible because governance checks are a dispatch concern, enforced at a chokepoint that every job must traverse. Without the separation, governance checks would need to be replicated in every execution path — in nexus-trigger, in nexus-mageagent, in every future plugin that calls an LLM. Replication is the enemy of enforcement.

References

[1] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica, "Efficient Memory Management for Large Language Model Serving with PagedAttention," in Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP '23), 2023. arXiv:2309.06180.

[2] G.-I. Yu, J. S. Jeong, G.-W. Kim, S. Kim, and B.-G. Chun, "Orca: A Distributed Serving System for Transformer-Based Generative Models," in Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI '22), 2022.

[3] L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Z. Sheng, "SGLang: Efficient Execution of Structured Language Model Programs," arXiv:2312.07104, 2024.

[4] A. Agrawal, N. Kedia, A. Panwar, J. Mohan, N. Kwatra, B. S. Gulavani, A. Tumanov, and R. Ramjee, "Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve," in Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI '24), 2024. arXiv:2403.02310.

[5] Apache Software Foundation, "Apache Airflow," airflow.apache.org, 2024.

[6] Dagster Labs, "Dagster: An Orchestration Platform for Data Assets," dagster.io, 2024.

[7] Temporal Technologies, "Temporal: Durable Execution Platform," temporal.io, 2024.

[8] A. Nadeem and M. Z. Malik, "A Case for Microservices Orchestration Using Workflow Engines," in Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER '22), 2022. arXiv:2204.07210.

[9] Netflix Technology Blog, "Maestro: Netflix's Workflow Orchestrator," netflixtechblog.com, 2024.

[10] L. Chen, M. Zaharia, and J. Zou, "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance," arXiv:2305.05176, 2023.

[11] I. Ong, A. Almahairi, V. Wu, W.-L. Chiang, T. Wu, J. E. Gonzalez, M. W. Kadous, and I. Stoica, "RouteLLM: Learning to Route LLMs with Preference Data," arXiv:2406.18665, 2024.

[12] J. Dekoninck et al., "A Unified Approach to Routing and Cascading for LLMs," arXiv:2410.10347, 2024.

[13] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, "ReAct: Synergizing Reasoning and Acting in Language Models," in Proceedings of the International Conference on Learning Representations (ICLR '23), 2023. arXiv:2210.03629.

[14] T. Richards et al., "Auto-GPT: An Autonomous GPT-4 Experiment," github.com, 2023.

[15] L. Wang et al., "A Survey on Large Language Model based Autonomous Agents," arXiv:2308.11432, 2023.

[16] T. Guo et al., "Large Language Model based Multi-Agents: A Survey of Progress and Challenges," in Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI '24), 2024. arXiv:2402.01680.

[17] T. Masterman, S. Besen, M. Penneschi, and T. Marandon, "The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey," arXiv:2404.11584, 2024.

[18] Istio Project, "Istio Security: Authorization Policy," istio.io, 2024.

[19] R. Laigner, Y. Zhou, M. A. V. Salles, Y. Liu, and M. Kalinowski, "Data Management in Microservices: State of the Practice, Challenges, and Research Directions," Proceedings of the VLDB Endowment, vol. 14, no. 13, pp. 3348--3361, 2021. arXiv:2103.00170.

[20] R. Laigner, A. C. Almeida, W. K. G. Assuncao, and Y. Zhou, "An Empirical Study on Challenges of Event Management in Microservice Architectures," arXiv:2408.00440, 2024.

[21] B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure," Google Technical Report, 2010.

[22] OpenTelemetry Authors, "OpenTelemetry Specification," opentelemetry.io, 2024.

[23] European Parliament and Council of the European Union, "Regulation (EU) 2024/1689 Laying Down Harmonised Rules on Artificial Intelligence (AI Act)," Official Journal of the European Union, 2024.

[24] D. Dellermann, P. Ebel, M. Soellner, and J. M. Leimeister, "The Future of Human-AI Collaboration: A Taxonomy of Design Knowledge for Hybrid Intelligence Systems," arXiv:2105.03354, 2021.

3. System Architecture Overview

The central architectural claim of this paper is deceptively simple: the orchestrator must route but never execute. In practice, enforcing this separation requires decomposing what was once a monolithic dispatch-and-execute service into three independently deployable components, each with a narrow contract and explicit prohibitions on scope creep. This section presents the resulting three-service topology, traces the full lifecycle of a dispatched workload, and argues that the separation yields benefits in scalability, fault isolation, and governance that no amount of internal refactoring of a monolith could achieve.

3.1 The Three-Service Separation

The Unified Nexus Orchestrator architecture comprises three services, each occupying a distinct layer of the dispatch-execution-inference stack:

UNO (Unified Nexus Orchestrator) -- the dispatch layer. UNO accepts inbound requests, resolves the target skill, enforces governance pre-checks, and enqueues work onto named BullMQ queues. It returns a 202 Accepted response with a job identifier and trace context. UNO maintains no execution state beyond the orchestrator.runs table and the BullMQ queue entries it produces. It never instantiates an LLM client, never calls an external AI API, and never executes tool functions.
nexus-workflows -- the execution engine. Workers consume jobs from BullMQ queues, hydrate the skill's execution configuration (system prompt, tool definitions, agent pattern), and run the appropriate execution tier: a single LLM call for Tier 1 (llm_only), a ReAct loop with tool invocations for Tier 2 (tool_using), or a multi-step agent for Tier 3 (autonomous). nexus-workflows is the only service authorized to call the AI Provider Router's internal endpoint.
AI Provider Router -- the LLM access layer, hosted within the nexus-gateway. The router resolves per-organization AI configuration (provider selection, API keys decrypted from AES-256 storage in the auth database), translates tool schemas between provider formats, and normalizes responses to a canonical structure. It exposes a single internal endpoint: POST /internal/ai/chat.

The boundaries between these services are not suggestions -- they are enforced at three layers. At the mesh level, Istio AuthorizationPolicy resources restrict which source principals may reach the AI Provider Router's internal endpoint. At the application level, validateServiceKey middleware rejects requests lacking the ORCHESTRATOR_SERVICE_KEY header. At the identity level, validateCallerIdentity middleware checks that the X-Caller-Service header reads nexus-orchestrator -- and in this context, "nexus-orchestrator" refers to the nexus-workflows execution engine, which inherited the service identity from the pre-separation monolith.

3.2 End-to-End Pipeline

The following diagram traces the lifecycle of a dispatched workload from browser interaction to result delivery:

Browser (Plugin / Dashboard)
    |
    | POST /api/v1/dispatch { job_type, org_id, payload }
    v
+-----------+     +------------------+
|  Gateway  |---->|       UNO        |
| (Ingress) |     | (Dispatch Layer) |
+-----------+     +------------------+
                    |  1. Zod validate
                    |  2. Rate limit (org, user)
                    |  3. Skill resolution (graphrag.skill_registry)
                    |  4. Governance pre-checks (risk, residency)
                    |  5. INSERT orchestrator.runs (status: 'queued')
                    |  6. WS emit: job:dispatched, job:skill_resolved
                    |  7. Enqueue to named BullMQ queue
                    |  8. Return 202 { jobId, traceId }
                    |
                    v
         +--------------------+
         |  BullMQ (Redis)    |
         |  Named Queues:     |
         |  - default (c:50)  |
         |  - security (c:10) |
         |  - gpu (c:5)       |
         |  - prosecreator    |
         |  - autoresearch    |
         |  - infra (c:5)     |
         +--------------------+
                    |
                    | dequeue
                    v
         +--------------------+        +---------------------+
         | nexus-workflows    |------->| AI Provider Router  |
         | (Execution Engine) |  POST  | (nexus-gateway)     |
         |                    |  /internal/ai/chat           |
         | - Tier 1: LLM call |        |                     |
         | - Tier 2: ReAct    |        | resolveOrgConfig()  |
         | - Tier 3: Agent    |        | Adapter selection    |
         +--------------------+        | Tool translation    |
                    |                  +---------------------+
                    |                           |
                    |                           v
                    |                  +---------------------+
                    |                  | External LLM API    |
                    |                  | (Gemini, Anthropic, |
                    |                  |  OpenRouter, etc.)   |
                    |                  +---------------------+
                    |
                    | UPDATE orchestrator.runs (status, output)
                    | WS emit: job:progress, job:completed
                    v
         +--------------------+
         | Redis Pub/Sub      |
         +--------------------+
                    |
                    v
         +--------------------+
         | Gateway Socket.IO  |-----> Browser (real-time)
         +--------------------+

Several properties of this pipeline deserve emphasis. First, the communication between UNO and nexus-workflows is entirely asynchronous -- mediated by BullMQ queues backed by Redis. UNO never waits for execution to complete. It enqueues and returns. This means UNO's response latency is bounded by skill resolution and queue insertion, not by LLM inference time, which can range from 200 milliseconds for a cached Haiku response to two hours for a GPU-intensive autonomous research workflow.

Second, the WebSocket event path is unidirectional from execution to browser. nexus-workflows publishes progress events to Redis Pub/Sub channels keyed by job ID. The gateway's Socket.IO layer subscribes to these channels and forwards events to the connected browser session. No polling. No request-response cycles for status updates. The browser receives job:dispatched within milliseconds of the dispatch call, job:progress events as execution proceeds, and job:completed or job:failed when the workflow terminates.

Third, the AI Provider Router is reachable only from nexus-workflows. This is not a convention -- it is an Istio-enforced invariant. A plugin service that attempts to call POST /internal/ai/chat directly will have its request rejected at the sidecar proxy before it reaches the gateway process. The enforcement is defense-in-depth: mesh policy, application middleware, and caller identity validation must all pass.

3.3 Why Separation Matters

The monolithic predecessor -- which we refer to as the "pre-separation orchestrator" -- combined dispatch, execution, and (in early iterations) direct LLM client instantiation within a single Node.js process. This design suffered from three categories of failure:

Scaling inversion. Dispatch is CPU-cheap and I/O-light: validate a payload, query a database row, insert a queue message. Execution is CPU-variable and I/O-heavy: run multi-turn ReAct loops, invoke external tools via HTTP, stream tokens from LLM providers. In the monolith, scaling execution required scaling dispatch, and vice versa. A surge in long-running GPU workflows would exhaust the process pool, causing dispatch latency to spike for all incoming requests -- including trivial NexusROS llm_only tasks that should complete in under a second. With separation, UNO scales horizontally based on dispatch throughput (request rate), while nexus-workflows scales based on execution demand (queue depth per named queue). These are independent scaling dimensions with independent metrics.

Fault contagion. When a Tier 2 tool-using workflow encountered an unhandled exception during tool execution -- a kubectl timeout, a malformed API response from an external service -- the crash propagated through the event loop and could destabilize the dispatch path. In the separated architecture, a nexus-workflows worker crash affects only the job it was processing. The BullMQ job is automatically retried (up to the configured retry count) or moved to the dead-letter queue. UNO remains unaffected. The blast radius of an execution failure is exactly one job.

Governance bypass. The monolith's governance checks -- risk classification, data residency enforcement, rate limiting -- were implemented as middleware in the same Express router that handled execution. A code change to the execution path could inadvertently modify or bypass governance logic. With UNO as a separate service, governance is enforced at the dispatch boundary, before work enters the queue. By the time nexus-workflows dequeues a job, every governance check has already passed. Execution code cannot circumvent dispatch-time governance because it never runs in the same process.

3.4 What UNO Does NOT Do

The negative contract is as important as the positive one. UNO is explicitly prohibited from:

Making LLM calls. No fetch() to generativelanguage.googleapis.com, api.anthropic.com, or any LLM provider API. No instantiation of LLM client libraries (GoogleGenerativeAI, Anthropic, OpenAI).
Executing tools. No kubectl exec, no HTTP calls to external services on behalf of a workflow, no filesystem operations beyond its own persistence layer.
Running ReAct loops. No iterative observe-think-act cycles. No tool-call result parsing. No multi-turn agent state management.
Calling the AI Provider Router. POST /internal/ai/chat is off-limits. The Istio policy, service key validation, and caller identity check would all reject the request even if the code attempted it.
Holding execution state in memory. UNO is stateless with respect to job execution. The orchestrator.runs table and BullMQ queues are the only state stores. Any UNO instance can serve any request.

These prohibitions are not merely documented conventions. They are verified by the Brutal Honesty Audit (BHA) code review gate, which flags any direct LLM SDK instantiation, any fetch() to a known provider URL, and any POST /internal/ai/chat call originating from a service other than the execution engine as a VIOLATION. The CI pipeline includes a static analysis step that greps for these patterns in UNO's source tree.

3.5 Comparison with the Monolithic Baseline

The pre-separation orchestrator handled approximately 1,200 daily dispatch-and-execute cycles across 65+ NexusROS task types, ProseCreator generation workflows, security scans, and infrastructure remediation jobs. Under this load, three operational patterns emerged:

P99 dispatch latency correlated with execution queue depth. When the security scan queue backed up (scans averaging 180 seconds each, concurrency limited to 10), new dispatch requests for unrelated llm_only tasks experienced latency increases of 3--5x because the Node.js event loop was saturated by in-process execution callbacks.
Deployment required full-service restart. A change to the skill resolution logic -- adding a new column to skill_registry, modifying queue routing -- required redeploying the entire orchestrator, which interrupted in-flight executions. Rolling restarts mitigated but did not eliminate this: long-running GPU jobs (up to 2 hours) could not be drained within the restart grace period.
Governance changes coupled to execution changes. Adding a data residency check required modifying the same codebase that handled ReAct loop iteration. Code review burden increased because reviewers needed to understand both the governance implications and the execution implications of every change.

The separated architecture eliminates all three patterns. UNO's P99 dispatch latency is bounded by PostgreSQL query time for skill resolution (~2ms) and Redis LPUSH for queue insertion (~1ms), independent of execution load. UNO deployments do not affect in-flight executions because jobs are already in BullMQ -- nexus-workflows workers continue processing without interruption. Governance changes are isolated to UNO's codebase; execution changes are isolated to nexus-workflows. The two teams (when they eventually exist) can deploy on independent cadences.

4. The UNO Dispatch Layer

UNO's dispatch pipeline is an eight-step sequence that transforms an inbound HTTP request into a queued, traceable, governance-checked job. Each step is synchronous within the request lifecycle, and the entire pipeline completes in under 50 milliseconds for the common case. This section walks through each step, then examines the skill resolution mechanism, governance pre-checks, queue routing strategy, and supplementary endpoints in detail.

4.1 The Eight-Step Dispatch Pipeline

Every POST /api/v1/dispatch request traverses the following steps:

Step 1: Zod Validation. The request body is validated against a Zod schema that enforces structural correctness at the type level. This is not a permissive any-typed handler that hopes for the best -- it is a strict contract:


TypeScript
14 lines
const DispatchRequestSchema = z.object({
  job_type: z.string().min(1).max(128),
  org_id: z.string().uuid(),
  user_id: z.string().uuid(),
  payload: z.record(z.unknown()),
  callback_url: z.string().url().optional(),
  priority: z.enum(['low', 'normal', 'high', 'critical']).default('normal'),
  idempotency_key: z.string().uuid().optional(),
  metadata: z.object({
    source_service: z.string(),
    correlation_id: z.string().optional(),
    parent_job_id: z.string().uuid().optional(),
  }),
});

A malformed request fails here with a 400 response containing the Zod error path -- not a generic "Bad Request" string, but the specific field that failed validation and why. This is the no-fallback principle applied at the API boundary: never accept ambiguous input and silently interpret it.

Step 2: Rate Limiting. Two rate limit checks execute in parallel: per-organization (to prevent a single tenant from monopolizing queue capacity) and per-user (to prevent runaway automation within a tenant). Rate limits are implemented via Redis sliding window counters, keyed by org:{org_id}:dispatch and user:{user_id}:dispatch. When a limit is exceeded, UNO returns 429 Too Many Requests with a Retry-After header and a structured error body indicating which limit was hit, the current count, and the window reset time.

Step 3: Skill Resolution. UNO queries graphrag.skill_registry for the row matching the request's job_type. This single database lookup replaces what was, in the monolithic predecessor, a growing cascade of conditional branches:


TypeScript
17 lines
// Before: the monolith's approach
if (jobType === 'nexusros_brain') { ... }
else if (jobType === 'prosecreator_generate') { ... }
else if (jobType === 'security_scan_full') { ... }
else if (jobType.startsWith('nexusros_')) { ... }  // prefix fallback
else { throw new Error('Unknown job type'); }

// After: UNO's approach
const skill = await db.query(
  `SELECT execution_type, execution_config, timeout_ms,
          queue_name, queue_concurrency, risk_level,
          data_residency, routing_hint, dispatch_mode
   FROM graphrag.skill_registry
   WHERE job_type = \$1 AND enabled = true`,
  [jobType]
);
if (!skill) throw new SkillNotFoundError(jobType);

The difference is not cosmetic. The if-chain approach required a code deployment to add a new job type. The database-as-router approach requires an INSERT statement. The operational implications are profound: a new NexusROS task type, a new ProseCreator workflow variant, or a new security scan profile can be activated in production without restarting any service.

If the job_type does not exist in skill_registry, UNO throws a SkillNotFoundError with a structured response:


JSON
10 lines
{
  "error": true,
  "code": "SKILL_NOT_FOUND",
  "message": "No skill registered for job_type 'nexusros_brainstorm'. Check graphrag.skill_registry.",
  "troubleshooting": [
    "Verify job_type spelling: SELECT * FROM graphrag.skill_registry WHERE job_type LIKE '%brainstorm%'",
    "Check if skill is enabled: enabled = true",
    "If new skill: INSERT into graphrag.skill_registry with required columns"
  ]
}

No silent skip. No fallback to a "default" skill. No prefix-based routing that might accidentally match the wrong handler. The skill either exists and is enabled, or the request fails loudly.

Step 4: Governance Pre-Checks. With the skill resolved, UNO evaluates governance constraints before the job enters any queue. Three checks execute:

Risk classification. The skill's risk_level (from skill_registry) is compared against the request payload. For skills marked high or unacceptable, UNO scans the payload for PII patterns (email addresses, phone numbers, government ID formats) and high-value keywords defined in a per-organization governance policy. If the scan triggers, the job is inserted into orchestrator.runs with status held_for_review rather than queued, and a WebSocket event job:held is emitted with the hold reason. The job does not enter any BullMQ queue until a human approves it via POST /api/v1/dispatch/:jobId/approve.

Data residency enforcement. The skill's data_residency field specifies geographic constraints: eu_only, us_only, or any. UNO checks the organization's registered data residency zone (retrieved during skill resolution or cached from a prior auth service call). If the organization's zone does not satisfy the skill's constraint -- for example, an eu_only skill dispatched by an organization registered in the US -- the request is rejected with a 403 and an explanation of the residency mismatch. This check occurs before any data leaves UNO's process, ensuring that payload contents are never transmitted to a queue or execution engine in a non-compliant region.

Human review gates. Certain skills are configured with requires_approval: true in their execution_config. These jobs are unconditionally held regardless of risk scan results. This mechanism supports regulatory workflows where every AI-generated output must be reviewed before delivery -- a requirement in some financial services and healthcare contexts.

Step 5: Insert into orchestrator.runs. A row is inserted into the orchestrator.runs table, which serves as the system of record for job lifecycle:


SQL
9 lines
INSERT INTO orchestrator.runs (
  job_id, trace_id, org_id, user_id, job_type,
  skill_config, payload, status, priority,
  queue_name, dispatch_mode, created_at
) VALUES (
  \$1, \$2, \$3, \$4, \$5,
  \$6, \$7, 'queued', \$8,
  \$9, \$10, NOW()
) RETURNING job_id, trace_id;

The skill_config column stores a snapshot of the resolved skill configuration at dispatch time. This is deliberate: if the skill_registry row is modified after dispatch (a new system prompt, a changed timeout), in-flight jobs retain the configuration they were dispatched with. The alternative -- reading skill_registry at execution time -- would create a race condition where a skill update mid-execution could alter behavior unpredictably.

Step 6: WebSocket Event Emission. UNO publishes two events to Redis Pub/Sub on the channel keyed by job:{jobId}:

job:dispatched -- includes jobId, traceId, job_type, queue_name, priority, timestamp
job:skill_resolved -- includes the resolved skill metadata (execution type, tools, routing hint) for UI display

These events are consumed by the gateway's Socket.IO layer and forwarded to any browser session subscribed to the job's channel. The browser receives confirmation of dispatch before the job has begun execution -- typically within 5--10 milliseconds of the HTTP request.

Step 7: BullMQ Queue Routing. The job is enqueued onto the named BullMQ queue specified by the skill's queue_name field. The queue routing is not a simple LPUSH -- it includes priority weighting, retry configuration, and backoff strategy:


TypeScript
9 lines
await queue.add(skill.queue_name, jobPayload, {
  jobId: jobId,
  priority: priorityToNumber(request.priority),
  attempts: skill.execution_config.max_retries ?? 3,
  backoff: { type: 'exponential', delay: 1000 },
  timeout: skill.timeout_ms,
  removeOnComplete: { age: 86400 },  // 24h retention
  removeOnFail: { age: 604800 },     // 7d retention
});

Step 8: Return 202 Accepted. UNO returns immediately with a response that gives the caller everything needed to track the job:


JSON
8 lines
{
  "jobId": "550e8400-e29b-41d4-a716-446655440000",
  "traceId": "abc123-trace-456",
  "status": "queued",
  "queue": "default",
  "estimatedPosition": 12,
  "wsChannel": "job:550e8400-e29b-41d4-a716-446655440000"
}

The wsChannel field tells the client exactly which Socket.IO channel to subscribe to for real-time updates. No guessing, no convention-based channel naming that might drift from implementation.

4.2 Skill Resolution: Database as Router

The skill_registry table is the routing brain of UNO. Every column serves a specific dispatch-time decision:


SQL
44 lines
CREATE TABLE graphrag.skill_registry (
  id              SERIAL PRIMARY KEY,
  job_type        VARCHAR(128) UNIQUE NOT NULL,
  display_name    VARCHAR(256),
  description     TEXT,

  -- Execution classification
  execution_type  VARCHAR(32) NOT NULL
    CHECK (execution_type IN ('llm_only', 'tool_using', 'chain', 'autonomous')),

  -- Execution configuration (consumed by nexus-workflows, opaque to UNO)
  execution_config JSONB NOT NULL DEFAULT '{}',
  -- Contains: system_prompt_ref, tools[], max_iterations,
  --           agent_pattern (react|plan_act|reflexion),
  --           callback_url_template, requires_approval

  -- Timeout and queue routing
  timeout_ms      INTEGER NOT NULL DEFAULT 60000,
  queue_name      VARCHAR(64) NOT NULL DEFAULT 'default'
    CHECK (queue_name IN ('default', 'security', 'gpu',
                          'prosecreator', 'autoresearch', 'infra')),
  queue_concurrency INTEGER NOT NULL DEFAULT 50,

  -- Governance
  risk_level      VARCHAR(32) NOT NULL DEFAULT 'minimal'
    CHECK (risk_level IN ('minimal', 'limited', 'high', 'unacceptable')),
  data_residency  VARCHAR(16) NOT NULL DEFAULT 'any'
    CHECK (data_residency IN ('eu_only', 'us_only', 'any')),

  -- LLM routing hints (passed through to AI Provider Router)
  routing_hint    VARCHAR(32) DEFAULT 'fast'
    CHECK (routing_hint IN ('fast', 'reasoning', 'code', 'long_context')),

  -- Dispatch mode
  dispatch_mode   VARCHAR(16) NOT NULL DEFAULT 'single'
    CHECK (dispatch_mode IN ('single', 'batch', 'chain')),

  -- Administrative
  enabled         BOOLEAN NOT NULL DEFAULT true,
  created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_skill_registry_job_type ON graphrag.skill_registry(job_type);

Several design choices warrant explanation.

The execution_config column is typed as JSONB and is deliberately opaque to UNO. UNO reads it, snapshots it into orchestrator.runs, and passes it through to nexus-workflows via the BullMQ job payload. UNO never interprets the contents -- it does not know what a system_prompt_ref resolves to, what tools are available, or what max_iterations means in the context of a ReAct loop. This opacity is a feature: it means UNO's dispatch logic does not need to change when execution behavior changes. A new agent pattern, a new tool type, a new prompt template mechanism -- none of these require a UNO deployment.

The queue_name column uses a constrained enum rather than a free-text field. This prevents accidental routing to nonexistent queues, which would cause jobs to silently queue with no consumer -- a failure mode that is extraordinarily difficult to diagnose in production. The constraint means that adding a new named queue requires both a database migration (to extend the CHECK constraint) and a nexus-workflows deployment (to register a consumer for the new queue). This coupling is intentional: a queue without a consumer is a bug, and the schema prevents it.

The routing_hint column deserves particular attention. It does not select a specific model or provider -- that decision belongs to the AI Provider Router based on the organization's configured AI provider. Instead, it expresses the character of the LLM call: fast for low-latency completions (Haiku-class models), reasoning for complex analysis (Opus-class), code for code generation tasks, long_context for inputs exceeding 100K tokens. The AI Provider Router maps these hints to specific models within the organization's selected provider. This indirection means a skill's routing hint remains stable even as the organization switches providers -- "reasoning" maps to Claude Opus on Anthropic, to Gemini 2.5 Pro on Google, to an appropriate model on OpenRouter.

4.3 Named Queue Routing

The six named BullMQ queues are not arbitrary partitions -- each represents a workload profile with distinct concurrency and timeout characteristics:

Queue	Concurrency	Timeout	Workload Profile
`default`	50	60s	NexusROS brain tasks, operational queries, lightweight generation
`security`	10	300s	Vulnerability scans, compliance checks, RBAC analysis
`gpu`	5	2h	Model fine-tuning, embedding generation, heavy inference
`prosecreator`	10	25m	Long-form content generation, blueprint synthesis
`autoresearch`	3	30m	Multi-source research with citation verification
`infra`	5	10m	Infrastructure remediation, kubectl operations, health checks

Why not a single queue with priority levels? Because concurrency limits and timeouts are per-queue, not per-job, in BullMQ's architecture. A single queue with concurrency 50 would allow 50 simultaneous GPU jobs to compete for resources with 50 lightweight NexusROS tasks. The named queue approach ensures that GPU jobs (concurrency 5) cannot starve the default queue (concurrency 50), and that a runaway autoresearch job (30-minute timeout) does not occupy a slot that a 60-second default job needs.

The concurrency numbers are not arbitrary. They are derived from resource profiling of each workload class: GPU jobs require dedicated GPU memory allocation, limiting practical concurrency; security scans invoke kubectl and network scanning tools that create measurable cluster load; autoresearch jobs make dozens of external HTTP requests per execution, and excessive parallelism triggers rate limits on upstream sources.

4.4 Governance Pre-Checks in Detail

Governance is enforced at the dispatch boundary -- not at execution time, not as an afterthought. This placement is critical: once a job enters a BullMQ queue, it will be executed. There is no governance middleware in nexus-workflows. If UNO enqueues a job, UNO has affirmed that the job satisfies all governance constraints.

The risk classification check operates on a two-axis model. The first axis is the skill's inherent risk level, declared in skill_registry. A security_scan_full skill is inherently high risk because it invokes kubectl against live infrastructure. A nexusros_brain skill is minimal because it produces text output with no side effects. The second axis is the payload's content risk, assessed at dispatch time via pattern matching. A minimal-risk skill that receives a payload containing Social Security numbers becomes a governance concern regardless of the skill's declared level.

When both axes indicate elevated risk, the job enters the human review pipeline. The orchestrator.runs row is set to held_for_review, the WebSocket event job:held is emitted (so the requesting user sees immediate feedback), and the job remains in limbo until an authorized reviewer calls POST /api/v1/dispatch/:jobId/approve. The approval endpoint verifies the caller's RBAC permissions -- not every user can approve held jobs. Rejection via the same endpoint sets the status to rejected and emits job:rejected.

Data residency enforcement is binary and non-negotiable. If an EU-only skill receives a dispatch from a US-registered organization, the request is rejected at HTTP level. The payload never reaches Redis, never enters a queue, never leaves UNO's process memory. This is the strongest guarantee the architecture can provide: data that should not cross a geographic boundary never enters the transport layer that would carry it across.

4.5 Supplementary Endpoints

UNO exposes five additional endpoints beyond the primary dispatch path:

POST /api/v1/dispatch/batch accepts an array of dispatch requests and processes them as a batch. Each item in the array is independently validated, skill-resolved, and governance-checked. Failures in individual items do not fail the batch -- the response includes per-item status. This endpoint exists because certain workflows (NexusROS processing 200 contacts, ProseCreator generating chapters for an entire book) dispatch dozens of jobs simultaneously, and individual HTTP requests per job create unnecessary overhead.

GET /api/v1/dispatch/:jobId returns the current status of a job, including its output if completed. This is the polling fallback for clients that cannot maintain a WebSocket connection. The response includes the full orchestrator.runs row: status, queue position (if queued), execution spans (if in progress or completed), output (if completed), and error details (if failed).

GET /api/v1/dispatch/:jobId/spans returns the execution span tree for observability. Each span represents a discrete operation within the job's execution: the initial LLM call, each tool invocation, each ReAct iteration. Spans include start time, duration, token counts, and tool-specific metadata. This endpoint powers the execution trace viewer in the dashboard, enabling operators to diagnose slow jobs by identifying which span consumed the most time.

POST /api/v1/dispatch/:jobId/approve releases a held job into its target queue. The caller must have the dispatch:approve RBAC permission for the job's organization. Upon approval, the job status transitions from held_for_review to queued, and the standard enqueue logic (Step 7) executes. A job:approved WebSocket event is emitted.

POST /api/v1/dispatch/:jobId/cancel aborts a running or queued job. For queued jobs, the BullMQ entry is removed. For running jobs, a cancellation signal is published to Redis Pub/Sub on the job's channel; the nexus-workflows worker is responsible for checking this signal between tool invocations and aborting gracefully. Cancellation is best-effort for in-flight LLM calls -- once tokens are streaming from a provider, they cannot be un-streamed -- but it prevents subsequent ReAct iterations and tool calls from executing.

4.6 No-Fallback Enforcement

The no-fallback principle permeates every error path in UNO. Consider the failure modes:

Skill not found: SkillNotFoundError with the exact job_type that was requested and SQL queries to diagnose why. Never a silent fallback to a "generic" skill.
Redis unavailable: Immediate 503 Service Unavailable with connection details and troubleshooting steps (redis-cli ping, check pod status, verify connection string). Never an in-memory queue fallback that would lose jobs on restart.
PostgreSQL unavailable: Immediate 503 with connection diagnostics. Never a cached skill resolution that might serve stale configuration.
Rate limit exceeded: 429 with explicit rate limit details (current count, limit, window, reset time). Never a silent queue that processes requests at a reduced rate without informing the caller.
Governance hold: 202 with status: held_for_review and the hold reason. Never an automatic approval that bypasses governance for "low-risk" payloads.

Each of these error responses follows a structured format with error, code, message, troubleshooting, and context fields. The troubleshooting array contains actionable diagnostic commands -- not vague suggestions, but specific kubectl, redis-cli, or SQL commands that an operator can execute immediately.

This is not defensive programming for its own sake. It is a direct response to operational experience with the monolithic predecessor, where silent fallbacks -- a missing skill defaulting to a generic prompt, a Redis timeout falling back to synchronous execution -- caused failures that were difficult to diagnose because the system appeared to be working. It was working. It was just working wrong, and nothing said so.

4.7 Caller-Side Dispatch: Buttons Are Dumb, Workflow Bindings Are Smart

The preceding sections describe what happens inside UNO when a dispatch request arrives. But something happens before the request is constructed — and the architecture of that "before" determines whether the system is fragile or robust.

Consider a button. A "Generate Blueprint" button in the ProseCreator dashboard. An "Implement All Feedback" button in the Writers Room. A "Score Lead" button in the NexusROS CRM. A "Scan Security" button on the infrastructure page. Each button triggers an AI workload. The question is: what does the button know?

In a naive architecture, the button knows everything. It knows which model to call. It knows the system prompt. It knows the execution tier. It carries all of this in its click handler, hardcoded in frontend code:


TypeScript
9 lines
// WRONG: Button carries execution knowledge
onClick={() => dispatch({
  jobType: 'room_feedback_implement',
  model: 'gemini-2.5-pro',
  systemPrompt: 'You are a prose editor...',
  tier: 'llm_only',
  maxTokens: 16000,
  temperature: 0.7,
})}

This is wrong for three reasons. First, changing the model requires a frontend deployment — a code change, a build, a Docker push, a rollout. Second, different buttons with similar needs duplicate configuration — the system prompt for feedback implementation and the system prompt for beat generation share 80% of their content, but the duplication is invisible because it is scattered across component files. Third, the button's behavior cannot be inspected or modified by an administrator without reading source code — there is no single place where "what does this button do?" is answerable.

The correct architecture inverts the knowledge hierarchy. Buttons are dumb. Workflow bindings are smart.

A workflow binding is a row in graphrag.skill_registry that maps a job_type string to a complete execution configuration:


TypeScript
16 lines
interface WorkflowBinding {
  job_type: string;           // What the button sends: 'room_feedback_implement'
  execution_type: string;     // How it runs: 'llm_only' | 'tool_using' | 'chain' | 'autonomous'
  routing_hint: string;       // What kind of LLM: 'fast' | 'reasoning' | 'code' | 'long_context'
  execution_config: {         // Tier-specific configuration (JSONB)
    modelOverride?: string;   // Optional: 'gemini-2.5-pro' (overrides routing_hint)
    maxTokens?: number;
    temperature?: number;
    tools?: string[];         // For tool_using tier
    maxIterations?: number;
  };
  system_prompt: string;      // The prompt — lives in the DB, not in code
  timeout_ms: number;
  queue_name: string;         // Which BullMQ queue
  risk_level: string;         // Governance classification
}

The button carries only a job_type and the input parameters specific to its context:


TypeScript
9 lines
// CORRECT: Button carries only identity and context
onClick={() => dispatch({
  jobType: 'room_feedback_implement',
  inputParams: {
    projectId,
    roomId,
    feedbackItems: selectedFeedback,
  },
})}

Everything else — how to execute, which model, what prompt, what tier, what timeout — is resolved from the workflow binding at dispatch time. The WorkflowJobDispatcher client library handles this resolution:


TypeScript
27 lines
class WorkflowJobDispatcher {
  async dispatch(opts: DispatchOptions): Promise<DispatchResult> {
    // Step 1: Resolve the workflow binding from skill_registry
    const binding = await this.resolveBinding(opts.jobType);

    // Step 2: Construct the enriched payload
    const payload = {
      jobType: opts.jobType,
      organizationId: opts.orgId,
      userId: opts.userId,
      inputParams: opts.inputParams,
      execution: {
        method: binding.execution_type,
        skillId: binding.skill_id,
        routingHint: binding.routing_hint,
        tier: binding.execution_type,
      },
      modelPreference: {
        suggested: opts.userAIProviderModel,  // From user's Settings → AI Provider
        role: 'default',
      },
    };

    // Step 3: POST to UNO
    return this.postToOrchestrator(payload);
  }
}

When UNO receives a payload with execution.skillId already populated, it has two options. It can trust the caller's resolution and skip Step 3 (skill resolution) — using the provided metadata directly. Or it can re-resolve as a validation step, comparing the caller's resolution against the current registry state and flagging discrepancies. The current implementation does the latter: UNO always re-resolves, treating the caller's metadata as advisory. This is defensive — a caller with a stale cache should not be able to dispatch a job with an outdated skill configuration.

The consequence of this architecture is that changing a button's behavior requires no code deployment. An administrator opens the Workflow Bindings page in the dashboard, finds the room_feedback_implement binding, changes the routing_hint from fast to reasoning, saves. The next time the button is clicked, the WorkflowJobDispatcher resolves the updated binding, and the orchestrator routes to a reasoning-class model instead of a fast one. No frontend build. No backend deployment. No rollout. The change is live in the time it takes to update a database row.

The platform currently defines workflow bindings across seven execution methods:

Method	Count	Description	Example
`llm_only`	171	Single LLM call, no tools	Lead scoring, sentiment analysis, prose generation
`tool_using`	235	ReAct loop with tool invocations	Security scans (kubectl), infrastructure remediation
`chain`	3	Multi-step DAG pipeline	QA validation workflows
`autonomous`	—	Goal-directed agent with reflection	Infrastructure investigation (migrating from mageagent)
`gpu`	—	GPU-provisioned workloads (RunPod, Hyperbolic)	Audiobook TTS synthesis
`mageagent`	—	Multi-agent consensus patterns	Blueprint generation, creative forge
`n8n`	—	n8n workflow execution	Registered in schemas, pending integration

Each binding is inspectable, testable, and modifiable from a single admin interface. The binding is the documentation — there is no separate configuration file, no environment variable, no hardcoded constant to keep in sync with the registry. The registry is the single source of truth.

4.8 Case Study: The Implement Button

The "Implement All Feedback" button in ProseCreator's Writers Room provides a concrete illustration of why the workflow binding architecture matters — and what happens when it is not fully in place.

The button's intent is straightforward: the user reviews feedback from AI personas in the Writers Room, selects the feedback items they want to apply, and clicks "Implement All Feedback." The system should take the current chapter text, the selected feedback, and produce a revised version with the feedback incorporated. The user then sees a diff view — original versus modified — and can accept or reject each change.

The button dispatches { job_type: 'room_feedback_implement', inputParams: { projectId, roomId, feedbackItems } }. The workflow binding in skill_registry specifies execution_type: 'llm_only', routing_hint: 'reasoning' (feedback implementation requires careful textual reasoning, not fast classification), and execution_config: { modelOverride: 'gemini-2.5-pro' }.

During the initial deployment of the UNO architecture, this button broke. Not because of a code bug — the button correctly dispatched the job type, ProseCreator correctly assembled the context, and the LLM would have correctly produced the revised text. The button broke because three links in the dispatch chain failed silently:

Link 1: Authentication key mismatch. The orchestrator sent ORCHESTRATOR_SERVICE_KEY to nexus-auth for organization configuration resolution, but nexus-auth validated against INTERNAL_SERVICE_API_KEY — a different key from a different Kubernetes secret. The auth call returned 401. Without the organization's AI configuration, the orchestrator could not determine which LLM provider to use.

Link 2: Request field name mismatch. The orchestrator's governance module called fetchOrgAIConfig() with { organizationId: orgId }, but the nexus-auth endpoint expected { orgId: ... }. A field name discrepancy — organizationId versus orgId — caused a silent failure where the auth service returned a default (empty) configuration instead of the organization's actual provider settings.

Link 3: Model routing default. Without a valid organization AI configuration (due to Links 1 and 2), the AI Provider Router defaulted to claude_code_max — a provider that was not configured for the test organization. The LLM call failed with an authentication error that bore no resemblance to the root cause (a field name mismatch in an auth request three services upstream).

The architectural lesson is not that these bugs were hard to fix — each was a one-line change. The lesson is that they were hard to diagnose. The button click produced a WebSocket job:failed event with a Claude API authentication error. Nothing in that error pointed to a field name mismatch in the orchestrator's auth module. The causal chain — button → dispatch → auth key wrong → field name wrong → no org config → wrong provider default → auth failure — crossed four services and three error domains.

The workflow binding architecture addresses this by making the execution intent explicit at every stage. The binding specifies routing_hint: 'reasoning' and modelOverride: 'gemini-2.5-pro'. If the organization's AI configuration cannot be resolved, the orchestrator does not silently default to a different provider — it fails with a structured error naming the organization, the missing configuration, and the troubleshooting steps. The no-fallback principle (Section 4.6) prevents the silent chain of defaults that made the implement button failure so difficult to trace.

After the fix: button click → WorkflowJobDispatcher.dispatch() → workflow binding resolved → payload with execution metadata → UNO validates and enqueues → nexus-workflows dequeues, resolves skill, calls Gemini 2.5 Pro via the AI Provider Router → response parsed → WebSocket generation_complete with diff data → frontend renders diff view with Accept/Reject buttons. The entire chain is observable in the span tree, each link producing a span with its own duration, status, and metadata.

5. Chain DAG Coordination

Not every workload is a single dispatch-execute cycle. Some workloads are compositions: a research workflow that first gathers sources, then analyzes them in parallel, then synthesizes findings. A content pipeline that generates a blueprint, creates individual chapters concurrently, and assembles the final document. A security audit that scans infrastructure, runs compliance checks against findings, and produces a remediation plan.

These are chains -- directed acyclic graphs (DAGs) of skills, where the output of one step feeds the input of the next, and some steps can execute in parallel. The chain DAG coordinator lives within UNO, not in nexus-workflows. This placement is deliberate and perhaps surprising: why would coordination logic belong in the dispatch layer rather than the execution layer?

The answer follows from the separation principle. Each step in a chain is itself a dispatch: it has a job_type, requires skill resolution, undergoes governance pre-checks, and is enqueued onto a named queue. The coordinator does not execute steps -- it dispatches them. When a step completes, the coordinator evaluates the DAG to determine which steps are now unblocked, and dispatches those. The coordinator is, in essence, a state machine that emits dispatch requests in response to completion events.

5.1 DAG Representation

A chain is defined as an array of steps, each with an identifier, a job_type, and an optional dependsOn array specifying which steps must complete before this step can be dispatched:


TypeScript
19 lines
interface ChainDefinition {
  chain_id: string;
  steps: ChainStep[];
  on_failure: 'abort' | 'skip_dependents' | 'continue';
  timeout_ms: number;  // overall chain timeout
}

interface ChainStep {
  step_id: string;
  job_type: string;
  payload: Record<string, unknown>;
  dependsOn: string[];  // step_ids that must complete first
  output_mapping?: Record<string, string>;  // maps parent output fields to this step's payload
  condition?: {
    field: string;        // output field from a dependency
    operator: 'eq' | 'neq' | 'gt' | 'lt' | 'contains' | 'exists';
    value: unknown;
  };
}

The dependsOn field creates the DAG edges. A step with an empty dependsOn array (or no dependsOn field) is a root step and can be dispatched immediately when the chain begins. A step that depends on multiple parents is a join point and is dispatched only when all parents have completed successfully.

The output_mapping field enables data flow between steps. When step B depends on step A, and B's payload needs A's output, the mapping declares how to extract values from A's result and inject them into B's payload. For example:


TypeScript
9 lines
{
  step_id: 'synthesize',
  job_type: 'autoresearch_synthesize',
  dependsOn: ['gather_sources', 'run_analysis'],
  output_mapping: {
    'gather_sources.output.sources': 'payload.sources',
    'run_analysis.output.findings': 'payload.analysis_findings',
  }
}

The condition field supports conditional dispatch. A step with a condition is dispatched only if the condition evaluates to true against a dependency's output. This enables branching: if the analysis step finds critical vulnerabilities, dispatch the emergency remediation step; otherwise, dispatch the standard reporting step.

5.2 The Coordinator as State Machine

The chain coordinator maintains state in the orchestrator.chain_runs and orchestrator.chain_steps tables:


SQL
20 lines
CREATE TABLE orchestrator.chain_runs (
  chain_id    UUID PRIMARY KEY,
  parent_job_id UUID REFERENCES orchestrator.runs(job_id),
  definition  JSONB NOT NULL,       -- full ChainDefinition
  status      VARCHAR(32) NOT NULL DEFAULT 'running',
  step_states JSONB NOT NULL DEFAULT '{}',  -- { step_id: status }
  created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  completed_at TIMESTAMPTZ
);

CREATE TABLE orchestrator.chain_steps (
  step_id     VARCHAR(128),
  chain_id    UUID REFERENCES orchestrator.chain_runs(chain_id),
  job_id      UUID REFERENCES orchestrator.runs(job_id),
  status      VARCHAR(32) NOT NULL DEFAULT 'pending',
  output      JSONB,
  dispatched_at TIMESTAMPTZ,
  completed_at  TIMESTAMPTZ,
  PRIMARY KEY (chain_id, step_id)
);

When a chain is initiated (via a dispatch_mode: 'chain' job), the coordinator:

Inserts the chain_runs row with the full DAG definition.
Identifies all root steps (no dependencies).
Dispatches each root step as an independent job via the standard 8-step pipeline.
Records each dispatched step in chain_steps with status dispatched.

The coordinator then subscribes to completion events for all dispatched steps. When a step completes:

The coordinator updates chain_steps with the step's output and status.
It evaluates the DAG: which steps had this completed step as a dependency? For each candidate, are all of its dependencies now complete?
For each newly unblocked step, it evaluates the condition (if present) against dependency outputs.
Steps that pass their condition (or have no condition) are dispatched. Steps that fail their condition are marked skipped.
If all steps are complete (or skipped), the chain itself is marked complete, and its aggregated output is stored in the parent job's orchestrator.runs row.

This is a straightforward topological-sort execution model, but the key insight is where it runs. The coordinator never executes any step -- it dispatches them. Each step traverses the full governance pipeline. A chain that includes a high-risk security scan step will have that step held for human review, just as it would if dispatched individually. The chain pauses at that step, waiting for approval, and resumes when approval is granted. Governance is not bypassed by composition.

5.3 Fork/Join Semantics

Parallel execution in a chain occurs naturally from the DAG structure. When multiple steps share the same set of dependencies (or have no dependencies), they are dispatched simultaneously. This is a fork. When a subsequent step depends on all of these parallel steps, it is a join -- dispatched only when every forked step has completed.

The coordinator emits specific WebSocket events for these patterns:

chain:step_dispatched -- emitted for every step dispatch, includes step_id, job_type, queue_name, and dependency context.
chain:fork -- emitted when multiple steps are dispatched simultaneously from a single completion event. Includes the set of forked step_id values and the triggering step that unblocked them.
chain:join -- emitted when a join step is dispatched after all its dependencies complete. Includes the step_id of the join step and the set of dependency steps whose outputs are being aggregated.

These events enable the dashboard to render a live DAG visualization: nodes light up as they are dispatched, fill as they complete, and edges animate to show data flow. The fork and join events provide the structural information needed to render parallel branches correctly.

A subtlety arises with partial fork failures. If a chain's on_failure policy is skip_dependents, a failed forked step causes all steps that transitively depend on it to be marked skipped, while independent branches continue executing. If the policy is abort, any step failure terminates the entire chain. If continue, the failed step is recorded but does not block dependents -- they receive null for the failed step's output mapping, and must handle this case in their payload processing.

5.4 Progress Aggregation

A chain's overall progress is not simply "step 3 of 7." Steps have heterogeneous durations -- a 60-second default queue task and a 30-minute autoresearch task contribute unequally to perceived progress. The coordinator calculates weighted progress using each step's timeout_ms as a proxy for expected duration:


TypeScript
19 lines
function calculateChainProgress(chain: ChainRun): number {
  const steps = chain.definition.steps;
  const totalWeight = steps.reduce((sum, s) => sum + s.timeout_ms, 0);
  let completedWeight = 0;

  for (const step of steps) {
    const state = chain.step_states[step.step_id];
    if (state === 'completed' || state === 'skipped') {
      completedWeight += step.timeout_ms;
    } else if (state === 'running') {
      // Estimate partial progress based on elapsed time
      const elapsed = Date.now() - step.dispatched_at;
      const fraction = Math.min(elapsed / step.timeout_ms, 0.95);
      completedWeight += step.timeout_ms * fraction;
    }
  }

  return completedWeight / totalWeight;
}

This weighted progress is emitted as part of the chain:progress WebSocket event, giving the dashboard a smooth progress indicator that accounts for step heterogeneity. A chain that has completed three 60-second steps and is midway through a 30-minute step reports approximately 15% progress, not "4 of 5 steps complete" -- which would misleadingly suggest 80%.

5.5 Example: AutoResearch Chain

Consider a concrete chain that illustrates all of these concepts. An AutoResearch workflow receives a research question and must: gather sources, run parallel analysis tracks (one via Jupyter notebook execution, one via CVAT annotation pipeline for image-heavy domains), and synthesize findings:


TypeScript
44 lines
const autoResearchChain: ChainDefinition = {
  chain_id: generateUUID(),
  steps: [
    {
      step_id: 'gather',
      job_type: 'autoresearch_gather_sources',
      payload: { query: 'Impact of transformer attention patterns on long-context retrieval' },
      dependsOn: [],
    },
    {
      step_id: 'jupyter_analysis',
      job_type: 'autoresearch_jupyter_execute',
      dependsOn: ['gather'],
      output_mapping: {
        'gather.output.academic_sources': 'payload.sources',
        'gather.output.datasets': 'payload.dataset_refs',
      },
    },
    {
      step_id: 'cvat_annotation',
      job_type: 'autoresearch_cvat_annotate',
      dependsOn: ['gather'],
      output_mapping: {
        'gather.output.figures': 'payload.images',
      },
      condition: {
        field: 'gather.output.has_visual_data',
        operator: 'eq',
        value: true,
      },
    },
    {
      step_id: 'synthesize',
      job_type: 'autoresearch_synthesize',
      dependsOn: ['jupyter_analysis', 'cvat_annotation'],
      output_mapping: {
        'jupyter_analysis.output.analysis': 'payload.quantitative_findings',
        'cvat_annotation.output.annotations': 'payload.visual_annotations',
      },
    },
  ],
  on_failure: 'skip_dependents',
  timeout_ms: 3600000,  // 1 hour overall
};

The execution proceeds as follows:

Dispatch: UNO receives the chain, inserts the chain_runs row, identifies gather as the root step, and dispatches it to the autoresearch queue.
Gather completes: The coordinator receives the completion event. Two steps depend on gather: jupyter_analysis and cvat_annotation. Both have their sole dependency satisfied. The coordinator evaluates cvat_annotation's condition: does gather.output.has_visual_data equal true? If yes, both steps are dispatched simultaneously -- a fork. The chain:fork event is emitted with { forked: ['jupyter_analysis', 'cvat_annotation'], trigger: 'gather' }. If the condition is false, cvat_annotation is marked skipped, and only jupyter_analysis is dispatched.
Fork execution: jupyter_analysis runs on the autoresearch queue (concurrency 3, 30-minute timeout). cvat_annotation runs on the gpu queue (concurrency 5, 2-hour timeout). They execute on different workers, potentially on different nodes, with no coordination between them.
Join: When both forked steps complete (or cvat_annotation was skipped), synthesize becomes unblocked. The coordinator applies output_mapping to assemble synthesize's payload from the outputs of both parent steps. If cvat_annotation was skipped, its output mapping resolves to null, and the synthesize step must handle the absence of visual annotations. The chain:join event is emitted.
Synthesis completes: The chain is marked complete. The synthesized output is stored in the parent job's orchestrator.runs row. The job:completed event is emitted on the parent job's channel.

Throughout this process, UNO made zero LLM calls. It dispatched four jobs (or three, if CVAT was skipped), tracked their completion via Redis Pub/Sub events, evaluated DAG transitions, and assembled payloads. The actual work -- source gathering, notebook execution, annotation, synthesis -- happened entirely within nexus-workflows workers consuming from their respective queues.

This is the separation principle applied to composition: even when orchestrating a multi-step workflow, the orchestrator orchestrates dispatch, not execution. The chain coordinator is a dispatcher of dispatchers -- and that is exactly the level of abstraction it should occupy.

6. The Execution Engine — nexus-workflows

The orchestrator dispatches. It does not execute. This is the invariant we have repeated throughout this paper, and now we must confront what it implies: somewhere, something must actually do the work. That something is nexus-workflows — a pool of BullMQ workers that consume jobs from Redis-backed queues, hydrate execution context from the skill registry, invoke LLM providers through the AI Provider Router, orchestrate tool calls, manage agent loops, and deliver results via WebSocket events. It is the only service in the platform authorized to call POST /internal/ai/chat. It is the only service that instantiates tool executors. It is, in a precise architectural sense, the platform's hands.

This section describes the execution engine's internal architecture, then walks through each of the four execution tiers in detail — from the trivially simple (Tier 1: a single LLM call) to the genuinely complex (Tier 4: autonomous agents with goal decomposition, reflection, and multi-agent collaboration patterns).

6.1 Architecture: Workers Without an HTTP Server

nexus-workflows is unusual among the platform's 67 microservices in that it does not serve HTTP traffic. There is no Express router, no request-response cycle, no REST API surface. The service exposes exactly one HTTP endpoint — /health — for Kubernetes liveness and readiness probes. Everything else happens through BullMQ.

This is a deliberate architectural choice, not an omission. Workers that consume from message queues have fundamentally different operational characteristics than services that serve HTTP requests:

Backpressure is automatic. An HTTP server under load either queues requests in the kernel's TCP backlog (risking timeout), drops connections (risking data loss), or returns 503 responses (pushing backpressure to the caller). A BullMQ worker simply processes jobs at its configured concurrency. If more jobs arrive than the worker can handle, they accumulate in the queue — durably, in Redis — until a worker becomes available. There is no connection to drop, no request to timeout. The queue is the buffer.

Scaling is additive. Adding a second HTTP server requires a load balancer, connection draining, session affinity (if stateful), and careful DNS propagation. Adding a second BullMQ worker requires deploying another pod. The new worker immediately begins dequeuing jobs from the same Redis-backed queues. No load balancer configuration. No service discovery changes. No traffic splitting.

Failure is bounded. When an HTTP handler crashes, the request is lost unless the caller implements retry logic. When a BullMQ worker crashes mid-job, the job returns to the queue after the configured visibility timeout and is picked up by another worker (or the same worker after restart). The job's attemptsMade counter increments; after maxRetries failures, it moves to the dead-letter queue for manual inspection. No job is silently lost.

The deployment topology reflects the per-queue nature of the workload:


TypeScript
19 lines
interface WorkerDeployment {
  queue: string;
  concurrency: number;
  replicas: number;
  resources: {
    cpu: string;
    memory: string;
    gpu?: string;
  };
}

const deployments: WorkerDeployment[] = [
  { queue: 'default',       concurrency: 50, replicas: 3, resources: { cpu: '500m',  memory: '1Gi' } },
  { queue: 'security',      concurrency: 10, replicas: 2, resources: { cpu: '1000m', memory: '2Gi' } },
  { queue: 'gpu',           concurrency: 5,  replicas: 1, resources: { cpu: '2000m', memory: '4Gi', gpu: 'nvidia.com/gpu: 1' } },
  { queue: 'prosecreator',  concurrency: 20, replicas: 2, resources: { cpu: '500m',  memory: '1Gi' } },
  { queue: 'autoresearch',  concurrency: 8,  replicas: 1, resources: { cpu: '1000m', memory: '2Gi' } },
  { queue: 'infra',         concurrency: 5,  replicas: 1, resources: { cpu: '500m',  memory: '1Gi' } },
];

Each queue has its own Kubernetes Deployment with tailored resource requests in the target architecture. The security queue gets more CPU because security scans involve parsing large kubectl output. The gpu queue runs on nodes with GPU affinity. The prosecreator queue has higher concurrency because individual prose generation calls are lightweight (Tier 1, typically under 3 seconds). As of Q2 2026, nexus-workflows runs as a single shared BullMQ deployment with per-skill routing via the skill_registry queue_name field; per-queue independent deployments are planned as Phase 3 of the migration. This per-queue deployment model is impossible in a monolithic architecture where all workloads share a single process pool.

6.2 Job Hydration: From Queue Message to Execution Context

When a worker dequeues a job, the first thing it does — before any LLM call, before any tool resolution, before any tier-specific logic — is re-resolve the skill from the registry. This is not redundant with UNO's skill resolution at dispatch time. It is defensive.


TypeScript
18 lines
interface HydratedJob {
  // From BullMQ job data (set by UNO at dispatch)
  jobId: string;
  traceId: string;
  orgId: string;
  userId: string;
  payload: Record<string, unknown>;
  routingHint: string;
  
  // Re-resolved from skill_registry (fresh at execution time)
  executionType: 'llm_only' | 'tool_using' | 'chain' | 'autonomous';
  systemPrompt: string;
  tools: ToolDefinition[];
  maxIterations: number;
  timeoutMs: number;
  riskLevel: 'low' | 'medium' | 'high' | 'critical';
  executionConfig: Record<string, unknown>;
}

Why re-resolve? Because the skill configuration may have changed between dispatch and execution. In a healthy system, the gap is milliseconds. Under load, when the security queue has 200 jobs waiting and concurrency is capped at 10, the gap can be minutes. During that interval, an operator may have updated the skill's system prompt, added a new tool, changed the max iteration count, or even disabled the skill entirely. Re-resolution ensures the worker executes with the current configuration, not a stale snapshot frozen at dispatch time.

If re-resolution returns no result — the skill was disabled or deleted after dispatch — the worker marks the job as failed with a structured error (SKILL_DISABLED_AFTER_DISPATCH), emits a job:failed WebSocket event, and moves on. It does not attempt to execute with cached configuration. It does not fall back to a default prompt. It fails loudly.

6.3 Tier 1: LLM-Only Execution

Tier 1 is the simplest execution path. No tools. No loops. No agents. A single LLM call: messages in, completion out.


TypeScript
25 lines
async function executeTier1(job: HydratedJob): Promise<ExecutionResult> {
  emitWS(job.jobId, 'job:llm_call', { tier: 1, iteration: 1 });
  
  const response = await llmClient.chat({
    messages: [
      { role: 'system', content: job.systemPrompt },
      ...buildMessagesFromPayload(job.payload),
    ],
    routingHint: job.routingHint,
    orgId: job.orgId,
  });
  
  emitWS(job.jobId, 'job:llm_response', { 
    tier: 1,
    tokensUsed: response.usage,
    provider: response.provider,
  });
  
  return {
    status: 'completed',
    output: response.content,
    usage: response.usage,
    spans: [createSpan('llm_call', response)],
  };
}

The simplicity is the point. Tier 1 exists because many platform workloads genuinely are single-shot: NexusROS task classification, ProseCreator outline generation, sentiment analysis, entity extraction. These workloads do not benefit from tool calling or multi-turn reasoning. They need a system prompt, a user message, and a response. Routing them through a ReAct loop would add latency and complexity for zero functional benefit.

Tier 1 jobs account for approximately 60% of all dispatched workloads on the platform. They complete in a median of 1.2 seconds (dominated by LLM inference time) and consume negligible worker resources — the worker is I/O-bound waiting for the provider API to respond. This is why the default queue runs at concurrency 50: each Tier 1 job holds a concurrency slot but almost no CPU or memory.

6.4 Tier 2: Tool-Using Execution (ReAct Loop)

Tier 2 is where execution becomes interesting. The worker enters a ReAct (Reason-Act) loop: the LLM is presented with a problem and a set of available tools, it decides which tool to call (reasoning), the worker executes the tool call against an external service (acting), the tool result is appended to the conversation, and the cycle repeats until the LLM produces a final response without requesting a tool call — or until the configured iteration limit is reached.

The critical design decision in our ReAct implementation is the forced first tool call:


TypeScript
64 lines
async function executeTier2(job: HydratedJob): Promise<ExecutionResult> {
  const messages: Message[] = [
    { role: 'system', content: job.systemPrompt },
    ...buildMessagesFromPayload(job.payload),
  ];
  const spans: Span[] = [];
  let iteration = 0;
  
  while (iteration < job.maxIterations) {
    iteration++;
    const toolChoice = iteration === 1 ? 'required' : 'auto';
    
    emitWS(job.jobId, 'job:llm_call', { tier: 2, iteration, toolChoice });
    
    const response = await llmClient.chat({
      messages,
      tools: job.tools,
      tool_choice: toolChoice,
      routingHint: job.routingHint,
      orgId: job.orgId,
    });
    
    spans.push(createSpan('llm_call', response, { iteration }));
    
    if (!response.toolCalls || response.toolCalls.length === 0) {
      // LLM chose to respond without tools — we're done
      emitWS(job.jobId, 'job:llm_response', { tier: 2, iteration, final: true });
      return {
        status: 'completed',
        output: response.content,
        usage: aggregateUsage(spans),
        spans,
      };
    }
    
    // Execute each tool call
    for (const toolCall of response.toolCalls) {
      emitWS(job.jobId, 'job:tool_call', { 
        tool: toolCall.name, 
        iteration,
      });
      
      const toolResult = await executeToolCall(toolCall, job);
      spans.push(createSpan('tool_execution', toolResult, { 
        tool: toolCall.name, 
        iteration,
      }));
      
      messages.push(
        { role: 'assistant', content: null, tool_calls: [toolCall] },
        { role: 'tool', tool_call_id: toolCall.id, content: toolResult.output },
      );
    }
  }
  
  // Max iterations reached without final response
  return {
    status: 'completed_max_iterations',
    output: messages[messages.length - 1].content,
    usage: aggregateUsage(spans),
    spans,
    warning: `Reached max iterations (${job.maxIterations}) without natural completion`,
  };
}

Why force the first tool call? Because without it, the LLM has a persistent tendency to hallucinate an answer rather than gathering information. Consider a security scan job. The system prompt says: "You are a security analyst. Use the kubectl tool to inspect pod configurations and identify misconfigurations." The payload says: "Scan the nexus namespace for pods running as root." If tool_choice is auto from the start, roughly 30% of the time (measured across 847 security scan dispatches over a three-month period), the LLM produces a plausible-sounding security analysis without ever calling kubectl. It fabricates pod names. It invents misconfigurations. The output reads convincingly — it has the right format, the right vocabulary, the right level of concern — but it describes a cluster that does not exist.

Setting tool_choice: 'required' on the first iteration eliminates this failure mode entirely. The LLM must call a tool. It must ground its first action in reality. On iteration 2 and beyond, tool_choice reverts to auto, allowing the LLM to reason about tool results, decide whether more information is needed, and eventually produce a final response.

The iteration limit is configurable per skill via execution_config.maxIterations in the skill registry. Default is 5. Security scans run at 10 (they often need multiple kubectl calls across different resource types). ProseCreator context assembly runs at 3 (context fetch, outline check, and one refinement). Infrastructure remediation runs at 15 (complex multi-step repairs involving multiple resources).

6.4.1 Tool Executor Architecture

Each tool available to a Tier 2 (or Tier 4) job is not an arbitrary function. It is a thin HTTP executor — a wrapper that translates the LLM's tool call arguments into an HTTP request to a specific service endpoint, executes the request, and returns the response in a format the LLM can reason about.


TypeScript
16 lines
interface ToolExecutor {
  name: string;
  description: string;
  parameters: JSONSchema;
  execute(args: Record<string, unknown>, context: ExecutionContext): Promise<ToolResult>;
}

interface ToolResult {
  output: string;
  metadata: {
    service: string;
    endpoint: string;
    statusCode: number;
    durationMs: number;
  };
}

The platform provides the following tool executors:

Tool	Target Service	Purpose
`kubectl`	nexus-infra-agent	Kubernetes cluster operations: pod inspection, log retrieval, resource status
`http_request`	(configurable)	Generic HTTP requests with URL allowlisting
`hpc_gateway`	nexus-gpu-scheduler	GPU job submission, status polling, result retrieval
`n8n_trigger`	n8n instance	Workflow automation triggers
`prosecreator`	nexus-prosecreator	Context assembly: characters, settings, plot threads, chapters
`autoresearch`	nexus-autoresearch	Web search, document analysis, citation extraction
`graphrag_query`	nexus-graphrag	Knowledge graph queries, entity search, relationship traversal
`jupyter_execute`	nexus-jupyter	Jupyter notebook cell execution for data analysis
`cvat_annotate`	nexus-cvat	Computer vision annotation tasks
`sandbox_execute`	nexus-sandbox	Sandboxed code execution (Python, Node.js, shell)
`vision_analyze`	nexus-vision	Image analysis, OCR, document parsing

Every tool executor enforces a timeout (default 30 seconds, configurable per tool). Every executor returns structured output — not raw HTTP responses, but parsed, truncated-if-necessary text that the LLM can consume without exceeding context windows. Every executor logs its invocation as a span in the execution trace.

The design is intentionally constrained. Tools are HTTP endpoints, not arbitrary code execution. The worker does not eval() code returned by the LLM. It does not dynamically import modules. It does not grant the LLM access to the filesystem, the network stack, or the worker's own process. The LLM reasons; tools act; the worker mediates.

6.5 Tier 3: Chain Execution (Per-Step DAG)

Tier 3 handles multi-step workflows where the output of one step feeds the input of the next. A patent analysis workflow, for example, consists of: (1) document ingestion via OCR, (2) prior art search across a knowledge graph, (3) claim extraction via LLM, (4) novelty scoring with a different model configuration, (5) human review of flagged claims, and (6) final report generation. Each step is a self-contained unit of work — some are Tier 1 (LLM-only), some are Tier 2 (tool-using). The chain defines their ordering, data flow, and conditional branching.

The execution model for chains is per-step, not per-chain. nexus-workflows does not execute an entire chain within a single worker process. Instead, each step is dispatched as an independent job — a Tier 1 or Tier 2 job with its own skill resolution, its own execution context, its own failure handling. The chain coordinator (implemented as a BullMQ completion listener in UNO) watches for step completions and dispatches subsequent steps when their predecessors have all finished.


TypeScript
16 lines
interface ChainDefinition {
  chainId: string;
  steps: ChainStep[];
}

interface ChainStep {
  stepId: string;
  jobType: string;           // Resolves to a skill in skill_registry
  dependsOn: string[];       // stepIds that must complete before this step runs
  inputMapping: Record<string, string>;  // Maps predecessor outputs to this step's payload
  condition?: {
    field: string;           // Field in predecessor output
    operator: 'eq' | 'neq' | 'gt' | 'lt' | 'contains';
    value: unknown;          // Value to compare against
  };
}

This per-step model has three properties that a monolithic chain executor lacks:

Fork and join without blocking. When a step has multiple successors with no mutual dependencies, all successors are dispatched simultaneously. This is fan-out parallelism: step 1 completes, steps 2a, 2b, and 2c are dispatched in parallel. When step 3 depends on all three, the completion listener maintains a counter. Each predecessor completion decrements the counter. When the counter reaches zero, step 3 dispatches. No worker thread is blocked waiting for siblings to complete. The coordination state lives in the orchestrator.chain_state table, not in worker memory.

Heterogeneous step execution. Step 1 might execute on the gpu queue (OCR with a vision model). Step 2 might execute on the default queue (a simple LLM classification). Step 3 might execute on the autoresearch queue (web search with tool calling). Each step is routed to the appropriate queue based on its skill's queue_name — the chain does not force all steps through a single queue with a single resource profile.

Conditional branching. A step can have a condition that evaluates against its predecessor's output. If a novelty scoring step returns a score below 0.7, the "generate rejection letter" step activates; if above, the "proceed to full analysis" step activates. The completion listener evaluates conditions before dispatching, routing the chain down the appropriate branch without worker involvement.

The trade-off is coordination complexity. UNO must maintain chain state — which steps have completed, what their outputs were, which steps are eligible for dispatch. This state lives in PostgreSQL, not in worker memory, which means it survives restarts and can be inspected via administrative tooling. But it means the completion listener must handle edge cases: what if a step fails mid-chain? What if a conditional branch references a predecessor that hasn't completed yet? What if two steps share a dependency and both complete within the same Redis event batch?

These edge cases are handled by a state machine with four states per step: pending, dispatched, completed, failed. Transitions are idempotent and serialized via PostgreSQL advisory locks on the chain ID. A failed step triggers the chain's failure policy: stop (halt the entire chain), skip (mark downstream steps as skipped and continue), or retry (re-dispatch the failed step up to N times). The policy is configurable per chain, not per step — a deliberate simplification that reduces the state space at the cost of per-step failure flexibility.

6.6 Tier 4: Autonomous Execution

Implementation status (April 2026): Tier 4 autonomous agent execution is deployed in nexus-workflows via autonomous-engine.ts. The execution engine's switch block routes 'autonomous' to executeAutonomous(), which delegates to the autonomous engine. The pattern was originally developed in nexus-mageagent and ported into nexus-workflows as Phase 6 of the migration plan. nexus-mageagent continues to operate as a separate service with both a GatewayLLMClient (routing through the AI Provider Router) and a deprecated direct OpenRouter client — the migration is partial.

Tier 4 is the most sophisticated execution tier, and the one that most clearly justifies the separation of dispatch and execution. An autonomous job is not a single LLM call. It is not a bounded loop with a known iteration count. It is a goal-directed agent that decomposes objectives, executes plans, evaluates its own progress, adjusts its approach when things go wrong, and can involve human approval gates that pause execution for hours or days.

The execution model follows the MageAgent pattern, decomposed into four phases:

Phase 1: Goal Decomposition

The LLM receives the high-level goal and decomposes it into concrete steps, each with explicit success criteria:


TypeScript
15 lines
interface GoalDecomposition {
  goal: string;
  steps: AgentStep[];
  maxSteps: number;        // Total step budget (default 20)
  maxReplans: number;      // Maximum full re-decompositions (default 3)
}

interface AgentStep {
  stepId: string;
  description: string;
  executionType: 'llm_only' | 'tool_using';  // Each step is Tier 1 or 2
  tools?: string[];                           // Subset of available tools
  successCriteria: string[];                  // Natural-language criteria
  dependsOn?: string[];                       // Ordering constraints
}

Goal decomposition is itself a Tier 1 LLM call — the agent uses the LLM to plan its own execution. The system prompt for decomposition includes the available tools, the organization's constraints, and examples of well-structured step plans. The output is parsed into the GoalDecomposition structure; if parsing fails (the LLM produces unstructured text), the worker retries decomposition up to 3 times with increasingly explicit formatting instructions before failing the job.

Phase 2: Step Execution

The agent iterates through its decomposed steps, executing each as either a Tier 1 or Tier 2 job. This is not a new dispatch-and-queue cycle — the steps execute within the same worker process, reusing the established LLM client connection and execution context. The distinction between Tier 1 and Tier 2 step execution determines whether the step involves tool calls (Tier 2) or pure reasoning (Tier 1).


TypeScript
10 lines
async function executeAgentStep(
  step: AgentStep,
  context: AgentContext,
): Promise<StepResult> {
  if (step.executionType === 'llm_only') {
    return executeTier1WithinAgent(step, context);
  } else {
    return executeTier2WithinAgent(step, context);
  }
}

Each step execution produces a StepResult containing the output, any tool call traces, token usage, and a summary suitable for inclusion in the agent's memory — the accumulated context that subsequent steps can reference.

Phase 3: Reflection

After each step completes, the agent reflects. This is the mechanism that distinguishes autonomous execution from a simple sequential chain: the agent evaluates its own output against the step's success criteria and decides how to proceed.


TypeScript
6 lines
interface ReflectionResult {
  verdict: 'pass' | 'adjust' | 'replan' | 'fail';
  reasoning: string;
  adjustments?: StepModification[];   // For 'adjust': changes to remaining steps
  newPlan?: GoalDecomposition;        // For 'replan': complete new decomposition
}

The four verdicts map to distinct control flow:

Verdict	Action	Frequency (observed)
`pass`	Proceed to next step. Success criteria met.	~72%
`adjust`	Modify remaining steps based on what was learned. Do not re-decompose the entire goal.	~18%
`replan`	Full re-decomposition. The current plan is fundamentally inadequate. Capped at `maxReplans` (default 3) to prevent infinite re-planning loops.	~7%
`fail`	This step cannot succeed. Mark the job as failed with the reflection's reasoning as the error message.	~3%

The adjust verdict is the most operationally significant. Consider an infrastructure remediation job tasked with "fix the failing health check on the nexus-auth deployment." The initial plan might be: (1) get pod status, (2) check container logs, (3) fix the issue, (4) verify the fix. After step 2, the agent discovers the issue is not a code bug but a misconfigured environment variable. The adjust verdict modifies step 3 from "apply code fix" to "update the environment variable via kubectl set env" — a runtime course correction that does not require starting over from scratch.

The replan verdict, by contrast, is a full reset. It occurs when the agent discovers that its initial decomposition was based on a false assumption — for example, assuming a service is deployed as a Deployment when it is actually a StatefulSet, which changes the entire remediation approach. The cap at 3 replans prevents a failure mode we observed during development: agents that endlessly re-decompose without making progress, consuming tokens and compute in a reasoning spiral.

Phase 4: Goal Evaluation

After all steps complete (or after the step budget is exhausted), the agent performs a final evaluation: are all success criteria met? This is a Tier 1 LLM call that receives the original goal, all step outputs, and the accumulated reflection history. The LLM produces a binary verdict: goal achieved or goal not achieved.

If the goal is not achieved and the step budget has remaining capacity, the agent may re-enter the execution loop — effectively a top-level retry. The total step budget (maxSteps, default 20) bounds the total compute expenditure regardless of how many replans or retries occur. When the budget is exhausted, the job completes with status completed_budget_exhausted, and the final evaluation's reasoning is included in the output for human review.

6.6.1 Human-in-the-Loop Approval Gates

Certain operations are too consequential for fully autonomous execution. An agent that can kubectl delete deployment without human approval is an agent that can cause production outages autonomously. The skill registry defines a requireApprovalFor list — a set of tool names or operation categories that trigger a human approval gate:


TypeScript
5 lines
interface ApprovalConfig {
  requireApprovalFor: string[];   // e.g., ['code_execute', 'deploy', 'file_delete', 'kubectl_delete']
  approvalTimeout: number;        // milliseconds before the job is marked as 'approval_timeout'
  notifyChannels: string[];       // WebSocket channels, email, Slack webhook
}

When the agent's step plan includes a tool call that matches the approval list, execution pauses. The worker emits a job:pending_review WebSocket event containing the proposed action, the agent's reasoning, and the step context. The job status transitions to pending_review. The worker releases the concurrency slot — it does not hold a thread or connection while waiting for approval.

Approval arrives via POST /dispatch/:id/approve (or /dispatch/:id/reject). Upon approval, the job re-enters the active queue and resumes from the paused step. Upon rejection, the agent receives the rejection as tool output and must find an alternative approach — it does not simply fail. This models the real-world dynamic where a human reviewer might say "don't delete the pod, but you can restart it" — the rejection is information, not termination.

6.6.2 Multi-Agent Patterns

Tier 4 supports five multi-agent collaboration patterns, selectable per skill via execution_config.agentPattern:

Execute (default). A single agent executes the goal. This is the standard pattern described above.

Compete. N agents (typically 3) execute the same goal independently and in parallel. A consensus judge — a separate LLM call — evaluates all N outputs and selects the best one. This pattern is useful for tasks where quality varies significantly across runs (creative writing, strategic analysis) and the cost of N parallel executions is justified by the quality improvement. Empirically, the compete pattern produces outputs that human reviewers rate 23% higher than single-agent execution on creative tasks, at 3x the token cost.

Self-Consistent. N agents execute the same goal via different reasoning paths (achieved through temperature variation and prompt rephrasing). The final answer is determined by majority vote — the output that appears most frequently (or most similarly, via embedding cosine similarity) across the N runs. This pattern is grounded in the self-consistency literature (Wang et al., 2022) and improves factual accuracy by approximately 17% on tasks with verifiable ground truth, at the cost of N parallel executions.

Critic. A two-phase pattern: a generator agent produces an initial output, then a critic agent evaluates and identifies weaknesses, then the generator revises. This generate-critique-improve cycle repeats for a configurable number of rounds (default 2). The critic uses a different system prompt optimized for fault-finding rather than production. Measured improvement: approximately 10% higher quality ratings on structured output tasks (reports, analyses, technical documentation) compared to single-pass generation.

Collaborate. Multiple specialist agents work on different aspects of the goal simultaneously, with a coordinator agent that synthesizes their outputs. For example, a market analysis might have a data agent (gathering statistics via tools), a narrative agent (writing the analysis), and a visualization agent (generating chart specifications). The coordinator receives all outputs and produces the final deliverable. This pattern is the most complex and the most resource-intensive; it is reserved for high-value tasks where specialization meaningfully improves output quality.

All multi-agent patterns share the same approval gate infrastructure: if any agent within a multi-agent execution triggers a requireApprovalFor tool, the entire execution pauses until approval is granted or denied.

6.7 The LLM Client

Every LLM call from nexus-workflows — regardless of tier — routes through a single client that calls the AI Provider Router on the gateway:


TypeScript
58 lines
interface LLMClientRequest {
  messages: Message[];
  tools?: ToolDefinition[];
  tool_choice?: 'auto' | 'required' | 'none';
  routingHint?: string;       // From skill_registry: 'fast', 'reasoning', 'code', etc.
  orgId: string;              // For per-org provider resolution
}

class LLMClient {
  private readonly gatewayUrl = 'http://gateway:8092/internal/ai/chat';
  private readonly serviceKey: string;
  
  async chat(request: LLMClientRequest): Promise<LLMResponse> {
    emitWS(request.jobId, 'job:llm_call', {
      messageCount: request.messages.length,
      hasTools: !!request.tools,
      routingHint: request.routingHint,
    });
    
    const response = await fetch(this.gatewayUrl, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'X-Service-Key': this.serviceKey,
        'X-Caller-Service': 'nexus-orchestrator',
        'X-Organization-ID': request.orgId,
      },
      body: JSON.stringify({
        messages: request.messages,
        tools: request.tools,
        tool_choice: request.tool_choice,
        routingHint: request.routingHint,
      }),
    });
    
    if (!response.ok) {
      const error = await response.json();
      throw new LLMCallError(
        `AI Provider Router returned ${response.status}: ${error.message}`,
        {
          statusCode: response.status,
          provider: error.context?.provider,
          troubleshooting: error.troubleshooting,
        }
      );
    }
    
    const result = await response.json();
    
    emitWS(request.jobId, 'job:llm_response', {
      provider: result.provider,
      model: result.model,
      tokensUsed: result.usage,
    });
    
    return result;
  }
}

Several design decisions in this client merit attention.

First, the client is deliberately thin. It does not choose providers, translate tool formats, manage API keys, or implement retry logic for provider-specific transient errors. All of that is the AI Provider Router's responsibility. The client sends a request with a routingHint and an orgId; the router handles everything else. This separation means that adding a new AI provider — say, a future Llama-based endpoint — requires zero changes to nexus-workflows.

Second, the X-Caller-Service header is hardcoded to nexus-orchestrator (a legacy naming artifact from the pre-separation era). The gateway validates this header against an allowlist. If any other value appears — or if the header is missing — the request is rejected at the application layer, before reaching the AI Provider Router. This is the second of three enforcement layers (the first being Istio, the third being the service key).

Third, errors are never swallowed. A non-200 response from the AI Provider Router produces a structured LLMCallError with the HTTP status, the provider that failed (if known), and troubleshooting steps from the router's error response. The calling tier handler (Tier 1, 2, 3, or 4) decides whether to retry, fail the job, or (in Tier 4) let the agent reason about the failure and adjust its approach.

7. Multi-Provider AI Routing

Implementation status (April 2026): This section describes the designed multi-provider routing architecture. The plumbing is complete end-to-end (routing_hint column, back-filled values, request-body passthrough), but the AI Provider Router does not yet act on routing hints — current production supports one active provider per org. The multi-provider configuration schema, routing policy resolution, and cost optimization described below are the target architecture for Phase 7, not the current deployment. See Section 7.2 for detailed status.

7.1 The Problem with Single-Provider Routing

The platform's original AI provider configuration was a single string: the user selected "Gemini" or "Anthropic" or "OpenRouter" in the settings panel, and every LLM call across the entire platform used that provider. This was simple. It was also wrong.

Different workloads have different requirements. A ProseCreator chapter generation job benefits from a model with strong creative writing capabilities — Claude Opus, for instance — while the context assembly step preceding it (fetching character details, plot thread summaries, setting descriptions) is a straightforward retrieval-augmented generation task that any fast, inexpensive model handles well. A security scan needs a model that can parse verbose kubectl output and reason about Kubernetes resource configurations; it does not need a model optimized for creative prose. A patent novelty scoring job needs a model with a large context window to hold the entire prior art corpus; latency is irrelevant because the user is not watching a cursor.

Single-provider routing forces a compromise. If the user selects Claude Opus (excellent for reasoning, expensive), every trivial classification task consumes expensive Opus tokens. If the user selects Gemini Flash (fast, cheap), complex reasoning tasks produce mediocre output. The platform's AI costs are either unnecessarily high or its output quality is unnecessarily low. There is no configuration that optimizes both.

7.2 Multi-Provider Configuration

Implementation status (April 2026): The routing_hint plumbing is complete end-to-end: the column exists in skill_registry (migration 007), migration 008 back-fills values for specific job types (e.g., security_scan_* → 'reasoning', lead_score → 'fast'), llm-client.ts in nexus-workflows sends routingHint as a JSON body field, and the gateway reads it from req.body.routingHint. However, the AI Provider Router does not yet act on the hint — resolveOrgConfig(orgId) returns a single provider per org, and adapter selection is always config.aiProvider regardless of the routing hint value. Multi-provider selection (returning an array of providers with a routing policy map) is the target state for Phase 7. Current production supports one active provider per org.

The solution is to let organizations enable multiple providers simultaneously, each assigned to specific roles. The database schema changes from a single string to a structured configuration:


TypeScript
34 lines
// Previous schema
interface OldAIConfig {
  aiProviderRoute: string;  // "gemini" | "anthropic" | "openrouter" | "claude_code_max"
}

// New schema
interface AIProviderConfig {
  providers: ProviderEntry[];
  routingPolicy: Record<string, string>;
}

interface ProviderEntry {
  type: 'gemini' | 'anthropic' | 'claude_code_max' | 'openrouter';
  roles: string[];           // What this provider is good at
  model?: string;            // Specific model override (optional)
  enabled: boolean;
}

// Example configuration
const exampleConfig: AIProviderConfig = {
  providers: [
    { type: 'gemini',          roles: ['fast', 'long_context', 'default'], enabled: true },
    { type: 'claude_code_max', roles: ['reasoning', 'code'],              enabled: true },
    { type: 'openrouter',      roles: ['creative'],                       enabled: true, model: 'anthropic/claude-3.5-sonnet' },
  ],
  routingPolicy: {
    fast:          'gemini',
    reasoning:     'claude_code_max',
    code:          'claude_code_max',
    long_context:  'gemini',
    creative:      'openrouter',
    default:       'gemini',
  },
};

The routingPolicy is the critical abstraction. It maps semantic roles — fast, reasoning, code, creative, long_context — to concrete providers. Skill authors do not specify providers in their skill definitions; they specify roles. A security scan skill sets routing_hint: 'reasoning' because it needs strong analytical capabilities. A context assembly skill sets routing_hint: 'fast' because it needs low latency. The routing policy translates these hints into provider selections at runtime, based on the organization's configuration.

This decoupling is essential. It means a skill definition never contains the string "gemini" or "anthropic." It contains the string "reasoning" or "fast." When an organization changes their reasoning provider from Claude to a hypothetical future model, every skill that requests the reasoning role automatically routes to the new provider. No skill definitions change. No code deployments occur. An administrator updates a database row.

7.3 Routing Resolution

The AI Provider Router's resolution logic follows a strict precedence chain:

routingHint (from request)
  → routingPolicy[hint] (from org config)
    → provider type
      → adapter instance
        → LLM API call


TypeScript
38 lines
function resolveProvider(
  routingHint: string | undefined,
  orgConfig: AIProviderConfig,
): ProviderEntry {
  // 1. If a routing hint is provided, look it up in the routing policy
  if (routingHint && orgConfig.routingPolicy[routingHint]) {
    const providerType = orgConfig.routingPolicy[routingHint];
    const provider = orgConfig.providers.find(
      p => p.type === providerType && p.enabled
    );
    if (provider) return provider;
    // Hint resolved to a disabled provider — fall through to default
  }
  
  // 2. Fall back to the 'default' role in the routing policy
  if (orgConfig.routingPolicy['default']) {
    const providerType = orgConfig.routingPolicy['default'];
    const provider = orgConfig.providers.find(
      p => p.type === providerType && p.enabled
    );
    if (provider) return provider;
  }
  
  // 3. No default configured — fail with actionable error
  throw new AIRoutingError(
    `No provider resolved for routing hint '${routingHint}' and no default configured`,
    {
      orgId: orgConfig.orgId,
      availableProviders: orgConfig.providers.filter(p => p.enabled).map(p => p.type),
      routingPolicy: orgConfig.routingPolicy,
      troubleshooting: [
        'Ensure at least one provider is enabled in Settings → AI Provider',
        `Check that routing policy has a mapping for '${routingHint}' or 'default'`,
        'Verify provider API keys are configured and valid',
      ],
    }
  );
}

Note the absence of silent fallbacks. If the routing hint maps to a disabled provider and no default is configured, the system does not silently pick a random enabled provider. It fails with a structured error that tells the administrator exactly what went wrong and how to fix it. This is the no-fallback principle applied at the provider routing layer: ambiguity in routing configuration is a bug, not a feature to be papered over.

7.3.1 User Model Preference as Advisory Hint

The dispatch payload from WorkflowJobDispatcher (Section 4.7) may carry a modelPreference.suggested field reflecting the user's AI provider setting from Settings → AI Provider. This is explicitly a suggestion, not an override. The resolution hierarchy is:

Workflow binding's routing_hint (from skill_registry) — highest priority. If the binding says reasoning, the router selects a reasoning-class model regardless of user preference.
Organization's routing policy (routingPolicy map) — determines which provider handles which routing hint for this organization.
User's model preference (modelPreference.suggested) — consulted only when the routing hint is auto or default, indicating that the skill has no strong opinion about which model class to use. In this case, the user's preference can influence provider selection within the bounds of what the organization has enabled.

This hierarchy ensures that skill authors can mandate appropriate model classes (a security scan should always use a reasoning-class model, regardless of whether the user prefers a fast one), while still respecting user preferences for skill types that genuinely have no model-class requirement (a simple classification where fast and reasoning would both produce acceptable results).

7.4 The Adapter Pattern

Each supported AI provider is encapsulated in an adapter that implements a common interface:


TypeScript
9 lines
interface AIProviderAdapter {
  readonly providerType: string;
  
  chat(request: NormalizedChatRequest): Promise<NormalizedChatResponse>;
  
  translateTools(tools: OpenAIToolDefinition[]): ProviderSpecificTools;
  
  healthCheck(): Promise<{ healthy: boolean; latencyMs: number }>;
}

Four adapters are registered at gateway initialization:

Adapter	Provider	Tool Format	Key Characteristics
`GeminiAdapter`	Google Gemini	`functionDeclarations`	Parallel tool calls, grounding, large context windows
`AnthropicAdapter`	Anthropic API	`tools` array with `input_schema`	Strong reasoning, XML-structured output
`ClaudeMaxAdapter`	Claude Code (Max plan)	Same as Anthropic	Free with subscription, reasoning-optimized
`OpenRouterAdapter`	OpenRouter proxy	OpenAI-compatible	Access to 200+ models, pay-per-token

The tool format translation layer (ai-tool-translator.ts) is the unsung infrastructure that makes multi-provider routing possible. LLM providers do not agree on how to represent tool definitions. OpenAI uses functions with parameters as JSON Schema. Anthropic uses tools with input_schema. Gemini uses functionDeclarations with parameters in a subtly different JSON Schema dialect that requires type to be uppercase (STRING vs. string). The translator normalizes all internal tool definitions to OpenAI format (the de facto standard) and translates to provider-specific format at the adapter boundary.

This translation is not merely cosmetic. Gemini's tool format requires explicit enum handling that differs from JSON Schema's enum keyword. Anthropic expects description at the tool level and within input_schema, while OpenAI places it only in the function definition. Missing or misformatted tool definitions cause provider-side validation failures that manifest as opaque 400 errors — the kind of error that takes hours to debug if you don't know the provider's exact schema expectations.

The translator handles these differences once, centrally, so that skill authors never encounter them.

7.5 Cost Optimization

The multi-provider routing architecture enables a cost optimization strategy that was impossible with single-provider routing. Consider a ProseCreator chapter generation workflow that involves 12 LLM calls:

Step	Routing Hint	Provider (Multi)	Provider (Single: Claude)	Tokens	Cost (Multi)	Cost (Single)
Context fetch (3 calls)	`fast`	Gemini Flash	Claude Opus	6K	$0.0009	$0.0012
Outline generation	`fast`	Gemini Flash	Claude Opus	4K	$0.0006	$0.0008
Quality review	`reasoning`	Claude Opus	Claude Opus	2K	$0.0000*	$0.0000*
Draft generation	`creative`	Claude Sonnet	Claude Opus	8K	$0.0015	$0.0030
Revision pass (2 calls)	`reasoning`	Claude Opus	Claude Opus	4K	$0.0000*	$0.0000*
Continuity check (3 calls)	`fast`	Gemini Flash	Claude Opus	6K	$0.0009	$0.0012
Final polish	`creative`	Claude Sonnet	Claude Opus	4K	$0.0008	$0.0015
Total				34K	$0.0047	$0.0077

* Claude Max subscription: unlimited usage at fixed monthly cost.

The multi-provider configuration saves approximately 39% per chapter while improving quality on the reasoning and quality review steps (Claude Opus excels at analytical evaluation) and maintaining equivalent quality on creative steps (Claude Sonnet's creative output is comparable to Opus for narrative prose). At scale — a novel-length project with 30 chapters — the savings compound: $0.14 versus $0.23, a difference that matters when multiplied across hundreds of active projects.

The cost optimization is not manual. It falls out naturally from the routing policy. The skill author sets routing_hint: 'fast' for context assembly because it accurately describes what the skill needs. The administrator sets Gemini Flash as the fast provider because it is the cheapest option that meets the latency requirement. Neither party explicitly optimizes cost — the architecture makes cost-efficient routing the default behavior.

7.6 Credential Management

API keys for each provider are stored in the nexus-auth database, encrypted with AES-256 at rest. The encryption key is stored as a Kubernetes secret, mounted into the nexus-auth pod at startup, and never written to disk in plaintext. The resolveOrgConfig function retrieves credentials through a single internal endpoint:

POST nexus-auth/internal/org/ai-config
Body: { orgId: string }
Response: { providers: [...], routingPolicy: {...}, keys: { gemini: "decrypted_key", ... } }

The response is cached in the AI Provider Router's memory for 60 seconds (TTL), keyed by organization ID. This balances security (short-lived cache reduces the window of exposure if the gateway process is compromised) against performance (avoiding a database round-trip on every LLM call).

Three absolute prohibitions govern credential management:

No environment variable fallbacks. If resolveOrgConfig returns no key for the selected provider, the system fails with a verbose error: "No API key configured for provider 'gemini' in organization '{orgId}'. Configure the key in Settings → AI Provider." It does not check process.env.GOOGLE_API_KEY. It does not silently degrade to a different provider. Environment variables are for local development only; production keys live exclusively in the encrypted database.
No hardcoded keys. No API key literal appears in source code, configuration files, Kubernetes manifests, or Docker images. The Brutal Honesty Audit's code review gate (Section 9) specifically scans for strings matching API key patterns (AIza, sk-, anthropic-key-) in committed code.
No cross-organization key sharing. Each organization's keys are isolated. Organization A's Gemini key is never used for Organization B's requests, even if Organization B has no key configured. The failure is per-organization: one organization's misconfiguration does not affect others.

7.7 Settings UI: User-Facing Configuration

The dashboard's Settings → AI Provider panel exposes the multi-provider configuration through a progressive disclosure interface. The default view shows enabled providers as cards with role badges. Expanding a card reveals model selection, API key input (masked, write-only), and role assignment checkboxes. The routing policy is displayed as a table mapping roles to providers, editable inline.

Changes are validated client-side (all enabled providers must have valid API keys; at least one provider must be assigned the default role) and server-side (key validation via a test API call to each provider). Invalid configurations cannot be saved — the UI displays specific validation errors for each failing constraint.

This design ensures that non-technical administrators can configure multi-provider routing without understanding the underlying routing mechanics. They see "Fast tasks → Gemini Flash" and "Reasoning tasks → Claude Opus" — not routing hints, adapter registrations, or skill registry columns.

8. AI Provider Router Lockdown

The AI Provider Router — POST /internal/ai/chat on the nexus-gateway — is the most sensitive internal endpoint in the platform. Every LLM call flows through it. Every API key is decrypted within it. Every token of every organization's AI usage passes through its process. If a service calls it directly, bypassing the orchestrator's dispatch pipeline, the governance guarantees described in Sections 4 and 5 evaporate: no risk classification, no rate limiting, no audit trail, no data residency enforcement.

The lockdown is enforced at two independent layers: application and network. Both must pass. Either alone is insufficient.

8.1 Application-Level Enforcement

The gateway middleware validates every request to /internal/ai/chat with two checks:


TypeScript
37 lines
function validateServiceKey(req: Request, res: Response, next: NextFunction) {
  const serviceKey = req.headers['x-service-key'];
  if (!serviceKey || serviceKey !== process.env.ORCHESTRATOR_SERVICE_KEY) {
    return res.status(403).json({
      error: true,
      code: 'INVALID_SERVICE_KEY',
      message: 'Access to /internal/ai/chat requires a valid orchestrator service key',
      troubleshooting: [
        'This endpoint is restricted to nexus-workflows (the execution engine)',
        'Direct calls from plugin services are not permitted',
        'Route LLM requests through POST /api/v1/dispatch on the orchestrator',
      ],
    });
  }
  next();
}

function validateCallerIdentity(req: Request, res: Response, next: NextFunction) {
  const allowedCallers = parseAllowedCallers(
    process.env.ALLOWED_AI_CALLERS || 'nexus-orchestrator'
  );
  const caller = req.headers['x-caller-service'];
  if (!caller || !allowedCallers.has(caller as string)) {
    return res.status(403).json({
      error: true,
      code: 'UNAUTHORIZED_CALLER',
      message: `Service '${caller}' is not authorized to call /internal/ai/chat`,
      troubleshooting: [
        `Allowed callers: ${[...allowedCallers].join(', ')}`,
        `Received X-Caller-Service: '${caller}'`,
        'Set ALLOWED_AI_CALLERS env var on the gateway to include this service',
        'Route LLM requests through POST /api/v1/dispatch on the orchestrator',
      ],
    });
  }
  next();
}

The ALLOWED_AI_CALLERS environment variable defaults to nexus-orchestrator only. In production, it must be set to nexus-orchestrator,nexus-workflows,nexus-mageagent to match the Istio policy's three authorized principals. This is a known configuration gap: if the env var is not set on the gateway pod, nexus-workflows and nexus-mageagent will pass Istio mTLS but be rejected at the application layer with HTTP 403 — an operationally dangerous mismatch where the mesh says "allowed" and the application says "denied."

The X-Caller-Service naming varies by service: nexus-workflows sends nexus-workflows, nexus-mageagent sends nexus-mageagent. The original nexus-orchestrator identity remains for backward compatibility.

8.2 Network-Level Enforcement

Application-level validation can be bypassed by a compromised service that sends forged headers. A pod running in the same namespace could craft a request with the correct X-Service-Key (if the secret leaks) and the correct X-Caller-Service header. Application-level checks are necessary but not sufficient.

Istio's AuthorizationPolicy provides network-level enforcement that does not rely on HTTP headers:


YAML
21 lines
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: ai-provider-router-access
  namespace: nexus
spec:
  selector:
    matchLabels:
      app: nexus-gateway
  action: ALLOW
  rules:
  - from:
    - source:
        principals:
          - "cluster.local/ns/nexus/sa/nexus-orchestrator"
          - "cluster.local/ns/nexus/sa/nexus-workflows"
          - "cluster.local/ns/nexus/sa/mageagent"
    to:
    - operation:
        paths: ["/internal/ai/chat", "/internal/ai/chat-with-tools"]
        methods: ["POST"]

The DENY policy uses notPrincipals with the same three identities, blocking all other service accounts from /internal/ai/* paths.

This policy operates at the Istio sidecar proxy level — before the request reaches the gateway's Node.js process. The sidecar inspects the client's mTLS certificate (issued automatically by Istio's certificate authority) to determine the source service account. Three principals are authorized: nexus-orchestrator (the dispatch layer, for internal health/admin calls), nexus-workflows (the execution engine, the primary LLM caller), and mageagent (the autonomous agent service, recently migrated from direct OpenRouter calls). The target state is to reduce this to nexus-workflows only, once the mageagent migration is complete and the orchestrator's admin calls are refactored. A request from any other service account — even with perfectly forged HTTP headers — is rejected at the sidecar with a 403 before the gateway process sees it.

8.3 Why Dual Enforcement

Neither layer alone is sufficient. Consider the failure modes:

Scenario	Application-Only	Network-Only	Dual Enforcement
Compromised pod forges headers	Bypassed — headers match	Blocked by mTLS identity	Blocked
Istio misconfiguration (policy not applied)	Still enforced by middleware	Bypassed — no policy active	Still enforced
Service key leaked via environment variable	Bypassed — key is valid	Blocked by service account	Blocked
Istio disabled for debugging	Still enforced by middleware	Bypassed — mesh inactive	Still enforced
Developer adds internal route that skips middleware	Bypassed for that route	Still enforced at sidecar	Still enforced

The dual-layer model ensures that no single misconfiguration — whether operational (Istio policy not synced), developmental (middleware accidentally removed from a route), or adversarial (header forging) — grants unauthorized access to the AI Provider Router.

8.4 Migration: Eliminating Direct SDK Callers

Before the lockdown, three services called LLM providers directly, instantiating SDK clients within their own codebases:

Service	Direct Call Pattern	Migration Path
`nexus-alive`	`new GoogleGenerativeAI(key).generateContent()` for diagnostic reasoning	Replaced with dispatch to `nexus_alive_diagnostic` skill (Tier 2, tools: `kubectl`, `http_request`)
`nexus-video-studio`	`new Anthropic().messages.create()` for script generation	Replaced with dispatch to `video_script_generate` skill (Tier 1, routing hint: `creative`)
`nexus-guest-experience`	`fetch('https://generativelanguage.googleapis.com/...')` for conversation	Replaced with dispatch to `guest_conversation` skill (Tier 2, tools: `graphrag_query`)

Each migration followed the same pattern:

Register the skill. Insert a row into graphrag.skill_registry with the appropriate execution type, tools, routing hint, and system prompt.
Replace the direct call. Remove the LLM SDK import, the API key fetch, and the direct provider call. Replace with a single POST /api/v1/dispatch request.
Handle async delivery. The original code was synchronous — call the LLM, wait for the response, use it. The dispatch model is asynchronous — dispatch the job, receive a job ID, receive the result via WebSocket or callback. This required restructuring the calling code to handle eventual delivery rather than immediate return.

The third step was consistently the most labor-intensive. Services that had been written around synchronous LLM calls needed their control flow restructured to accommodate the dispatch-and-wait pattern. In two cases (nexus-alive and nexus-guest-experience), the restructuring revealed latent bugs: race conditions between the LLM response and UI state updates that had been masked by the synchronous call pattern. The asynchronous model forced these bugs to the surface — the response arrived via WebSocket, the UI attempted to render it before the state was ready, and the resulting crash was immediate and diagnosable.

After migration, the three services no longer import LLM SDK libraries. They no longer fetch API keys from the auth service. They no longer make direct HTTP calls to provider endpoints. Their AI usage is visible in the orchestrator's run tables, subject to rate limiting, governed by risk classification, and traceable through the span tree. The Istio policy and gateway middleware enforce that this remains the case — a future developer adding an LLM call to nexus-alive would discover, at the network layer, that their call is rejected.

The migration took approximately three weeks of engineering time. Two of those weeks were spent on the async restructuring. One day was spent registering skills. One day was spent removing SDK imports and direct calls. The ratio — 90% restructuring, 10% actual migration — is itself an argument for the dispatch-execution separation: the sooner a platform adopts the pattern, the less code must be restructured to conform to it.

8.5 Known Governance Violations (April 2026)

Two services remain in partial or complete violation of the lockdown:

nexus-orchestration (the legacy meta-agent, distinct from nexus-orchestrator) still instantiates OpenRouterClient directly with process.env.OPENROUTER_API_KEY. It has no GatewayLLMClient, no X-Caller-Service header, and no routing through the AI Provider Router. Every LLM call from this service bypasses organizational provider configuration, quota enforcement, audit logging, and cost attribution. This is an active governance violation. The service is scheduled for deprecation but remains running.

nexus-mageagent has been partially migrated. A GatewayLLMClient exists in clients/gateway-llm-client.ts and correctly sends X-Caller-Service: nexus-mageagent with the X-Service-Key header. However, the legacy OpenRouterClient in clients/openrouter-client.ts — marked @deprecated in comments — remains present and is still imported via the llm-router utility. Whether the gateway client or the OpenRouter client handles a given agent execution depends on which client is instantiated in the hot path of smart-model-router.ts and universal-task-executor.ts. This creates two parallel execution paths — one governed (through the gateway) and one ungoverned (direct to OpenRouter) — coexisting in the same service.

The existence of these violations is itself an argument for the Istio-level enforcement described in Section 8.2. Without the Istio policy, these violations would be invisible at the infrastructure level — the gateway would have no way to distinguish a legitimate call from an unauthorized one. With the policy in place, the violations are contained: the services can call OpenRouter directly, but they cannot call the AI Provider Router's internal endpoint without a valid Istio principal. The governance loss is limited to the direct OpenRouter path — audit trails, cost attribution, and organizational provider preferences are not applied to those calls, but at least the AI Provider Router itself is not compromised.

9. Span-Tree Observability

Every distributed system produces telemetry. Most of it is useless.

The conventional approach — shipping logs to Elasticsearch, traces to Jaeger [25], metrics to Prometheus — produces three disconnected observability streams that must be manually correlated by a human staring at timestamps. "The error happened at 14:32:07. What was the system doing at 14:32:07? Let me search the logs. Let me find the trace. Let me cross-reference the dashboard." This is archaeology, not engineering. It is reconstruction after the fact, dependent on whether the right log line was emitted, whether the trace was sampled, whether the metric scrape interval happened to capture the anomaly.

The span-tree observability model in the Unified Nexus Orchestrator takes a fundamentally different approach. Every operation — every dispatch, every governance check, every LLM call, every tool invocation, every ReAct iteration, every chain fork, every human approval gate — is recorded as a span in a hierarchical tree structure. Spans are not telemetry bolted onto the execution model. They are the execution model, rendered as data.

9.1 The Execution Span Model

The core data structure is the ExecutionSpan:


TypeScript
14 lines
interface ExecutionSpan {
  spanId: string;
  parentSpanId: string | null;  // null = root span
  jobId: string;
  type: 'skill_resolve' | 'react_iteration' | 'llm_call' | 'tool_call' 
      | 'chain_step' | 'chain_fork' | 'chain_join' | 'goal_decompose' 
      | 'reflection' | 'plan_adjust' | 'batch_fanout' | 'batch_aggregate';
  label: string;
  status: 'pending' | 'running' | 'completed' | 'failed';
  startedAt: number;
  completedAt?: number;
  durationMs?: number;
  metadata: Record<string, unknown>;
}

Several design decisions in this structure deserve scrutiny.

The parentSpanId creates a tree, not a flat list. Every span except the root references its parent. A Tier 2 ReAct job produces a tree that reads: root (job execution) → child (skill resolution) → child (LLM call #1) → child (tool call: kubectl) → child (LLM call #2) → child (tool call: graphrag_query) → child (LLM call #3, final response). The tree structure captures causality, not merely temporality. The tool call happened because the LLM requested it. The second LLM call happened because the tool returned a result. The tree encodes these causal relationships through the parent-child edges.

This is a deliberate departure from OpenTelemetry's [22] general-purpose span model, which represents spans as nodes in a distributed trace — a graph that can span multiple services, processes, and machines. Our spans are scoped to a single job. They do not cross service boundaries (from the span tree's perspective, the AI Provider Router is an opaque external call, not a subtree). The narrower scope enables a critical simplification: the entire tree fits in a single PostgreSQL query, because all spans for a job share the same job_id foreign key. There is no need to assemble fragments from multiple backends. There is no sampling — every span for every job is recorded. This is Dapper [21] without the sampling, at the cost of storage.

The type field is an enumerated taxonomy, not a free-form string. Twelve span types capture the vocabulary of the execution engine: skill_resolve marks the moment the worker looks up the skill configuration. react_iteration wraps an entire ReAct cycle (LLM call + tool calls + reasoning). llm_call records a single invocation of the AI Provider Router. tool_call records a single tool execution. chain_step, chain_fork, and chain_join track DAG coordination. goal_decompose, reflection, and plan_adjust capture the autonomous agent's metacognitive phases. batch_fanout and batch_aggregate handle batch dispatch patterns.

This taxonomy is closed. A developer cannot invent a new span type without modifying the schema. This is intentional friction. In systems where span types are free-form strings — Jaeger [25], Zipkin, Grafana Tempo [26] — the observability data becomes progressively harder to query as teams invent ad-hoc types that overlap semantically but differ syntactically. Was it db.query, database_call, sql_execute, or pg_select? With a closed taxonomy, every query over span types is exhaustive: WHERE type = 'llm_call' returns every LLM call, with no false negatives from naming inconsistencies.

The metadata field is an unstructured JSONB escape hatch. While the type taxonomy is rigid, the metadata is flexible — it stores billing information (tokens consumed, provider used, cost), tool input/output (truncated to 10KB to prevent bloat), reasoning traces (the LLM's chain-of-thought before a tool call), governance decisions (risk classification result, data residency check outcome), and error details (HTTP status, provider error message, troubleshooting steps). The metadata is the span's context; the structured fields are the span's identity.

9.2 Database Schema

The span tree is stored in PostgreSQL, not shipped to an external tracing backend:


SQL
19 lines
CREATE TABLE orchestrator.execution_spans (
  span_id        UUID PRIMARY KEY,
  parent_span_id UUID REFERENCES orchestrator.execution_spans(span_id),
  job_id         UUID NOT NULL REFERENCES orchestrator.runs(job_id),
  type           TEXT NOT NULL,
  label          TEXT NOT NULL,
  status         TEXT NOT NULL DEFAULT 'pending',
  started_at     TIMESTAMPTZ,
  completed_at   TIMESTAMPTZ,
  duration_ms    INTEGER,
  metadata       JSONB DEFAULT '{}',
  created_at     TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_spans_job_id ON orchestrator.execution_spans(job_id);
CREATE INDEX idx_spans_type ON orchestrator.execution_spans(type);
CREATE INDEX idx_spans_parent ON orchestrator.execution_spans(parent_span_id);
CREATE INDEX idx_spans_status ON orchestrator.execution_spans(status) 
  WHERE status IN ('pending', 'running');

Why PostgreSQL and not a dedicated tracing backend? Three reasons.

First, span trees serve dual purposes: operational observability and regulatory audit trail. Under EU AI Act Article 12 [23], high-risk AI systems must maintain automatically generated logs of system operation, including inputs, outputs, and decision rationale, retained for a minimum of six months. Our span tree satisfies this requirement structurally — each llm_call span's metadata contains the prompt sent, the model invoked, the tokens consumed, and the response returned. An external tracing backend like Jaeger [25] is designed for operational debugging with sampling and retention policies that conflict with regulatory requirements. PostgreSQL gives us transactional guarantees, standard SQL querying, and retention policies controlled by database-level partitioning rather than backend-specific configuration.

Second, the span tree is queryable with standard SQL. "Show me all LLM calls for organization X in the last 24 hours that exceeded 10 seconds" is a single query:


SQL
9 lines
SELECT s.span_id, s.label, s.duration_ms, s.metadata->>'provider' AS provider,
       s.metadata->>'tokens_total' AS tokens, r.org_id
FROM orchestrator.execution_spans s
JOIN orchestrator.runs r ON s.job_id = r.job_id
WHERE s.type = 'llm_call'
  AND r.org_id = \$1
  AND s.started_at > now() - INTERVAL '24 hours'
  AND s.duration_ms > 10000
ORDER BY s.duration_ms DESC;

This query is impossible in most tracing backends without first exporting data to a data warehouse. In our system, it runs against the primary database.

Third, span trees are small. A Tier 1 job produces 2-3 spans. A Tier 2 job with 5 iterations produces 15-20 spans. A Tier 3 chain with 6 steps produces 40-60 spans. A Tier 4 autonomous job with goal decomposition and reflection produces 50-100 spans. At current production volume — approximately 8,000 jobs per day — this amounts to roughly 200,000 span rows per day, or 6 million per month. PostgreSQL handles this comfortably on a single node with appropriate indexing. The partial index on status (WHERE status IN ('pending', 'running')) ensures that the hot working set — active spans — is efficiently queryable even as the total table grows.

9.3 WebSocket Event Taxonomy

The span tree is populated by the execution engine and consumed by the dashboard. The communication channel is WebSocket, delivered through a carefully layered event relay architecture.

The full event taxonomy comprises 22 event types across four categories:

Job lifecycle events (emitted by nexus-workflows, relayed by gateway):

Event	Emitter	Payload	Purpose
`job:dispatched`	UNO	`{ jobId, traceId, jobType, orgId }`	Job accepted and queued
`job:started`	Worker	`{ jobId, workerId, queue }`	Worker picked up job
`job:skill_resolved`	Worker	`{ jobId, skillId, executionType, routingHint }`	Skill hydration complete
`job:progress`	Worker	`{ jobId, percent, message }`	Progress update (percentage)
`job:llm_call`	Worker	`{ jobId, iteration, messageCount, hasTools }`	LLM call initiated
`job:llm_response`	Worker	`{ jobId, provider, model, tokensUsed, durationMs }`	LLM response received
`job:tool_call`	Worker	`{ jobId, tool, iteration, args }`	Tool invocation started
`job:tool_result`	Worker	`{ jobId, tool, statusCode, durationMs, outputPreview }`	Tool result received
`job:thinking`	Worker	`{ jobId, reasoning }`	Agent reasoning trace
`job:completed`	Worker	`{ jobId, status, output, billing, totalDurationMs }`	Job finished successfully
`job:failed`	Worker	`{ jobId, error, code, troubleshooting }`	Job failed with error
`job:log_stream`	Worker	`{ jobId, line, stream }`	Raw log output (GPU, infra jobs)
`job:session_ready`	Worker	`{ jobId, sessionUrl }`	Interactive session available
`job:gpu_metrics`	Worker	`{ jobId, gpuUtil, memUtil, temperature }`	GPU utilization telemetry

Span events (the universal observability primitive):

Event	Emitter	Payload	Purpose
`job:span`	Worker	`{ jobId, span: ExecutionSpan }`	Span created or updated
`job:plan_created`	Worker	`{ jobId, plan: GoalDecomposition }`	Tier 4 plan generated
`job:reflection`	Worker	`{ jobId, verdict, reasoning }`	Tier 4 reflection result
`job:plan_adjusted`	Worker	`{ jobId, adjustments }`	Tier 4 plan modification
`job:pending_review`	Worker	`{ jobId, action, reasoning, context }`	Human approval required

Chain coordination events (emitted by UNO's completion listener):

Event	Emitter	Payload	Purpose
`chain:step_dispatched`	UNO	`{ chainId, stepId, jobId }`	Chain step dispatched
`chain:fork`	UNO	`{ chainId, parentStepId, childStepIds }`	Parallel fan-out
`chain:join`	UNO	`{ chainId, joinStepId, completedPredecessors }`	Barrier synchronization

Batch events (emitted by UNO's batch coordinator):

Event	Emitter	Payload	Purpose
`batch:progress`	UNO	`{ batchId, completed, total, failed }`	Batch progress
`batch:completed`	UNO	`{ batchId, results, billing }`	All batch items done

The relay architecture is straightforward but demands precision. nexus-workflows emits events via a ws-emitter utility that publishes to a Redis Pub/Sub channel namespaced by organization: nexus:jobs:org:{orgId}. The nexus-gateway's JobEventRelay module subscribes to these channels using Redis psubscribe (pattern subscription on nexus:jobs:org:*). When an event arrives, the relay emits it to the corresponding Socket.IO room org:{orgId}. Only authenticated clients who belong to the organization receive the event.

This architecture ensures that the execution engine (nexus-workflows) has no direct connection to client WebSocket sessions. The gateway is the sole WebSocket server. The execution engine publishes to Redis; the gateway subscribes and relays. This decoupling means the execution engine can scale independently — adding workers does not increase WebSocket connection count, and a WebSocket reconnection does not affect job execution.

9.4 The `job:span` Event as Universal Primitive

Among the 22 event types, job:span deserves special attention. It is the universal observability primitive — the event that carries the full ExecutionSpan payload, enabling the dashboard to construct the span tree in real time.

Every other event type is, in a sense, a legacy convenience. job:llm_call is redundant with a job:span event where span.type === 'llm_call' and span.status === 'running'. job:tool_result is redundant with a job:span event where span.type === 'tool_call' and span.status === 'completed'. The named events exist because the dashboard was built before the span tree model was formalized — they are consumed by existing UI components that expect specific event shapes. New dashboard features are built exclusively against job:span, and the named events are scheduled for deprecation in a future release.

The job:span event fires on every span state transition: creation (pending), activation (running), and completion (completed or failed). A single span therefore produces 2-3 events over its lifecycle. The dashboard maintains an in-memory span map keyed by spanId, updating the span's status, duration, and metadata as events arrive. When a span transitions to completed, the dashboard calculates durationMs from startedAt and completedAt and updates the tree visualization.

9.5 Dashboard Visualization: Trees for Every Tier

The span tree renders differently depending on the job's execution tier, because the trees have fundamentally different shapes.

Tier 1: Flat tree. A Tier 1 job produces a minimal tree: root → skill_resolve → llm_call. The dashboard renders this as a simple status card with three stages: "Resolved skill → Called LLM → Complete." There is no tree expansion because there is no branching. The visualization is a linear progress indicator with timing annotations.

Tier 2: Iteration tree. A Tier 2 ReAct job produces a tree with depth proportional to the number of iterations and breadth proportional to the number of tool calls per iteration. The dashboard renders this as an expandable iteration list:

▼ Security Scan (Tier 2, 4 iterations, 12.4s)
  ├─ Skill Resolve: security_scan (42ms)
  ├─ ▼ Iteration 1 (3.2s)
  │   ├─ LLM Call: Claude Opus (1.8s, 2.1K tokens)
  │   └─ Tool: kubectl get pods -n nexus (1.4s)
  ├─ ▼ Iteration 2 (4.1s)
  │   ├─ LLM Call: Claude Opus (2.3s, 3.4K tokens)
  │   ├─ Tool: kubectl describe pod nexus-auth-xxx (1.2s)
  │   └─ Tool: kubectl logs nexus-auth-xxx (0.6s)
  ├─ ▼ Iteration 3 (3.8s)
  │   ├─ LLM Call: Claude Opus (2.1s, 4.2K tokens)
  │   └─ Tool: kubectl get events -n nexus (1.7s)
  └─ ▼ Iteration 4 — Final Response (1.3s)
      └─ LLM Call: Claude Opus (1.3s, 1.8K tokens)

Each node is expandable. Clicking an llm_call node reveals the prompt summary, token breakdown (input vs. output), provider, model, and cost. Clicking a tool_call node reveals the tool arguments, the raw output (truncated), the HTTP status code, and the duration breakdown (network vs. execution).

Tier 3: DAG visualization. A Tier 3 chain produces a tree with fork and join nodes that represent parallel execution paths. The dashboard renders this as a directed acyclic graph with horizontal flow:

[Step 1: OCR] ──→ ┌─ [Step 2a: Prior Art Search] ─┐
                   ├─ [Step 2b: Claim Extraction]  ─┤──→ [Step 3: Novelty Score] ──→ [Step 4: Report]
                   └─ [Step 2c: Citation Check]    ─┘

Each step node expands to show its own span tree — because each step is itself a Tier 1 or Tier 2 job with its own iteration structure. The DAG visualization thus composes: the macro view shows the chain topology, and drilling into any step reveals the micro view of that step's execution tree.

The fork and join semantics are represented visually: a chain_fork span shows the fan-out moment (one predecessor dispatching multiple successors), and a chain_join span shows the barrier (multiple predecessors converging to enable a single successor). The join node displays a counter — "3/3 predecessors completed" — that updates in real time as predecessor job:completed events arrive.

Tier 4: Dynamic plan with metacognition. A Tier 4 autonomous job produces the richest tree, with goal decomposition, step execution, reflection, and plan adjustment as distinct span types:

▼ Infrastructure Remediation (Tier 4, goal: fix health check, 3m 42s)
  ├─ Goal Decompose (2.1s) — 5 steps planned
  ├─ ▼ Step 1: Get Pod Status (tool_using, 4.2s)
  │   ├─ LLM Call → Tool: kubectl get pods → LLM Response
  │   └─ Reflection: PASS — "pods identified, 1 CrashLoopBackOff"
  ├─ ▼ Step 2: Check Logs (tool_using, 6.8s)
  │   ├─ LLM Call → Tool: kubectl logs → LLM Response
  │   └─ Reflection: ADJUST — "root cause is env var, not code"
  ├─ Plan Adjust — modified steps 3-5
  ├─ ▼ Step 3: Fix Env Var (tool_using, 3.1s)
  │   ├─ LLM Call → Tool: kubectl set env → LLM Response
  │   └─ Reflection: PASS
  ├─ ▼ Step 4: Verify Fix (tool_using, 8.2s)
  │   ├─ LLM Call → Tool: kubectl rollout status → LLM Response
  │   └─ Reflection: PASS — "deployment rolled out successfully"
  └─ Goal Evaluation: ACHIEVED (1.4s)

The reflection spans are visually distinguished — rendered with a different background color and an icon indicating the verdict (checkmark for pass, warning triangle for adjust, refresh icon for replan, X for fail). The plan_adjust span shows a diff between the original plan and the modified plan, highlighting which steps changed and why.

Human approval gates (job:pending_review events) render as blocking nodes with a distinctive appearance — a pulsing border, a timer showing how long the job has been waiting for approval, and action buttons (Approve / Reject) that POST to the dispatch endpoint's approval route.

9.6 Per-Span Billing: Cost at Every Node

Every span that involves LLM token consumption carries a billing object in its metadata:


TypeScript
8 lines
interface SpanBilling {
  provider: string;
  model: string;
  tokens_input: number;
  tokens_output: number;
  cost_usd: number;
  cached_tokens?: number;
}

This per-span billing creates a cost tree that mirrors the execution tree. The dashboard renders cost annotations on every llm_call node — "$0.0012 · 1.8K tok" — and aggregates costs upward through the tree. An iteration's cost is the sum of its LLM call costs. A job's cost is the sum of its iteration costs. A chain's cost is the sum of its step costs.

The aggregation is not merely additive — it is attributable. An organization's monthly bill can be decomposed: "You spent $142.30 on AI this month. $67.40 on ProseCreator (47%), $34.20 on security scans (24%), $18.90 on NexusROS task planning (13%), $21.80 on everything else (16%)." This attribution flows directly from the span tree, because every LLM call span is a child of a job span, which has a job_type that maps to a skill, which maps to a functional category.

The live billing snapshot appears in every WebSocket event — not just the final job:completed. Each job:llm_response event carries the cumulative token count and cost for the job so far. The dashboard's dock bar displays this live: "$0.0034 · 3.2K tok" updating with every LLM call. Users see money being spent in real time. This transparency is not incidental — it is a design principle. Cost awareness reduces waste. When a user watches a security scan consume $0.47 worth of tokens across 10 iterations, they develop an intuition for what operations cost and whether a particular skill's iteration limit is appropriately calibrated.

The final billing in the job:completed event is the authoritative record. It is written to the job_usage_log table — an append-only ledger that serves as the billing source of truth. The Redis quota counter is incremented atomically (INCR) with the final token count, updating the organization's consumption against their quota.

10. Governance, Compliance, and Cost Attribution

An AI platform that executes workloads on behalf of multiple organizations — each with different providers, different budgets, different regulatory jurisdictions — is a governance system whether its designers intend it or not. The question is not whether governance will be enforced but where and how. Our answer: at the dispatch boundary, as a structural property of the architecture, not as an afterthought bolted onto execution.

10.1 Quota Enforcement: Pre-Dispatch and Post-Execution

Quota enforcement operates at two checkpoints, each serving a different purpose.

Pre-dispatch quota check (UNO). When a dispatch request arrives, UNO checks the organization's token quota in Redis using an atomic INCR operation against the key quota:org:{orgId}:tokens:month:{YYYY-MM}. If the current count exceeds the organization's configured limit, UNO returns HTTP 429 (Too Many Requests) with a structured error:


JSON
16 lines
{
  "error": true,
  "code": "QUOTA_EXCEEDED",
  "message": "Organization 'acme-corp' has exceeded its monthly token quota (500,000 tokens used of 500,000 limit)",
  "troubleshooting": [
    "Contact your administrator to increase the token limit",
    "Review usage breakdown in Settings → Billing → Usage Details",
    "Consider optimizing high-consumption skills (see top consumers below)"
  ],
  "context": {
    "orgId": "acme-corp",
    "used": 500000,
    "limit": 500000,
    "topConsumers": ["prosecreator_chapter_generate: 210K", "security_scan: 89K"]
  }
}

The pre-dispatch check is approximate. Redis counters are updated asynchronously after job completion (see below), so there is a window — typically seconds, at most minutes — during which the counter lags behind actual consumption. A job that completes while the counter shows 495,000 of 500,000 tokens consumed might push the actual total to 510,000 before the counter catches up. This is acceptable because the alternative — synchronous quota validation that blocks dispatch until all in-flight jobs report their final token counts — would defeat the purpose of asynchronous execution.

Post-execution quota update (nexus-workflows). After every LLM call, the usage-tracker accumulates token counts within the worker process. Upon job completion, the tracker atomically increments the Redis quota counter with the job's total token consumption and writes a record to the job_usage_log table. This append-only table is the billing source of truth — Redis is the enforcement mechanism, PostgreSQL is the audit record.

The dual-checkpoint model balances responsiveness (approximate pre-check prevents obviously over-quota jobs from consuming resources) with accuracy (post-execution update ensures the billing record is exact).

10.2 EU AI Act Article 12: Audit Logging

The EU AI Act (Regulation 2024/1689) [23] imposes specific record-keeping requirements on providers and deployers of high-risk AI systems. Article 12 mandates that high-risk AI systems "shall technically allow for the automatic recording of events ('logs') over the lifetime of the system" [23]. These logs must enable post-market monitoring, risk detection, and operational oversight, and must be retained for a minimum of six months per log entry.

Our span tree satisfies these requirements as a structural byproduct of the execution model, not as a separate compliance system. Consider what Article 12 requires and how the span tree provides it:

Inputs and outputs. Every llm_call span's metadata contains the message array (prompt + context) sent to the provider and the completion returned. The tool_call spans contain tool arguments and tool results. The chain of spans from root to leaf provides the complete input-output pipeline for every AI operation.

Decision rationale. Tier 4 autonomous jobs produce reflection spans with explicit reasoning — the agent's assessment of whether its step succeeded and why. Tier 2 jobs produce react_iteration spans that capture the LLM's tool selection reasoning. These are not logs of what happened; they are logs of why it happened, in the agent's own words.

System configuration. The skill_resolve span's metadata contains the skill configuration at execution time: system prompt, model routing hint, tool permissions, iteration limits. If a skill's configuration changes between two executions, the span metadata captures both states.

Retention. The execution_spans table is partitioned by month (see Section 11.3), with a retention policy of 12 months — double the Article 12 minimum. Older partitions are detached and archived to cold storage, not deleted.

The nexus-auth service receives a structured audit event for every LLM call — a denormalized record containing the organization ID, user ID, job ID, skill name, provider, model, token counts, timestamps, and a reference to the span ID for drill-down. This audit trail is separate from the span tree, serving as a lightweight compliance record that can be exported to regulatory authorities without exposing the full operational telemetry. The audit endpoint is GDPR-aware: personal data (user identifiers) can be pseudonymized in exports, while the operational data (model used, tokens consumed, duration) remains intact.

10.3 Data Residency Enforcement

Multi-tenant AI platforms face a specific governance challenge that single-tenant systems do not: data residency. An EU-based organization's data must not leave the EU — not in transit to an LLM provider, not stored in a US-hosted cache, not logged in a monitoring system running on AWS us-east-1.

Data residency enforcement is a dispatch-time concern, not an execution-time concern. By the time execution begins, the job has been committed to a queue, and the execution worker may be running on any available node. The enforcement must happen before the job enters the pipeline.

UNO's governance pre-check includes a residency classifier:

Organization jurisdiction lookup. The organization's registered jurisdiction (e.g., EU, US, APAC) is fetched from the auth service alongside the AI configuration.
Provider residency mapping. Each provider adapter declares its data residency: Gemini (US/EU, configurable per model), Anthropic (US), Claude Max (US), OpenRouter (varies by upstream model). The AI Provider Router maintains this mapping.
Residency validation. If the organization's jurisdiction is EU and the resolved provider's residency is US, the dispatch is blocked with a structured error explaining the residency conflict and listing EU-resident alternatives.
Override mechanism. Organizations can explicitly acknowledge residency exceptions for specific skills — a signed consent stored in the auth database — enabling cross-border routing when legally permissible (e.g., adequacy decisions, standard contractual clauses).

This enforcement is impossible without the dispatch-execution separation. If services call the AI Provider Router directly — as the pre-UNO nexus-mageagent did — there is no chokepoint at which to enforce residency. Every service becomes responsible for its own residency check. Replication, as we have argued, is the enemy of enforcement.

10.4 Human Review Gates for High-Risk Operations

Not all AI operations are created equal. An LLM call that classifies a support ticket as "billing inquiry" poses negligible risk. An autonomous agent that executes kubectl delete deployment nexus-auth -n nexus could cause a production outage. The governance model must distinguish between these cases and apply proportional controls.

UNO's risk classification assigns one of four levels at dispatch time, based on the skill's risk_level field in the registry:

Risk Level	Governance Action	Example Job Types
`low`	Dispatch immediately	Text classification, entity extraction, summarization
`medium`	Dispatch with enhanced audit logging	Prose generation, code review, document analysis
`high`	Dispatch with mandatory human notification	Security scans, infrastructure diagnostics
`critical`	Block dispatch until human approval received	Autonomous agents with kubectl/deploy tools, code execution

For critical jobs, the dispatch flow pauses. UNO inserts the run record with status pending_review, emits a job:pending_review WebSocket event containing the proposed action and the agent's reasoning, and returns HTTP 202 with a status indicating approval is required. The job does not enter the BullMQ queue. No resources are consumed. No LLM calls are made.

Approval arrives via POST /dispatch/:id/approve (authenticated, authorized to the organization's admin role). Upon approval, UNO transitions the run to queued and enqueues the job to BullMQ. Upon rejection, the run transitions to rejected and no execution occurs. The human reviewer's identity and decision are recorded in the span tree as a governance_review span — part of the permanent audit trail.

This is not approval of the output. It is approval of the intent. The reviewer sees: "Autonomous agent requests permission to execute kubectl delete in the nexus namespace to remediate a failing health check. Tools requested: kubectl, http_request. Step budget: 20. Max replans: 3." The reviewer decides whether this scope of authority is appropriate, not whether the specific kubectl command is correct. The agent retains autonomy within the approved scope — the approval authorizes the mission, not the individual actions.

10.5 Live Billing via WebSocket Events

Cost transparency is a governance mechanism. When users see costs in real time — not in a monthly invoice, not in a dashboard they never visit, but inline in the job progress interface — they self-regulate.

Every WebSocket event from nexus-workflows includes a billing field with the cumulative cost snapshot:


TypeScript
7 lines
interface BillingSnapshot {
  tokens_input: number;
  tokens_output: number;
  tokens_total: number;
  cost_usd: number;
  provider_breakdown: Record<string, { tokens: number; cost: number }>;
}

The dashboard's Progress Control Center (PCC) renders this as a dock bar beneath the job progress indicator: $0.0034 · 3.2K tok · Gemini Flash. The values update with every job:llm_response event. For multi-provider jobs, the provider breakdown shows cost per provider.

The final job:completed event contains the authoritative billing snapshot, which is written to the append-only job_usage_log table and used to increment the Redis quota counter. The live snapshots are approximations — they do not include overhead costs (API request fees, network transfer) — but they are accurate to within 5% of the final bill.

10.6 Cost Attribution Hierarchy

The span tree enables a four-level cost attribution hierarchy:

Per-span: Individual LLM call cost — "$0.0008 for this specific call to Gemini Flash."
Per-job: Aggregate cost of all spans in a job — "$0.0047 for this ProseCreator chapter."
Per-skill: Aggregate cost of all jobs for a given skill type — "$67.40 for prosecreator_chapter_generate this month."
Per-organization: Aggregate cost of all skills for an organization — "$142.30 total AI spend for Acme Corp this month."

Each level is derivable from the level below via aggregation queries on the execution_spans and job_usage_log tables. The dashboard exposes levels 2 (in the job detail view), 3 (in the billing summary), and 4 (in the admin billing dashboard). Level 1 is available by expanding individual span nodes in the tree visualization.

10.7 The No-Fallback Principle Applied to Governance

The no-fallback principle — described in prior sections for credential management and provider routing — extends to governance. If the risk classifier cannot determine a job's risk level (skill not found in registry, risk_level field null), the job is not dispatched with a default of low. It fails with a structured error:


JSON
10 lines
{
  "error": true,
  "code": "RISK_CLASSIFICATION_FAILED",
  "message": "Cannot classify risk for job type 'unknown_skill': skill not found in registry",
  "troubleshooting": [
    "Register the skill in graphrag.skill_registry with a risk_level",
    "Valid risk levels: low, medium, high, critical",
    "See docs: /platform/skills/registering-a-skill"
  ]
}

Silent fallback to low risk would mean that an unregistered skill — which by definition has no governance metadata, no tool permissions, no iteration limits — executes with the least oversight. This is the governance equivalent of leaving a door unlocked because you could not find the key. The no-fallback principle ensures that ambiguity in governance configuration surfaces as a loud, actionable error rather than a silent policy bypass.

11. Scaling Analysis

The dispatch-execution separation is an architectural pattern. Its value, ultimately, depends on whether it delivers measurable scaling advantages over the monolithic alternative. This section analyzes the scaling characteristics of each component, demonstrates how the separation enables independent scaling, and projects behavior at three scale thresholds: 10K, 100K, and 1M concurrent users.

11.1 Per-Component Scaling Characteristics

UNO (Dispatch Layer). UNO is stateless — modulo a single PostgreSQL INSERT (the run record) and a single Redis LPUSH (the BullMQ enqueue). Its per-request processing cost is dominated by skill resolution (a PostgreSQL SELECT, cached in-memory with 60-second TTL) and governance pre-checks (a Redis GET for quota, a conditional Redis lookup for risk classification). Measured p99 latency: 15ms. Measured throughput: approximately 2,000 dispatches per second per pod at 70% CPU utilization. UNO scales horizontally with a Kubernetes HorizontalPodAutoscaler (HPA) [27] configured to maintain 70% average CPU utilization, with a minimum of 3 pods and a maximum of 15.

Because UNO is stateless, horizontal scaling introduces no coordination overhead. Two UNO pods do not need to agree on anything — they both write to the same PostgreSQL table and the same Redis queues, and PostgreSQL's MVCC and Redis's single-threaded command processing handle concurrency natively. There is no distributed consensus, no leader election, no state synchronization.

nexus-workflows (Execution Workers). Workers are CPU-and-memory-bound during tool execution and I/O-bound during LLM calls. The scaling unit is the queue, not the service. The default queue runs at concurrency 50 across 3 replicas (150 concurrent jobs). The gpu queue runs at concurrency 5 on 1 GPU-attached replica. The security queue runs at concurrency 10 across 2 replicas.

Queue-based autoscaling is achieved through KEDA [28] (Kubernetes Event-Driven Autoscaling), which monitors the BullMQ queue depth in Redis and scales worker deployments proportionally. When the default queue exceeds 100 pending jobs, KEDA adds a worker replica. When the queue drains below 20, KEDA removes a replica. This is fundamentally different from HPA's CPU-based scaling: workers may be at 10% CPU (waiting for LLM responses) while the queue has 500 pending jobs. CPU-based scaling would not react; queue-depth-based scaling would.

Redis. At current scale (single Redis instance), Redis handles approximately 5,000 operations per second (BullMQ enqueue/dequeue, Pub/Sub, quota counters). Redis's single-threaded architecture provides natural serialization — quota INCR operations are atomic, Pub/Sub delivery is ordered per channel. The scaling boundary is approximately 50,000 operations per second, beyond which Redis Cluster [29] (3 masters with hash-slot sharding) becomes necessary. BullMQ supports Redis Cluster natively; the migration is a configuration change, not a code change.

PostgreSQL. The primary tables — orchestrator.runs and orchestrator.execution_spans — grow linearly with job volume. At current production volume (8,000 jobs/day, ~200K spans/day), a single PostgreSQL instance with PgBouncer [30] connection pooling (transaction mode, 100 pool size) handles the load. The scaling strategy is partitioning: the runs table is partitioned by month using PostgreSQL's native declarative partitioning [31], enabling efficient queries over recent data while archiving old partitions to cold storage.


SQL
9 lines
CREATE TABLE orchestrator.runs (
  job_id UUID NOT NULL,
  org_id UUID NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  -- ... other columns
) PARTITION BY RANGE (created_at);

CREATE TABLE orchestrator.runs_2026_04 PARTITION OF orchestrator.runs
  FOR VALUES FROM ('2026-04-01') TO ('2026-05-01');

Gateway (WebSocket Relay). The gateway maintains persistent Socket.IO connections with all connected dashboard clients. Socket.IO requires sticky sessions [32] — a client's WebSocket connection must always reach the same gateway pod. Istio's consistent-hash load balancing (hashed on the io cookie) provides this affinity without application-layer session management. The gateway scales from 3 to 15 pods via HPA, with each pod handling approximately 2,000 concurrent WebSocket connections.

11.2 Independent Scaling in Practice

The separation principle manifests as independent scaling vectors:

Scaling Pressure	Monolithic Response	Separated Response
Burst of 1,000 dispatch requests	Scale entire service (dispatch + execution)	Scale UNO only (3 → 8 pods, 15 seconds)
GPU queue backlog of 200 jobs	Cannot isolate GPU scaling	Scale GPU workers only (1 → 4 replicas)
WebSocket connection spike	Scale entire service	Scale gateway only
LLM provider latency increase	Workers blocked, dispatch stalls	Workers slow down; dispatch unaffected
Redis memory pressure	All functions degraded	Isolate: BullMQ Redis vs. Pub/Sub Redis

The fourth row is the most consequential. In the monolithic architecture, when Gemini's API latency increases from 1.5s to 8s (as occurred during a provider-side capacity event in January 2026), the dispatch pipeline stalled because the execution thread pool was saturated with slow LLM calls. No new jobs could be dispatched. Health checks failed. Pods restarted. Queued work was lost.

In the separated architecture, the same provider latency increase causes workers to process jobs more slowly — each job takes longer, so the queue depth increases — but dispatch is unaffected. UNO continues accepting and queuing jobs at full speed. The queue absorbs the burst. When provider latency normalizes, workers drain the backlog. No restarts. No lost work.

11.3 Scaling Scenarios

10K concurrent users (current). 3 UNO pods, 7 worker pods (across all queues), 1 Redis instance, 1 PostgreSQL instance with PgBouncer, 3-5 gateway pods. Monthly data: ~240K jobs, ~6M spans, ~50GB PostgreSQL storage. No partitioning urgently needed but implemented proactively. Total compute: approximately 24 vCPUs, 48GB RAM.

100K concurrent users. 8-10 UNO pods, 20-25 worker pods, Redis Cluster (3 masters, 3 replicas), PostgreSQL with monthly partitioning and read replicas for analytics, 10-15 gateway pods. Monthly data: ~2.4M jobs, ~60M spans, ~500GB PostgreSQL storage. KEDA autoscaling essential for handling queue-depth spikes. Estimated compute: approximately 100 vCPUs, 200GB RAM. The critical bottleneck shifts from CPU to network I/O (Redis Pub/Sub fanout to 15 gateway pods).

1M concurrent users. 30-50 UNO pods, 80-120 worker pods across specialized queue deployments, Redis Cluster (6 masters, 12 replicas) with separate clusters for BullMQ and Pub/Sub, PostgreSQL with Citus distributed extension or Aurora PostgreSQL with read replicas, 50+ gateway pods behind geographic load balancing. Monthly data: ~24M jobs, ~600M spans. At this scale, the span tree's "record everything" policy requires re-evaluation — sampling for low-risk Tier 1 jobs while retaining full recording for Tier 2+ becomes a viable cost-performance tradeoff. Estimated compute: approximately 500 vCPUs, 1TB RAM. The architecture's linearity — each component scales independently with near-linear cost — holds through this range, though operational complexity increases substantially.

12. Migration Strategy

Architectural papers tend to present their designs as if they emerged fully formed, as Athena from Zeus's head. They did not. The Unified Nexus Orchestrator was carved — over nine migration phases spanning approximately seven weeks — from a monolithic predecessor that coupled dispatch, execution, governance, and provider routing in a single service boundary. This section describes how the separation was achieved incrementally, without a flag-day cutover that would have required stopping the world.

12.1 The Nine-Phase Plan

The migration follows the strangler fig pattern [33]: new capability is built alongside the existing system, traffic is gradually shifted, and the old system is decommissioned only after the new system has proven itself in production. Each phase has explicit verification criteria — observable outcomes that confirm the phase is complete before the next begins.

Phase 1: Fix Redis and Database Foundation (Immediate, 2 days). Status: COMPLETE. Migrations 001–008 applied. orchestrator.runs and orchestrator.execution_spans tables created. routing_hint and dispatch_mode columns added (migrations 007–008). Redis connection pooling via ioredis cluster-aware configuration.

Phase 2: Split — Extract Execution to nexus-workflows (1 week). Status: COMPLETE. nexus-workflows deployed as independent K8s service (2 replicas, K8s deployment manifest in /k8s/base/deployments/nexus-workflows.yaml). Execution engine handles Tiers 1–4. BullMQ Worker consumes from nexus-orchestrator-dispatch queue. 684-line job-processor.ts dequeues, re-resolves skill, routes by execution_type.

Phase 2.5: Span-Tree Visualization (4 days). Status: COMPLETE. execution_spans table (migration 004), job:span WebSocket events emitted, job-event-relay.ts (270 lines) in gateway subscribes to Redis Pub/Sub and relays to Socket.IO rooms.

Phase 3: Named Queues (3 days). Status: COMPLETE. queue_name column in skill_registry (migration 003). Jobs route to named queues based on skill configuration. Concurrency settings per queue.

Phase 4: Chain DAG Coordinator (1 week). Status: COMPLETE. chain-coordinator.ts in nexus-orchestrator handles step completions, fork/join dispatch, and conditional branching. Chain step callbacks at POST /api/v1/dispatch/chain/step-completed and step-failed.

Phase 5: Batch Dispatch (3 days). Status: COMPLETE. POST /api/v1/dispatch/batch endpoint in dispatch-routes.ts. Batch state tracked in Redis.

Phase 6: Tier 4 Autonomous Engine (1 week). Status: COMPLETE. autonomous-engine.ts in nexus-workflows handles the 'autonomous' execution type. Goal decomposition, step execution, reflection, plan adjustment. Ported from nexus-mageagent. nexus-mageagent continues to operate separately with partial gateway migration.

Phase 7: Multi-Provider AI Routing (1 week). Status: PARTIAL. The routing_hint plumbing is complete end-to-end: column in skill_registry (migration 007), values back-filled (migration 008), llm-client.ts sends routingHint in request body, gateway reads it. However, the AI Provider Router does not yet act on the hint — resolveOrgConfig returns a single provider per org, adapter selection is config.aiProvider unconditionally. Multi-provider selection (array of providers with routing policy) is designed but not implemented.

Phase 8: New Tool Executors (1 week). Status: PARTIAL (7 of 11). Deployed: kubectl, http_request, prosecreator, graphrag_query, hpc_gateway, n8n_trigger, autoresearch. Pending: jupyter_execute, cvat_annotate, sandbox_execute, vision_analyze.

Phase 9: Kill Legacy (2 days). Status: COMPLETE (3 of 3). nexus-trigger: SCALED TO 0 — all 6 dependent services migrated to orchestrator dispatch; 32 new skills seeded via migration 010. nexus-orchestration: SCALED TO 0 — dependency audit found both env var references were dead; removed from K8s manifests and ALLOWED_AI_CALLERS. nexus-mageagent: FULLY MIGRATED — OpenRouterClient deleted (650 lines), llm-router deleted (172 lines), llm-route middleware deleted. 41 files changed, -1126 net lines. All LLM traffic routes through GatewayLLMClient → gateway AI Provider Router. Added to ALLOWED_AI_CALLERS.

12.2 Component Relocation Table

Component	Origin (Pre-Migration)	Destination (Post-Migration)
Dispatch logic (validate, enqueue)	`nexus-trigger`	UNO
Skill resolution	`nexus-trigger` (hardcoded routing)	UNO → `skill_registry` DB
Governance pre-checks	Absent	UNO
BullMQ queue management	`nexus-trigger` (single queue)	UNO (dispatch) + nexus-workflows (consume)
ReAct loop execution	`nexus-trigger`	nexus-workflows
Tool executors	`nexus-trigger` + `nexus-mageagent`	nexus-workflows
LLM client (provider calls)	`nexus-trigger` + `nexus-mageagent` + `nexus-alive`	nexus-workflows → AI Provider Router
API key resolution	Each service independently	AI Provider Router → nexus-auth
Chain coordination	`nexus-orchestration` (limited)	UNO completion listener
Autonomous agent logic	`nexus-mageagent`	nexus-workflows Tier 4
Span-tree observability	Absent	nexus-workflows + UNO
WebSocket event emission	`nexus-trigger` (direct)	nexus-workflows → Redis Pub/Sub → gateway

12.3 Legacy Service Deprecation

Three services become obsolete after the migration:

nexus-trigger. The original dispatch-and-execute monolith. Its dispatch logic moves to UNO; its execution logic moves to nexus-workflows. After Phase 2, it runs in parallel as a fallback. After Phase 9, it is scheduled for deprecation. As of Q2 2026, nexus-trigger is still running — it has not yet been scaled to zero. Deprecation is planned after nexus-workflows consumers are fully validated. Over its operational lifetime, nexus-trigger processed approximately 2.1 million jobs. Its Redis connection exhaustion failures — the original motivation for the separation — occurred 47 times, each incident affecting all dispatched jobs for 30-90 seconds.

nexus-orchestration. A limited chain coordinator that supported linear pipelines (step A → step B → step C) but not DAG patterns (fork, join, conditional). After Phase 4 introduces the DAG coordinator in UNO, nexus-orchestration becomes redundant.

nexus-mageagent. The autonomous agent service that called LLM providers directly, bypassing the AI Provider Router and all governance controls. After Phase 6 and Phase 8, its agent logic and tool executors are subsumed by nexus-workflows Tier 4. Its direct SDK calls — the governance bypass that prompted the lockdown — are eliminated.

12.4 Verification Criteria Summary

Each phase completes only when its verification criteria are met. These criteria are not test suites — they are observable production behaviors:

Phase	Criterion	Measurement Method
1	Redis connections drop >50%	`redis-cli info clients` before/after
2	Output hash match for 100% of shadow jobs	Comparison log over 48-hour window
2.5	Span trees render for all 4 tiers	Manual dashboard inspection + screenshot evidence
3	Jobs route to correct named queue	BullMQ dashboard showing per-queue job counts
4	Chain survives UNO restart	Kill UNO pod during chain execution; chain resumes
5	Batch of 50 completes and aggregates	batch:completed event with 50 results
6	Approval gate pauses/resumes correctly	Submit critical job, verify pending_review, approve, verify completion
7	Different providers for different steps	Span metadata showing provider per llm_call
8	Each new tool responds to structured invocation	Per-tool integration test from Tier 2 job
9	Zero fallback traffic over 48 hours	Gateway access logs showing no requests to legacy endpoints

13. Discussion

13.1 The Dispatch-Execution Tradeoff

The separation of dispatch and execution is not free. It introduces latency — the queue transit time between UNO enqueuing a job and nexus-workflows dequeuing it. In practice, this latency is small (median 12ms, p99 47ms under normal load) because BullMQ's blocking dequeue (BRPOPLPUSH) returns immediately when the queue is non-empty. Under load, when the queue has pending jobs and all workers are busy, the queue wait time can reach seconds or minutes — but this is precisely the scenario where the separation provides its greatest benefit. The queue is absorbing a burst that would have crashed a monolithic dispatcher-executor.

The deeper tradeoff is between synchronous simplicity and asynchronous composability. A synchronous LLM call — const result = await callLLM(prompt) — is easy to write, easy to debug, easy to reason about. A dispatched job — const jobId = await dispatch(jobType, payload); onComplete(jobId, handleResult) — requires restructuring control flow, handling eventual delivery, and managing the gap between "job accepted" and "result ready." As noted in Section 8.4, this restructuring accounted for 90% of the migration effort.

Is the tradeoff worth it? For single-shot LLM calls with no governance requirements and no multi-tenant cost attribution needs — arguably not. For a platform serving 67 microservices across multiple organizations with heterogeneous workload types, regulatory obligations, and per-organization billing — unambiguously yes. The synchronous model cannot provide governance at a chokepoint, because there is no chokepoint. The synchronous model cannot provide queue-based autoscaling, because there is no queue. The synchronous model cannot provide fault isolation between dispatch and execution, because they share a process. The benefits of separation scale superlinearly with platform complexity.

13.2 The Chat Exception

Real-time chat — the Revenue Intelligence Analyst, plugin chat windows, conversational interfaces — bypasses UNO entirely. Chat traffic routes through the nexus-gateway's chat-orchestrator directly to the dashboard's AI provider proxy endpoints. This is the only exception to the dispatch-execution separation, and it merits explanation.

Chat requires sub-second time-to-first-token. The user is watching a cursor blink. Adding a queue transit (even 12ms median) and a dispatch overhead (15ms p99) to every message would not, by itself, degrade the experience noticeably. But the queue introduces variance — under load, wait times spike to seconds — and variance in interactive latency is more disruptive than consistently elevated latency. Users tolerate a chatbot that always takes 300ms to start responding. They do not tolerate one that usually takes 50ms but occasionally takes 3 seconds.

More fundamentally, chat does not benefit from the separation's advantages. Chat does not need queue-based autoscaling (it is connection-bound, not compute-bound). Chat does not need governance pre-checks (the user is the human in the loop). Chat does not need span-tree observability (conversation history is the audit trail). Chat does not need multi-step coordination (each message is independent). The dispatch-execution separation solves problems that chat does not have.

The exception is narrow and explicit. Only user-facing conversational interfaces — those where a human is watching tokens stream in real time — qualify. ProseCreator's prose generation, even though it has a "chat-like" interface, is a dispatched job because it generates structured output (chapters, outlines, character profiles) that benefits from governance, cost attribution, and span-tree observability. The distinction is not "does it look like chat?" but "does it need the guarantees that dispatch provides?"

13.3 Limitations

Several limitations of the current architecture merit candid acknowledgment.

Token quotas are approximate. The pre-dispatch quota check uses Redis counters that lag behind actual consumption. An organization can exceed its quota by the token cost of all currently in-flight jobs. For organizations with tight quotas and many concurrent jobs, this overshoot can be significant. A precise quota would require synchronous token reservation at dispatch time — debiting estimated tokens before execution and refunding the difference after — which adds complexity and latency to the dispatch path.

The chain DAG engine is implicit. Chain definitions are JSON structures interpreted by the completion listener, not a proper workflow language. There is no visual chain editor (chains are defined programmatically), no conditional expression evaluator beyond simple comparisons, and no loop construct (a step cannot re-dispatch itself). More sophisticated chain patterns — dynamic sub-DAG expansion, parameterized chain templates, chains that modify their own topology based on intermediate results — require extension of the completion listener's state machine, which is currently 600 lines of TypeScript that would benefit from formal verification.

Span storage scales linearly. The "record every span" policy provides complete observability and regulatory compliance but costs storage. At 1M users, the projected 600M spans per month would require approximately 300GB of PostgreSQL storage per month before archival. Tiered sampling — full recording for Tier 2+ and sampled recording for Tier 1 — would reduce this by approximately 60% given that Tier 1 jobs constitute 60% of volume.

Multi-agent patterns are expensive. The compete pattern (N parallel agents) and self-consistent pattern (N diverse reasoning paths, inspired by Wang et al. [34]) multiply token costs by N. At N=3, a $0.01 job becomes a $0.03 job. The quality improvement (23% for compete, 17% for self-consistent on measured benchmarks) may not justify the 3x cost for all use cases. The current implementation offers no cost-aware selection between single-agent and multi-agent execution — the pattern is configured per skill, not dynamically selected based on budget.

13.4 Lessons from the Monolithic Baseline

Three observations from the 12 months operating the monolithic nexus-trigger informed the separated architecture:

First, shared Redis connections are the canary. Redis connection exhaustion was the most frequent failure mode in the monolithic nexus-trigger — 47 incidents over its operational lifetime. It was also the most diagnostic: when dispatch and execution share connections, connection exhaustion is a structural signal that the two responsibilities are too tightly coupled. Fixing the symptom (increasing the connection pool) delayed but did not prevent recurrence. Fixing the cause (separating the connection consumers into UNO + nexus-workflows) eliminates the failure class.

Second, governance cannot be retrofitted. Adding risk classification, data residency checks, and human-in-the-loop gates to a monolithic service that both dispatches and executes requires threading governance logic through every execution path. In a system with three execution paths (nexus-trigger, nexus-mageagent, and two services calling providers directly), this means implementing the same governance check in four places. We implemented it in zero — because the effort of implementing it four times, and keeping four implementations synchronized, was prohibitive. The dispatch-execution separation creates a single governance chokepoint where one implementation suffices.

Third, observability built in is observability that exists. The pre-UNO system had logging. It had metrics. It had occasional traces shipped to Jaeger. None of them told us why a security scan took 47 seconds instead of 12 — because the ReAct loop's per-iteration timing, tool I/O, and LLM reasoning were not captured as structured data. The span tree makes these invisible operations visible, not because someone remembers to add a log line, but because the execution model is the span tree.

14. Conclusion

This paper has presented the Unified Nexus Orchestrator (UNO) and argued, through architectural analysis and production experience, that dispatch and execution are fundamentally different responsibilities that must not share a process boundary in multi-chain AI workload platforms.

The core principle is simple — deceptively so. UNO routes. nexus-workflows executes. The AI Provider Router translates. These three services occupy distinct layers of the stack with enforced boundaries: Istio AuthorizationPolicy at the mesh (3 authorized principals), service key validation at the application, caller identity verification at the request. The dispatch-execution separation is deployed and serving production traffic. All four execution tiers — LLM-only, tool-using ReAct, chain DAG, and autonomous agent — are operational in nexus-workflows. The remaining gaps are honest and documented: multi-provider routing is plumbed but the router doesn't yet act on routing hints; nexus-orchestration still bypasses governance via direct OpenRouter calls; the ALLOWED_AI_CALLERS env var requires explicit configuration to match the Istio policy. Section 14.1 documents the deployment state component by component.

The contributions of this work are sevenfold. First, the architectural separation of dispatch and execution eliminates an entire class of reliability failures — Redis connection exhaustion, dispatch stalls during provider latency events, governance bypass by rogue services — that plagued the monolithic predecessor. Second, the four-tier execution taxonomy (LLM-only, tool-using ReAct, chain DAG, autonomous agent) maps workload types to resource isolation strategies, governance requirements, and observability granularity with no ambiguity at the tier boundaries — all four tiers are deployed. Third, multi-provider AI routing architecture with per-organization configuration and per-skill routing hints is plumbed end-to-end (routing_hint in skill_registry, in the LLM client request body, read by the gateway) but the router's adapter selection does not yet act on the hint — single-provider-per-org remains the current state, with multi-provider selection as the next implementation milestone. Fourth, chain DAG coordination via BullMQ completion listeners supports fork, join, and conditional branching without blocking the dispatcher or requiring workers to understand DAG topology. Fifth, span-tree observability records every operation as a hierarchical span, satisfying both operational debugging needs and EU AI Act Article 12 [23] audit trail requirements as a structural byproduct of the execution model. Sixth, governance pre-checks at the dispatch boundary — risk classification, data residency enforcement, human-in-the-loop gates, quota enforcement — ensure that no AI operation dispatched through UNO executes without appropriate oversight (though bypass paths via legacy services remain, as documented in Section 8.5). Seventh, the nine-phase migration strategy demonstrates that the separation can be achieved incrementally, following the strangler fig pattern [33] — seven of nine phases are complete, with multi-provider routing and legacy deprecation in progress.

The separation principle is not unique to AI workloads. It applies wherever a system must decide what to do and then do it — wherever the decision has different scaling characteristics, different failure modes, and different governance requirements than the action. What makes AI workloads particularly susceptible to the monolithic anti-pattern is the temptation to treat LLM calls as fast, stateless operations analogous to database queries. They are not. An LLM call may take 800 milliseconds or 45 seconds. It may trigger a tool-calling loop of unbounded depth. It may consume $0.0001 or $0.50 in a single invocation. It may process data subject to jurisdictional constraints. These characteristics make the case for separation not merely persuasive but, we argue, necessary.

What UNO enables — and what the monolith could not — is a platform where adding a new AI capability requires inserting a row into a database table, not modifying the orchestrator's source code. A new skill registered in graphrag.skill_registry is immediately dispatchable, governed, observed, and billed — without a deployment, without a code review, without risk of introducing a regression in the dispatch pipeline. This is the operational dividend of the separation principle: the dispatch boundary is stable, the execution engine is general, and the surface area for change is the skill registry — a database row, not a codebase.

14.1 Implementation Status Scorecard

The following table reflects the deployment state as of April 2026, verified against source code and running pods. This scorecard is the honest accounting of where the migration stands.

Core Infrastructure

Component	Status	Evidence
nexus-orchestrator dispatch	DEPLOYED	2 replicas running, `dispatch-routes.ts` (837 lines)
nexus-workflows execution	DEPLOYED	2 replicas running, K8s manifest, `execution-engine.ts`
skill_registry with execution_type	DEPLOYED	Migration 001, 135+ job types seeded (migration 003)
`routing_hint` column	DEPLOYED	Migration 007, CHECK constraint on valid values
`dispatch_mode` column	DEPLOYED	Migration 007, CHECK constraint (single/batch/chain)
routing_hint values back-filled	DEPLOYED	Migration 008 (security→reasoning, lead_score→fast, etc.)
`orchestrator.runs` table	DEPLOYED	Migration 006
`execution_spans` table	DEPLOYED	Migration 004
BullMQ enqueue/dequeue split	DEPLOYED	Orchestrator: Queue-only (89-line job-processor.ts), Workflows: Worker (684-line job-processor.ts)

Execution Tiers

Component	Status	Evidence
Tier 1 (llm_only)	DEPLOYED	`execution-engine.ts` case `'llm_only'`
Tier 2 (tool_using)	DEPLOYED	ReAct loop with forced first tool call, configurable max iterations
Tier 3 (chain)	DEPLOYED	`chain-coordinator.ts` in orchestrator + per-step execution in workflows
Tier 4 (autonomous)	DEPLOYED	`autonomous-engine.ts` in nexus-workflows, ported from mageagent
Human approval gates	DEPLOYED	`POST /dispatch/:id/approve` route in dispatch-routes.ts

Provider Routing & Governance

Component	Status	Evidence
AI Provider Router (4 adapters)	DEPLOYED	`ai-provider-router.ts` (623 lines), Gemini/Anthropic/ClaudeMax/OpenRouter
ALLOWED_AI_CALLERS enforcement	DEPLOYED	3 principals: nexus-orchestrator, nexus-workflows, nexus-mageagent
`validateServiceKey` middleware	DEPLOYED	`internal-ai-routes.ts` (910 lines)
`validateCallerIdentity` middleware	DEPLOYED	`caller-identity.ts`, reads ALLOWED_AI_CALLERS env var
Redis Pub/Sub → Socket.IO relay	DEPLOYED	`job-event-relay.ts` (270 lines)
Multi-provider routing	NOT DEPLOYED	routingHint plumbed end-to-end but router ignores it; single provider per org
`ALLOWED_AI_CALLERS` env var	DEPLOYED	Live value: `nexus-orchestrator,nexus-workflows,nexus-mageagent`

Tool Executors (7 of 11)

Tool	Status
kubectl	DEPLOYED
http_request	DEPLOYED
prosecreator	DEPLOYED
graphrag_query	DEPLOYED
hpc_gateway	DEPLOYED
n8n_trigger	DEPLOYED
autoresearch	DEPLOYED
jupyter_execute	NOT DEPLOYED
cvat_annotate	NOT DEPLOYED
sandbox_execute	NOT DEPLOYED
vision_analyze	NOT DEPLOYED

Legacy Service Deprecation

Service	Status	Detail
nexus-trigger	SCALED TO 0	All 6 dependent services migrated to orchestrator dispatch (April 2026). nexus-auth webhook emitter rewired to dispatch user_event_notification jobs. nexus-autoresearch PipelineOrchestrator + MultiServiceConnector rewired. nexus-plugins onboarding pipeline (12 tasks) rewired. Dashboard health-reports migrated to orchestrator runs API. 32 new skills seeded in skill_registry (migration 010). VirtualService references remain but route to no pods.
nexus-orchestration	SCALED TO 0	Scaled to 0 replicas (April 2026). Dependency audit found: nexus-robotics env var was dead (never called), nexus-security env var targeted a non-existent endpoint (silent failure). Both env vars removed from K8s manifests. Removed from ALLOWED_AI_CALLERS. VirtualService references remain but route to no pods.
nexus-mageagent	COMPLETE	Full migration to GatewayLLMClient (April 2026). OpenRouterClient deleted (650 lines), llm-router deleted (172 lines), llm-route middleware deleted. 41 files changed, -1126 net lines. All LLM traffic routes through gateway AI Provider Router. Added to ALLOWED_AI_CALLERS.

Workflow Bindings (Section 4.7)

Component	Status	Evidence
Workflow binding concept (skill_registry as binding source)	DEPLOYED	406+ bindings in skill_registry, all execution methods represented
WorkflowJobDispatcher dispatch-time resolution	IN PROGRESS	Architecture defined; client-side lookup being integrated
Workflow Bindings admin page	PARTIAL	Exists in dashboard (skill-bindings route); needs refactoring to unified UI
Auth key alignment (orchestrator ↔ nexus-auth)	DEPLOYED	Fixed: AUTH_SERVICE_KEY env var added to orchestrator deployment
Request field name alignment (organizationId → orgId)	DEPLOYED	Fixed in governance.ts
Model routing via execution_config.modelOverride	DEPLOYED	JSONB parsing fix in skill-resolver.ts

Migration Phase Summary

Phase	Status
1. Fix Redis + DB	COMPLETE
2. Split execution to nexus-workflows	COMPLETE
2.5. Span-tree visualization	COMPLETE
3. Named queues	COMPLETE
4. Chain DAG coordinator	COMPLETE
5. Batch dispatch	COMPLETE
6. Tier 4 autonomous engine	COMPLETE
7. Multi-provider AI routing	PARTIAL (plumbing complete, router selection not implemented)
8. New tool executors	PARTIAL (7 of 11)
9. Kill legacy	COMPLETE (3 of 3: nexus-orchestration scaled to 0, nexus-trigger scaled to 0, nexus-mageagent fully migrated to GatewayLLMClient)