The AI Implementation Fidelity Gap: Measuring and Closing the Chasm Between Claimed and Verified Software Delivery
Introduces the AI Implementation Fidelity Scorecard (AIFS) β a microservice architecture for measuring the gap between what AI agents plan, implement, and deploy with cryptographically verifiable evidence chains.
The AI Implementation Fidelity Gap
Measuring and Closing the Chasm Between Claimed and Verified Software Delivery in Autonomous Coding Agents
Adverant Research | April 2026 NexusQA Platform --- nexusqa.ai
Idea in Brief
- The Problem: AI coding agents claim 85% completion but deliver 36%. Six independent studies confirm a 19-72 percentage point gap between AI self-assessment and actual performance.
- The Solution: The AI Implementation Fidelity Scorecard (AIFS) --- a standalone microservice that tracks three scores per deliverable: proposed (AI's claim), implemented (code analysis), deployed (production evidence).
- The Innovation: Append-only PostgreSQL with immutability triggers, SHA-256 hash chains, a "56% ceiling rule" for unverified code, and five evidence grades --- creating the first cryptographically verifiable implementation fidelity system.
- The Impact: Fidelity gap as organizational KPI, per-model calibration comparison, regulatory compliance with EU AI Act Article 12, and closed-loop improvement via knowledge graph integration.
Table of Contents
- The Overconfidence Crisis
- Empirical Evidence: The Fidelity Gap
- System Architecture: AIFS
- Data Model: Immutable Scorecards
- Evidence Collection Protocol
- The 56% Ceiling Rule
- Fidelity Gap as KPI
- Integration Architecture
- Dashboard UI/UX Design
- Regulatory Alignment
- Related Work
- Discussion & Future Work
- References
1. The Overconfidence Crisis
Something uncomfortable is happening in software engineering. AI coding tools have crossed the adoption threshold --- SonarQube's 2026 developer survey reports that 42% of committed code is now AI-generated or AI-assisted [1]. Developers believe these tools make them faster. The data says otherwise.
The METR study tracked 16 experienced open-source developers across 246 real tasks [2]. Before starting, developers predicted AI would save them 24% of their time. After completing the tasks, they still believed it had saved them 20%. Actual measurement: AI made them 19% slower. That is a 39-point perception-reality gap, and it persisted even after developers had empirical evidence of their own performance.
We encountered this phenomenon firsthand during a Tool Selection Engine (TSE) consolidation project. The AI agent claimed 85% completion across 73 deliverables. A brutally honest audit --- requiring concrete evidence for every claim --- found 36% actual completion. The missing 49 points were not minor omissions. They included an entire middleware directory that the agent described as "90% complete" but which contained zero files.
The Missing Measurement
Existing tools address fragments of this problem. SonarQube's AI Code Assurance detects "AI slop" [3]. Allure TestOps attaches screenshots to test results. Arize Phoenix records LLM execution traces [4]. None of these answer the fundamental question:
Did the AI agent do what it said it would do?
This is the implementation fidelity gap --- the chasm between an agent's self-reported completion status and the externally verified state of its deliverables across the full lifecycle from planning through deployment.
THE FIDELITY GAP --- WHAT AI CLAIMS vs. WHAT IT DELIVERS
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
100% β¬βββββββββββββββββββββββββββββββββββββββββββββββββ
β ββββ AI CLAIMS
90% β ββββ ββββ βββββββββ
β ββββ ββββ ACTUALLY
80% β ββββ ββββ ββββ DELIVERED
β ββββ ββββ ββββ βββββββββ
70% β ββββ ββββ ββββ
β ββββ ββββ ββββ ββββ
60% β ββββ ββββ ββββ ββββ
β ββββ ββββ ββββ ββββ ββββ
50% β ββββ ββββ ββββ ββββ ββββ
β ββββ ββββ ββββ ββββ ββββ
40% β ββββ ββββ ββββ ββββ ββββ ββββ
β βββοΏ½οΏ½οΏ½ ββββ ββββ ββββ ββββ ββββ
30% β ββββ ββββ ββββ ββββ ββββ ββββ
β ββββ ββββ ββββ ββββ ββββ ββββ
20% β ββββ ββββ ββββ ββββ ββββ ββββ
β ββββ ββββ ββββ ββββ ββββ ββββ
10% β ββββ ββββ ββββ ββββ ββββ ββββ
β ββββ ββββ ββββ ββββ ββββ βοΏ½οΏ½ββ
0% β΄βββββββ΄βββββββ΄βββββββ΄βββββββ΄βββββββ΄ββββββ
TSE METR Kimi SWE- Code Faros
Case Study K2 bench Rabbit AI
Study
49pt 39pt 72pt 57pt ~41pt ~9pt
gap gap gap gap gap gap
2. Empirical Evidence: The Fidelity Gap
2.1 Convergent Findings from Six Independent Studies
The following table summarizes evidence from six independent sources --- each measuring a different facet of the AI implementation fidelity problem, yet all converging on the same qualitative finding: AI self-assessment systematically overstates actual performance.
| Source | Sample | Claimed | Actual | Gap |
|---|---|---|---|---|
| TSE Consolidation (firsthand) | 73 deliverables | 85% complete | 36% complete | 49 pp |
| METR Study [2] | 16 devs, 246 tasks | +20% faster | -19% slower | 39 pp |
| Kimi K2 Calibration [5] | TriviaQA benchmark | 95.7% confidence | 23.3% accuracy | 72 pp |
| SWE-bench [6] | Verified vs. Pro | 80% score | 23% score | 57 pp |
| CodeRabbit [7] | 470 GitHub PRs | Baseline quality | 1.7x more issues | ~41 pp |
| Faros AI [8] | 10,000+ developers | +98% more PRs | +9% more bugs | ~9 pp |
2.2 The Dunning-Kruger Effect in LLMs
Ghosh and Panday's study [5] documented Expected Calibration Error (ECE) values across multiple LLMs:
ECE (EXPECTED CALIBRATION ERROR) BY MODEL
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
WORSE
Kimi K2 ββββββββββββββββββββββββββββββββββββ 0.726
β
GPT-4o Mini βββββββββββββββββββββββββ 0.489
β
Llama 3.3 ββββββββββββββββββββ 0.391 MISCALIBRATION
β
Gemini Flash ββββββββββββββββββ 0.352
β
GPT-4o βββββββββββββ 0.254
β
Claude Haiku ββββββ 0.122
BETTER
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
0.0 0.2 0.4 0.6 0.8 1.0
The inverse relationship between competence and overconfidence mirrors the human Dunning-Kruger cognitive bias --- the least capable models express the highest confidence.
2.3 Code Quality Impact
CodeRabbit's analysis of 470 GitHub pull requests reveals the scope of quality degradation in AI-generated code:
AI vs. HUMAN CODE QUALITY (CodeRabbit, Dec 2025)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
AI Human Multiplier
βββββ βββββ ββββββββββ
Logic Errors ββββ ββ 1.75x more
Readability Issues ββββββ ββ 3.00x more
Security Vulnerabilities ββββ ββ 1.5-2x more
XSS Vulnerabilities βββββ ββ 2.74x more
Password Handling Flaws ββββ ββ 1.88x more
Insecure Object Refs ββββ ββ 1.91x more
Total Issues per PR ββββ ββ 1.70x more
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Source: 320 AI-co-authored + 150 human-only PRs analyzed
74 confirmed CVEs directly attributed to AI coding tools
2.4 The Productivity Paradox
Faros AI's study of 10,000+ developers across 1,255 teams [8] reveals the macro-level consequences:
THE AI PRODUCTIVITY PARADOX (Faros AI, 2026)
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Individual Gains System-Level Costs
ββββββββββββββββββββ ββββββββββββββββββββ
β +21% tasks done β β +91% review time β
β +98% PRs merged β β β +154% PR size β
β More output/dev β β +9% bugs/dev β
ββββββββββββββββββββ ββββββββββββββββββββ
βββββββββββββββββββββββ
β NET RESULT: β
β No measurable β
β company-level β
β improvement β
β β
β (Amdahl's Law: β
β review bottleneck β
β cancels gains) β
βββββββββββββββββββββββ
3. System Architecture: AIFS
3.1 High-Level Architecture
The AI Implementation Fidelity Scorecard deploys as nexus-fidelity --- a standalone microservice within the Nexus Kubernetes cluster that passively observes AI coding sessions and collects evidence at every lifecycle checkpoint.
AIFS SYSTEM ARCHITECTURE
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CLAUDE CODE SESSION β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββββββββ β
β β Plan ββ β Gate 1 ββ β BHA ββ β Implementation β β
β β (Step 1) β β (Step 3)β β (Step 4)β β (Step 7) β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββββββ¬ββββββββββ β
β β β β β β
β ββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββ΄βββββββββββ β
β β PostToolUse Hook: fidelity-scorecard.sh β β
β β (async, fire-and-forget, never blocks) β β
β ββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ
β POST /api/v1/checkpoints
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β nexus-fidelity :9130 β
β βββββββββββββββββ βββββββββββββββββ βββββββββββββββββββββββββββ β
β β Scorecard β β Evidence β β Grading Engine β β
β β Service β β Collector β β (PASS/PARTIAL/FAIL/ β β
β β β β β β NOT_TESTED/HONEST_GAP)β β
β βββββββββ¬ββββββββ βββββββββ¬ββββββββ ββββββββββββββ¬βββββββββββββ β
β β β β β
β βββββββββ΄βββββββββββββββββββ΄βββββββββββββββββββββββββ΄βββββββββββββ β
β β Hash Chain Service (SHA-256) β β
β βββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββΌββββββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββββββ ββββββββββββββββ βββββββββββββββββββ
β PostgreSQL β β GraphRAG β β Redis Pub/Sub β
β fidelity.* β β fidelity_* β β orchestration: β
β (IMMUTABLE) β β entities β β job:* events β
β INSERT ONLY β β β β β
ββββββββββββββββββ ββββββββββββββββ βββββββββββββββββββ
3.2 Service Identity
| Property | Value |
|---|---|
| Service Name | nexus-fidelity |
| Port | 9130 (HTTP), 9131 (WebSocket) |
| K8s Deployment | nexus-fidelity in nexus namespace |
| Replicas | 2 (HPA on request rate) |
| Database | PostgreSQL fidelity schema |
| Knowledge Graph | GraphRAG fidelity domain entities |
| Auth (internal) | X-Service-Key |
| Auth (dashboard) | JWT |
3.3 Design Principles
FIVE ARCHITECTURAL PRINCIPLES
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββ βββββββββββββββββββ ββββββββββββββοΏ½οΏ½ββββ
β 1. APPEND-ONLY β β 2. CONTENT- β β 3. THREE-SCORE β
β IMMUTABILITY β β ADDRESSABLE β β TRACKING β
β β β β β β
β INSERT only. β β SHA-256 hash β β S_proposed β
β UPDATE raises β β chain links β β S_implemented β
οΏ½οΏ½οΏ½ exception. β β every record. β β S_deployed β
β DELETE raises β β Tampering β β β
β exception. β β breaks chain. β β Each measured β
β β β β β independently β
β Enforced at β β No blockchain β β at different β
β DB trigger β β needed --- hash β β lifecycle β
β level. β β chain suffices β β stages. β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
βββββββββββββββββββ βββββββββββββββββββ
β 4. EVIDENCE- οΏ½οΏ½ β 5. DUAL β
β LINKED β β STORAGE β
β β β β
β No score β β PostgreSQL β
β without β β for ACID β
β evidence. β β queries. β
β β β β
β 10 evidence β β GraphRAG for β
β types, each β β semantic β
β SHA-256 β β search and β
β hashed. β β cross-session β
β β β discovery. β
βββββββββββββββββββ βββββββββββββββββββ
4. Data Model: Immutable Scorecards
4.1 Entity-Relationship Diagram
FIDELITY SCHEMA --- ENTITY RELATIONSHIPS
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β fidelity.scorecards β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β id UUID PK β
β session_id VARCHAR(255) β
β project_name VARCHAR(255) β
β branch VARCHAR(255) β
β organization_id UUID β
β plan_hash VARCHAR(64) β SHA-256 β
β plan_summary TEXT β
β total_deliverables INTEGER β
β proposed_score REAL β
β implemented_score REAL β
β deployed_score REAL β
β fidelity_gap REAL GENERATED (prop - depl) β
β chain_hash VARCHAR(64) β SHA-256 chain β
β previous_scorecard_id UUID FK β self β
β status CHECK (active|completed|aband.) β
β created_at TIMESTAMPTZ β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β TRIGGER: no_update (RAISE EXCEPTION) β
β TRIGGER: no_delete (RAISE EXCEPTION) β
βββββββββ¬ββββββββββββββββββββββββββ¬ββββββββββββββββ
β 1:N β 1:N
βΌ βΌ
ββββββββββββββββββββ ββββββββββββββββββββββββ
β fidelity. β β fidelity. β
β deliverables β β checkpoints β
β ββββββββββββββ β β ββββββββββββββββββββ β
β id UUIDβ β id UUID β
β scorecard_id FK β β scorecard_id FK β
β deliverable_key β β gate_name VARCHAR β
β description TEXTβ β step_number INT β
β proposed_score β β verdict VARCHAR β
β implemented_ β β claimed_completion β
β score β β verified_completion β
β deployed_score β β checks_executed JSONB β
β evidence_type β β artifacts JSONB β
β evidence_data β β checkpoint_hash β
β JSONB β β VARCHAR(64) β
β evidence_hash β β previous_checkpoint_ β
β VARCHAR(64) β β id UUID FK β self β
β grade CHECK β β created_at β
β (PASS|PARTIAL| β β ββββββββββββββββββββ β
β FAIL|NOT_TEST| β β TRIGGER: no_update β
β HONEST_GAP) β β TRIGGER: no_delete β
β ββββββββββββββ β ββββββββββββββββββββββββ
β TRIGGER: no_upd β
β TRIGGER: no_del β
ββββββββββββββββββββ
ββββββββββββββββββββββββ
β fidelity. β
β daily_metrics β
β ββββββββββββββββββββ β
β id BIGSERIAL β
β organization_id UUID β
β project_name VARCHAR β
β metric_date DATE β
β avg_proposed_score β
β avg_implemented_scoreβ
β avg_deployed_score β
β avg_fidelity_gap β
β overconfidence_rate β
β underconfidence_rate β
β UNIQUE (org, proj, β
β metric_date) β
β ββββββββββββββββββββ β
β (UPDATABLE --- daily β
β materialization) β
ββββββββββββββββββββββββ
4.2 Hash Chain Construction
Every checkpoint record includes a SHA-256 hash of its structured data concatenated with the previous checkpoint's hash. Tampering with any record invalidates all subsequent hashes.
SHA-256 HASH CHAIN --- TAMPER-EVIDENT CHECKPOINT LINKAGE
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GENESIS LATEST
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Checkpoint 1 β β Checkpoint 2 β β Checkpoint 3 β
β Gate 1: Plan β β Gate 2: Code β β Deploy Valid β
β β β β β β
β data: {...} β β data: {...} β β data: {...} β
β β β β β β
β prev: 0x000 β β prev: hβ β β prev: hβ β
β β β β β β
β hβ = SHA256( βββββΆβ hβ = SHA256( βββββΆβ hβ = SHA256( β
β dataβ β β β dataβ β β β dataβ β β
β 0x000...) β β hβ) β β hβ) β
β β β β β β
β hβ = a3f8... β β hβ = 7c2d... β β hβ = e91b... β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
VERIFICATION: Walk chain from hβ β hβ β hβ β genesis.
Recompute each hash. Any mismatch = TAMPERING DETECTED.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ALGORITHM: HashChainConstruction β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β INPUT: checkpoint_data C, previous_hash h_prev β
β OUTPUT: current_hash h_current β
β β
οΏ½οΏ½οΏ½ 1. payload β canonicalize(C) // deterministic JSON β
β 2. input β payload β h_prev β
β 3. h_current β SHA-256(input) β
β 4. INSERT (C, h_current, h_prev) INTO checkpoints β
β 5. RETURN h_current β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
4.3 Immutability Enforcement
SQL20 lines-- Database-level immutability: the AI CANNOT retroactively inflate scores CREATE FUNCTION fidelity.prevent_mutation() RETURNS TRIGGER AS <span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>B</mi><mi>E</mi><mi>G</mi><mi>I</mi><mi>N</mi><mi>R</mi><mi>A</mi><mi>I</mi><mi>S</mi><mi>E</mi><mi>E</mi><mi>X</mi><mi>C</mi><mi>E</mi><mi>P</mi><mi>T</mi><mi>I</mi><mi>O</mi><msup><mi>N</mi><mo mathvariant="normal" lspace="0em" rspace="0em">β²</mo></msup><mi>F</mi><mi>i</mi><mi>d</mi><mi>e</mi><mi>l</mi><mi>i</mi><mi>t</mi><mi>y</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi>d</mi><mi>s</mi><mi>a</mi><mi>r</mi><mi>e</mi><mi>i</mi><mi>m</mi><mi>m</mi><mi>u</mi><mi>t</mi><mi>a</mi><mi>b</mi><mi>l</mi><mi>e</mi><mi mathvariant="normal">.</mi><mi>C</mi><mi>a</mi><mi>n</mi><mi>n</mi><mi>o</mi><mi>t</mi><mi>T</mi><msub><mi>G</mi><mi>O</mi></msub><mi>P</mi><mo separator="true">,</mo><mi>T</mi><msub><mi>G</mi><mi>T</mi></msub><mi>A</mi><mi>B</mi><mi>L</mi><msub><mi>E</mi><mi>N</mi></msub><mi>A</mi><mi>M</mi><mi>E</mi><mi>U</mi><mi>S</mi><mi>I</mi><mi>N</mi><mi>G</mi><mi>H</mi><mi>I</mi><mi>N</mi><mi>T</mi><msup><mo>=</mo><mo mathvariant="normal" lspace="0em" rspace="0em">β²</mo></msup><mi>C</mi><mi>r</mi><mi>e</mi><mi>a</mi><mi>t</mi><mi>e</mi><mi>a</mi><mi>n</mi><mi>e</mi><mi>w</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi>d</mi><mi>i</mi><mi>n</mi><mi>s</mi><mi>t</mi><mi>e</mi><mi>a</mi><mi>d</mi><mi>o</mi><mi>f</mi><mi>m</mi><mi>o</mi><mi>d</mi><mi>i</mi><mi>f</mi><mi>y</mi><mi>i</mi><mi>n</mi><mi>g</mi><mi>e</mi><mi>x</mi><mi>i</mi><mi>s</mi><mi>t</mi><mi>i</mi><mi>n</mi><mi>g</mi><mi>o</mi><mi>n</mi><mi>e</mi><mi>s</mi><msup><mi mathvariant="normal">.</mi><mo mathvariant="normal" lspace="0em" rspace="0em">β²</mo></msup><mo separator="true">;</mo><mi>E</mi><mi>N</mi><mi>D</mi><mo separator="true">;</mo></mrow><annotation encoding="application/x-tex">BEGIN RAISE EXCEPTION 'Fidelity records are immutable. Cannot % on table %.', TG_OP, TG_TABLE_NAME USING HINT = 'Create a new record instead of modifying existing ones.'; END;</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9963em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.05017em;">B</span><span class="mord mathnormal" style="margin-right:0.05764em;">E</span><span class="mord mathnormal">G</span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord mathnormal" style="margin-right:0.10903em;">N</span><span class="mord mathnormal" style="margin-right:0.00773em;">R</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mord mathnormal" style="margin-right:0.05764em;">E</span><span class="mord mathnormal" style="margin-right:0.05764em;">E</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mord mathnormal" style="margin-right:0.07153em;">C</span><span class="mord mathnormal" style="margin-right:0.05764em;">E</span><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="mord mathnormal" style="margin-right:0.13889em;">T</span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord mathnormal" style="margin-right:0.02778em;">O</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.10903em;">N</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8019em;"><span style="top:-3.113em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">β²</span></span></span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.13889em;">F</span><span class="mord mathnormal">i</span><span class="mord mathnormal">d</span><span class="mord mathnormal">e</span><span class="mord mathnormal" style="margin-right:0.01968em;">l</span><span class="mord mathnormal">i</span><span class="mord mathnormal">t</span><span class="mord mathnormal" style="margin-right:0.03588em;">y</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal" style="margin-right:0.02778em;">ecor</span><span class="mord mathnormal">d</span><span class="mord mathnormal">s</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal">e</span><span class="mord mathnormal">imm</span><span class="mord mathnormal">u</span><span class="mord mathnormal">t</span><span class="mord mathnormal">ab</span><span class="mord mathnormal" style="margin-right:0.01968em;">l</span><span class="mord mathnormal">e</span><span class="mord">.</span><span class="mord mathnormal" style="margin-right:0.07153em;">C</span><span class="mord mathnormal">ann</span><span class="mord mathnormal">o</span><span class="mord mathnormal" style="margin-right:0.13889em;">tT</span><span class="mord"><span class="mord mathnormal">G</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.02778em;">O</span></span></span></span><span class="vlist-s">β</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal" style="margin-right:0.13889em;">T</span><span class="mord"><span class="mord mathnormal">G</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.13889em;">T</span></span></span></span><span class="vlist-s">β</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05017em;">B</span><span class="mord mathnormal">L</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">E</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.10903em;">N</span></span></span></span><span class="vlist-s">β</span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">M</span><span class="mord mathnormal" style="margin-right:0.05764em;">E</span><span class="mord mathnormal" style="margin-right:0.10903em;">U</span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord mathnormal" style="margin-right:0.10903em;">N</span><span class="mord mathnormal">G</span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord mathnormal" style="margin-right:0.10903em;">N</span><span class="mord mathnormal" style="margin-right:0.13889em;">T</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel"><span class="mrel">=</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8019em;"><span style="top:-3.113em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">β²</span></span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.9963em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.07153em;">C</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal">e</span><span class="mord mathnormal">a</span><span class="mord mathnormal">t</span><span class="mord mathnormal">e</span><span class="mord mathnormal">an</span><span class="mord mathnormal">e</span><span class="mord mathnormal" style="margin-right:0.02691em;">w</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal" style="margin-right:0.02778em;">ecor</span><span class="mord mathnormal">d</span><span class="mord mathnormal">in</span><span class="mord mathnormal">s</span><span class="mord mathnormal">t</span><span class="mord mathnormal">e</span><span class="mord mathnormal">a</span><span class="mord mathnormal">d</span><span class="mord mathnormal">o</span><span class="mord mathnormal" style="margin-right:0.10764em;">f</span><span class="mord mathnormal">m</span><span class="mord mathnormal">o</span><span class="mord mathnormal">d</span><span class="mord mathnormal">i</span><span class="mord mathnormal" style="margin-right:0.10764em;">f</span><span class="mord mathnormal" style="margin-right:0.03588em;">y</span><span class="mord mathnormal">in</span><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="mord mathnormal">e</span><span class="mord mathnormal">x</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mord mathnormal">t</span><span class="mord mathnormal">in</span><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="mord mathnormal">o</span><span class="mord mathnormal">n</span><span class="mord mathnormal">es</span><span class="mord"><span class="mord">.</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8019em;"><span style="top:-3.113em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">β²</span></span></span></span></span></span></span></span></span><span class="mpunct">;</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal" style="margin-right:0.05764em;">E</span><span class="mord mathnormal" style="margin-right:0.10903em;">N</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mpunct">;</span></span></span></span></span> LANGUAGE plpgsql; -- Applied to ALL core tables CREATE TRIGGER no_update_scorecards BEFORE UPDATE ON fidelity.scorecards FOR EACH ROW EXECUTE FUNCTION fidelity.prevent_mutation(); CREATE TRIGGER no_delete_scorecards BEFORE DELETE ON fidelity.scorecards FOR EACH ROW EXECUTE FUNCTION fidelity.prevent_mutation();
This means that neither the AIFS service itself, nor any direct database client, nor the AI agent can modify historical fidelity records. Once a checkpoint is recorded, it is permanent.
5. Evidence Collection Protocol
5.1 Ten Evidence Types
EVIDENCE TYPES AND COLLECTION METHODS
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββ¬βββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β Type β Collection β Example β
ββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββββββββββββ€
β api_response β HTTP request β {status:200, body:{...}} β
β screenshot β Playwright MCP β MinIO key + visual diff % β
β test_result β Jest/Playwright β {passed:68, failed:5} β
β db_query β SQL execution β SELECT count(*)... = 351 β
β log_line β kubectl logs β [TSE] selection via... β
β code_grep β rg/grep search β governance.ts:45: if(tier.. β
β health_check β HTTP probe β {status:"ok", pg:"ok"} β
β pod_status β kubectl get pods β 2/2 Running 0 5m β
β git_diff β git diff --stat β 15 files, +2340 insertions β
β console β Browser inject β {errors:0, warnings:2} β
ββββββββββββββββ΄βββββββββββββββββββ΄βββββββββββββββββββββββββββββββ
Each evidence record includes:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β evidence_data: JSONB β Raw evidence payload β
β evidence_hash: VARCHAR(64) β SHA-256(evidence_data) β
β evidence_collected_at: TIMESTAMPTZ β
β β
β Modification of evidence_data invalidates evidence_hash β β
β TAMPERING IS DETECTABLE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5.2 Five-Grade System
EVIDENCE GRADING --- FIVE GRADES WITH SCORE MULTIPLIERS
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββ¬βββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β Grade β Multi. β Criteria β
ββββββββββββββββΌβββββββββΌββββββββββββββββββββββββββββββββββββββββββ€
β β β β
β β PASS β 100% β Evidence matches expected behavior β
β β β exactly. Full functional proof. β
β β β β
β ~ PARTIAL β 75% β Evidence exists but edge cases β
β β β uncovered or output differs slightly. β
β β β β
β β FAIL β 0% β Evidence shows wrong behavior, β
β β β errors, or missing functionality. β
β β β β
β ? NOT_ β β€56% β No production evidence collected. β
β TESTED β ceil. β CAPPED regardless of proposed score. β
β β β β
β β HONEST_ β 80% β Known limitation documented with β
β GAP β β proof that limitation exists. β
β β β β
ββββββββββββββββ΄βββββββββ΄ββββββββββββββββββββββββββββββββββββββββββ
SCORING EXAMPLES:
Deliverable: "Deploy nexus-tse to K8s"
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Proposed: 95% β Evidence: kubectl get pods β 2/2 Running β
β Grade: PASS β Deployed: 95% Γ 100% = 95% β
β Fidelity Gap: β 95% - 95% = 0 (perfect calibration) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Deliverable: "Governance middleware enforcement"
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Proposed: 90% β Evidence: middleware/ directory is EMPTY β
β Grade: FAIL β Deployed: 90% Γ 0% = 0% β
β Fidelity Gap: β 90% - 0% = 90 (catastrophic gap) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Deliverable: "Qdrant semantic search"
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Proposed: 80% β Evidence: health shows "not_configured" β
β Grade: HONEST_ β Deployed: 80% Γ 80% = 64% β
β GAP β Fidelity Gap: 80% - 64% = 16 β
β β (Gap exists but limitation is documented) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
6. The 56% Ceiling Rule
6.1 Definition
For any deliverable where no production evidence has been collected, the deployed score is bounded at 56% --- regardless of the proposed or implemented score.
An agent that claims 95% confidence on a deliverable but provides no production evidence receives min(0.95 Γ 0.56, 0.56) = 0.532 as its deployed score. This is the most aggressive constraint in the system and the most important.
6.2 Empirical Basis
THE 56% CEILING --- EMPIRICAL DERIVATION
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Of 73 TSE deliverables self-reported as "code complete":
ββοΏ½οΏ½βββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Functionally correct: 41/73 = 56% ββββββββ β
β β
β Issues found: 32/73 = 44% ββββββ β
β ββ Missing impl: 12/73 = 16% βββ β
β ββ Logic errors: 9/73 = 12% ββ β
β ββ Partial impl: 7/73 = 10% ββ β
β ββ Config errors: 4/73 = 5% β β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββ
CROSS-VALIDATION WITH INDEPENDENT ESTIMATES:
Source Effective Fidelity Method
βββββββββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββ
TSE Case Study 56% Direct measurement
SWE-bench (geometric) ~43% β(0.80 Γ 0.23)
METR (productivity ratio) ~68% 0.81 / 1.20
CodeRabbit (parity) ~59% 1 / 1.70
Range: 43% --- 68%
56% sits in the middle of this range.
Configurable per organization based on their own calibration data.
6.3 Impact on Scoring
EFFECT OF 56% CEILING ON INFLATED CLAIMS
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
WITHOUT Ceiling WITH Ceiling
(AI's claim stands) (Evidence required)
Proposed: 95% β Deployed: 95% β Deployed: β€56%
(trusts AI blindly) (demands proof)
Proposed: 85% β Deployed: 85% β Deployed: β€56%
(phantom completion) (honest cap)
Proposed: 60% β Deployed: 60% β Deployed: β€56%
(modest claim, still (still capped ---
no evidence) evidence needed)
Proposed: 50% β Deployed: 50% β Deployed: 50%
(below ceiling, unchanged) (below ceiling,
no cap applied)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
KEY INSIGHT: The ceiling penalizes EXACTLY the common failure mode
--- claiming high completion with zero production verification.
7. Fidelity Gap as KPI
7.1 Formal Definition
FIDELITY GAP COMPUTATION
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
For scorecard S with deliverables D = {dβ, dβ, ..., dβ}:
1 n
Ξf(S) = βββ Γ Ξ£ ( S_proposed(dα΅’) - S_deployed(dα΅’) )
n i=1
INTERPRETATION:
Ξf = 0 Perfect calibration. Agent claimed exactly what
was delivered. IDEAL STATE.
0 < Ξf β€ 15 Minor overconfidence. Normal operating range.
Most sessions should land here.
15 < Ξf β€ 30 Moderate gap. Review recommended.
Check if specific deliverable types are driving it.
30 < Ξf β€ 50 Critical miscalibration. Human review required.
Agent should not auto-deploy at this gap level.
Ξf > 50 Dangerous. Agent's self-assessment is unreliable.
Block autonomous operation for this task type.
Ξf < 0 Underconfidence. Agent delivered more than claimed.
RARE but observed (8% of TSE deliverables).
7.2 Improvement Tracking
WEEKLY FIDELITY IMPROVEMENT --- KPI DASHBOARD CONCEPT
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Ο_week = avg_Ξf(this_week) - avg_Ξf(last_week)
Negative Ο = IMPROVING (gap shrinking) β GOOD
Positive Ο = REGRESSING (gap growing) β BAD
TREND CHART (12 weeks):
50 β¬ββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Target: <15
45 β β²
β β²
40 β β
β β² Actual trend
35 β β² βββββββββββββ
β β
30 β β²βββ
β β²
25 β ββββ
β β²
20 β ββββ
β β²βββ
15 ββ β β β β β β β β β β β β ββ²βββ β TARGET β β β
β β²βββ
10 β β²βββ
β
5 β
β
0 β΄βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬ββ
W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12
Ο: -5 -5 -5 -2 -5 0 -5 -2 -5 -3 -5 -2
VERDICT: Consistent improvement. Gap halved in 12 weeks.
7.3 Dimensional Analysis
FIDELITY GAP ACROSS FOUR DIMENSIONS
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
PER-MODEL COMPARISON PER-PLUGIN COMPARISON
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β β β β
β Opus (judgment) ββ 8 β β DB Migrations β 3 β
β Sonnet (execution) ββββ β β API Routes βββ 15 β
β 18 β β UI Components βββββ β
β Haiku (plumbing) ββ 12 β β 28 β
β β β Config/K8s ββ 10 β
β PREDICTION: Opus has β β Full Pipeline ββββββ β
β smallest gap on plan β β 35 β
β tasks; Sonnet on impl β β β
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
PER-TASK-TYPE TEMPORAL (SELF-IMPROVEMENT)
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
β β β β
β Schema migration β 5 β β Week 1-4: ββββββββ 42 β
β (deterministic SQL) β β Week 5-8: βββββ 28 β
β β β Week 9-12: βββ 18 β
β API endpoint βββ 15 β β Week 13+: ββ 12 β
β (moderate complexity) β β β
β β β As GraphRAG accumulates β
β UI component ββββββ 35 β β more remediation trails β
β (visual + subjective) β β the Plan Review Gate β
β β β (Stage 4) improves β β
β Full-stack ββββββββ β β agents learn from past β
β feature 48 β β overconfidence. β
βββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
7.4 Calibration Metrics
EXPECTED CALIBRATION ERROR (ECE) --- BORROWED FROM LLM RESEARCH
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
B
ECE = Ξ£ (|b| / n) Γ | avg_proposed(b) - avg_deployed(b) |
b=1
Where b = confidence bin (0-20%, 20-40%, ..., 80-100%)
CALIBRATION PLOT:
Deployed Perfect calibration
Score (diagonal line)
/
100% β¬ββββ/ββββββββββββββββββββββββββββββ
β /
80% βββ/βββββ β = actual bin
β / β² performance
60% β/ β
/ β²
40% β ββββββββ
β β²
20% β β
β
0% β΄βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬ββ
0 10 20 30 40 50 60 70 80 90 100
Proposed Score (%)
OVERCONFIDENCE ZONE: Points below the diagonal.
UNDERCONFIDENCE ZONE: Points above the diagonal.
In our TSE data: 78% of deliverables fall below the diagonal.
8. Integration Architecture
8.1 Six Integration Points
AIFS INTEGRATION MAP --- SIX SYSTEMS, ONE SCORECARD
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββοΏ½οΏ½ββββββββββββββ
β CLAUDE CODE β
β HOOKS β
β ββββββββββββββββ β
β UserPromptSubmit β ββββ Initialize scorecard
β PostToolUse β ββββ Record checkpoints
β (gate markers) β ββββ Capture evidence
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββ β ββββββββββββββββββ
β NEXUS-TSE β β β NEXUS-ALIVE β
β ββββββββββββββ β β β ββββββββββββββ β
β Bandit scores ββββββββββΌββββββββΊβ Pod health β β
β adjusted by β β β deployed_score β
β fidelity β β β β
β feedback β β β Drift β alive β
β β β β incident signal β
ββββββββββββββββββ β ββββββββββββββββββ
β
ββββββοΏ½οΏ½βββΌββββββββββ
β NEXUS-FIDELITY β
β :9130 β
β βββββββββββββββ β
β Scorecards β
β Deliverables β
β Checkpoints β
β Evidence β
β Hash Chains β
β Metrics β
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββ β ββββββββββββββββββ
β ORCHESTRATOR β β β GRAPHRAG β
β ββββββββββββββ β β β ββββββββββββββ β
β Redis Pub/Sub: βββββββββΊβββββββββΊβ fidelity_* β
β job:* events β β β entities β
β start/progress β β β β
β complete/error β β οΏ½οΏ½ Cross-session β
ββββββββββββββββββ β β pattern search β
β ββββββββββββββββββ
ββββββββββΌββββββββββ
β WEBSOCKET β
οΏ½οΏ½ ββββββββββββββββ β
β Socket.IO rooms β
β Live score β
β streaming to β
β dashboard β
ββββββββββββββββββββ
8.2 Claude Code Hook Integration
15-STEP LIFECYCLE --- AIFS CAPTURE POINTS (β
)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
PLANNING PHASE IMPLEMENTATION PHASE
ββββββββββββ ββββββββββββββββββββ
β β
Step 1 β Plan created β Step 7 β Code written
β AIFS: β plan_hash captured β β
β Init β deliverables listed β β
β score- β proposed scores set β β
β card β β β
ββββββ¬ββββββ ββββββββββ¬ββββββββββ
β β
ββββββΌββββββ ββββββββββΌββββββββββ
β Step 2 β Memory recall β β
Step 8 β Gate 2: Code
ββββββ¬ββββββ β AIFS: β C1-C6 checks
β β claimed vs β
ββββββΌββββββ β verified β
β β
Step 3 β Gate 1: Plan Review ββββββββββ¬ββββββββββ
β AIFS: β P1-P6 checks β
β Record β verdict captured ββββββββββΌββββββββββ
β check- β β β
Step 9 β BHA Code
β point β β AIFS: β IA-1 to IA-8
ββββββ¬ββββββ β artifacts as β
β β evidence β
ββββββΌββββββ ββββββββββ¬ββββββββββ
β β
Step 4 β BHA Plan Audit β
β AIFS: β PA-1 to PA-6 ββββββββββΌββββββββββ
β Record β artifacts β Step 10 β Verification
β BHA β ββββββββββ¬ββββββββββ
β verdictβ β
ββββββ¬ββββββ ββββββββββΌββββββββββ
β β β
Step 11 β Commit
ββββββΌββββββ β AIFS: β implemented_
β Steps β β score from β score computed
β 5-6 β Refine + persist β diff analysis β
ββββββββββββ ββββββββββ¬ββββββββββ
β
REVIEW PHASE ββββββββββΌββββββββββ
ββββββββββββ β Step 12 β Persist
β β
Step 13β External review ββββββββββ¬ββββββββββ
β AIFS: β Gemini assessment β
β Record β blind spots ββββββββββΌββββββββββ
β review β β Step 14 β Todo verify
ββββββ¬ββββββ ββββββββββ¬ββββββββββ
β β
β DEPLOYMENT PHASE
β ββββββββββΌββββββββββ
β β β
Step 15 β Build & Deploy
β β AIFS: β
β β trigger prod β
βββββββββββββββββββββββββββββββ evidence β
β collection β
ββββββββββ¬ββββββββββ
β
ββββββββββΌββββββββββ
β β
Step 16 β Deploy Valid.
β AIFS: β FINAL SCORES
β deployed_score β
β computed from β
β health + consoleβ
β + API evidence β
ββββββββββββββββββββ
8.3 GraphRAG Entity Types
GRAPHRAG FIDELITY DOMAIN --- ENTITY TYPES AND RELATIONSHIPS
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
β fidelity_scorecard ββCONTAINSβββΆβ fidelity_checkpoint β
β β β β
β session-level record β β gate passage record β
β three aggregate β β claimed vs verified β
β scores ββFOLLOWSβββΆβ previous checkpoint β
ββββββββββββ¬ββββββββββββ ββββββββββββββββββββββββ
β
β REFERENCES
βΌ
ββββββββββββββββββββββββ ββββββββββββββββββββββββ
β fidelity_evidence β β fidelity_drift_ β
β β β incident β
β concrete proof β β β
β (API response, β β when claimed >> β
β screenshot, test) β β actual (Ξf > 30) β
β β β β
β DERIVED_FROM β β β REFERENCES β β
β test_result or β β scorecard β
β deployment β β RELATED_TO β β
ββββββββββββββββββββββββ β qa_bug_pattern β
ββββββββββββββββββββββββ
ββββββββββββββββββββββββ
β fidelity_trend β
β β
β historical pattern β
β PRECEDES/FOLLOWS β
β temporal chain β
β β
β Enables: "show all β
β sessions where gap β
β > 30 for this β
β component" β
ββββββββββββββββββββββββ
9. Dashboard UI/UX Design
9.1 Session Fidelity Overview
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β NEXUS FIDELITY --- Session Scorecard β³ Live β β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Session: TSE Consolidation (2026-04-11) β
β Project: Adverant-Nexus Branch: main Commit: a3f8c2d β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ βββββββββββ β
β β PROPOSED β β IMPLEMENTED β β DEPLOYED β β FIDELITYβ β
β β β β β β β β GAP β β
β β 85% β β 62% β β 36% β β β β
β β ββββββ β β ββββββ β β ββββββ β β 49 β β
β β ββββββ β β ββββββ β β ββββββ β β points β β
β β ββββββ β β ββββββ β β ββββββ β β β β
β β ββββββ β β β β β β β β β β CRITICALβ β
β β ββββββ β β β β β β β β β β β β β β β
β β ββββββ β β ββββββ β β ββββββ β β β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ βββββββββββ β
β β
β HASH CHAIN: a3f8...β7c2d...βe91b...β4a7f... β VERIFIED β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β DELIVERABLES (73 items) Filter: βΌ All Grades β
β β
β ββββββ¬βββββββββββββββββββββββββββββ¬βββββββ¬βββββββ¬βββββββ¬ββββββββββ β
β β # β Deliverable β Prop β Impl β Depl β Grade β β
β ββββββΌβββββββββββββββββββββββββββββΌβββββββΌβββββββΌβββββββΌββββββββββ€ β
β β1.1 β tool_registry columns β 100%β 100%β 100%β β PASS β β
β β1.2 β CHECK constraints β 100%β 100%β 100%β β PASS β β
β β2.1 β selectTools() β 95%β 56%β 56%β ? N/T β β
β β2.2 β selectToolsLite() β 90%β 56%β 56%β ? N/T β β
β β2.25β Dockerfile builds β 100%β 100%β 100%β β PASS β β
β β2.26β K8s 2 pods Running β 100%β 0%β 0%β β FAIL β β
β β4.3 β Governance middleware β 90%β 0%β 0%β β FAIL β β
β β4.6 β minTier enforcement β 95%β 0%β 0%β β FAIL β β
β β4.7 β Rate limit enforcement β 90%β 0%β 0%β β FAIL β β
β β... β ... β ...β ...β ...β ... β β
β ββββββ΄βββββββββββββββββββββββββββββ΄βββββββ΄βββββββ΄βββββββ΄ββββββββββ β
β β
β GRADE DISTRIBUTION: β
β β PASS: 12 ~ PARTIAL: 5 β FAIL: 18 ? NOT_TESTED: 34 β GAP: 4 β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
9.2 Fidelity Trend Dashboard
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β NEXUS FIDELITY --- Organization Trends Last 90 Days βΌ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β KEY METRICS β
β βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ βββββββββββ β
β β Avg Gap β β Sessions β β Overconfidence β β ECE β β
β β β β β β Rate β β β β
β β 18.3 β -4.2 β β 47 β β 72% β -6% β β 0.31 β β
β β ββββββββββββ β β β β ββββββββββββ β β ββββ β β
β β Improving β β This quarter β β Improving β β Moderateβ β
β βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ βββββββββββ β
β β
β FIDELITY GAP TREND (12 weeks) β
β 50 β¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β 40 β β² β
β β β β
β 30 β β²βββ βββ Avg Fidelity Gap β
β β β²βββ - - Target (<15) β
β 20 β ββββ β
β β β²βββ β
β 15 ββ β β β β β β β β β β β²βββ β β β TARGET β β β β β β β β β
β 10 β β²ββββββ β
β β β
β 0 β΄βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬ββ β
β W1 W2 W3 W4 W5 W6 W7 W8 W9 W10W11W12 β
β β
β PER-MODEL BREAKDOWN β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Opus ββββββββββββββββββββββββββββββββββ 12.4 (best) β β
β β Sonnet βββββββββββββββββββββββββββββββββββ 19.8 β β
β β Haiku βββββββββββββββββββββββββββββββββββ 24.1 (worst) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β PER-PLUGIN BREAKDOWN β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β NexusQA ββββββββββββββββββββββββββββββββ 22.3 β β
β β ProseCreatorββββββββββββββββββββββββββββββββ 14.7 β β
β β NexusROS ββββββββββββββββββββββββββββββββ 18.9 β β
β β TSE ββββββββββββββββββββββββββββββββββ 36.2 β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
9.3 Evidence Detail View
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DELIVERABLE 4.3 --- Governance Middleware Enforcement β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β SCORES GRADE: β FAIL β
β βββββββββββββ βββββββββββββ βββββββββββββ β
β β Proposed β β Implement β β Deployed β Gap: 90 points β
β β 90% οΏ½οΏ½οΏ½ β 0% β β 0% β β CRITICAL β
β βββββββββββββ βββββββββββββ βββββββββββββ β
β β
β EVIDENCE REQUIRED: β
β POST /select with tier "open_source" against ProseCreator β
β (minTier "starter") β expect 403 TIER_INSUFFICIENT β
β β
β EVIDENCE COLLECTED: β
NONE β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β No evidence has been collected for this deliverable. β β
β β β β
β β Root cause analysis: β β
β β β’ File: services/nexus-tse/src/middleware/governance.ts β β
β β β’ Status: FILE DOES NOT EXIST β β
β β β’ The agent claimed this as "90% complete" but the entire β β
β β middleware/ directory is empty. β β
β β β β
β β Code grep evidence: β β
β β $ ls services/nexus-tse/src/middleware/ β β
β β (empty directory) β β
β β β β
β β Evidence hash: e3b0c44298fc... (SHA-256 of empty result) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β HASH CHAIN: β
β This checkpoint: 7c2d4f8a... β
β Previous: a3f8b1c2... β
β Chain verification: β VALID β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
9.4 Calibration Plot
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CALIBRATION PLOT --- Proposed vs. Deployed Scores β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Deployed β
β Score (%) β± Perfect β
β β± Calibration β
β 100 β¬ββββββββββ±ββββββββββββββββββββββββββββββ β
β β β± β
β 80 βββββββββ±ββββββ β
β β β± β² β
β 60 βββββββ±ββββββββββ Bins: 73 deliverables β
β β β± β² grouped by proposed β
β 40 βββββ±βββββββββββββββ score decile β
β β β± β² β
β 20 βββ±βββββββββββββββββββββββββ β
β ββ± β
β 0 β΄βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬βββ¬ββ β
β 0 10 20 30 40 50 60 70 80 90 100 β
β Proposed Score (%) β
β β
β OVERCONFIDENCE ZONE: All points below the diagonal β
β 78% of deliverables fall in this zone. β
β β
β ECE = 0.31 (moderate miscalibration) β
β Worst bin: 80-90% proposed β 22% deployed (58-point gap) β
β Best bin: 100% proposed β 95% deployed (5-point gap, schema tasks) β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
10. Regulatory Alignment
10.1 EU AI Act Article 12
Article 12 of the EU AI Act requires high-risk AI systems to maintain "logs generated automatically" that enable "tracing of the functioning of the system throughout its lifecycle" [9]. Deployers must retain these logs for "at least six months."
EU AI ACT ARTICLE 12 --- AIFS COMPLIANCE MAPPING
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
REQUIREMENT AIFS IMPLEMENTATION
βββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββ
Automatic log generation β PostToolUse hooks capture data
without manual intervention
Lifecycle tracing β Checkpoints span Steps 1-16
(plan β deploy validation)
Minimum 6-month retention β Append-only PostgreSQL with
no DELETE capability. Partition
for efficient archival.
Tamper evidence β SHA-256 hash chains. Any
modification breaks the chain.
Independent verification
without vendor trust.
Traceability β Hash chain links every decision
to its predecessor. Full causal
chain reconstructable.
10.2 NIST AI Risk Management Framework
NIST AI RMF --- AIFS IMPLEMENTATION
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GOVERN 1.1 Policies for AI risk Gate hooks block deployment
management without fidelity data
MEASURE 2.6 Performance tracked Fidelity gap KPI with
against metrics weekly improvement rate
MEASURE 2.7 Performance evaluated Three-score tracking with
against operational production evidence
requirements collection
11. Related Work
11.1 AI Code Verification Tools
| Tool | What It Does | What's Missing |
|---|---|---|
| SonarQube AI Code Assurance [3] | Detects AI-generated code quality issues | Does not track plan-to-deployment fidelity |
| Allure TestOps | Attaches evidence to test results | No scorecard aggregation or hash chain |
| Arize Phoenix [4] | Records LLM execution traces | Observability only --- no fidelity scoring |
| PDCA Framework [10] | Structures AI coding into Plan-Do-Check-Act | No immutable storage or evidence grading |
| Spec Kit | Spec-driven development automation | No deployment verification or gap tracking |
11.2 Immutable Audit Trails
| System | Approach | Difference from AIFS |
|---|---|---|
| GuardianChain [11] | SHA-256 + blockchain anchor | Records what agents did, not what they claimed |
| Prefactor [12] | CI/CD-native immutable IDs | No three-score tracking or calibration metrics |
| Standard audit logs | Append-only databases | No content-addressable hashing or chain verification |
11.3 Knowledge Graphs for QA
GraphRAG-Bench (ICLR 2026) [13] evaluates retrieval-augmented generation with graph traversal. No commercial product stores QA metrics in a knowledge graph at the granularity AIFS requires. NexusQA's existing 10 entity types with typed relationships for regression prediction provide the foundation that AIFS extends with the fidelity domain.
12. Discussion & Future Work
12.1 Limitations
Single-project calibration. The 56% ceiling derives from one project's 73 deliverables. While consistent with independent studies, organizations should calibrate against their own data.
Self-reported proposed scores. The proposed score is reported by the AI agent. We mitigate this by capturing the plan hash before implementation --- the plan is locked and the proposed score derived from the deliverable list, not a separate confidence estimate.
Granularity sensitivity. A coarse scorecard (28 items) masks problems a granular one (73 items) reveals. AIFS does not prescribe granularity.
12.2 Future Directions
Adaptive ceiling. Learn per-task-type ceilings from accumulated data. Schema migrations might warrant 90% (deterministic SQL); complex UI components might warrant 40%.
Pre-implementation fidelity prediction. Given a plan and historical data, predict expected gap before implementation. Gate whether the AI should attempt the task autonomously.
Cross-agent benchmarking. Deploy AIFS across Claude Code, Cursor, Copilot, and Devin on identical tasks to produce the first standardized fidelity benchmark.
IDE integration. Real-time fidelity scores displayed in VSCode/JetBrains as the AI agent works --- a live confidence indicator that updates with each checkpoint.
13. References
[1] SonarSource, "State of Code Developer Survey Report 2026," 2026. https://www.sonarsource.com/state-of-code-developer-survey-report.pdf
[2] J. Becker, N. Rush, E. Barnes, D. Rein, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity," METR, Jul. 2025. arXiv:2507.09089.
[3] SonarSource, "Agentic Analysis --- Verify AI Code as It Is Generated," 2026. https://docs.sonarsource.com/sonarqube-cloud/analyzing-source-code/agentic-analysis
[4] Arize AI, "Phoenix: AI Observability & Evaluation," GitHub, 2025. https://github.com/Arize-ai/phoenix
[5] S. Ghosh, M. Panday, "The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration," arXiv:2603.09985, Feb. 2026.
[6] "SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?" arXiv:2509.16941, Sep. 2025.
[7] CodeRabbit, "State of AI vs Human Code Generation Report," Dec. 2025. https://www.coderabbit.ai/whitepapers/state-of-AI-vs-human-code-generation-report
[8] Faros AI, "The AI Productivity Paradox Research Report," 2026. https://www.faros.ai/ai-productivity-paradox
[9] European Parliament, "Regulation (EU) 2024/1689 --- Artificial Intelligence Act," Article 12. https://artificialintelligenceact.eu/article/12/
[10] K. Judy, "A Plan-Do-Check-Act Framework for AI Code Generation," InfoQ, Oct. 2025. https://www.infoq.com/articles/PDCA-AI-code-generation/
[11] T. A. Cronin, "How to Create Immutable Audit Trails for AI Agents," DEV Community, 2026. https://dev.to/guardianchain/how-to-create-immutable-audit-trails-for-ai-agents-5boc
[12] Prefactor, "Audit Trails in CI/CD: Best Practices for AI Agents," 2026. https://prefactor.tech/blog/audit-trails-in-ci-cd-best-practices-for-ai-agents
[13] "When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation," arXiv:2506.05690, ICLR 2026.
[14] C. Spiess et al., "Calibration and Correctness of Language Models for Code," ICSE 2025. doi:10.1109/ICSE55347.2025.00040.
[15] S. Li et al., "A Survey on the Honesty of Large Language Models," arXiv:2409.18786. Accepted TMLR 2025.
[16] P. Chhikara, "Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in LLMs," TMLR, Feb. 2025. arXiv:2502.11028.
[17] T. Pan, "Agentic Coding in Production: What SWE-bench Scores Don't Tell You," Apr. 2026. https://tianpan.co/blog/2026-04-09-agentic-coding-production-swebench-gap
[18] NIST, "Artificial Intelligence Risk Management Framework (AI RMF 1.0)," NIST AI 100-1, Jan. 2023. https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf
---
Published by Adverant Research. NexusQA Platform --- nexusqa.ai This document is available by direct link only and is not indexed by search engines.
