The AI Implementation Fidelity Gap

Measuring and Closing the Chasm Between Claimed and Verified Software Delivery in Autonomous Coding Agents

Adverant Research | April 2026 NexusQA Platform --- nexusqa.ai

Idea in Brief

The Problem: AI coding agents claim 85% completion but deliver 36%. Six independent studies confirm a 19-72 percentage point gap between AI self-assessment and actual performance.
The Solution: The AI Implementation Fidelity Scorecard (AIFS) --- a standalone microservice that tracks three scores per deliverable: proposed (AI's claim), implemented (code analysis), deployed (production evidence).
The Innovation: Append-only PostgreSQL with immutability triggers, SHA-256 hash chains, a "56% ceiling rule" for unverified code, and five evidence grades --- creating the first cryptographically verifiable implementation fidelity system.
The Impact: Fidelity gap as organizational KPI, per-model calibration comparison, regulatory compliance with EU AI Act Article 12, and closed-loop improvement via knowledge graph integration.

The Overconfidence Crisis
Empirical Evidence: The Fidelity Gap
System Architecture: AIFS
Data Model: Immutable Scorecards
Evidence Collection Protocol
The 56% Ceiling Rule
Fidelity Gap as KPI
Integration Architecture
Dashboard UI/UX Design
Regulatory Alignment
Related Work
Discussion & Future Work
References

1. The Overconfidence Crisis

Something uncomfortable is happening in software engineering. AI coding tools have crossed the adoption threshold --- SonarQube's 2026 developer survey reports that 42% of committed code is now AI-generated or AI-assisted [1]. Developers believe these tools make them faster. The data says otherwise.

The METR study tracked 16 experienced open-source developers across 246 real tasks [2]. Before starting, developers predicted AI would save them 24% of their time. After completing the tasks, they still believed it had saved them 20%. Actual measurement: AI made them 19% slower. That is a 39-point perception-reality gap, and it persisted even after developers had empirical evidence of their own performance.

We encountered this phenomenon firsthand during a Tool Selection Engine (TSE) consolidation project. The AI agent claimed 85% completion across 73 deliverables. A brutally honest audit --- requiring concrete evidence for every claim --- found 36% actual completion. The missing 49 points were not minor omissions. They included an entire middleware directory that the agent described as "90% complete" but which contained zero files.

The Missing Measurement

Existing tools address fragments of this problem. SonarQube's AI Code Assurance detects "AI slop" [3]. Allure TestOps attaches screenshots to test results. Arize Phoenix records LLM execution traces [4]. None of these answer the fundamental question:

Did the AI agent do what it said it would do?

This is the implementation fidelity gap --- the chasm between an agent's self-reported completion status and the externally verified state of its deliverables across the full lifecycle from planning through deployment.

 THE FIDELITY GAP --- WHAT AI CLAIMS vs. WHAT IT DELIVERS
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  100% ┬─────────────────────────────────────────────────
       │  ████                                   AI CLAIMS
   90% │  ████  ░░░░                             ─────────
       │  ████  ░░░░                             ACTUALLY
   80% │  ████  ░░░░  ████                       DELIVERED
       │  ████  ░░░░  ████                       ░░░░░░░░░
   70% │  ████  ░░░░  ████
       │  ████  ░░░░  ████  ░░░░
   60% │  ████  ░░░░  ████  ░░░░
       │  ████  ░░░░  ████  ░░░░  ████
   50% │  ████  ░░░░  ████  ░░░░  ████
       │  ████  ░░░░  ████  ░░░░  ████
   40% │  ████  ░░░░  ████  ░░░░  ████  ░░░░
       │  ███���  ████  ████  ░░░░  ████  ░░░░
   30% │  ████  ████  ████  ░░░░  ████  ░░░░
       │  ████  ████  ████  ████  ████  ░░░░
   20% │  ████  ████  ████  ████  ████  ░░░░
       │  ████  ████  ████  ████  ████  ████
   10% │  ████  ████  ████  ████  ████  ████
       │  ████  ████  ████  ████  ████  █��██
    0% ┴──────┴──────┴──────┴──────┴──────┴──────
       TSE     METR    Kimi   SWE-   Code   Faros
       Case    Study    K2    bench  Rabbit   AI
       Study

        49pt    39pt   72pt   57pt   ~41pt   ~9pt
        gap     gap    gap    gap    gap     gap

2. Empirical Evidence: The Fidelity Gap

2.1 Convergent Findings from Six Independent Studies

The following table summarizes evidence from six independent sources --- each measuring a different facet of the AI implementation fidelity problem, yet all converging on the same qualitative finding: AI self-assessment systematically overstates actual performance.

Source	Sample	Claimed	Actual	Gap
TSE Consolidation (firsthand)	73 deliverables	85% complete	36% complete	49 pp
METR Study [2]	16 devs, 246 tasks	+20% faster	-19% slower	39 pp
Kimi K2 Calibration [5]	TriviaQA benchmark	95.7% confidence	23.3% accuracy	72 pp
SWE-bench [6]	Verified vs. Pro	80% score	23% score	57 pp
CodeRabbit [7]	470 GitHub PRs	Baseline quality	1.7x more issues	~41 pp
Faros AI [8]	10,000+ developers	+98% more PRs	+9% more bugs	~9 pp

2.2 The Dunning-Kruger Effect in LLMs

Ghosh and Panday's study [5] documented Expected Calibration Error (ECE) values across multiple LLMs:

 ECE (EXPECTED CALIBRATION ERROR) BY MODEL
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
                                                  WORSE
 Kimi K2       ████████████████████████████████████  0.726
                                                     ↑
 GPT-4o Mini   █████████████████████████             0.489
                                                     │
 Llama 3.3     ████████████████████                  0.391   MISCALIBRATION
                                                     │
 Gemini Flash  ██████████████████                    0.352
                                                     │
 GPT-4o        █████████████                         0.254
                                                     ↓
 Claude Haiku  ██████                                0.122
                                                  BETTER
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
              0.0    0.2    0.4    0.6    0.8    1.0

The inverse relationship between competence and overconfidence mirrors the human Dunning-Kruger cognitive bias --- the least capable models express the highest confidence.

2.3 Code Quality Impact

CodeRabbit's analysis of 470 GitHub pull requests reveals the scope of quality degradation in AI-generated code:

 AI vs. HUMAN CODE QUALITY (CodeRabbit, Dec 2025)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

                           AI      Human     Multiplier
                          ─────    ─────     ──────────
 Logic Errors             ████     ██        1.75x more
 Readability Issues       ██████   ██        3.00x more
 Security Vulnerabilities ████     ██        1.5-2x more
 XSS Vulnerabilities      █████   ██        2.74x more
 Password Handling Flaws  ████    ██         1.88x more
 Insecure Object Refs     ████    ██         1.91x more
 Total Issues per PR      ████    ██         1.70x more

 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 Source: 320 AI-co-authored + 150 human-only PRs analyzed
 74 confirmed CVEs directly attributed to AI coding tools

2.4 The Productivity Paradox

Faros AI's study of 10,000+ developers across 1,255 teams [8] reveals the macro-level consequences:

 THE AI PRODUCTIVITY PARADOX (Faros AI, 2026)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

          Individual Gains              System-Level Costs
       ┌──────────────────┐          ┌──────────────────┐
       │  +21% tasks done  │          │ +91% review time  │
       │  +98% PRs merged  │    →     │ +154% PR size     │
       │  More output/dev  │          │ +9% bugs/dev      │
       └──────────────────┘          └──────────────────┘

                    ┌─────────────────────┐
                    │   NET RESULT:        │
                    │   No measurable      │
                    │   company-level      │
                    │   improvement        │
                    │                      │
                    │   (Amdahl's Law:     │
                    │   review bottleneck  │
                    │   cancels gains)     │
                    └─────────────────────┘

3. System Architecture: AIFS

3.1 High-Level Architecture

The AI Implementation Fidelity Scorecard deploys as nexus-fidelity --- a standalone microservice within the Nexus Kubernetes cluster that passively observes AI coding sessions and collects evidence at every lifecycle checkpoint.

 AIFS SYSTEM ARCHITECTURE
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 ┌─────────────────────────────────────────────────────────────────────┐
 │                        CLAUDE CODE SESSION                          │
 │                                                                     │
 │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────────┐   │
 │  │  Plan     │→ │  Gate 1  │→ │  BHA     │→ │  Implementation  │   │
 │  │  (Step 1) │  │  (Step 3)│  │  (Step 4)│  │  (Step 7)        │   │
 │  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────────┬─────────┘   │
 │       │              │              │                  │             │
 │  ┌────┴──────────────┴──────────────┴──────────────────┴──────────┐ │
 │  │              PostToolUse Hook: fidelity-scorecard.sh           │ │
 │  │              (async, fire-and-forget, never blocks)            │ │
 │  └──────────────────────────────┬────────────────────────────────┘ │
 └─────────────────────────────────┼──────────────────────────────────┘
                                   │ POST /api/v1/checkpoints
                                   ▼
 ┌─────────────────────────────────────────────────────────────────────┐
 │                     nexus-fidelity :9130                            │
 │  ┌───────────────┐  ┌───────────────┐  ┌─────────────────────────┐ │
 │  │  Scorecard     │  │  Evidence      │  │  Grading Engine         │ │
 │  │  Service       │  │  Collector     │  │  (PASS/PARTIAL/FAIL/    │ │
 │  │                │  │                │  │   NOT_TESTED/HONEST_GAP)│ │
 │  └───────┬───────┘  └───────┬───────┘  └────────────┬────────────┘ │
 │          │                  │                        │              │
 │  ┌───────┴──────────────────┴────────────────────────┴────────────┐ │
 │  │                Hash Chain Service (SHA-256)                    │ │
 │  └───────────────────────────┬────────────────────────────────────┘ │
 └──────────────────────────────┼──────────────────────────────────────┘
                                │
              ┌─────────────────┼─────────────────┐
              ▼                 ▼                  ▼
 ┌────────────────┐  ┌──────────────┐  ┌─────────────────┐
 │   PostgreSQL    │  │   GraphRAG   │  │   Redis Pub/Sub │
 │  fidelity.*     │  │  fidelity_*  │  │  orchestration: │
 │  (IMMUTABLE)    │  │  entities    │  │  job:* events   │
 │  INSERT ONLY    │  │              │  │                 │
 └────────────────┘  └──────────────┘  └─────────────────┘

3.2 Service Identity

Property	Value
Service Name	`nexus-fidelity`
Port	9130 (HTTP), 9131 (WebSocket)
K8s Deployment	`nexus-fidelity` in `nexus` namespace
Replicas	2 (HPA on request rate)
Database	PostgreSQL `fidelity` schema
Knowledge Graph	GraphRAG `fidelity` domain entities
Auth (internal)	X-Service-Key
Auth (dashboard)	JWT

3.3 Design Principles

 FIVE ARCHITECTURAL PRINCIPLES
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 ┌─────────────────┐    ┌─────────────────┐    ┌─────────────��───┐
 │  1. APPEND-ONLY │    │ 2. CONTENT-     │    │ 3. THREE-SCORE  │
 │   IMMUTABILITY  │    │  ADDRESSABLE    │    │   TRACKING      │
 │                 │    │                 │    │                 │
 │  INSERT only.   │    │  SHA-256 hash   │    │  S_proposed     │
 │  UPDATE raises  │    │  chain links    │    │  S_implemented  │
 ���  exception.     │    │  every record.  │    │  S_deployed     │
 │  DELETE raises  │    │  Tampering      │    │                 │
 │  exception.     │    │  breaks chain.  │    │  Each measured  │
 │                 │    │                 │    │  independently  │
 │  Enforced at    │    │  No blockchain  │    │  at different   │
 │  DB trigger     │    │  needed --- hash  │    │  lifecycle      │
 │  level.         │    │  chain suffices │    │  stages.        │
 └─────────────────┘    └─────────────────┘    └─────────────────┘

 ┌─────────────────┐    ┌─────────────────┐
 │ 4. EVIDENCE-    ��    │  5. DUAL        │
 │   LINKED        │    │   STORAGE       │
 │                 │    │                 │
 │  No score       │    │  PostgreSQL     │
 │  without        │    │  for ACID       │
 │  evidence.      │    │  queries.       │
 │                 │    │                 │
 │  10 evidence    │    │  GraphRAG for   │
 │  types, each    │    │  semantic       │
 │  SHA-256        │    │  search and     │
 │  hashed.        │    │  cross-session  │
 │                 │    │  discovery.     │
 └─────────────────┘    └─────────────────┘

4. Data Model: Immutable Scorecards

4.1 Entity-Relationship Diagram

 FIDELITY SCHEMA --- ENTITY RELATIONSHIPS
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 ┌─────────────────────────────────────────────────┐
 │                fidelity.scorecards               │
 │ ─────────────────────────────────────────────── │
 │ id              UUID PK                         │
 │ session_id      VARCHAR(255)                    │
 │ project_name    VARCHAR(255)                    │
 │ branch          VARCHAR(255)                    │
 │ organization_id UUID                            │
 │ plan_hash       VARCHAR(64)  ← SHA-256          │
 │ plan_summary    TEXT                            │
 │ total_deliverables INTEGER                      │
 │ proposed_score  REAL                            │
 │ implemented_score REAL                          │
 │ deployed_score  REAL                            │
 │ fidelity_gap    REAL  GENERATED (prop - depl)   │
 │ chain_hash      VARCHAR(64)  ← SHA-256 chain    │
 │ previous_scorecard_id UUID FK → self            │
 │ status          CHECK (active|completed|aband.) │
 │ created_at      TIMESTAMPTZ                     │
 │ ═══════════════════════════════════════════════ │
 │ TRIGGER: no_update (RAISE EXCEPTION)            │
 │ TRIGGER: no_delete (RAISE EXCEPTION)            │
 └───────┬─────────────────────────┬───────────────┘
         │ 1:N                     │ 1:N
         ▼                         ▼
 ┌──────────────────┐     ┌──────────────────────┐
 │ fidelity.        │     │ fidelity.             │
 │ deliverables     │     │ checkpoints           │
 │ ────────────── │     │ ──────────────────── │
 │ id          UUID│     │ id          UUID      │
 │ scorecard_id FK │     │ scorecard_id FK       │
 │ deliverable_key │     │ gate_name    VARCHAR  │
 │ description TEXT│     │ step_number  INT      │
 │ proposed_score  │     │ verdict      VARCHAR  │
 │ implemented_    │     │ claimed_completion    │
 │   score         │     │ verified_completion   │
 │ deployed_score  │     │ checks_executed JSONB │
 │ evidence_type   │     │ artifacts       JSONB │
 │ evidence_data   │     │ checkpoint_hash       │
 │   JSONB         │     │   VARCHAR(64)         │
 │ evidence_hash   │     │ previous_checkpoint_  │
 │   VARCHAR(64)   │     │   id UUID FK → self   │
 │ grade  CHECK    │     │ created_at            │
 │ (PASS|PARTIAL|  │     │ ════════════════════ │
 │  FAIL|NOT_TEST| │     │ TRIGGER: no_update    │
 │  HONEST_GAP)    │     │ TRIGGER: no_delete    │
 │ ══════════════ │     └──────────────────────┘
 │ TRIGGER: no_upd │
 │ TRIGGER: no_del │
 └──────────────────┘

                    ┌──────────────────────┐
                    │ fidelity.             │
                    │ daily_metrics         │
                    │ ──────────────────── │
                    │ id      BIGSERIAL    │
                    │ organization_id UUID │
                    │ project_name VARCHAR │
                    │ metric_date   DATE   │
                    │ avg_proposed_score   │
                    │ avg_implemented_score│
                    │ avg_deployed_score   │
                    │ avg_fidelity_gap     │
                    │ overconfidence_rate  │
                    │ underconfidence_rate │
                    │ UNIQUE (org, proj,   │
                    │   metric_date)       │
                    │ ════════════════════ │
                    │ (UPDATABLE --- daily   │
                    │  materialization)    │
                    └──────────────────────┘

4.2 Hash Chain Construction

Every checkpoint record includes a SHA-256 hash of its structured data concatenated with the previous checkpoint's hash. Tampering with any record invalidates all subsequent hashes.

 SHA-256 HASH CHAIN --- TAMPER-EVIDENT CHECKPOINT LINKAGE
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 GENESIS                                              LATEST
 ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
 │ Checkpoint 1 │    │ Checkpoint 2 │    │ Checkpoint 3 │
 │ Gate 1: Plan │    │ Gate 2: Code │    │ Deploy Valid │
 │              │    │              │    │              │
 │ data: {...}  │    │ data: {...}  │    │ data: {...}  │
 │              │    │              │    │              │
 │ prev: 0x000  │    │ prev: h₁     │    │ prev: h₂     │
 │              │    │              │    │              │
 │ h₁ = SHA256( │───▶│ h₂ = SHA256( │───▶│ h₃ = SHA256( │
 │   data₁ ‖    │    │   data₂ ‖    │    │   data₃ ‖    │
 │   0x000...)  │    │   h₁)        │    │   h₂)        │
 │              │    │              │    │              │
 │ h₁ = a3f8... │    │ h₂ = 7c2d... │    │ h₃ = e91b... │
 └──────────────┘    └──────────────┘    └──────────────┘

 VERIFICATION: Walk chain from h₃ → h₂ → h₁ → genesis.
 Recompute each hash. Any mismatch = TAMPERING DETECTED.

 ┌───────────────────────────────────────────────────────┐
 │  ALGORITHM: HashChainConstruction                     │
 │  ─────────────────────────────────────────────────── │
 │  INPUT:  checkpoint_data C, previous_hash h_prev      │
 │  OUTPUT: current_hash h_current                       │
 │                                                       │
 ���  1. payload ← canonicalize(C)  // deterministic JSON  │
 │  2. input   ← payload ‖ h_prev                        │
 │  3. h_current ← SHA-256(input)                        │
 │  4. INSERT (C, h_current, h_prev) INTO checkpoints    │
 │  5. RETURN h_current                                  │
 └───────────────────────────────────────────────────────┘

4.3 Immutability Enforcement


SQL
20 lines
-- Database-level immutability: the AI CANNOT retroactively inflate scores
CREATE FUNCTION fidelity.prevent_mutation() RETURNS TRIGGER AS 

<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>B</mi><mi>E</mi><mi>G</mi><mi>I</mi><mi>N</mi><mi>R</mi><mi>A</mi><mi>I</mi><mi>S</mi><mi>E</mi><mi>E</mi><mi>X</mi><mi>C</mi><mi>E</mi><mi>P</mi><mi>T</mi><mi>I</mi><mi>O</mi><msup><mi>N</mi><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><mi>F</mi><mi>i</mi><mi>d</mi><mi>e</mi><mi>l</mi><mi>i</mi><mi>t</mi><mi>y</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi>d</mi><mi>s</mi><mi>a</mi><mi>r</mi><mi>e</mi><mi>i</mi><mi>m</mi><mi>m</mi><mi>u</mi><mi>t</mi><mi>a</mi><mi>b</mi><mi>l</mi><mi>e</mi><mi mathvariant="normal">.</mi><mi>C</mi><mi>a</mi><mi>n</mi><mi>n</mi><mi>o</mi><mi>t</mi><mi>T</mi><msub><mi>G</mi><mi>O</mi></msub><mi>P</mi><mo separator="true">,</mo><mi>T</mi><msub><mi>G</mi><mi>T</mi></msub><mi>A</mi><mi>B</mi><mi>L</mi><msub><mi>E</mi><mi>N</mi></msub><mi>A</mi><mi>M</mi><mi>E</mi><mi>U</mi><mi>S</mi><mi>I</mi><mi>N</mi><mi>G</mi><mi>H</mi><mi>I</mi><mi>N</mi><mi>T</mi><msup><mo>=</mo><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><mi>C</mi><mi>r</mi><mi>e</mi><mi>a</mi><mi>t</mi><mi>e</mi><mi>a</mi><mi>n</mi><mi>e</mi><mi>w</mi><mi>r</mi><mi>e</mi><mi>c</mi><mi>o</mi><mi>r</mi><mi>d</mi><mi>i</mi><mi>n</mi><mi>s</mi><mi>t</mi><mi>e</mi><mi>a</mi><mi>d</mi><mi>o</mi><mi>f</mi><mi>m</mi><mi>o</mi><mi>d</mi><mi>i</mi><mi>f</mi><mi>y</mi><mi>i</mi><mi>n</mi><mi>g</mi><mi>e</mi><mi>x</mi><mi>i</mi><mi>s</mi><mi>t</mi><mi>i</mi><mi>n</mi><mi>g</mi><mi>o</mi><mi>n</mi><mi>e</mi><mi>s</mi><msup><mi mathvariant="normal">.</mi><mo mathvariant="normal" lspace="0em" rspace="0em">′</mo></msup><mo separator="true">;</mo><mi>E</mi><mi>N</mi><mi>D</mi><mo separator="true">;</mo></mrow><annotation encoding="application/x-tex">BEGIN
  RAISE EXCEPTION
    &#x27;Fidelity records are immutable. Cannot % on table %.&#x27;,
    TG_OP, TG_TABLE_NAME
  USING HINT = &#x27;Create a new record instead of modifying existing ones.&#x27;;
END;</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.9963em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.05017em;">B</span><span class="mord mathnormal" style="margin-right:0.05764em;">E</span><span class="mord mathnormal">G</span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord mathnormal" style="margin-right:0.10903em;">N</span><span class="mord mathnormal" style="margin-right:0.00773em;">R</span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mord mathnormal" style="margin-right:0.05764em;">E</span><span class="mord mathnormal" style="margin-right:0.05764em;">E</span><span class="mord mathnormal" style="margin-right:0.07847em;">X</span><span class="mord mathnormal" style="margin-right:0.07153em;">C</span><span class="mord mathnormal" style="margin-right:0.05764em;">E</span><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="mord mathnormal" style="margin-right:0.13889em;">T</span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord mathnormal" style="margin-right:0.02778em;">O</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.10903em;">N</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8019em;"><span style="top:-3.113em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.13889em;">F</span><span class="mord mathnormal">i</span><span class="mord mathnormal">d</span><span class="mord mathnormal">e</span><span class="mord mathnormal" style="margin-right:0.01968em;">l</span><span class="mord mathnormal">i</span><span class="mord mathnormal">t</span><span class="mord mathnormal" style="margin-right:0.03588em;">y</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal" style="margin-right:0.02778em;">ecor</span><span class="mord mathnormal">d</span><span class="mord mathnormal">s</span><span class="mord mathnormal">a</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal">e</span><span class="mord mathnormal">imm</span><span class="mord mathnormal">u</span><span class="mord mathnormal">t</span><span class="mord mathnormal">ab</span><span class="mord mathnormal" style="margin-right:0.01968em;">l</span><span class="mord mathnormal">e</span><span class="mord">.</span><span class="mord mathnormal" style="margin-right:0.07153em;">C</span><span class="mord mathnormal">ann</span><span class="mord mathnormal">o</span><span class="mord mathnormal" style="margin-right:0.13889em;">tT</span><span class="mord"><span class="mord mathnormal">G</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.02778em;">O</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="mpunct">,</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal" style="margin-right:0.13889em;">T</span><span class="mord"><span class="mord mathnormal">G</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:0em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.13889em;">T</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.05017em;">B</span><span class="mord mathnormal">L</span><span class="mord"><span class="mord mathnormal" style="margin-right:0.05764em;">E</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height:0.3283em;"><span style="top:-2.55em;margin-left:-0.0576em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right:0.10903em;">N</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height:0.15em;"><span></span></span></span></span></span></span><span class="mord mathnormal">A</span><span class="mord mathnormal" style="margin-right:0.10903em;">M</span><span class="mord mathnormal" style="margin-right:0.05764em;">E</span><span class="mord mathnormal" style="margin-right:0.10903em;">U</span><span class="mord mathnormal" style="margin-right:0.05764em;">S</span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord mathnormal" style="margin-right:0.10903em;">N</span><span class="mord mathnormal">G</span><span class="mord mathnormal" style="margin-right:0.08125em;">H</span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord mathnormal" style="margin-right:0.10903em;">N</span><span class="mord mathnormal" style="margin-right:0.13889em;">T</span><span class="mspace" style="margin-right:0.2778em;"></span><span class="mrel"><span class="mrel">=</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8019em;"><span style="top:-3.113em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span></span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:0.9963em;vertical-align:-0.1944em;"></span><span class="mord mathnormal" style="margin-right:0.07153em;">C</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal">e</span><span class="mord mathnormal">a</span><span class="mord mathnormal">t</span><span class="mord mathnormal">e</span><span class="mord mathnormal">an</span><span class="mord mathnormal">e</span><span class="mord mathnormal" style="margin-right:0.02691em;">w</span><span class="mord mathnormal" style="margin-right:0.02778em;">r</span><span class="mord mathnormal" style="margin-right:0.02778em;">ecor</span><span class="mord mathnormal">d</span><span class="mord mathnormal">in</span><span class="mord mathnormal">s</span><span class="mord mathnormal">t</span><span class="mord mathnormal">e</span><span class="mord mathnormal">a</span><span class="mord mathnormal">d</span><span class="mord mathnormal">o</span><span class="mord mathnormal" style="margin-right:0.10764em;">f</span><span class="mord mathnormal">m</span><span class="mord mathnormal">o</span><span class="mord mathnormal">d</span><span class="mord mathnormal">i</span><span class="mord mathnormal" style="margin-right:0.10764em;">f</span><span class="mord mathnormal" style="margin-right:0.03588em;">y</span><span class="mord mathnormal">in</span><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="mord mathnormal">e</span><span class="mord mathnormal">x</span><span class="mord mathnormal">i</span><span class="mord mathnormal">s</span><span class="mord mathnormal">t</span><span class="mord mathnormal">in</span><span class="mord mathnormal" style="margin-right:0.03588em;">g</span><span class="mord mathnormal">o</span><span class="mord mathnormal">n</span><span class="mord mathnormal">es</span><span class="mord"><span class="mord">.</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8019em;"><span style="top:-3.113em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">′</span></span></span></span></span></span></span></span></span><span class="mpunct">;</span><span class="mspace" style="margin-right:0.1667em;"></span><span class="mord mathnormal" style="margin-right:0.05764em;">E</span><span class="mord mathnormal" style="margin-right:0.10903em;">N</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mpunct">;</span></span></span></span></span>

 LANGUAGE plpgsql;

-- Applied to ALL core tables
CREATE TRIGGER no_update_scorecards
  BEFORE UPDATE ON fidelity.scorecards
  FOR EACH ROW EXECUTE FUNCTION fidelity.prevent_mutation();

CREATE TRIGGER no_delete_scorecards
  BEFORE DELETE ON fidelity.scorecards
  FOR EACH ROW EXECUTE FUNCTION fidelity.prevent_mutation();

This means that neither the AIFS service itself, nor any direct database client, nor the AI agent can modify historical fidelity records. Once a checkpoint is recorded, it is permanent.

5. Evidence Collection Protocol

5.1 Ten Evidence Types

 EVIDENCE TYPES AND COLLECTION METHODS
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 ┌──────────────┬──────────────────┬──────────────────────────────┐
 │ Type         │ Collection       │ Example                      │
 ├──────────────┼──────────────────┼──────────────────────────────┤
 │ api_response │ HTTP request     │ {status:200, body:{...}}     │
 │ screenshot   │ Playwright MCP   │ MinIO key + visual diff %    │
 │ test_result  │ Jest/Playwright  │ {passed:68, failed:5}        │
 │ db_query     │ SQL execution    │ SELECT count(*)... = 351     │
 │ log_line     │ kubectl logs     │ [TSE] selection via...       │
 │ code_grep    │ rg/grep search   │ governance.ts:45: if(tier..  │
 │ health_check │ HTTP probe       │ {status:"ok", pg:"ok"}       │
 │ pod_status   │ kubectl get pods │ 2/2 Running 0 5m             │
 │ git_diff     │ git diff --stat  │ 15 files, +2340 insertions   │
 │ console      │ Browser inject   │ {errors:0, warnings:2}       │
 └──────────────┴──────────────────┴──────────────────────────────┘

 Each evidence record includes:
 ┌─────────────────────────────────────────────────────────────────┐
 │  evidence_data: JSONB      ← Raw evidence payload              │
 │  evidence_hash: VARCHAR(64) ← SHA-256(evidence_data)           │
 │  evidence_collected_at: TIMESTAMPTZ                             │
 │                                                                 │
 │  Modification of evidence_data invalidates evidence_hash →      │
 │  TAMPERING IS DETECTABLE                                        │
 └─────────────────────────────────────────────────────────────────┘

5.2 Five-Grade System

 EVIDENCE GRADING --- FIVE GRADES WITH SCORE MULTIPLIERS
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 ┌──────────────┬────────┬─────────────────────────────────────────┐
 │    Grade     │ Multi. │ Criteria                                │
 ├──────────────┼────────┼─────────────────────────────────────────┤
 │              │        │                                         │
 │  ✓  PASS     │  100%  │ Evidence matches expected behavior      │
 │              │        │ exactly. Full functional proof.          │
 │              │        │                                         │
 │  ~  PARTIAL  │   75%  │ Evidence exists but edge cases           │
 │              │        │ uncovered or output differs slightly.   │
 │              │        │                                         │
 │  ✗  FAIL     │    0%  │ Evidence shows wrong behavior,          │
 │              │        │ errors, or missing functionality.       │
 │              │        │                                         │
 │  ?  NOT_     │  ≤56%  │ No production evidence collected.       │
 │     TESTED   │  ceil. │ CAPPED regardless of proposed score.    │
 │              │        │                                         │
 │  ◊  HONEST_  │   80%  │ Known limitation documented with        │
 │     GAP      │        │ proof that limitation exists.           │
 │              │        │                                         │
 └──────────────┴────────┴─────────────────────────────────────────┘

 SCORING EXAMPLES:

 Deliverable: "Deploy nexus-tse to K8s"
 ┌───────────────────────────────────────────────────────────────┐
 │  Proposed: 95%  │  Evidence: kubectl get pods → 2/2 Running  │
 │  Grade: PASS    │  Deployed: 95% × 100% = 95%                │
 │  Fidelity Gap:  │  95% - 95% = 0 (perfect calibration)       │
 └───────────────────────────────────────────────────────────────┘

 Deliverable: "Governance middleware enforcement"
 ┌───────────────────────────────────────────────────────────────┐
 │  Proposed: 90%  │  Evidence: middleware/ directory is EMPTY   │
 │  Grade: FAIL    │  Deployed: 90% × 0% = 0%                   │
 │  Fidelity Gap:  │  90% - 0% = 90 (catastrophic gap)          │
 └───────────────────────────────────────────────────────────────┘

 Deliverable: "Qdrant semantic search"
 ┌───────────────────────────────────────────────────────────────┐
 │  Proposed: 80%  │  Evidence: health shows "not_configured"   │
 │  Grade: HONEST_ │  Deployed: 80% × 80% = 64%                │
 │         GAP     │  Fidelity Gap: 80% - 64% = 16             │
 │                 │  (Gap exists but limitation is documented) │
 └───────────────────────────────────────────────────────────────┘

6. The 56% Ceiling Rule

6.1 Definition

For any deliverable where no production evidence has been collected, the deployed score is bounded at 56% --- regardless of the proposed or implemented score.

An agent that claims 95% confidence on a deliverable but provides no production evidence receives min(0.95 × 0.56, 0.56) = 0.532 as its deployed score. This is the most aggressive constraint in the system and the most important.

6.2 Empirical Basis

 THE 56% CEILING --- EMPIRICAL DERIVATION
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 Of 73 TSE deliverables self-reported as "code complete":

 ┌─��──────────────────────────────────────────────┐
 │                                                │
 │   Functionally correct:  41/73 = 56%  ████████ │
 │                                                │
 │   Issues found:          32/73 = 44%  ██████   │
 │     ├─ Missing impl:     12/73 = 16%  ███      │
 │     ├─ Logic errors:      9/73 = 12%  ██       │
 │     ├─ Partial impl:      7/73 = 10%  ██       │
 │     └─ Config errors:     4/73 =  5%  █        │
 │                                                │
 └────────────────────────────────────────────────┘

 CROSS-VALIDATION WITH INDEPENDENT ESTIMATES:

 Source                    Effective Fidelity    Method
 ─────────────────────     ──────────────────    ────────────────────
 TSE Case Study            56%                   Direct measurement
 SWE-bench (geometric)     ~43%                  √(0.80 × 0.23)
 METR (productivity ratio) ~68%                  0.81 / 1.20
 CodeRabbit (parity)       ~59%                  1 / 1.70

 Range: 43% --- 68%
 56% sits in the middle of this range.
 Configurable per organization based on their own calibration data.

6.3 Impact on Scoring

 EFFECT OF 56% CEILING ON INFLATED CLAIMS
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

                    WITHOUT Ceiling              WITH Ceiling
                    (AI's claim stands)          (Evidence required)

 Proposed: 95%  →   Deployed: 95%           →   Deployed: ≤56%
                    (trusts AI blindly)          (demands proof)

 Proposed: 85%  →   Deployed: 85%           →   Deployed: ≤56%
                    (phantom completion)         (honest cap)

 Proposed: 60%  →   Deployed: 60%           →   Deployed: ≤56%
                    (modest claim, still          (still capped ---
                     no evidence)                 evidence needed)

 Proposed: 50%  →   Deployed: 50%           →   Deployed: 50%
                    (below ceiling, unchanged)   (below ceiling,
                                                  no cap applied)

 ═══════════════════════════════════════════════════════════════════
 KEY INSIGHT: The ceiling penalizes EXACTLY the common failure mode
 --- claiming high completion with zero production verification.

7. Fidelity Gap as KPI

7.1 Formal Definition

 FIDELITY GAP COMPUTATION
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 For scorecard S with deliverables D = {d₁, d₂, ..., dₙ}:

                    1   n
   Δf(S) = ─── × Σ  ( S_proposed(dᵢ) - S_deployed(dᵢ) )
                    n  i=1

 INTERPRETATION:

   Δf = 0        Perfect calibration. Agent claimed exactly what
                  was delivered. IDEAL STATE.

   0 < Δf ≤ 15   Minor overconfidence. Normal operating range.
                  Most sessions should land here.

   15 < Δf ≤ 30  Moderate gap. Review recommended.
                  Check if specific deliverable types are driving it.

   30 < Δf ≤ 50  Critical miscalibration. Human review required.
                  Agent should not auto-deploy at this gap level.

   Δf > 50       Dangerous. Agent's self-assessment is unreliable.
                  Block autonomous operation for this task type.

   Δf < 0        Underconfidence. Agent delivered more than claimed.
                  RARE but observed (8% of TSE deliverables).

7.2 Improvement Tracking

 WEEKLY FIDELITY IMPROVEMENT --- KPI DASHBOARD CONCEPT
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 ρ_week = avg_Δf(this_week) - avg_Δf(last_week)

   Negative ρ = IMPROVING (gap shrinking)  ↓ GOOD
   Positive ρ = REGRESSING (gap growing)   ↑ BAD

 TREND CHART (12 weeks):

  50 ┬──────────────────────────────────────────────────
     │  ●                                    Target: <15
  45 │   ╲
     │    ╲
  40 │     ●
     │      ╲                               Actual trend
  35 │       ╲                              ─────────────
     │        ●
  30 │         ╲──●
     │              ╲
  25 │               ●──●
     │                    ╲
  20 │                     ●──●
     │                          ╲──●
  15 │─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─╲─●─ ─ TARGET ─ ─ ─
     │                               ╲──●
  10 │                                    ╲──●
     │
   5 │
     │
   0 ┴──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──
     W1  W2  W3  W4  W5  W6  W7  W8  W9 W10 W11 W12

  ρ: -5  -5  -5  -2  -5  0  -5  -2  -5  -3  -5  -2
  VERDICT: Consistent improvement. Gap halved in 12 weeks.

7.3 Dimensional Analysis

 FIDELITY GAP ACROSS FOUR DIMENSIONS
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 PER-MODEL COMPARISON                    PER-PLUGIN COMPARISON
 ┌─────────────────────────┐            ┌─────────────────────────┐
 │                         │            │                         │
 │ Opus (judgment)    ██ 8 │            │ DB Migrations      █ 3  │
 │ Sonnet (execution) ████ │            │ API Routes       ███ 15 │
 │                      18 │            │ UI Components   █████   │
 │ Haiku (plumbing)  ██ 12 │            │                     28  │
 │                         │            │ Config/K8s      ██ 10   │
 │ PREDICTION: Opus has    │            │ Full Pipeline  ██████   │
 │ smallest gap on plan    │            │                     35  │
 │ tasks; Sonnet on impl   │            │                         │
 └─────────────────────────┘            └─────────────────────────┘

 PER-TASK-TYPE                           TEMPORAL (SELF-IMPROVEMENT)
 ┌─────────────────────────┐            ┌─────────────────────────┐
 │                         │            │                         │
 │ Schema migration  █ 5   │            │ Week 1-4:  ████████ 42  │
 │ (deterministic SQL)     │            │ Week 5-8:  █████   28   │
 │                         │            │ Week 9-12: ███     18   │
 │ API endpoint    ███ 15  │            │ Week 13+:  ██      12   │
 │ (moderate complexity)   │            │                         │
 │                         │            │ As GraphRAG accumulates │
 │ UI component  ██████ 35 │            │ more remediation trails │
 │ (visual + subjective)   │            │ the Plan Review Gate    │
 │                         │            │ (Stage 4) improves →    │
 │ Full-stack   ████████   │            │ agents learn from past  │
 │ feature          48     │            │ overconfidence.          │
 └─────────────────────────┘            └─────────────────────────┘

7.4 Calibration Metrics

 EXPECTED CALIBRATION ERROR (ECE) --- BORROWED FROM LLM RESEARCH
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

              B
 ECE  =  Σ  (|b| / n) × | avg_proposed(b) - avg_deployed(b) |
             b=1

 Where b = confidence bin (0-20%, 20-40%, ..., 80-100%)

 CALIBRATION PLOT:

 Deployed    Perfect calibration
 Score       (diagonal line)
             /
  100% ┬────/──────────────────────────────
       │   /
   80% │──/────●                    ● = actual bin
       │ /      ╲                       performance
   60% │/        ●
       /          ╲
   40% │           ●──────●
       │                    ╲
   20% │                     ●
       │
    0% ┴──┬──┬──┬──┬──┬──┬──┬──┬──┬──
       0  10 20 30 40 50 60 70 80 90 100
                Proposed Score (%)

 OVERCONFIDENCE ZONE: Points below the diagonal.
 UNDERCONFIDENCE ZONE: Points above the diagonal.
 In our TSE data: 78% of deliverables fall below the diagonal.

8. Integration Architecture

8.1 Six Integration Points

 AIFS INTEGRATION MAP --- SIX SYSTEMS, ONE SCORECARD
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

                          ┌────��─────────────┐
                          │   CLAUDE CODE    │
                          │   HOOKS          │
                          │ ──────────────── │
                          │ UserPromptSubmit │ ──── Initialize scorecard
                          │ PostToolUse      │ ──── Record checkpoints
                          │ (gate markers)   │ ──── Capture evidence
                          └────────┬─────────┘
                                   │
         ┌────────────────┐        │        ┌────────────────┐
         │  NEXUS-TSE     │        │        │  NEXUS-ALIVE   │
         │ ────────────── │        │        │ ──────────────  │
         │ Bandit scores  │◄───────┼───────►│ Pod health →   │
         │ adjusted by    │        │        │ deployed_score  │
         │ fidelity       │        │        │                 │
         │ feedback       │        │        │ Drift → alive   │
         │                │        │        │ incident signal │
         └────────────────┘        │        └────────────────┘
                                   │
                          ┌─────��──▼─────────┐
                          │  NEXUS-FIDELITY  │
                          │  :9130           │
                          │ ═══════════════  │
                          │ Scorecards       │
                          │ Deliverables     │
                          │ Checkpoints      │
                          │ Evidence         │
                          │ Hash Chains      │
                          │ Metrics          │
                          └────────┬─────────┘
                                   │
         ┌────────────────┐        │        ┌────────────────┐
         │ ORCHESTRATOR   │        │        │  GRAPHRAG      │
         │ ────────────── │        │        │ ──────────────  │
         │ Redis Pub/Sub: │───────►│◄──────►│ fidelity_*     │
         │ job:* events   │        │        │ entities       │
         │ start/progress │        │        │                │
         │ complete/error │        │        �� Cross-session  │
         └────────────────┘        │        │ pattern search │
                                   │        └────────────────┘
                          ┌────────▼─────────┐
                          │  WEBSOCKET       │
                          �� ──────────────── │
                          │ Socket.IO rooms  │
                          │ Live score       │
                          │ streaming to     │
                          │ dashboard        │
                          └──────────────────┘

8.2 Claude Code Hook Integration

 15-STEP LIFECYCLE --- AIFS CAPTURE POINTS (★)
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 PLANNING PHASE                     IMPLEMENTATION PHASE
 ┌──────────┐                       ┌──────────────────┐
 │ ★ Step 1 │  Plan created         │ Step 7           │  Code written
 │   AIFS:  │  plan_hash captured   │                  │
 │   Init   │  deliverables listed  │                  │
 │   score- │  proposed scores set  │                  │
 │   card   │                       │                  │
 └────┬─────┘                       └────────┬─────────┘
      │                                      │
 ┌────▼─────┐                       ┌────────▼─────────┐
 │ Step 2   │  Memory recall        │ ★ Step 8         │  Gate 2: Code
 └────┬─────┘                       │   AIFS:          │  C1-C6 checks
      │                             │   claimed vs     │
 ┌────▼─────┐                       │   verified       │
 │ ★ Step 3 │  Gate 1: Plan Review  └────────┬─────────┘
 │   AIFS:  │  P1-P6 checks                 │
 │   Record │  verdict captured     ┌────────▼─────────┐
 │   check- │                       │ ★ Step 9         │  BHA Code
 │   point  │                       │   AIFS:          │  IA-1 to IA-8
 └────┬─────┘                       │   artifacts as   │
      │                             │   evidence       │
 ┌────▼─────┐                       └────────┬─────────┘
 │ ★ Step 4 │  BHA Plan Audit                │
 │   AIFS:  │  PA-1 to PA-6        ┌────────▼─────────┐
 │   Record │  artifacts            │ Step 10          │  Verification
 │   BHA    │                       └────────┬─────────┘
 │   verdict│                                │
 └────┬─────┘                       ┌────────▼─────────┐
      │                             │ ★ Step 11        │  Commit
 ┌────▼─────┐                       │   AIFS:          │  implemented_
 │ Steps    │                       │   score from     │  score computed
 │ 5-6      │  Refine + persist     │   diff analysis  │
 └──────────┘                       └────────┬─────────┘
                                             │
 REVIEW PHASE                       ┌────────▼─────────┐
 ┌──────────┐                       │ Step 12          │  Persist
 │ ★ Step 13│  External review      └────────┬─────────┘
 │   AIFS:  │  Gemini assessment             │
 │   Record │  blind spots          ┌────────▼─────────┐
 │   review │                       │ Step 14          │  Todo verify
 └────┬─────┘                       └────────┬─────────┘
      │                                      │
      │                             DEPLOYMENT PHASE
      │                             ┌────────▼─────────┐
      │                             │ ★ Step 15        │  Build & Deploy
      │                             │   AIFS:          │
      │                             │   trigger prod   │
      └─────────────────────────────│   evidence       │
                                    │   collection     │
                                    └────────┬─────────┘
                                             │
                                    ┌────────▼─────────┐
                                    │ ★ Step 16        │  Deploy Valid.
                                    │   AIFS:          │  FINAL SCORES
                                    │   deployed_score │
                                    │   computed from  │
                                    │   health + console│
                                    │   + API evidence │
                                    └──────────────────┘

8.3 GraphRAG Entity Types

 GRAPHRAG FIDELITY DOMAIN --- ENTITY TYPES AND RELATIONSHIPS
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 ┌──────────────────────┐         ┌──────────────────────┐
 │ fidelity_scorecard   │─CONTAINS──▶│ fidelity_checkpoint  │
 │                      │         │                      │
 │ session-level record │         │ gate passage record  │
 │ three aggregate      │         │ claimed vs verified  │
 │ scores               │─FOLLOWS──▶│ previous checkpoint  │
 └──────────┬───────────┘         └──────────────────────┘
            │
            │ REFERENCES
            ▼
 ┌──────────────────────┐         ┌──────────────────────┐
 │ fidelity_evidence    │         │ fidelity_drift_      │
 │                      │         │ incident             │
 │ concrete proof       │         │                      │
 │ (API response,       │         │ when claimed >>      │
 │  screenshot, test)   │         │ actual (Δf > 30)     │
 │                      │         │                      │
 │ DERIVED_FROM →       │         │ REFERENCES →         │
 │ test_result or       │         │ scorecard            │
 │ deployment           │         │ RELATED_TO →         │
 └──────────────────────┘         │ qa_bug_pattern       │
                                  └──────────────────────┘

                          ┌──────────────────────┐
                          │ fidelity_trend       │
                          │                      │
                          │ historical pattern   │
                          │ PRECEDES/FOLLOWS     │
                          │ temporal chain       │
                          │                      │
                          │ Enables: "show all   │
                          │ sessions where gap   │
                          │ > 30 for this        │
                          │ component"           │
                          └──────────────────────┘

9. Dashboard UI/UX Design

9.1 Session Fidelity Overview

 ┌──────────────────────────────────────────────────────────────────────┐
 │  NEXUS FIDELITY --- Session Scorecard                    ⟳ Live │ ⚙  │
 ├──────────────────────────────────────────────────────────────────────┤
 │                                                                      │
 │  Session: TSE Consolidation (2026-04-11)                            │
 │  Project: Adverant-Nexus  Branch: main  Commit: a3f8c2d            │
 │                                                                      │
 │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌─────────┐ │
 │  │  PROPOSED     │  │ IMPLEMENTED  │  │  DEPLOYED     │  │ FIDELITY│ │
 │  │              │  │              │  │              │  │   GAP   │ │
 │  │    85%       │  │    62%       │  │    36%       │  │         │ │
 │  │   ┌────┐    │  │   ┌────┐    │  │   ┌────┐    │  │   49    │ │
 │  │   │████│    │  │   │████│    │  │   │░░░░│    │  │  points │ │
 │  │   │████│    │  │   │████│    │  │   │░░░░│    │  │         │ │
 │  │   │████│    │  │   │    │    │  │   │    │    │  │ CRITICAL│ │
 │  │   │████│    │  │   │    │    │  │   │    │    │  │  ⚠ ⚠ ⚠  │ │
 │  │   └────┘    │  │   └────┘    │  │   └────┘    │  │         │ │
 │  └──────────────┘  └──────────────┘  └──────────────┘  └─────────┘ │
 │                                                                      │
 │  HASH CHAIN: a3f8...→7c2d...→e91b...→4a7f...  ✓ VERIFIED           │
 │                                                                      │
 ├──────────────────────────────────────────────────────────────────────┤
 │                                                                      │
 │  DELIVERABLES (73 items)                     Filter: ▼ All Grades   │
 │                                                                      │
 │  ┌────┬────────────────────────────┬──────┬──────┬──────┬─────────┐ │
 │  │ #  │ Deliverable                │ Prop │ Impl │ Depl │  Grade  │ │
 │  ├────┼────────────────────────────┼──────┼──────┼──────┼─────────┤ │
 │  │1.1 │ tool_registry columns      │  100%│  100%│  100%│ ✓ PASS  │ │
 │  │1.2 │ CHECK constraints          │  100%│  100%│  100%│ ✓ PASS  │ │
 │  │2.1 │ selectTools()              │   95%│   56%│   56%│ ? N/T   │ │
 │  │2.2 │ selectToolsLite()          │   90%│   56%│   56%│ ? N/T   │ │
 │  │2.25│ Dockerfile builds          │  100%│  100%│  100%│ ✓ PASS  │ │
 │  │2.26│ K8s 2 pods Running         │  100%│    0%│    0%│ ✗ FAIL  │ │
 │  │4.3 │ Governance middleware       │   90%│    0%│    0%│ ✗ FAIL  │ │
 │  │4.6 │ minTier enforcement         │   95%│    0%│    0%│ ✗ FAIL  │ │
 │  │4.7 │ Rate limit enforcement      │   90%│    0%│    0%│ ✗ FAIL  │ │
 │  │... │ ...                         │   ...│   ...│   ...│  ...    │ │
 │  └────┴────────────────────────────┴──────┴──────┴──────┴─────────┘ │
 │                                                                      │
 │  GRADE DISTRIBUTION:                                                │
 │  ✓ PASS: 12  ~ PARTIAL: 5  ✗ FAIL: 18  ? NOT_TESTED: 34  ◊ GAP: 4 │
 │  ████████░░░░░████████████████████████████████████████████████░░░░   │
 │                                                                      │
 └──────────────────────────────────────────────────────────────────────┘

9.2 Fidelity Trend Dashboard

 ┌──────────────────────────────────────────────────────────────────────┐
 │  NEXUS FIDELITY --- Organization Trends            Last 90 Days  ▼   │
 ├──────────────────────────────────────────────────────────────────────┤
 │                                                                      │
 │  KEY METRICS                                                        │
 │  ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌─────────┐ │
 │  │ Avg Gap       │ │ Sessions      │ │ Overconfidence │ │  ECE    │ │
 │  │               │ │               │ │ Rate           │ │         │ │
 │  │  18.3  ↓ -4.2 │ │    47         │ │  72%   ↓ -6%  │ │  0.31   │ │
 │  │  ████████████ │ │               │ │  ████████████ │ │ ████    │ │
 │  │  Improving    │ │  This quarter │ │  Improving    │ │ Moderate│ │
 │  └───────────────┘ └───────────────┘ └───────────────┘ └─────────┘ │
 │                                                                      │
 │  FIDELITY GAP TREND (12 weeks)                                      │
 │  50 ┬────────────────────────────────────────────────────           │
 │     │  ●                                                            │
 │  40 │   ╲                                                           │
 │     │    ●                                                          │
 │  30 │     ╲──●                             ─── Avg Fidelity Gap    │
 │     │          ╲──●                        - - Target (<15)        │
 │  20 │               ●──●                                           │
 │     │                    ╲──●                                       │
 │  15 │─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ╲──● ─ ─ ─ TARGET ─ ─ ─ ─ ─ ─ ─ ─  │
 │  10 │                             ╲──●──●                           │
 │     │                                                               │
 │   0 ┴──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──                          │
 │     W1  W2 W3 W4 W5 W6 W7 W8 W9 W10W11W12                        │
 │                                                                      │
 │  PER-MODEL BREAKDOWN                                                │
 │  ┌────────────────────────────────────────────────────────────────┐ │
 │  │ Opus      ██████████░░░░░░░░░░░░░░░░░░░░░░░░  12.4  (best)   │ │
 │  │ Sonnet    ██████████████████░░░░░░░░░░░░░░░░░  19.8           │ │
 │  │ Haiku     █████████████████████████░░░░░░░░░░  24.1  (worst)  │ │
 │  └────────────────────────────────────────────────────────────────┘ │
 │                                                                      │
 │  PER-PLUGIN BREAKDOWN                                               │
 │  ┌────────────────────────────────────────────────────────────────┐ │
 │  │ NexusQA     ████████████████████░░░░░░░░░░░░  22.3            │ │
 │  │ ProseCreator████████████░░░░░░░░░░░░░░░░░░░░  14.7            │ │
 │  │ NexusROS    █████████████████░░░░░░░░░░░░░░░  18.9            │ │
 │  │ TSE         ██████████████████████████████████  36.2            │ │
 │  └────────────────────────────────────────────────────────────────┘ │
 │                                                                      │
 └──────────────────────────────────────────────────────────────────────┘

9.3 Evidence Detail View

 ┌──────────────────────────────────────────────────────────────────────┐
 │  DELIVERABLE 4.3 --- Governance Middleware Enforcement                │
 ├──────────────────────────────────────────────────────────────────────┤
 │                                                                      │
 │  SCORES                                    GRADE: ✗ FAIL            │
 │  ┌───────────┐ ┌───────────┐ ┌───────────┐                         │
 │  │ Proposed  │ │ Implement │ │ Deployed  │ Gap: 90 points          │
 │  │   90%     ��� │    0%     │ │    0%     │ ⚠ CRITICAL              │
 │  └───────────┘ └───────────┘ └───────────┘                         │
 │                                                                      │
 │  EVIDENCE REQUIRED:                                                 │
 │  POST /select with tier "open_source" against ProseCreator          │
 │  (minTier "starter") → expect 403 TIER_INSUFFICIENT                │
 │                                                                      │
 │  EVIDENCE COLLECTED: ∅ NONE                                         │
 │  ┌────────────────────────────────────────────────────────────────┐ │
 │  │  ⚠ No evidence has been collected for this deliverable.       │ │
 │  │                                                                │ │
 │  │  Root cause analysis:                                          │ │
 │  │  • File: services/nexus-tse/src/middleware/governance.ts       │ │
 │  │  • Status: FILE DOES NOT EXIST                                 │ │
 │  │  • The agent claimed this as "90% complete" but the entire     │ │
 │  │    middleware/ directory is empty.                              │ │
 │  │                                                                │ │
 │  │  Code grep evidence:                                           │ │
 │  │  $ ls services/nexus-tse/src/middleware/                       │ │
 │  │  (empty directory)                                             │ │
 │  │                                                                │ │
 │  │  Evidence hash: e3b0c44298fc... (SHA-256 of empty result)      │ │
 │  └────────────────────────────────────────────────────────────────┘ │
 │                                                                      │
 │  HASH CHAIN:                                                        │
 │  This checkpoint: 7c2d4f8a...                                       │
 │  Previous: a3f8b1c2...                                              │
 │  Chain verification: ✓ VALID                                        │
 │                                                                      │
 └──────────────────────────────────────────────────────────────────────┘

9.4 Calibration Plot

 ┌──────────────────────────────────────────────────────────────────────┐
 │  CALIBRATION PLOT --- Proposed vs. Deployed Scores                    │
 ├──────────────────────────────────────────────────────────────────────┤
 │                                                                      │
 │  Deployed                                                           │
 │  Score (%)        ╱ Perfect                                         │
 │                  ╱  Calibration                                     │
 │  100 ┬─────────╱──────────────────────────────                      │
 │      │        ╱                                                     │
 │   80 │───────╱─────●                                                │
 │      │      ╱       ╲                                               │
 │   60 │─────╱─────────●          Bins: 73 deliverables               │
 │      │    ╱            ╲        grouped by proposed                 │
 │   40 │───╱──────────────●       score decile                        │
 │      │  ╱                 ╲                                         │
 │   20 │─╱───────────────────●────●                                   │
 │      │╱                                                             │
 │    0 ┴──┬──┬──┬──┬──┬──┬──┬──┬──┬──                                │
 │         0  10 20 30 40 50 60 70 80 90 100                           │
 │                  Proposed Score (%)                                  │
 │                                                                      │
 │  OVERCONFIDENCE ZONE: All points below the diagonal                 │
 │  78% of deliverables fall in this zone.                             │
 │                                                                      │
 │  ECE = 0.31 (moderate miscalibration)                               │
 │  Worst bin: 80-90% proposed → 22% deployed (58-point gap)           │
 │  Best bin: 100% proposed → 95% deployed (5-point gap, schema tasks) │
 │                                                                      │
 └──────────────────────────────────────────────────────────────────────┘

10. Regulatory Alignment

10.1 EU AI Act Article 12

Article 12 of the EU AI Act requires high-risk AI systems to maintain "logs generated automatically" that enable "tracing of the functioning of the system throughout its lifecycle" [9]. Deployers must retain these logs for "at least six months."

 EU AI ACT ARTICLE 12 --- AIFS COMPLIANCE MAPPING
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 REQUIREMENT                          AIFS IMPLEMENTATION
 ─────────────────────────           ─────────────────────────────────
 Automatic log generation     →      PostToolUse hooks capture data
                                     without manual intervention

 Lifecycle tracing            →      Checkpoints span Steps 1-16
                                     (plan → deploy validation)

 Minimum 6-month retention    →      Append-only PostgreSQL with
                                     no DELETE capability. Partition
                                     for efficient archival.

 Tamper evidence              →      SHA-256 hash chains. Any
                                     modification breaks the chain.
                                     Independent verification
                                     without vendor trust.

 Traceability                 →      Hash chain links every decision
                                     to its predecessor. Full causal
                                     chain reconstructable.

10.2 NIST AI Risk Management Framework

 NIST AI RMF --- AIFS IMPLEMENTATION
 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

 GOVERN 1.1  Policies for AI risk    Gate hooks block deployment
             management               without fidelity data

 MEASURE 2.6 Performance tracked     Fidelity gap KPI with
             against metrics          weekly improvement rate

 MEASURE 2.7 Performance evaluated   Three-score tracking with
             against operational      production evidence
             requirements             collection

11.1 AI Code Verification Tools

Tool	What It Does	What's Missing
SonarQube AI Code Assurance [3]	Detects AI-generated code quality issues	Does not track plan-to-deployment fidelity
Allure TestOps	Attaches evidence to test results	No scorecard aggregation or hash chain
Arize Phoenix [4]	Records LLM execution traces	Observability only --- no fidelity scoring
PDCA Framework [10]	Structures AI coding into Plan-Do-Check-Act	No immutable storage or evidence grading
Spec Kit	Spec-driven development automation	No deployment verification or gap tracking

11.2 Immutable Audit Trails

System	Approach	Difference from AIFS
GuardianChain [11]	SHA-256 + blockchain anchor	Records what agents did, not what they claimed
Prefactor [12]	CI/CD-native immutable IDs	No three-score tracking or calibration metrics
Standard audit logs	Append-only databases	No content-addressable hashing or chain verification

11.3 Knowledge Graphs for QA

GraphRAG-Bench (ICLR 2026) [13] evaluates retrieval-augmented generation with graph traversal. No commercial product stores QA metrics in a knowledge graph at the granularity AIFS requires. NexusQA's existing 10 entity types with typed relationships for regression prediction provide the foundation that AIFS extends with the fidelity domain.

12. Discussion & Future Work

12.1 Limitations

Single-project calibration. The 56% ceiling derives from one project's 73 deliverables. While consistent with independent studies, organizations should calibrate against their own data.

Self-reported proposed scores. The proposed score is reported by the AI agent. We mitigate this by capturing the plan hash before implementation --- the plan is locked and the proposed score derived from the deliverable list, not a separate confidence estimate.

Granularity sensitivity. A coarse scorecard (28 items) masks problems a granular one (73 items) reveals. AIFS does not prescribe granularity.

12.2 Future Directions

Adaptive ceiling. Learn per-task-type ceilings from accumulated data. Schema migrations might warrant 90% (deterministic SQL); complex UI components might warrant 40%.

Pre-implementation fidelity prediction. Given a plan and historical data, predict expected gap before implementation. Gate whether the AI should attempt the task autonomously.

Cross-agent benchmarking. Deploy AIFS across Claude Code, Cursor, Copilot, and Devin on identical tasks to produce the first standardized fidelity benchmark.

IDE integration. Real-time fidelity scores displayed in VSCode/JetBrains as the AI agent works --- a live confidence indicator that updates with each checkpoint.

13. References

[1] SonarSource, "State of Code Developer Survey Report 2026," 2026. https://www.sonarsource.com/state-of-code-developer-survey-report.pdf

[2] J. Becker, N. Rush, E. Barnes, D. Rein, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity," METR, Jul. 2025. arXiv:2507.09089.

[3] SonarSource, "Agentic Analysis --- Verify AI Code as It Is Generated," 2026. https://docs.sonarsource.com/sonarqube-cloud/analyzing-source-code/agentic-analysis

[4] Arize AI, "Phoenix: AI Observability & Evaluation," GitHub, 2025. https://github.com/Arize-ai/phoenix

[5] S. Ghosh, M. Panday, "The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration," arXiv:2603.09985, Feb. 2026.

[6] "SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?" arXiv:2509.16941, Sep. 2025.

[7] CodeRabbit, "State of AI vs Human Code Generation Report," Dec. 2025. https://www.coderabbit.ai/whitepapers/state-of-AI-vs-human-code-generation-report

[8] Faros AI, "The AI Productivity Paradox Research Report," 2026. https://www.faros.ai/ai-productivity-paradox

[9] European Parliament, "Regulation (EU) 2024/1689 --- Artificial Intelligence Act," Article 12. https://artificialintelligenceact.eu/article/12/

[10] K. Judy, "A Plan-Do-Check-Act Framework for AI Code Generation," InfoQ, Oct. 2025. https://www.infoq.com/articles/PDCA-AI-code-generation/

[11] T. A. Cronin, "How to Create Immutable Audit Trails for AI Agents," DEV Community, 2026. https://dev.to/guardianchain/how-to-create-immutable-audit-trails-for-ai-agents-5boc

[12] Prefactor, "Audit Trails in CI/CD: Best Practices for AI Agents," 2026. https://prefactor.tech/blog/audit-trails-in-ci-cd-best-practices-for-ai-agents

[13] "When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation," arXiv:2506.05690, ICLR 2026.

[14] C. Spiess et al., "Calibration and Correctness of Language Models for Code," ICSE 2025. doi:10.1109/ICSE55347.2025.00040.

[15] S. Li et al., "A Survey on the Honesty of Large Language Models," arXiv:2409.18786. Accepted TMLR 2025.

[16] P. Chhikara, "Mind the Confidence Gap: Overconfidence, Calibration, and Distractor Effects in LLMs," TMLR, Feb. 2025. arXiv:2502.11028.

[17] T. Pan, "Agentic Coding in Production: What SWE-bench Scores Don't Tell You," Apr. 2026. https://tianpan.co/blog/2026-04-09-agentic-coding-production-swebench-gap

[18] NIST, "Artificial Intelligence Risk Management Framework (AI RMF 1.0)," NIST AI 100-1, Jan. 2023. https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf

---

Published by Adverant Research. NexusQA Platform --- nexusqa.ai This document is available by direct link only and is not indexed by search engines.

Keywords

AI fidelityimplementation gapcode verificationimmutable scorecardNexusQA