Session Health Monitor

Deliverable Brief & Integration Note · companion to design spec session-health-monitor-spec-v2.md

0 · Document control

DOCUMENT
session_health_monitor_deliverable_brief
VERSION
2.0
DATE
2026-07-01
STATUS
For Steve's review
OWNER
Steve Burson
CONSULTANT
Mike · system consultant
SOURCE SPEC
session-health-monitor-spec-v2.md
REVIEW
Codex — reviewed 2026-07-01
STANDARDS APPLIED
arc42 (brief structure) · C4 (topology diagram) · Diátaxis (reference tables kept separate from explanation)

Revision history. v2.0 — updated to design spec v2.0: externalized two-tier judge with loosened escalation triggers, incident-clustered scoring with five fabrication sub-types and cause-classified tool errors, hybrid global/rolling/per-project baseline, empirically-verified display behavior on the desktop app, and the tamper-evident (not tamper-resistant) framing for the Windows ledger. v1.0 — first deliverable brief for the Session Health Monitor: problem, goal, integration into the Foundry, the measured signals, overhead, honest limits, privacy, and the pre-build sign-off items. Content is faithful to design spec v2.0; nothing is built yet.

How to read this document

This is the deliverable brief — the plain-English companion that travels with the tool. It answers, in order: what problem it solves, what it is and is not, exactly how it plugs into your system, what it measures and at what cost, and where its limits are. The full engineering detail lives in the source spec; this document is the summary-first layer on top of it. Reference material (the signal table, the open-items list) is kept in tables; the reasoning sits in the prose around them.

One honest frame up front. "Session Health" is a single 0–100 score = the inverse of operational risk (100 = healthy, it counts down as risk rises). It reflects observed operational risk — not literal model intelligence. It improves how sessions are run (when to compact, when to start fresh, which workflows to avoid), not the model's raw capability. Everything below is calibrated to that claim, and no larger one.

1 · Summary

The Session Health Monitor is a Claude Code skill that watches an agent session in real time for cognitive degradation — fabrication, repeated errors, and thrash — and reports a live Session Health score. It persists a per-session scorecard, and mines cross-session trends into concrete operating-discipline recommendations.

Watches

Deterministic signals every turn (zero agent judgment); semantic checks — fabrication, quiet-wrongness — run on escalation via a fresh-context judge reading structured evidence, never the agent's own prose.

Scores

One 0–100 number = inverse of operational risk, incident-clustered so one failure can't multi-count, always with the top contributing incidents shown.

Escalates

A two-tier judge: cheap deterministic checks every turn, escalating to a cross-family judge (OpenAI Codex, gpt-5.5) on loosened triggers — not gated on high-stakes.

Learns

Scorecards accrue; once minimum sample sizes are met, trends become "when to compact / start fresh / avoid workflow X," calibrated separately per global/user/project.

2 · The problem being solved

This is the same failure mode that drives the Foundry — but measured live. Across sessions, a Builder agent will occasionally fabricate (claim a file was written when it was not, claim work done that was not), repeat the same error, thrash on the same files, or drift as context fills. Today there is no signal at the moment it happens. You find out after the fact, by reading output that turned out to be wrong.

The monitor closes that gap: it turns "the session felt off" into a number you can see while the session is still running, and into a record you can review afterward. Critically, it does this without asking the agent to grade itself — the same principle your audit layer already rests on.

3 · Goal & scope

In scope

  • Compute a live 0–100 Session Health score from the session transcript
  • Display it via the channels each environment actually supports (see §§6–7) — text first-line, CLI status bar, OS notification for criticals
  • Persist per-session scorecards to a local, tamper-evident store
  • Turn cross-session trends into operating-discipline recommendations
  • Cross-platform: Windows-primary, macOS supported; runtime auto-detected at install

Out of scope

  • Improving the model itself — this measures risk, it does not raise capability
  • Gating or blocking the build — it is advisory, never an enforcement wall
  • A compliance / court / audit-of-record instrument — it is an operating aid
  • Cross-party data handling — it is a same-user, same-machine tool

4 · How it integrates into your system

It is the "you don't audit yourself" principle, applied at the session level.

The monitor was shaped to your architecture rather than bolted on. Three deliberate design choices line up with the Foundry:

Externalized scoring

The score is computed outside the monitored agent: deterministic signals from a hook/script reading the session transcript, semantic signals from a fresh-context subagent reading structured evidence it did not assemble itself. Your Node C rule, at session granularity.

Cross-lineage judge

Escalations go to OpenAI Codex (gpt-5.5) — a different family from the subject, so it does not share the subject's blind spots. Same lineage-separation logic as your audit layer.

Chained ledger

An append-only, SHA-256-chained JSONL ledger is the source of truth; SQLite is a rebuildable projection. This mirrors the Foundry integrity backbone.

Session Health Monitor — Integration Topology Externalized scoring · the monitored agent never grades itself · cross-lineage judge on escalation DWG session_health_topology REV 2.0 DATE 2026-07-01 STATUS design DOC session_health_monitor_deliverable_brief_v2.html session boundary scoring runs outside the agent MONITORED SESSION Claude Code · Anthropic lineage pure build — no self-scoring Persists as it works: • transcript · tool results • diffs · exit codes · file checks at most, transcribes a number the external layer already computed evidence HOOK / SCRIPT LAYER trusted · external · never the agent Computes deterministic signals every turn: • context % · repeat errors · tool-error cause • rework/churn · compactions · denial-reissue Assembles the evidence bundle (not the agent — avoids circularity) Precomputes Health → ~/.claude/session-health/current on escalation only ESCALATION JUDGE OpenAI Codex · gpt-5.5 · cross-family "you don't audit yourself" Reads structured evidence → • fabrication verdict • may override the score triggers loosened — not gated on high-stakes append LOCAL STORE session-health-ledger.jsonl — append-only, SHA-256 chained authoritative source of truth (each line carries the prior line's hash) session-health.db — SQLite, derived scorecards + cross-session history rebuildable projection — verify chain on startup, quarantine broken suffixes ONE PRECOMPUTED SCORE ● Health 0–100 = inverse of risk every channel reads the same value DISPLAY CHANNELS Text first-line — advisory; only when score crosses a 10% floor / urgent flag not a reliable channel — a degraded agent could omit it CLI status bar — always-on bonus; cheap file read, no interpreter start desktop app does not render custom status bars (verified) — text line + OS notification cover it there OS notification — criticals; osascript / PowerShell toast fired from the hook, bypasses the agent — the can't-miss channel OPERATING-DISCIPLINE RECOMMENDATIONS Cross-session trends → "degrades past turn N", "tool X spikes errors" → when to compact / start fresh / which workflows to avoid emitted only after minimum sample sizes are met calibration kept separate: global · per-user · per-project (no overfitting) informs how sessions are run
Figure 1 · Integration topology. The monitored agent (left of the boundary) only builds. Everything that scores, stores, judges, and displays sits outside it. The purple loop is the cross-session feedback that turns history into operating discipline. Diagram address: session_health_topology rev 2.0.

5 · What it measures

The score is Health = 100 − risk, where risk is a weighted roll-up of incident-clustered signals — correlated events are grouped into one incident so a single failure does not multi-count. Every signal is normalized 0–1. A single score is displayed, but it is always computed from — and traceable back to — this incident-level breakdown.

SignalWeight*What it captures
Context pressure0.25Flat below ~60% util, steep ramp after; from context_window.used_percentage
Repeat errors / re-corrections0.20Same error or re-correction vs baseline
Fabrication (broadened)0.155 sub-types (below) — not just "file doesn't exist"
Tool-error rate (cause-classified)0.10Only cognitive causes count — infra flakiness excluded
Rework / thrash0.10Edit reverts, re-touched files, churn
Compactions (paired only)0.10Counts only when paired with lost context / re-asking — a bare compaction is healthy hygiene
Verification gap0.07"Done / verified" with no supporting evidence
Denial-reissue0.03Verbatim retry of a denied or failed call

provisional *Weights are provisional until calibration. They are labeled as such until the cross-session store calibrates them against real labeled incidents. They will not be presented as precise before then.

Fabrication → five sub-signals

(a) nonexistent artifact (file / symbol / flag) · (b) unsupported completion claim · (c) claim contradicted by tool output · (d) stale / unsourced external fact · (e) user-corrected false statement. Only a hard contradiction counts as "confirmed fabrication"; the rest score as lower-weight "unsupported-claim risk."

Tool errors are classified by cause

policy-denied · syntax · missing-dep · network · permission · timeout · test-failure · unexpected-output. Only cognitive causes degrade the score — a flaky network or a denied permission must never read as "the model got dumb."

Inspectability is required — the score is never a black box. The scorecard always exposes the top contributing incidents ("why it moved"), so you can tell real degradation from noisy telemetry. Baseline is hybrid: fixed global thresholds + a rolling session baseline + a per-project historical baseline (not just the first ~10 calls, which is too fragile).

6 · Overhead & impact

Designed so the thing that runs constantly costs almost nothing.

Hot path (status bar)The ~300ms status-bar cadence must not start Python. The hook precomputes the score into a tiny file (~/.claude/session-health/current); the status line is a cheap file read (cat / type) — zero interpreter startup at cadence. This is a CLI-only path; the desktop app does not render custom status bars.
Two-tier judgeFast path: deterministic signals every turn, no model call. Judge path: the cross-family Codex (gpt-5.5) call fires on a loosened set of triggers — any urgent flag, a sharp score drop, repeated user corrections, an unsupported completion claim, repeated same-file churn, or random sampling. High-stakes context raises priority but is never a required gate, since a degraded agent may misjudge what counts as high-stakes. Model cost is incurred only when one of these actually trips.
Core runtimePython 3 (confirmed present on the target Windows box; Node is not guaranteed on PATH). OS-specific bits branch inside one codebase; runtime is auto-detected at install.
FootprintOne Python core + two thin installer stubs + a local JSONL/SQLite store. No always-on service; minimal footprint — Python 3 standard library plus the OS-native notification path.

7 · Failure modes & honest limits

Steve is a systems engineer, so this section is deliberately blunt — the limits are stated, not buried.

The desktop app does not render a status bar — verified empirically. Claude Code's desktop app does not support custom status lines, so the CLI status bar is a CLI-only channel; it does not show up there. On desktop, Session Health surfaces as the text first-line (only when the score crosses a 10% floor (100→90→80…) or an urgent flag trips — not every message) plus the OS notification for criticals. If a future desktop build adds status-line support, the bar auto-activates — no rebuild required.
The advisory text line is not a reliable alert. A degraded agent could omit or garble its own first-line status. That is why criticals fire an OS notification straight from the hook, bypassing the agent entirely — that is the can't-miss, agent-independent channel. The text line is treated as advisory display only.
Tamper-evident, not tamper-resistant (on Windows). The hash-chain detects edits to the ledger; it does not prevent deletion. An optional remote seal (periodically email/commit the ledger head-hash offsite) is available but is lower priority for a solo-user tool.
Weights are uncalibrated at ship. Until the cross-session store has labeled incidents to calibrate against, the score is directionally useful but not precise — and the tool says so rather than implying false precision.
Quiet wrongness (false negatives) is the hard case. An agent can be confidently wrong without tripping a deterministic signal. Mitigation is periodic challenge checks — requiring evidence for done-claims, sampling assertions, comparing final answers to actual tool output — plus random-sampling escalation. It reduces the risk; it does not eliminate it.

8 · Privacy note

This is a same-user, same-machine tool, so there is no cross-party leak. Two facts still warrant the minimal controls the tool ships with. First: this stays within Steve's own accounts — no cross-party sharing — but escalation does send recent session context to Steve's own OpenAI account (a different policy surface than Anthropic). Second: the ledger/DB are cleartext on disk.

These are lightweight controls for Steve's install; Steve can choose to keep or remove them before build.

9 · Make-vs-buy & dependencies

ComponentMake / BuyNotes
Scoring core, parsing, installer, hooksMakePython 3 standard library
Local storeMakeAppend-only JSONL + SQLite (both stdlib-reachable)
Escalation judgeBuyOpenAI Codex / gpt-5.5 — your existing second-opinion tooling
NotificationsMakeosascript (macOS) / PowerShell toast (Windows) — OS-native
Build methodAssembled via the custom agent-builder skill

10 · Install & staged rollout

Packaging. One Python core + two thin installer stubs (.command for macOS, .bat for Windows). Roughly a 2-click install via an emailed link that downloads the installer; it lays down the skill, the hooks (score computation + OS-notification), the status-line script, and the local store, auto-detecting the environment.

Install honesty. True silent link-install is not safe or possible — unsigned stubs trip SmartScreen (Windows) / Gatekeeper (macOS). The tool ships checksums and shows the exact install path, so you verify before you run.

Rollout rhythm follows the house pattern: recommend → agree → drill. This brief is the "recommend" step. Nothing installs until you sign off.

11 · Open items & sign-off

Items to close before build. The first three are closed during pre-build discovery where possible; the installer also auto-detects and confirms them at install time. The last two are your calls.

#ItemOwner
1Confirm Python minor version on Steve's box (Python 3 confirmed)auto-detect + confirm
2Claude Code config path on Windows (%APPDATA%\Claude vs ~/.claude)auto-detect + confirm
3Windows installer shell (Git Bash / WSL / PowerShell)auto-detect + confirm
4Conformance to your Foundry conventions — inventory done; confirm specificsSteve
5Keep or strip the lightweight privacy toggle (§8)Steve
What we need from you to proceed. A yes on the design as framed here, plus your calls on items 4 and 5. On sign-off, this moves from "recommend" to the implementation drill-down — built via agent-builder, with pre-send review complete.
Session Health Monitor · Deliverable Brief v2.0 · faithful to design spec v2.0 · prepared by Mike, system consultant · 2026-07-01. Nothing in this document is built yet.