Session Health Monitor
Deliverable Brief & Integration Note · companion to design spec session-health-monitor-spec-v2.md
0 · Document control
session_health_monitor_deliverable_brief
2.0
2026-07-01
For Steve's review
Steve Burson
Mike · system consultant
session-health-monitor-spec-v2.md
Codex — reviewed 2026-07-01
arc42 (brief structure) · C4 (topology diagram) · Diátaxis (reference tables kept separate from explanation)
Revision history. v2.0 — updated to design spec v2.0: externalized two-tier judge with loosened escalation triggers, incident-clustered scoring with five fabrication sub-types and cause-classified tool errors, hybrid global/rolling/per-project baseline, empirically-verified display behavior on the desktop app, and the tamper-evident (not tamper-resistant) framing for the Windows ledger. v1.0 — first deliverable brief for the Session Health Monitor: problem, goal, integration into the Foundry, the measured signals, overhead, honest limits, privacy, and the pre-build sign-off items. Content is faithful to design spec v2.0; nothing is built yet.
How to read this document
This is the deliverable brief — the plain-English companion that travels with the tool. It answers, in order: what problem it solves, what it is and is not, exactly how it plugs into your system, what it measures and at what cost, and where its limits are. The full engineering detail lives in the source spec; this document is the summary-first layer on top of it. Reference material (the signal table, the open-items list) is kept in tables; the reasoning sits in the prose around them.
1 · Summary
The Session Health Monitor is a Claude Code skill that watches an agent session in real time for cognitive degradation — fabrication, repeated errors, and thrash — and reports a live Session Health score. It persists a per-session scorecard, and mines cross-session trends into concrete operating-discipline recommendations.
Watches
Deterministic signals every turn (zero agent judgment); semantic checks — fabrication, quiet-wrongness — run on escalation via a fresh-context judge reading structured evidence, never the agent's own prose.
Scores
One 0–100 number = inverse of operational risk, incident-clustered so one failure can't multi-count, always with the top contributing incidents shown.
Escalates
A two-tier judge: cheap deterministic checks every turn, escalating to a cross-family judge (OpenAI Codex, gpt-5.5) on loosened triggers — not gated on high-stakes.
Learns
Scorecards accrue; once minimum sample sizes are met, trends become "when to compact / start fresh / avoid workflow X," calibrated separately per global/user/project.
2 · The problem being solved
This is the same failure mode that drives the Foundry — but measured live. Across sessions, a Builder agent will occasionally fabricate (claim a file was written when it was not, claim work done that was not), repeat the same error, thrash on the same files, or drift as context fills. Today there is no signal at the moment it happens. You find out after the fact, by reading output that turned out to be wrong.
The monitor closes that gap: it turns "the session felt off" into a number you can see while the session is still running, and into a record you can review afterward. Critically, it does this without asking the agent to grade itself — the same principle your audit layer already rests on.
3 · Goal & scope
In scope
- Compute a live 0–100 Session Health score from the session transcript
- Display it via the channels each environment actually supports (see §§6–7) — text first-line, CLI status bar, OS notification for criticals
- Persist per-session scorecards to a local, tamper-evident store
- Turn cross-session trends into operating-discipline recommendations
- Cross-platform: Windows-primary, macOS supported; runtime auto-detected at install
Out of scope
- Improving the model itself — this measures risk, it does not raise capability
- Gating or blocking the build — it is advisory, never an enforcement wall
- A compliance / court / audit-of-record instrument — it is an operating aid
- Cross-party data handling — it is a same-user, same-machine tool
4 · How it integrates into your system
It is the "you don't audit yourself" principle, applied at the session level.
The monitor was shaped to your architecture rather than bolted on. Three deliberate design choices line up with the Foundry:
Externalized scoring
The score is computed outside the monitored agent: deterministic signals from a hook/script reading the session transcript, semantic signals from a fresh-context subagent reading structured evidence it did not assemble itself. Your Node C rule, at session granularity.
Cross-lineage judge
Escalations go to OpenAI Codex (gpt-5.5) — a different family from the subject, so it does not share the subject's blind spots. Same lineage-separation logic as your audit layer.
Chained ledger
An append-only, SHA-256-chained JSONL ledger is the source of truth; SQLite is a rebuildable projection. This mirrors the Foundry integrity backbone.
session_health_topology rev 2.0.5 · What it measures
The score is Health = 100 − risk, where risk is a weighted roll-up of incident-clustered signals — correlated events are grouped into one incident so a single failure does not multi-count. Every signal is normalized 0–1. A single score is displayed, but it is always computed from — and traceable back to — this incident-level breakdown.
| Signal | Weight* | What it captures |
|---|---|---|
| Context pressure | 0.25 | Flat below ~60% util, steep ramp after; from context_window.used_percentage |
| Repeat errors / re-corrections | 0.20 | Same error or re-correction vs baseline |
| Fabrication (broadened) | 0.15 | 5 sub-types (below) — not just "file doesn't exist" |
| Tool-error rate (cause-classified) | 0.10 | Only cognitive causes count — infra flakiness excluded |
| Rework / thrash | 0.10 | Edit reverts, re-touched files, churn |
| Compactions (paired only) | 0.10 | Counts only when paired with lost context / re-asking — a bare compaction is healthy hygiene |
| Verification gap | 0.07 | "Done / verified" with no supporting evidence |
| Denial-reissue | 0.03 | Verbatim retry of a denied or failed call |
provisional *Weights are provisional until calibration. They are labeled as such until the cross-session store calibrates them against real labeled incidents. They will not be presented as precise before then.
Fabrication → five sub-signals
(a) nonexistent artifact (file / symbol / flag) · (b) unsupported completion claim · (c) claim contradicted by tool output · (d) stale / unsourced external fact · (e) user-corrected false statement. Only a hard contradiction counts as "confirmed fabrication"; the rest score as lower-weight "unsupported-claim risk."
Tool errors are classified by cause
policy-denied · syntax · missing-dep · network · permission · timeout · test-failure · unexpected-output. Only cognitive causes degrade the score — a flaky network or a denied permission must never read as "the model got dumb."
6 · Overhead & impact
Designed so the thing that runs constantly costs almost nothing.
| Hot path (status bar) | The ~300ms status-bar cadence must not start Python. The hook precomputes the score into a tiny file (~/.claude/session-health/current); the status line is a cheap file read (cat / type) — zero interpreter startup at cadence. This is a CLI-only path; the desktop app does not render custom status bars. |
|---|---|
| Two-tier judge | Fast path: deterministic signals every turn, no model call. Judge path: the cross-family Codex (gpt-5.5) call fires on a loosened set of triggers — any urgent flag, a sharp score drop, repeated user corrections, an unsupported completion claim, repeated same-file churn, or random sampling. High-stakes context raises priority but is never a required gate, since a degraded agent may misjudge what counts as high-stakes. Model cost is incurred only when one of these actually trips. |
| Core runtime | Python 3 (confirmed present on the target Windows box; Node is not guaranteed on PATH). OS-specific bits branch inside one codebase; runtime is auto-detected at install. |
| Footprint | One Python core + two thin installer stubs + a local JSONL/SQLite store. No always-on service; minimal footprint — Python 3 standard library plus the OS-native notification path. |
7 · Failure modes & honest limits
Steve is a systems engineer, so this section is deliberately blunt — the limits are stated, not buried.
8 · Privacy note
This is a same-user, same-machine tool, so there is no cross-party leak. Two facts still warrant the minimal controls the tool ships with. First: this stays within Steve's own accounts — no cross-party sharing — but escalation does send recent session context to Steve's own OpenAI account (a different policy surface than Anthropic). Second: the ledger/DB are cleartext on disk.
- Escalation on/off toggle — default ON.
- One-line disclosure at install — "escalation sends recent session context to your OpenAI account."
- Retention / delete-history for the local store.
- Optional secret redaction on the escalation payload (API keys, tokens).
These are lightweight controls for Steve's install; Steve can choose to keep or remove them before build.
9 · Make-vs-buy & dependencies
| Component | Make / Buy | Notes |
|---|---|---|
| Scoring core, parsing, installer, hooks | Make | Python 3 standard library |
| Local store | Make | Append-only JSONL + SQLite (both stdlib-reachable) |
| Escalation judge | Buy | OpenAI Codex / gpt-5.5 — your existing second-opinion tooling |
| Notifications | Make | osascript (macOS) / PowerShell toast (Windows) — OS-native |
| Build method | — | Assembled via the custom agent-builder skill |
10 · Install & staged rollout
Packaging. One Python core + two thin installer stubs (.command for macOS, .bat for Windows). Roughly a 2-click install via an emailed link that downloads the installer; it lays down the skill, the hooks (score computation + OS-notification), the status-line script, and the local store, auto-detecting the environment.
Rollout rhythm follows the house pattern: recommend → agree → drill. This brief is the "recommend" step. Nothing installs until you sign off.
11 · Open items & sign-off
Items to close before build. The first three are closed during pre-build discovery where possible; the installer also auto-detects and confirms them at install time. The last two are your calls.
| # | Item | Owner |
|---|---|---|
| 1 | Confirm Python minor version on Steve's box (Python 3 confirmed) | auto-detect + confirm |
| 2 | Claude Code config path on Windows (%APPDATA%\Claude vs ~/.claude) | auto-detect + confirm |
| 3 | Windows installer shell (Git Bash / WSL / PowerShell) | auto-detect + confirm |
| 4 | Conformance to your Foundry conventions — inventory done; confirm specifics | Steve |
| 5 | Keep or strip the lightweight privacy toggle (§8) | Steve |