Session Health Monitor

Deliverable Brief & Integration Note · companion to design spec session-health-monitor-spec-v2.md

0 · Document control

DOCUMENT
session_health_monitor_deliverable_brief

VERSION
2.0

DATE
2026-07-01

STATUS
For Steve's review

OWNER
Steve Burson

CONSULTANT
Mike · system consultant

SOURCE SPEC
session-health-monitor-spec-v2.md

REVIEW
Codex — reviewed 2026-07-01

STANDARDS APPLIED
arc42 (brief structure) · C4 (topology diagram) · Diátaxis (reference tables kept separate from explanation)

Revision history. v2.0 — updated to design spec v2.0: externalized two-tier judge with loosened escalation triggers, incident-clustered scoring with five fabrication sub-types and cause-classified tool errors, hybrid global/rolling/per-project baseline, empirically-verified display behavior on the desktop app, and the tamper-evident (not tamper-resistant) framing for the Windows ledger. v1.0 — first deliverable brief for the Session Health Monitor: problem, goal, integration into the Foundry, the measured signals, overhead, honest limits, privacy, and the pre-build sign-off items. Content is faithful to design spec v2.0; nothing is built yet.

How to read this document

This is the deliverable brief — the plain-English companion that travels with the tool. It answers, in order: what problem it solves, what it is and is not, exactly how it plugs into your system, what it measures and at what cost, and where its limits are. The full engineering detail lives in the source spec; this document is the summary-first layer on top of it. Reference material (the signal table, the open-items list) is kept in tables; the reasoning sits in the prose around them.

One honest frame up front. "Session Health" is a single 0–100 score = the inverse of operational risk (100 = healthy, it counts down as risk rises). It reflects observed operational risk — not literal model intelligence. It improves how sessions are run (when to compact, when to start fresh, which workflows to avoid), not the model's raw capability. Everything below is calibrated to that claim, and no larger one.

1 · Summary

The Session Health Monitor is a Claude Code skill that watches an agent session in real time for cognitive degradation — fabrication, repeated errors, and thrash — and reports a live Session Health score. It persists a per-session scorecard, and mines cross-session trends into concrete operating-discipline recommendations.

Watches

Deterministic signals every turn (zero agent judgment); semantic checks — fabrication, quiet-wrongness — run on escalation via a fresh-context judge reading structured evidence, never the agent's own prose.

Scores

One 0–100 number = inverse of operational risk, incident-clustered so one failure can't multi-count, always with the top contributing incidents shown.

Escalates

A two-tier judge: cheap deterministic checks every turn, escalating to a cross-family judge (OpenAI Codex, gpt-5.5) on loosened triggers — not gated on high-stakes.

Learns

Scorecards accrue; once minimum sample sizes are met, trends become "when to compact / start fresh / avoid workflow X," calibrated separately per global/user/project.

2 · The problem being solved

This is the same failure mode that drives the Foundry — but measured live. Across sessions, a Builder agent will occasionally fabricate (claim a file was written when it was not, claim work done that was not), repeat the same error, thrash on the same files, or drift as context fills. Today there is no signal at the moment it happens. You find out after the fact, by reading output that turned out to be wrong.

The monitor closes that gap: it turns "the session felt off" into a number you can see while the session is still running, and into a record you can review afterward. Critically, it does this without asking the agent to grade itself — the same principle your audit layer already rests on.

3 · Goal & scope

In scope

Compute a live 0–100 Session Health score from the session transcript
Display it via the channels each environment actually supports (see §§6–7) — text first-line, CLI status bar, OS notification for criticals
Persist per-session scorecards to a local, tamper-evident store
Turn cross-session trends into operating-discipline recommendations
Cross-platform: Windows-primary, macOS supported; runtime auto-detected at install

Out of scope

Improving the model itself — this measures risk, it does not raise capability
Gating or blocking the build — it is advisory, never an enforcement wall
A compliance / court / audit-of-record instrument — it is an operating aid
Cross-party data handling — it is a same-user, same-machine tool

4 · How it integrates into your system

It is the "you don't audit yourself" principle, applied at the session level.

The monitor was shaped to your architecture rather than bolted on. Three deliberate design choices line up with the Foundry:

Externalized scoring

The score is computed outside the monitored agent: deterministic signals from a hook/script reading the session transcript, semantic signals from a fresh-context subagent reading structured evidence it did not assemble itself. Your Node C rule, at session granularity.

Cross-lineage judge

Escalations go to OpenAI Codex (gpt-5.5) — a different family from the subject, so it does not share the subject's blind spots. Same lineage-separation logic as your audit layer.

Chained ledger

An append-only, SHA-256-chained JSONL ledger is the source of truth; SQLite is a rebuildable projection. This mirrors the Foundry integrity backbone.

Figure 1 · Integration topology. The monitored agent (left of the boundary) only builds. Everything that scores, stores, judges, and displays sits outside it. The purple loop is the cross-session feedback that turns history into operating discipline. Diagram address: session_health_topology rev 2.0.

5 · What it measures

The score is Health = 100 − risk, where risk is a weighted roll-up of incident-clustered signals — correlated events are grouped into one incident so a single failure does not multi-count. Every signal is normalized 0–1. A single score is displayed, but it is always computed from — and traceable back to — this incident-level breakdown.

Signal	Weight*	What it captures
Context pressure	0.25	Flat below ~60% util, steep ramp after; from `context_window.used_percentage`
Repeat errors / re-corrections	0.20	Same error or re-correction vs baseline
Fabrication (broadened)	0.15	5 sub-types (below) — not just "file doesn't exist"
Tool-error rate (cause-classified)	0.10	Only cognitive causes count — infra flakiness excluded
Rework / thrash	0.10	Edit reverts, re-touched files, churn
Compactions (paired only)	0.10	Counts only when paired with lost context / re-asking — a bare compaction is healthy hygiene
Verification gap	0.07	"Done / verified" with no supporting evidence
Denial-reissue	0.03	Verbatim retry of a denied or failed call

provisional *Weights are provisional until calibration. They are labeled as such until the cross-session store calibrates them against real labeled incidents. They will not be presented as precise before then.

Fabrication → five sub-signals

(a) nonexistent artifact (file / symbol / flag) · (b) unsupported completion claim · (c) claim contradicted by tool output · (d) stale / unsourced external fact · (e) user-corrected false statement. Only a hard contradiction counts as "confirmed fabrication"; the rest score as lower-weight "unsupported-claim risk."

Tool errors are classified by cause

policy-denied · syntax · missing-dep · network · permission · timeout · test-failure · unexpected-output. Only cognitive causes degrade the score — a flaky network or a denied permission must never read as "the model got dumb."

Inspectability is required — the score is never a black box. The scorecard always exposes the top contributing incidents ("why it moved"), so you can tell real degradation from noisy telemetry. Baseline is hybrid: fixed global thresholds + a rolling session baseline + a per-project historical baseline (not just the first ~10 calls, which is too fragile).

6 · Overhead & impact

Designed so the thing that runs constantly costs almost nothing.

Hot path (status bar)	The ~300ms status-bar cadence must not start Python. The hook precomputes the score into a tiny file (`~/.claude/session-health/current`); the status line is a cheap file read (`cat` / `type`) — zero interpreter startup at cadence. This is a CLI-only path; the desktop app does not render custom status bars.
Two-tier judge	Fast path: deterministic signals every turn, no model call. Judge path: the cross-family Codex (gpt-5.5) call fires on a loosened set of triggers — any urgent flag, a sharp score drop, repeated user corrections, an unsupported completion claim, repeated same-file churn, or random sampling. High-stakes context raises priority but is never a required gate, since a degraded agent may misjudge what counts as high-stakes. Model cost is incurred only when one of these actually trips.
Core runtime	Python 3 (confirmed present on the target Windows box; Node is not guaranteed on PATH). OS-specific bits branch inside one codebase; runtime is auto-detected at install.
Footprint	One Python core + two thin installer stubs + a local JSONL/SQLite store. No always-on service; minimal footprint — Python 3 standard library plus the OS-native notification path.

7 · Failure modes & honest limits

Steve is a systems engineer, so this section is deliberately blunt — the limits are stated, not buried.

The desktop app does not render a status bar — verified empirically. Claude Code's desktop app does not support custom status lines, so the CLI status bar is a CLI-only channel; it does not show up there. On desktop, Session Health surfaces as the text first-line (only when the score crosses a 10% floor (100→90→80…) or an urgent flag trips — not every message) plus the OS notification for criticals. If a future desktop build adds status-line support, the bar auto-activates — no rebuild required.

The advisory text line is not a reliable alert. A degraded agent could omit or garble its own first-line status. That is why criticals fire an OS notification straight from the hook, bypassing the agent entirely — that is the can't-miss, agent-independent channel. The text line is treated as advisory display only.

Tamper-evident, not tamper-resistant (on Windows). The hash-chain detects edits to the ledger; it does not prevent deletion. An optional remote seal (periodically email/commit the ledger head-hash offsite) is available but is lower priority for a solo-user tool.

Weights are uncalibrated at ship. Until the cross-session store has labeled incidents to calibrate against, the score is directionally useful but not precise — and the tool says so rather than implying false precision.

Quiet wrongness (false negatives) is the hard case. An agent can be confidently wrong without tripping a deterministic signal. Mitigation is periodic challenge checks — requiring evidence for done-claims, sampling assertions, comparing final answers to actual tool output — plus random-sampling escalation. It reduces the risk; it does not eliminate it.

8 · Privacy note

This is a same-user, same-machine tool, so there is no cross-party leak. Two facts still warrant the minimal controls the tool ships with. First: this stays within Steve's own accounts — no cross-party sharing — but escalation does send recent session context to Steve's own OpenAI account (a different policy surface than Anthropic). Second: the ledger/DB are cleartext on disk.

Escalation on/off toggle — default ON.
One-line disclosure at install — "escalation sends recent session context to your OpenAI account."
Retention / delete-history for the local store.
Optional secret redaction on the escalation payload (API keys, tokens).

These are lightweight controls for Steve's install; Steve can choose to keep or remove them before build.

9 · Make-vs-buy & dependencies

Component	Make / Buy	Notes
Scoring core, parsing, installer, hooks	Make	Python 3 standard library
Local store	Make	Append-only JSONL + SQLite (both stdlib-reachable)
Escalation judge	Buy	OpenAI Codex / gpt-5.5 — your existing second-opinion tooling
Notifications	Make	osascript (macOS) / PowerShell toast (Windows) — OS-native
Build method	—	Assembled via the custom agent-builder skill

10 · Install & staged rollout

Packaging. One Python core + two thin installer stubs (.command for macOS, .bat for Windows). Roughly a 2-click install via an emailed link that downloads the installer; it lays down the skill, the hooks (score computation + OS-notification), the status-line script, and the local store, auto-detecting the environment.

Install honesty. True silent link-install is not safe or possible — unsigned stubs trip SmartScreen (Windows) / Gatekeeper (macOS). The tool ships checksums and shows the exact install path, so you verify before you run.

Rollout rhythm follows the house pattern: recommend → agree → drill. This brief is the "recommend" step. Nothing installs until you sign off.

11 · Open items & sign-off

Items to close before build. The first three are closed during pre-build discovery where possible; the installer also auto-detects and confirms them at install time. The last two are your calls.

#	Item	Owner
1	Confirm Python minor version on Steve's box (Python 3 confirmed)	auto-detect + confirm
2	Claude Code config path on Windows (`%APPDATA%\Claude` vs `~/.claude`)	auto-detect + confirm
3	Windows installer shell (Git Bash / WSL / PowerShell)	auto-detect + confirm
4	Conformance to your Foundry conventions — inventory done; confirm specifics	Steve
5	Keep or strip the lightweight privacy toggle (§8)	Steve

What we need from you to proceed. A yes on the design as framed here, plus your calls on items 4 and 5. On sign-off, this moves from "recommend" to the implementation drill-down — built via agent-builder, with pre-send review complete.

Session Health Monitor · Deliverable Brief v2.0 · faithful to design spec v2.0 · prepared by Mike, system consultant · 2026-07-01. Nothing in this document is built yet.