Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Varta is a zero-dependency, zero-allocation health protocol designed for distributed local agents and networked clusters.

The Problem: “The Observer Gap”

In high-performance or safety-critical systems, monitoring process health is often surprisingly expensive or dangerously imprecise.

  • Expensive: Monitoring agents that consume 5-10% CPU just to check if others are alive.
  • Imprecise: TCP-based health checks that fail due to network congestion, not process failure.
  • Fragile: Monitoring systems that crash when the target process panics or deadlocks.

The Varta Philosophy: “Zero-Everything”

Varta was built to bridge this gap by providing a protocol that is:

  • Zero Dependencies: Production crates have empty [dependencies] sections.
  • Zero Allocations: After initialization, the beat path never touches the heap.
  • Zero Block: The agent never waits for the observer. If the observer is busy, the heartbeat is simply dropped.

How it Works

  1. Agents emit a 32-byte fixed-layout frame (VLP) over a Unix Domain Socket or UDP.
  2. The Observer (varta-watch) polls these frames, tracks per-pid state machines, and triggers recovery actions if a “stall” is detected.

Ready to get started? Check out the Installation guide.

Installation

Varta is currently in rapid development (post-v0.1.0). While it is not yet published to crates.io, it is designed to be easily included as a path dependency.

Adding to your Rust project

Add the varta-client to your Cargo.toml:

[dependencies.varta-client]
path = "path/to/varta/crates/varta-client"

Optional Features

You can enable specific transport or safety features:

[dependencies.varta-client]
path = "path/to/varta/crates/varta-client"
features = [
    "panic-handler", # Automatic 'Critical' beat on thread panic
    "udp",           # Support for networked agents
    "secure-udp",    # Encrypted UDP transport (requires crypto deps)
]

Installing the Observer

To build and install the varta-watch observer binary:

cargo install --path crates/varta-watch

Verifying the Toolchain

Varta is pinned to a specific stable toolchain via rust-toolchain.toml. We recommend matching this for production builds:

rustup show

The Minimum Supported Rust Version (MSRV) is 1.70.0.

VLP Frame — Wire Layout (v0.2)

The Varta Lifeline Protocol carries a single message type: a 32-byte fixed-layout health frame. Every byte position is pinned at the protocol level so encode/decode is a handful of from_le_bytes / to_le_bytes calls and a single CRC-32C pass — nothing else.

Byte map

offset │ size │ field      │ notes
───────┼──────┼────────────┼──────────────────────────────────────────────
 0     │  2   │ magic      │ const [0x56, 0x41]  (ASCII "VA")
 2     │  1   │ version    │ const 0x02         (v0.1 → BadVersion)
 3     │  1   │ status     │ Status::{Ok=0, Degraded=1, Critical=2, Stall=3}
 4     │  4   │ pid        │ u32 little-endian — emitter's process id
 8     │  8   │ timestamp  │ u64 little-endian — emitter-local monotonic
16     │  8   │ nonce      │ u64 little-endian — strictly increasing
24     │  4   │ payload    │ u32 little-endian — opaque app context (v0.2)
28     │  4   │ crc32c     │ u32 LE CRC-32C over bytes 0..28        (v0.2)
───────┴──────┴────────────┴──────────────────────────────────────────────
                                                              total 32 bytes

v0.2 wire integrity (CRC-32C)

Bytes 28..32 carry a CRC-32C (Castagnoli, polynomial 0x1EDC6F41, init 0xFFFFFFFF, reflected, output-XOR 0xFFFFFFFF) computed over bytes 0..28. The CRC catches:

  • Non-ECC RAM bit flips and cosmic-ray single-event upsets on the agent or the observer host.
  • NIC firmware corruption between RX queue and userspace.
  • In-process memory corruption between Frame::encode and the transport write (or between the transport read and Frame::decode), including the gap between crypto::seal / crypto::open and the frame-level codec on the secure-UDP transport. AEAD tag failures surface separately as crypto::AuthError; the CRC is the defence-in-depth catch for everything that AEAD does not (in-process corruption on either side of the seal/open boundary).

Decode order is fixed: magic → version → CRC → status → pid → timestamp → nonce. CRC verification sits between version and field-range checks so random bytes from a wrong-protocol sender still surface as BadMagic / BadVersion (preserving the “this isn’t even VLP” diagnostic) while a single-bit-flipped status byte surfaces as BadCrc, never as a valid frame with the wrong meaning.

Implementation: crates/varta-vlp/src/crc32c.rs carries a const-fn 256-entry lookup table; per-frame cost is ~28 cycles (~9 ns on Apple Silicon). Hardware CRC-32C is available on x86_64 (SSE 4.2) and ARMv8.1+ via core::arch intrinsics; a future target_feature cfg can drop the cost to ~1 cycle without changing the wire format.

The payload field shrank from u64 (v0.1) to u32 (v0.2) to make room for the CRC trailer inside the 32-byte budget. Agents needing more than 4 bytes of context should externalize the data and reference it from the payload (e.g. as a slot index into a shared ring buffer).

The two compile-time assertions in crates/varta-vlp/src/lib.rs lock this in:

#![allow(unused)]
fn main() {
const _: () = assert!(core::mem::size_of::<Frame>() == 32);
const _: () = assert!(core::mem::align_of::<Frame>() == 8);
}

A drift in field order, padding, or width breaks the build. The integration test frame_round_trip_matches_golden_bytes cross-checks a hand-computed golden byte array against Frame::encode, so the layout is also pinned at runtime.

Why #[repr(C, align(8))]

  • repr(C) pins field order to declaration order. Without it the compiler is free to reorder fields, which would silently break a wire format consumed by any tool that decodes by offset (including varta-watch itself).
  • align(8) makes the struct’s start address 8-byte aligned, matching the natural alignment of the three u64 fields. The first 8 bytes (magic + version + status + pid) total exactly 8 bytes, so once the struct is 8-aligned the u64 fields land on 8-byte boundaries with zero padding. size_of therefore equals the sum of the field widths (32), and the const-assert proves it.
  • No unsafe is required at the encode/decode boundary because we never transmute the struct to or from [u8; 32]. The body of Frame::encode and Frame::decode is a sequence of to_le_bytes / from_le_bytes calls against fixed-length array slices, all of which are checked at the type system level.

Why little-endian on the wire

  • Every tier-1 target Varta will plausibly run on (x86_64, aarch64) is little-endian natively, so to_le_bytes is a no-op copy on the hot path.
  • Even on a hypothetical big-endian target the cost is one bswap-class instruction per integer field — a rounding error against UDS write/read.
  • Pinning byte order in the spec means a frame captured on one host can be decoded byte-for-byte on another, which keeps the varta-watch recovery command testable in isolation.

Why zero-dependency

  • The protocol crate is the foundation everything else links against. Any registry crate it pulls in (bytes, byteorder, zerocopy, …) becomes a transitive obligation for every agent that wants to integrate Varta. Keeping [dependencies] empty preserves the “drop in one path dep, get health signaling” contract.
  • The whole crate is a struct, an enum, and four free functions. There is nothing here that core does not already provide.
  • Empty deps also keep the audit surface minimal: the only unsafe in the workspace will live in varta-client and varta-watch (where required for UDS plumbing), never in the protocol crate itself.

Cross-references

VLP Transports

The Varta Lifeline Protocol (VLP) wire format is entirely transport-agnostic — a 32-byte, 8-byte-aligned #[repr(C)] frame. The transport layer is abstracted via traits that allow swapping out the underlying socket type without modifying the protocol core.

Architecture

┌──────────────────────────────────────────────────────────────────┐
│  varta-vlp                                                       │
│   Frame (32 bytes) │ Status │ DecodeError                        │
│   Zero dependencies. Never changes.                              │
└────────────┬───────────────────────────────┬─────────────────────┘
             │                               │
┌────────▼─────────┐            ┌────────▼──────────┐
     │  varta-client     │            │  varta-watch       │
     │                   │            │                    │
     │  BeatTransport    │            │  BeatListener      │
     │   ├── UdsTransport│            │   ├── UdsListener  │
     │   ├── UdpTransport│            │   ├── UdpListener  │
     │   └── SecureUdpTransport (secure-udp feat.)│   └── SecureUdpListener (secure-udp feat.)│
     │       (udp feat.) │            │       (udp feat.) │
     └───────────────────┘            └────────────────────┘

Agent side (varta-client)

#![allow(unused)]
fn main() {
pub trait BeatTransport: Send + 'static {
    fn send(&mut self, buf: &[u8; 32]) -> io::Result<usize>;
    fn reconnect(&mut self) -> io::Result<()>;
}
}

Varta<T: BeatTransport> owns a transport and calls send(2) on every beat(). The default transport is UdsTransport (Unix Domain Socket). When the udp feature is enabled, UdpTransport is available via Varta::connect_udp(addr). When the secure-udp feature is enabled, SecureUdpTransport is available via Varta::connect_secure_udp(addr, key) — every beat is encrypted with ChaCha20-Poly1305 AEAD (RFC 8439).

Observer side (varta-watch)

#![allow(unused)]
fn main() {
pub trait BeatListener: Send + 'static {
    fn recv(&mut self) -> RecvResult;
    fn drain_decrypt_failures(&mut self) -> u64 { 0 }  // default = 0
    fn drain_truncated(&mut self) -> u64 { 0 }         // default = 0
}
}

The Observer holds a Vec<Box<dyn BeatListener>> and polls all listeners round-robin on each poll() call. When --udp-port is passed at the CLI, a UdpListener is added alongside the UDS listener.

Transport comparison

| | UDS (default) | UDP (feature = “udp”) | Secure UDP (feature = “secure-udp”) | |—|—|—|—|—| | Addressing | Filesystem path | IP:PORT | IP:PORT | | Encryption | None (kernel isolation) | None | ChaCha20-Poly1305 AEAD | | Authentication | Kernel PID + UID via SO_PASSCRED (Linux) / LOCAL_PEERTOKEN (macOS) | None | Poly1305 tag + PID in IV prefix (master-key mode) — wire-content only, not the sending process | | Replay protection | None (local IPC) | None | Per-sender IV counter monotonicity | | Trust model | Filesystem permissions + kernel credential attestation | Network segmentation | 256-bit pre-shared or per-agent derived key | | Origin classification | KernelAttested | NetworkUnverified | NetworkUnverified (cryptographic binding ≠ kernel attestation) | | Recovery-eligible by default? | Yes | No (see [peer-authentication.md → Recovery eligibility]) | No (same gate; even master-key derivation cannot replace kernel attestation) | | Frame size | 32 bytes | 32 bytes | 60 bytes (AEAD overhead) | | Socket cleanup | UdsListener::drop unlinks socket | Kernel reclaims port | Kernel reclaims port | | Use case | Local IPC, process monitoring | IoT/edge, microservices | Anything crossing untrusted networks |

Recovery-on-UDP is structurally rejected by default. Combining any recovery flag (--recovery-cmd / --recovery-exec / *-file) with --udp-port is a startup hard-error unless the operator passes --i-accept-recovery-on-unauthenticated-transport. Even with the flag, the runtime origin gate still refuses to fire recovery for UDP-origin stalls — flipping Recovery::with_allow_unauthenticated_source(true) is a separate, conscious choice. See book/src/architecture/peer-authentication.md for the full threat model.

CLI additions

# Listen on UDS only (default)
varta-watch --socket /tmp/varta.sock --threshold-ms 500

# Listen on UDS + UDP (requires --features udp at build time)
varta-watch --socket /tmp/varta.sock --threshold-ms 500 \
            --udp-port 9000 --udp-bind-addr 0.0.0.0

# UDP-only (no UDS)
varta-watch --socket /tmp/varta.sock --threshold-ms 500 \
            --udp-port 9000

# UDP with ChaCha20-Poly1305 encryption
# Generate a 256-bit key (64 hex chars)
openssl rand -hex 32 > /tmp/varta.key

varta-watch --socket /tmp/varta.sock --threshold-ms 500 \
            --udp-port 9000 --key-file /tmp/varta.key

# Rotation: accept old key while transitioning to new key
openssl rand -hex 32 > /tmp/varta-new.key
varta-watch --socket /tmp/varta.sock --threshold-ms 500 \
            --udp-port 9000 --key-file /tmp/varta.key \
            --accepted-key-file /tmp/varta-new.key

# Per-agent key derivation from master key
# The observer derives agent-specific keys from the PID embedded in
# each frame's iv_random prefix. Compromise of one agent's key does
# not reveal other agents' keys or the master key.
openssl rand -hex 32 > /tmp/varta-master.key
varta-watch --socket /tmp/varta.sock --threshold-ms 500 \
            --udp-port 9000 --master-key-file /tmp/varta-master.key

Feature flags

CrateFlagEffect
varta-vlpcryptoEnables ChaCha20-Poly1305 AEAD (seal, open, Key). No_std-compatible — all four RustCrypto deps are default-features = false.
varta-vlpstdOpt-in std-dependent conveniences (Key::from_file, std::path::Path-typed helpers). Off by default so the crate is #![no_std] + alloc-free out of the box — ready for FreeRTOS/Zephyr targets.
varta-clientudpEnables UdpTransport, Varta::connect_udp(), install_panic_handler_udp()
varta-clientsecure-udpEnables SecureUdpTransport, Varta::connect_secure_udp(); implies udp, varta-vlp/crypto, and varta-vlp/std (the secure_udp example calls Key::from_file).
varta-watchudpEnables UdpListener, --udp-port / --udp-bind-addr CLI flags
varta-watchsecure-udpEnables SecureUdpListener, --key-file / --accepted-key-file / --master-key-file; implies udp-core
varta-testsudpEnables UDP integration tests
varta-benchudpEnables udp-latency benchmark subcommand

Security

  • UDS: On Linux, the kernel attests the sender’s PID and UID via SCM_CREDENTIALS. The observer rejects frames where frame.pid != peer_pid or peer_uid != observer_uid. On macOS, getsockopt(LOCAL_PEERTOKEN) is attempted for the same verification, falling back to --socket-mode 0600. On other platforms, the only defence is --socket-mode.

  • UDP (plaintext): No kernel credential mechanism exists. peer_pid is always 0, which causes the observer to skip PID verification. Trust must be established at the network layer — firewall rules, VPC boundaries.

  • UDP (secure): Every frame is encrypted with ChaCha20-Poly1305 (RFC 8439) using a 256-bit key. Primitives are provided by the chacha20poly1305 crate (RustCrypto, NCC Group audit 2020) — no hand-rolled crypto. Key derivation uses HKDF-SHA256 (RFC 5869) via the hkdf + sha2 crates. Two key modes:

    • Shared key: A single pre-shared key for all agents (--key-file).
    • Master key: Per-agent keys derived from the agent’s PID via HKDF-SHA256 (--master-key-file). The PID is embedded in the iv_random prefix so the observer can derive the correct agent key before decryption. Compromise of one agent’s key does not reveal other agents’ keys or the master key. Note: the HKDF-based KDF is incompatible with the ChaCha20-PRF KDF used in earlier releases — agents must re-key when upgrading from a pre-RustCrypto build if master-key mode was in use.
    • Replay attacks are blocked by enforcing monotonic IV counters per sender. Key rotation is supported via --accepted-key-file (no downtime required).
    • Panic-hook entropy: install_panic_handler_secure_udp reads entropy at install time and fails closed if all sources (getrandom, getentropy, /dev/urandom) are unavailable. In chrooted environments without /dev, use install_panic_handler_secure_udp_accept_degraded_entropy to opt into a non-cryptographic fallback — see book/src/architecture/peer-authentication.md for the full nonce-reuse risk analysis.
  • Recovery commands: Two execution modes:

    • --recovery-cmd: Shell mode — templates executed via /bin/sh -c with the PID as $1 (positional argument, never string-interpolated).
    • --recovery-exec: Exec mode — commands executed directly via execvp(2) with {pid} replaced in arguments. No shell is involved.
    • --recovery-cmd-file / --recovery-exec-file: Read templates from files with mandatory ownership/permission checks (UID match, mode ≤ 0600).

Container / PID-namespace semantics

Frame.pid carries the agent’s PID in the agent’s PID namespace. The observer’s kernel-attested peer PID (SO_PASSCRED / LOCAL_PEERTOKEN / SCM_CREDS) is in the observer’s namespace. When the two namespaces differ:

  • The pid in the frame cannot be used to identify a process the observer can kill(2) or systemctl restart — the same numeric PID refers to a different process in each namespace.
  • The existing frame.pid == peer_pid check at observer ingress catches most cases (different namespaces usually produce different numeric pids), but same-pid collisions across containers (every container’s first process is PID 1) are invisible to that gate.

varta-watch therefore (Linux only):

  1. Reads /proc/self/ns/pid once at startup and caches the inode as the observer’s namespace identity.
  2. For every kernel-attested beat (UDS), reads /proc/<peer_pid>/ns/pid and compares the inode to the observer’s. Mismatch ⇒ drop the beat (varta_frame_namespace_mismatch_total++) and emit Event::NamespaceConflict.
  3. Per-pid tracker slots pin the namespace inode at first beat; a later beat with a different Some(_) inode is rejected as Update::NamespaceConflict (varta_tracker_namespace_conflict_total++).
  4. Recovery commands refuse to spawn for cross-namespace stalls and log an audit record with reason=cross_namespace_agent (varta_recovery_refused_total{reason="cross_namespace_agent"}++).

Escape hatch — --allow-cross-namespace-agents

When agents are intentionally run with --pid=host (containers sharing the host PID namespace), the observer’s namespace and the agents’ namespace agree at the kernel level — the gate above is a no-op.

For deployments where the agent runs in a private namespace and the operator has out-of-band PID translation (e.g. CNI metadata that lets a recovery script translate container pids to host pids), pass --allow-cross-namespace-agents. The audit log and metrics still fire, but beats are admitted and recovery is permitted.

--strict-namespace-check

Treat namespace mismatch as a fatal startup error: on the first Event::NamespaceConflict, the daemon logs a FATAL line and exits with a non-zero status. Used in environments where the operator wants the daemon to fail loudly rather than silently log audit refusals.

Non-Linux platforms

PID namespaces are a Linux kernel concept. On macOS and the BSDs, observer_pid_namespace_inode() returns None and all comparisons short-circuit to “match”. The CLI flags are accepted for portability but have no runtime effect.

UDP transports

UDP listeners (plain or secure) have no kernel peer-cred mechanism. peer_pid is 0; peer_pid_ns_inode is None. Recovery is already refused for NetworkUnverified origins by the existing transport gate — namespace mismatch adds nothing for UDP. See peer-authentication.md for the full trust model.

Secure UDP — replay-shadow threat boundary (H4)

SecureUdpListener keeps per-sender replay state in a bounded HashMap indexed by SocketAddr:

  • Capacity: MAX_SENDER_STATES = 1024 simultaneously-tracked senders.
  • After capacity is reached, force_evict_oldest_sender stashes the evicted sender’s (addr, SenderState) in a single-slot last_evicted: Option<(SocketAddr, SenderState)> shadow so a replay attempt from the just-evicted sender is still rejected.

The shadow is one entry deep. An attacker who can spoof UDP source addresses can cycle ≥1025 distinct sources to overwrite the shadow with their own chaff, then replay a captured frame from the target sender as if it were a “new” sender — the listener has no surviving record of the target’s last counter and accepts the replay.

Why the shadow isn’t deeper

A 1-deep shadow is acceptable for the loopback configuration: only processes on the same host can craft loopback source addresses (127.0.0.0/8 requires CAP_NET_RAW to set as a UDP source, and even then the kernel refuses spoofed loopback from external interfaces). On any reachable network — VLAN, VPC, the public internet — the source address is freely forgeable, and a deeper shadow merely raises the attacker’s required address budget rather than closing the gap. Bounding the shadow to a single slot keeps the eviction story constant-time and aligns the threat boundary with a clean operational constraint (network reach), rather than a fuzzy quantitative argument about how many spoofed sources are “enough”.

Mitigation

varta-watch defaults --udp-bind-addr to 127.0.0.1 when secure-UDP keys are configured. Operators who genuinely need the listener to accept non-loopback peers must pass --i-accept-secure-udp-non-loopback explicitly — a CLI flag whose name signals the residual risk. When the flag is set, a high-visibility startup warning is emitted to stderr and the operator is expected to constrain network reach (firewall, private VLAN, mTLS-fronted tunnel) so that no untrusted host can reach the bound port.

The recovery gate on NetworkUnverified origins (see peer-authentication.md) remains independent of this flag — opting in to non-loopback secure-UDP does NOT enable recovery commands from UDP-origin beats. Those still require the separate --secure-udp-i-accept-recovery-on-unauthenticated-transport acknowledgement.

Fork-safety on secure-UDP

After fork(2), a child process inherits its parent’s SecureUdpTransport state — the 16-byte iv_session_salt, the iv_prefix_index, and the iv_counter. Three nominally-independent fields whose product defines the AEAD nonce. If the child ever calls Varta::beat() without intervention, it derives the same 12-byte ChaCha20-Poly1305 nonce its parent has already emitted under the same key — a catastrophic confidentiality and integrity failure (Poly1305 key recovery, plaintext XOR leak).

How Varta enforces fork-safety structurally

Varta::connect snapshots std::process::id() into a private connect_pid field. Every Varta::beat reads the current PID and compares — on mismatch (i.e. the handle is now in a forked child), the wrapper invokes transport.reconnect() before building the frame. SecureUdpTransport::reconnect() re-reads OS entropy into a fresh 16-byte session salt, recomputes the IV prefix, and resets the prefix index and counter to zero. The child’s first emitted frame therefore uses an IV prefix derived from independent entropy — nonce collision across the fork boundary is impossible.

Auto-recovery is silent: the caller observes BeatOutcome::Sent. The event is observable via Varta::fork_recoveries() -> u64 (suggested Prometheus name: varta_client_fork_recoveries_total). The local session epoch resets too — nonce → 0, start → Instant::now(), last_timestamp → 0, consecutive_dropped → 0 — so the child’s wire stream looks like a fresh session to the observer.

Observer view

The observer’s per-sender state in SecureUdpListener is keyed by (SocketAddr, iv_prefix) with a 1-deep replay history (see H4 replay shadow above). When the forked child sends frames from the same source port with a new IV prefix, the observer transitions its current state into the prev_* slots and accepts the new prefix as a fresh session — no replay error, no protocol-level signal required. Fork-recovery is entirely transparent to the wire format.

Advanced callers

Callers using SecureUdpTransport directly (without the Varta wrapper) do not get auto-detection. The BeatTransport trait is intentionally low-level; the safety policy lives one layer up. Direct-transport users must call SecureUdpTransport::reconnect() themselves in the forked child before the first beat.

Panic-hook parallel

install_panic_handler_secure_udp caches an 8-byte IV at install time to avoid the (non-async-signal-safe) entropy read inside the panic hook itself. The same fork hazard applies: a child that panics would otherwise emit (cached_iv, iv_counter=1) — colliding with the parent’s identical pair if the parent panicked too. The installer snapshots install_pid and, inside the hook, re-runs the entropy chain (getrandom/getentropy/dev/urandom) when the PID has changed. The strict variant fails closed (skips the secure frame) when no entropy source is reachable; the accept-degraded-entropy variant falls back to fallback_iv_random() per the documented degraded-entropy policy.

Cross-references

  • Observer liveness — the watcher’s own liveness story: in-process self-watchdog, systemd sd_notify, hardware watchdog, and paired-observer pattern
  • Safety profiles — compile-time vs. runtime feature gating for production-safe builds
  • Peer authentication — kernel-level PID attestation and transport trust classification
  • Namespaces — dedicated reference for cross-namespace deployments

Future transports

Additional transports can be implemented by implementing BeatTransport (agent side) and BeatListener (observer side) without touching the protocol core:

  • Shared memory (memfd, shm) — Wasm plugins writing directly to a shared ring buffer
  • Unix pipes (pipe, fifo) — stdin/stdout health frames for supervised processes
  • WebSocket — for browser-based health dashboards

Observer Liveness — “Who Watches the Watcher?”

varta-watch is the single observer for all agents on a host. If it crashes or its poll loop hangs, no agent gets a Stall event and no recovery fires — the entire monitoring layer fails silently. For life-support deployments this is the most critical functional gap.

This document describes four independent, layered defenses. Deploy as many as your environment supports; each catches failure modes the others cannot.


Threat model

Failure modeL1L2L3L4
Poll loop hangs (stuck in I/O or computation)✓*
Process crash (SIGSEGV, stack overflow, OOM)✓†
Watchdog thread dies silently (panic, signal)✓‡✓†
Kernel hang / host deadlock
Misconfiguration (wrong socket path, wrong user)

*systemd detects a hang only if WATCHDOG=1 stops arriving; the self-watchdog ensures that also stops when the loop wedges.
†hardware watchdog fires when the kick loop stops; process crash achieves this.
‡since H5 the watchdog thread is the sole source of WATCHDOG=1; if it dies, the emission stream stops and systemd’s WatchdogSec= fires.


L1 — In-process self-watchdog (--self-watchdog-secs)

A background thread checks that the main poll loop has ticked at least once within the configured deadline. If not, it calls process::abort().

varta-watch --self-watchdog-secs 4 ...
  • The background thread is the only non-main thread in the binary. The beat path and observer loop remain single-threaded.
  • process::abort() produces SIGABRT, which appears in journalctl, enables core dumps, and triggers Restart=on-abort in systemd units.
  • The deadline should be set to roughly 2× the expected worst-case poll latency (typically --threshold-ms + reaping time).
  • H5 (post-2026-05-13): the watchdog thread is ALSO the sole emitter of systemd WATCHDOG=1. Emission used to live on the main loop, which left a silent-failure window: if the watchdog thread died while the main loop remained healthy, WATCHDOG=1 kept arriving from the main thread and systemd had no way to notice the in-process abort path was already gone. Now WATCHDOG=1 emission is moved to the watchdog thread (via a dup(2)-ed copy of the notify socket carved off SdNotify with take_watchdog_notifier). If the thread dies, the emission stream stops and WatchdogSec= fires. This is the only design where systemd can detect a dead watchdog while the main loop is still alive.
  • Auto-enable: when $WATCHDOG_USEC is set by the service manager and --self-watchdog-secs is not passed, the watchdog thread is spawned unconditionally with a 4 s deadline. Operators with tighter WatchdogSec= values can override via the CLI. This collapses the L1+L2 layers structurally: enabling WatchdogSec= in the unit automatically buys both the in-process abort path and the WATCHDOG=1 emission stream.

L2 — systemd sd_notify watchdog integration

varta-watch speaks the sd_notify(3) protocol natively. Set Type=notify in the service unit and configure WatchdogSec=:

[Service]
Type=notify
NotifyAccess=main
WatchdogSec=5s
Restart=on-watchdog
RestartSec=1s
TimeoutStartSec=10s
ExecStart=/usr/bin/varta-watch \
    --socket /run/varta/agents.sock \
    --threshold-ms 5000 \
    --self-watchdog-secs 4 \
    --hw-watchdog /dev/watchdog \
    --heartbeat-file /run/varta/heartbeat

varta-watch sends:

  • READY=1 after observer bind succeeds and all listeners are attached
  • WATCHDOG=1 every WATCHDOG_USEC / 2 microseconds while the poll loop runs
  • STOPPING=1 when the SHUTDOWN latch flips

If WATCHDOG=1 stops arriving, systemd kills and restarts the process. This catches both crashes (no more sends) and hangs (LAST_TICK_NS stops advancing, the self-watchdog aborts, systemd restarts).

$NOTIFY_SOCKET and $WATCHDOG_USEC are passed automatically by systemd; no extra flags are needed.


L3 — Hardware watchdog (--hw-watchdog)

On hosts with a kernel hardware watchdog (e.g. /dev/watchdog), varta-watch can kick it once per poll iteration. If the kick stops, the kernel reboots the host — even if the OS itself is wedged.

varta-watch --hw-watchdog /dev/watchdog ...

Magic close: on a clean shutdown (SIGTERM/SIGINT followed by graceful exit) varta-watch writes the magic byte 'V' to disarm the watchdog before exiting. A crash or hang leaves the watchdog armed; the kernel reboots after its timeout.

The /dev/watchdog device is typically root-owned (mode 0600). Run varta-watch as root or grant the CAP_SYS_ADMIN capability, or use a watchdog daemon (e.g. watchdog(8)) for the actual device management.


L4 — Paired observers (operational)

A second monitoring process scrapes the first observer’s liveness signals and restarts it if they stall. This requires no code changes — use the existing --heartbeat-file and /metrics primitives.

Heartbeat-file poller

#!/bin/sh
HEARTBEAT=/run/varta/heartbeat
while :; do
    prev=$(awk '{print $1}' "$HEARTBEAT" 2>/dev/null || echo 0)
    sleep 5
    cur=$(awk '{print $1}' "$HEARTBEAT" 2>/dev/null || echo 0)
    if [ "$cur" -le "$prev" ]; then
        logger -t varta-watchdog "heartbeat stalled (loop_count=$prev); restarting"
        systemctl restart varta-watch
    fi
done

The first field in the heartbeat file is a monotonically increasing loop counter. If it stops advancing, the observer is wedged or dead.

Prometheus uptime scraper

/metrics exposes varta_watch_uptime_seconds. A second Prometheus instance (or Alertmanager rule) can alert when the gauge stops increasing:

# Alert when varta-watch uptime has not increased for 30 seconds.
alert: VartaWatchStalled
expr: rate(varta_watch_uptime_seconds[30s]) == 0
for: 30s
labels:
  severity: critical

Threading note

--self-watchdog-secs spawns one background thread. This is the only non-main thread in the varta-watch binary, and that property is a load-bearing architectural invariant, not an accident. All agent beat processing, stall detection, recovery spawning, and Prometheus serving happen on the main thread. The watchdog thread reads two atomics (SHUTDOWN and LAST_TICK_NS), calls process::abort() on wedge, and writes WATCHDOG=1 to its own dup(2)-ed UnixDatagram fd; it never touches shared mutable state. The dup-ed fd is independent kernel state — both threads own their own descriptor and there is no synchronisation between them on the notify path.

The single-threaded design is what lets the project preserve its zero-alloc, ABI-stable beat contract: a beat is decoded into a stack-allocated [u8; 32] and dispatched through the per-pid tracker without locking, because nothing else holds a reference. Moving any phase of the loop to a second thread would require a lock-free SPSC ring between threads at the ingress and break that contract. Stall-detection latency under scrape load is instead bounded by an explicit per-iteration latency budget — see below.

Why /metrics is on the poll thread

“Doesn’t scrape latency variance steal time from beat ingestion?”

It can, by up to ~200 ms per iteration — the structural cap of PromExporter::serve_pending (100 ms serve deadline + 100 ms drain deadline, see exporter.rs). The obvious mitigation is to spawn a second thread that owns serve_pending and reads tracker state through a shared snapshot. We deliberately do not do this. Three reasons:

  1. The beat path would acquire a lock on every tick. Whether via Arc<Mutex<PromExporter>> or an SPSC snapshot ring, every record-side counter increment (pe.record_beat(...), pe.record_stall(...), pe.record_loop_tick(...) etc.) becomes either a mutex acquisition or a single-producer write into a wait-free queue. Neither is zero-overhead on the hot path, and both introduce per-architecture memory-ordering questions that the current &mut self model eliminates by construction.
  2. The zero-allocation invariant becomes harder to enforce. The beat path is currently zero-alloc post-connect, enforced by the varta-tests guard allocator. A snapshot ring requires either a pre-sized arena (more state on the hot path) or per-snapshot allocation (kills the invariant). Both are worse than what we have.
  3. The variance is already bounded and now observable. Scrape work per iteration is capped at ~200 ms by PROM_READ_DEADLINE = 10 ms, PROM_MAX_CONNECTIONS_PER_SERVE = 8, PROM_MAX_DRAIN_PER_SERVE = 50, the 100 ms serve deadline, and the per-IP token bucket. Operators see the variance through varta_observer_serve_pending_seconds (new — see “Observing scrape-induced latency” below); beat-path latency is iteration_seconds - serve_pending_seconds in PromQL.

Scrape-storm alarms and beat-path alarms therefore route off different metrics, and the load-bearing single-thread invariant is preserved.


Latency budget — worst-case poll iteration time

A bounded iteration time guarantees a bounded stall-detection latency. The table below names the phases of the poll loop in main.rs and the upper-bound source for each:

PhaseWorst caseSource / constantObservable as
1. Drain queued stall eventsO(queue)·~1 µsObserver::poll_pending — one stack pop per call(subsumed in iteration_seconds)
2. Observer::poll() (one recv each)read_timeout·NUDS recv(2) blocks up to --read-timeout-ms (default 100 ms) per listener; UDP listeners are non-blocking(subsumed in iteration_seconds)
3. Maintenance counter drains<1 msConstant work over observer.drain_* counters(subsumed in iteration_seconds)
3. Recovery::try_reap~64 µs≤64 waitpid(2, WNOHANG) syscalls (bounded outstanding-pids fan)(subsumed in iteration_seconds)
3. PromExporter::serve_pending≤200 ms100 ms serve deadline + 100 ms drain deadline (see exporter.rs)varta_observer_serve_pending_seconds (independent histo)
4. Heartbeat-file atomic write<5 msSame-dir write + rename (write_heartbeat_atomic)(subsumed in iteration_seconds)
4. sd_notify + HW watchdog kicks<1 msOne sendmsg(2) + one write(2)(subsumed in iteration_seconds)
Iteration total (worst case)~310 msUDS read_timeout (100 ms) + serve_pending (≤200 ms) + small fixed work — assuming a single UDS listenervarta_observer_iteration_seconds

Two observations the table makes explicit:

  • The UDS read-timeout is the idle floor: with no incoming beats and no scrape pressure, every iteration costs about read_timeout. This is intentional — it yields CPU between recvs without busy-spinning. Lower the floor by lowering --read-timeout-ms, at the cost of a tighter idle poll loop.
  • The worst-case active iteration is bounded by read_timeout + serve_pending, since recv(2) returns early as soon as a frame arrives and serve_pending is the only other phase that can spend more than a few milliseconds.

The default soft budget is 250 ms (--iteration-budget-ms). Iterations exceeding it increment varta_observer_iteration_budget_exceeded_total and are visible in the varta_observer_iteration_seconds histogram. The budget is advisory: hard wedges (seconds, never returning) remain the responsibility of --self-watchdog-secs.

The idle sleep at the end of an iteration with no pending I/O (10 ms) is excluded from the histogram. Idle time is a throttling primitive, not work latency; including it would mask the bad iterations.

Tuning relationship

For a given --threshold-ms T, stall-detection latency is bounded by T + per_iteration_worst_case. With defaults (--threshold-ms 5000, --read-timeout-ms 100, default serve_pending bounds) the worst case is ~310 ms, so a stalled agent surfaces no later than ~5.31 s after its last beat.

The soft --iteration-budget-ms (default 250 ms) sits between the typical case (~100 ms idle floor) and the worst case (~310 ms under scrape storm) so the budget-exceeded counter fires only during real scrape pressure, not on every active iteration. Operators with higher --read-timeout-ms or multiple listeners should raise the budget proportionally (budget ≥ read_timeout × N_listeners + 150 ms).

--self-watchdog-secs should be set such that self_watchdog_secs × 1000 ≥ 4 × iteration_budget_ms so transient overruns during scrape bursts do not trigger false-positive aborts. The default guidance (--self-watchdog-secs 4 with --iteration-budget-ms 250) gives a 16× margin (4000 ms ÷ 250 ms), well above the worst-case ratio.

Observing scrape-induced latency

Three metrics together let an operator separate scrape pressure from beat-path slowness:

  • varta_observer_iteration_seconds — wall time for the entire poll iteration (drain → poll → maintenance → recovery reap → serve_pending → heartbeat write → watchdog kicks). Bucketed by [0.001, 0.005, 0.01, 0.05, 0.1, 0.25, 0.5, 1.0, +Inf]. Includes serve_pending — unchanged contract.
  • varta_observer_serve_pending_seconds — wall time for the serve_pending phase alone. Same bucket boundaries as iteration_seconds so the two are coherent. Configurable soft budget via --scrape-budget-ms (default 250 ms); overruns increment varta_observer_scrape_budget_exceeded_total.
  • varta_observer_iteration_budget_exceeded_total — iterations exceeding --iteration-budget-ms (default 250 ms). Includes serve_pending time.

Beat-path latency is then a PromQL expression — the difference between iteration time and serve-pending time:

# P99 beat-path latency = P99(iteration_seconds) − P99(serve_pending_seconds).
# Note: subtracting quantiles is approximate (P99 of diff ≠ diff of P99s),
# but in practice serve_pending and the rest of the iteration are weakly
# correlated, so the approximation is monotonic with the true beat-path
# latency.  Use sum_by-(le) rate() if you want exact derived histograms
# (compute beat_path_seconds in a recording rule from the two histos).
histogram_quantile(0.99,
  sum by (le) (rate(varta_observer_iteration_seconds_bucket[5m])))
- histogram_quantile(0.99,
    sum by (le) (rate(varta_observer_serve_pending_seconds_bucket[5m])))

Alarms that should fire on beat-path slowness route off iteration_seconds - serve_pending_seconds or off iteration_budget_exceeded_total minus scrape_budget_exceeded_total when scrape overruns dominate the budget overruns.

Alarms that should fire on scrape-storm pressure route off scrape_budget_exceeded_total and serve_pending_seconds quantiles directly.

# Warn — more than 10% of recent iterations exceeded the soft budget.
alert: VartaIterationBudgetOverruns
expr: rate(varta_observer_iteration_budget_exceeded_total[5m])
    / rate(varta_observer_iteration_seconds_count[5m]) > 0.10
for: 5m
labels: { severity: warning }

# Crit — 99th-percentile iteration time has exceeded 500 ms (twice the budget).
alert: VartaIterationP99High
expr: histogram_quantile(0.99,
        sum by (le) (rate(varta_observer_iteration_seconds_bucket[5m]))) > 0.5
for: 5m
labels: { severity: critical }

# Warn — sustained scrape pressure (≥10% of serve_pending calls over budget).
# Fires on scrape-storm symptoms specifically, NOT on beat-path slowness.
alert: VartaScrapeStormPressure
expr: rate(varta_observer_scrape_budget_exceeded_total[5m])
    / rate(varta_observer_serve_pending_seconds_count[5m]) > 0.10
for: 5m
labels: { severity: warning }

# Crit — beat-path P99 latency exceeds 200 ms.  Derived: subtract scrape
# time from iteration time so this alarm is immune to scrape storms.
# (See "Observing scrape-induced latency" for the approximation caveat —
# put this in a recording rule for production use.)
alert: VartaBeatPathP99High
expr: |
  (histogram_quantile(0.99,
     sum by (le) (rate(varta_observer_iteration_seconds_bucket[5m])))
   - histogram_quantile(0.99,
     sum by (le) (rate(varta_observer_serve_pending_seconds_bucket[5m])))) > 0.2
for: 5m
labels: { severity: critical }

Tracker bounded-work guarantee

Each beat frame triggers at most one call to find_evictable_slot when the tracker is at capacity. That call scans at most eviction_scan_window slots (default 256, configurable via --eviction-scan-window).

Per-frame slot reads ≤ eviction_scan_window.

A full table sweep — confirming every slot is ineligible — takes at most:

ceil(tracker_capacity / eviction_scan_window)

consecutive record() calls (the rotating cursor resumes where it stopped).

With defaults (capacity = 256, window = 256) this is 1 call. With --tracker-capacity 4096 --eviction-scan-window 16 the sweep takes 256 calls — each individual call still reads ≤ 16 slots, so the per-frame beat-path cost stays bounded.

The varta_tracker_eviction_scan_window_max gauge (set once at startup) exposes the configured window so dashboards can derive the worst-case sweep depth. Operators alert on varta_tracker_eviction_scan_truncated_total to detect when the cap engages under a unique-pid flood.

Combine this bound with the iteration-budget WCET derivation above:

iteration_max ≤ read_timeout × N_listeners + eviction_scan_window × slot_read_ns

Tick-latency budget and hardware-watchdog margin

Bench-derived p99 cap

Under the canonical stress profile — 4096-slot tracker, balanced eviction policy, 30 agents × 100 Hz (≈ 3 000 beats/s) over UDS — the varta_observer_iteration_seconds p99 is ≤ 5 ms.

Run the bench to reproduce the measurement on your hardware:

cargo build --workspace --release --features prometheus-exporter
cargo run -p varta-bench --release -- tick-distribution

The bench asserts p99 ≤ 5 ms and exits non-zero if the cap is breached, printing the full bucket distribution and observed percentiles for triage. It also reports varta_tracker_eviction_scan_truncated_total and varta_observer_iteration_budget_exceeded_total so you can confirm the eviction-scan cap engages under the test load without blowing the latency budget.

Soft iteration budget

--iteration-budget-ms (default 250 ms) is the soft per-iteration ceiling. Overruns increment varta_observer_iteration_budget_exceeded_total but do not abort the loop. The default 250 ms gives 50× headroom over the 5 ms p99 cap; overruns therefore indicate genuine scrape-storm pressure, not normal active-load variance. See the “Latency budget” section for the full derivation.

Hardware-watchdog timeout floor

Operators deploying --hw-watchdog /dev/watchdog must configure the kernel watchdog device with a timeout of ≥ 30 s. The derivation:

Margin factorValueNote
p99 iteration time≤ 5 msBench-certified under canonical load
Iteration budget (soft)250 msDefault; raise for higher --read-timeout-ms
Self-watchdog deadline4 sDefault auto-set from $WATCHDOG_USEC
Recommended device timeout≥ 30 s≥ 6000× p99 cap, ≥ 7× self-watchdog deadline

The observer kicks the hardware watchdog at the end of every poll iteration (after heartbeat-file write and sd_notify). A single missed kick cannot trip the device; a sustained stall of ≥ device-timeout will. The 30 s floor provides ample budget for:

  • Audit-log filesystem stalls (varta_log_suppressed_total{kind="audit_io"} will show rate limiting if these recur)
  • Prometheus scrape contention (serve_pending_seconds quantiles)
  • The H5 self-watchdog’s 4 s deadline with ≥ 7× margin

Round-robin fairness bound

Observer::poll() rotates the next_listener_start cursor on every non-WouldBlock receive. Per-listener worst-case admission delay is therefore bounded by N_listeners × per-listener-recv-cost. Under the canonical bench profile (single UDS listener) this is simply the UDS recv latency; with N additional UDP listeners add N × ~10 µs per iteration.

Eviction scan under stress

The bench will record non-zero varta_tracker_eviction_scan_truncated_total when the tracker fills and the 256-slot eviction window exhausts without finding a stalled slot. This is expected and by design — the cap proves the per-frame cost stays bounded even under a unique-pid flood. The p99 assertion holds even when the truncation counter is non-zero.


Debounce table semantics under load

The Recovery runner keeps a per-pid ledger of the most recent recovery fire (LastFiredTable). Each subsequent stall for the same pid is gated on now - last_fired[pid] >= debounce; closer-than-debounce stalls return RecoveryOutcome::Debounced and never spawn a child.

Capacity and eviction policy

The ledger is a fixed-size, array-backed table with capacity MAX_LAST_FIRED_CAPACITY = 4096. Capacity is sized to make the M8 adversarial-burst pattern costly: 4096 distinct pids would have to stall faster than debounce cadence before the eviction policy is engaged. Per-slot cost is Option<LastFiredSlot> ≈ 24 bytes → ~96 KiB total — within budget for the observer.

When the table is full and a stall arrives for a new pid, the policy is fail-closed:

  1. The oldest slot is identified by a single bounded linear scan.
  2. If that slot’s age is at least debounce, it is evicted and the new pid takes its place. Per-pid debounce semantics are preserved because the evicted pid’s window has already elapsed. The eviction is counted in varta_recovery_last_fired_evictions_total (operators tune capacity on this signal).
  3. If the oldest slot’s age is below debounce, the recovery is refused. The runner returns RecoveryOutcome::RefusedDebounceCapacity { pid }, emits a RefusedRecord { reason: "debounce_capacity" } to the audit log, and bumps both varta_recovery_outcomes_total{outcome="refused_debounce_capacity"} and varta_recovery_refused_total{reason="debounce_capacity"}.

Eviction is debounce-respecting churn; refusal is suppression. Operators tune capacity on the first signal and alert on the second.

Clock-regression defense

All age comparisons use Instant::saturating_duration_since, which returns Duration::ZERO on regression. ZERO-duration entries are treated as “not eligible for eviction” — preventing a backwards clock blip from auto-evicting the whole table.

# Alert immediately on any debounce-capacity refusal — this is either
# legitimate scale-out past 4096 concurrent stalls or the M8
# adversarial stall-burst pattern.  Either case warrants paging.
rate(varta_recovery_refused_total{reason="debounce_capacity"}[5m]) > 0
# Warn on sustained eviction churn — debounce semantics are still
# intact, but capacity is becoming a bottleneck under steady-state
# load.  Tune MAX_LAST_FIRED_CAPACITY or audit which pids are
# stalling.
rate(varta_recovery_last_fired_evictions_total[5m]) > 0.1
# Page on any non-zero invariant-violation count — the defensive
# fall-throughs in LastFiredTable should never fire in correct
# operation.  Non-zero values indicate a code bug, not load.
varta_recovery_invariant_violations_total > 0

Bounded-WCET guarantee

Every LastFiredTable operation is a linear scan over a fixed-size backing store. The unit test last_fired_table_prune_bounded_wcet asserts the prune sweep completes in under 5 ms in debug builds at full capacity (a future refactor that reintroduces O(n²) behaviour disguised as “cleanup” is caught by this test).

The pre-M8 HashMap-based implementation was the source of the debounce-bypass bug closed by this section: reactive pruning at the top of on_stall (prune_threshold = debounce * 10) left the map full of fresh entries under adversarial load, and the at_capacity branch skipped the debounce check entirely. The new table never skips the check; capacity pressure surfaces as a refusal or an audited eviction.


Cross-references

  • Safety profiles — compile-time vs. runtime feature gating for production-safe builds
  • VLP transports — transport-level trust classification
  • Peer authentication — kernel-level PID attestation
  • Verification — symbolic verification of Frame::decode (M7) and the LastFiredTable invariants on the verification roadmap

Recovery — Non-Blocking Spawn / Async Reap

Status: implemented (Sessions 01–03 completed). The --recovery-timeout-ms flag is live in varta-watch; see crates/varta-watch/src/config.rs and crates/varta-watch/src/recovery.rs.

1. Problem

varta-watch runs a single thread driving Observer::poll on a 100 ms read-timeout cadence. When a stalled pid crosses its silence threshold, the observer surfaces Event::Stall and the binary calls Recovery::on_stall(pid).

Today, Recovery::on_stall (crates/varta-watch/src/recovery.rs:71) shells out via Command::new("/bin/sh").arg("-c").arg(&rendered).status(). status() blocks the calling thread until the child exits, which means the entire poll loop — beat decoding, exporter pumping, Prometheus serving, stall surfacing for other pids — freezes for the duration of the recovery template. A misbehaving template (sleep 30, a slow restart script) effectively takes the observer offline.

This is blocker B1 for v0.1.0.

2. Goal

Replace the blocking shell-out with a non-blocking spawn followed by an asynchronous reap on subsequent observer ticks, and add an optional kill-after deadline so a runaway template cannot consume an unbounded recovery slot. All within the project’s hard constraints:

  • Zero registry dependencies in varta-watch (path-only deps).
  • No new threads. No tokio, no executors.
  • No unsafe. The crate already declares #![deny(unsafe_op_in_unsafe_fn, rust_2018_idioms)].
  • Library code does not print; diagnostics live in crates/varta-watch/src/main.rs only.

3. API surface (Session 01 lock-in)

The public surface in varta_watch::recovery becomes:

#![allow(unused)]
fn main() {
use std::process::ExitStatus;
use std::time::Duration;

#[derive(Debug)]
pub enum RecoveryOutcome {
    /// A child process was forked and is now outstanding. The observer
    /// has NOT waited on it. Reap on a later tick via `try_reap`.
    Spawned { child_pid: u32 },

    /// The previous invocation for this pid is still inside the per-pid
    /// debounce window; nothing was spawned.
    Debounced,

    /// `Command::spawn` failed before the shell could run (e.g. fork
    /// failure, `/bin/sh` missing). Surfaced verbatim.
    SpawnFailed(std::io::Error),

    /// A previously-`Spawned` child has exited and was reaped on this
    /// tick. The observer never blocks waiting for this transition.
    Reaped { child_pid: u32, status: ExitStatus },

    /// A previously-`Spawned` child exceeded `recovery_timeout` and was
    /// killed via `kill(2)` on this tick.
    Killed { child_pid: u32 },

    /// `try_wait` or `kill` failed for an outstanding child. The pid is
    /// still tracked; the observer will retry on the next tick.
    ReapFailed(std::io::Error),
}

pub struct Recovery { /* private */ }

impl Recovery {
    /// Backwards-compatible constructor. Equivalent to
    /// `with_timeout(template, debounce, None)`.
    pub fn new(template: String, debounce: Duration) -> Self;

    /// Construct a runner with an optional per-child deadline.
    ///
    /// `timeout = None` ⇒ children are reaped but never killed
    /// (preserves v0.1.0 semantics for users who tolerate long-running
    /// recovery templates).
    pub fn with_timeout(
        template: String,
        debounce: Duration,
        timeout: Option<Duration>,
    ) -> Self;

    /// Render `{pid}` and spawn `/bin/sh -c <rendered>` non-blockingly.
    /// Returns `Spawned`, `Debounced`, or `SpawnFailed` — never blocks.
    pub fn on_stall(&mut self, pid: u32) -> RecoveryOutcome;

    /// Drain completed (or deadline-exceeded) children for one observer
    /// tick. Returns one outcome per state transition observed:
    /// `Reaped`, `Killed`, or `ReapFailed`. Never blocks; returns an
    /// empty vector when no children have transitioned since the last
    /// tick.
    pub fn try_reap(&mut self) -> Vec<RecoveryOutcome>;
}
}

Config gains:

#![allow(unused)]
fn main() {
pub struct Config {
    /* existing fields */
    pub recovery_timeout: Option<Duration>,
}
}

The --recovery-timeout-ms <MS> flag is not parsed in Session 01 — that is Session 03’s deliverable. Session 01 only widens the type.

4. Lifecycle of one recovery

                    debounce-suppressed
                ┌──────────────► Debounced
                │
  Event::Stall ─┤                                  spawn ok
                │                              ┌────────────► Outstanding
                └─► Recovery::on_stall(pid) ───┤
                                               │ spawn err
                                               └────────────► SpawnFailed
                                                              (terminal)

  on every Observer tick:
      Recovery::try_reap()
         │
         ├─► child exited ─────► Reaped { child_pid, status }   (terminal)
         │
         ├─► deadline exceeded ─► kill(2) ─► Killed { child_pid } (terminal)
         │
         └─► try_wait/kill errno ─► ReapFailed(io::Error)        (retry)

Outstanding lives in a HashMap<u32, _> keyed by stalled pid (cold path; allocation acceptable per the operator rules). One outstanding child per stalled pid; if the pid stalls again while a child is still outstanding, the per-pid debounce window suppresses a duplicate spawn.

5. Tick budget

The observer’s READ_TIMEOUT is 100 ms. try_reap is invoked once per Observer::poll iteration (Session 02 owns the wiring). Worst-case latencies:

EventLatency upper bound
Successful child → Reaped surfacesone tick (≤ 100 ms) after exit
Deadline exceeded → Killed surfacesone tick (≤ 100 ms) after deadline
kill(2)Reaped of killed childone further tick (≤ 100 ms)

These are additive with the observer’s normal stall-detection latency; they do not affect beat decoding or exporter throughput on the critical path.

6. Default behaviour when --recovery-timeout-ms is omitted

Config::recovery_timeout = None is the default. In that mode, Recovery::with_timeout stores no deadline; outstanding children are reaped on completion but are never killed. This preserves v0.1.0 semantics for operators whose recovery templates are intentionally long-running (e.g. service restarts that block on health checks).

Operators who want the kill-after behaviour set --recovery-timeout-ms <MS> explicitly. Sub-100 ms values still work but the kill is surfaced no faster than one tick after the deadline.

7. Concurrency model

  • Children are pid-indexed in HashMap<u32, Outstanding>. The observer’s Tracker is bounded to 64 distinct pids, so the map caps at 64 outstanding children in steady state.
  • Debounce is per-pid and unchanged. A repeat stall for the same pid inside the debounce window returns Debounced regardless of whether a child is still outstanding.
  • No locks; the Recovery struct is owned exclusively by the binary’s poll loop and is !Send by virtue of holding std::process::Child values, which is fine since the observer is single-threaded.

8. Out of scope for this epic

  • varta-vlp (frame ABI is frozen).
  • varta-client (no agent-side change).
  • Observer poll cadence (still 100 ms read timeout).
  • Exporter line schema.
  • Panic-handler feature.

9. Cross-references

  • Session 02 (docs/claude-sessions/recovery-async-spawn/session-02-recovery-impl.md) owns the green-phase implementation in crates/varta-watch/src/recovery.rs and the try_reap wiring in crates/varta-watch/src/main.rs / observer.rs.
  • Session 03 (docs/claude-sessions/recovery-async-spawn/session-03-cli-and-loop-integration.md) owns the --recovery-timeout-ms parser, the HELP-text update, and threading cfg.recovery_timeout into Recovery::with_timeout at the binary call site.
  • Acceptance contract: docs/acceptance/varta-v0-1-0.md, subsection Recovery — non-blocking.

10. Failing tests gating Sessions 02 and 03

Session 01 lands these as red-phase acceptance tests:

TestFileOwned by
recovery_spawn_returns_within_50ms_for_slow_templatecrates/varta-watch/tests/recovery_e2e.rsSession 02
recovery_try_reap_yields_reaped_for_completed_childcrates/varta-watch/tests/recovery_e2e.rsSession 02
recovery_try_reap_kills_after_timeoutcrates/varta-watch/tests/recovery_e2e.rsSession 02
recovery_concurrent_pids_run_in_parallelcrates/varta-watch/tests/recovery_e2e.rsSession 02
cli_help_lists_recovery_timeout_ms_flagcrates/varta-watch/tests/cli_smoke.rsSession 03
cli_parses_recovery_timeout_mscrates/varta-watch/tests/cli_smoke.rsSession 03

Peer Authentication

Varta’s observer trusts the kernel, not the wire. Two layers of defence in-depth ensure that process identity cannot be spoofed by anything that can reach the Unix Domain Socket.

Layer 1: socket file permissions (--socket-mode)

After bind(2), the observer chmods the socket file to 0600 by default (owner read and write only). Only processes running under the same UID as the observer can connect(2) to the socket.

FlagDefaultFormatBehaviour
--socket-mode0600Octal (e.g. 0660)File mode applied via chmod(2) after bind. Pass 0660 to allow group access.

Layer 2: kernel credential verification

Linux

The observer sets SO_PASSCRED on the socket after binding. Every recvmsg(2) call then receives a SCM_CREDENTIALS ancillary message containing a struct ucred { pid, uid, gid } populated by the kernel. The observer compares ucred.pid against frame.pid from the VLP wire format. If they disagree the frame is silently dropped and varta_frame_auth_failures_total is incremented. The ucred.uid field is implicitly trusted by Layer 1 (--socket-mode 0600 already restricts access to the owning UID), but could be checked as a fail-safe if a permission bypass is ever discovered.

macOS

On macOS, the observer first attempts getsockopt(LOCAL_PEERTOKEN) immediately after each recvmsg(2). LOCAL_PEERTOKEN returns an audit_token_t containing the sender’s PID, UID, GID, and audit information. Because the observer is single-threaded and calls getsockopt immediately after recvmsg, no other datagram can arrive between the two syscalls.

When LOCAL_PEERTOKEN succeeds, the observer performs the same PID + UID verification as on Linux. When it fails (e.g. on older macOS versions or unconnected SOCK_DGRAM where the kernel doesn’t expose per-datagram credentials), the observer falls back to two separate getsockopt calls:

  1. LOCAL_PEERPID (0x0002) — returns the peer’s PID directly.
  2. LOCAL_PEERCRED (0x0001) — returns a struct xucred with the peer’s UID in cr_uid.

If the fallback also fails, the observer falls back to the sentinel PID 0 — relying on --socket-mode 0600 as the primary defence.

FreeBSD, DragonFly BSD, NetBSD

On FreeBSD-family platforms, the observer sets LOCAL_CREDS on the socket (value 0x0002 on FreeBSD/DragonFly, 0x0001 on NetBSD). Every recvmsg(2) then receives a SCM_CREDS ancillary message containing a struct cmsgcred { cmcred_pid, cmcred_uid, cmcred_euid, cmcred_gid, ... } populated by the kernel. The observer extracts cmcred_pid and cmcred_euid and performs the same PID + UID verification as on Linux.

The ancillary buffer is sized at 256 bytes — sufficient for the 84-byte cmsgcred with generous headroom for future kernel extensions.

Note: On platforms other than Linux, macOS, FreeBSD, DragonFly, and NetBSD (OpenBSD, Solaris, illumos, etc.), varta-watch emits a startup warning via stderr: "per-datagram PID verification is unavailable. The only defence is --socket-mode (default 0600); any process under the same UID can impersonate any PID." This is by design — the kernel does not expose per-datagram peer credentials for unconnected SOCK_DGRAM on these platforms. Containers that run multiple processes under the same UID should be aware of this limitation.

UDP transport authentication

For network-based agents that emit beats over UDP, the trust model is cryptographic, not kernel-attested. UDP has no peer-credential mechanism on any platform — recvmsg(2) cannot tell the observer who sent a datagram, only where it claims to be from. Varta therefore requires authentication at the AEAD layer, and refuses to bind an unauthenticated UDP listener without two layers of explicit opt-in.

Compile-time features (crates/varta-watch/Cargo.toml)

Cargo featureWhat it enablesProduction posture
secure-udpSecureUdpListener (ChaCha20-Poly1305 AEAD + per-sender replay)Recommended
unsafe-plaintext-udpUdpListener (no authentication)Forbidden in production
udp-coreInternal — shared UDP socket wiring(transitive)

A build that does not include unsafe-plaintext-udp cannot link the plaintext path at all. Passing --udp-port without keys to such a build hard-errors at startup; there is no warn-and-continue path.

Runtime selection rules

When --udp-port is set, the observer chooses exactly one listener:

  1. If --features secure-udp is compiled in and --key-file / --master-key-file resolve to a usable key, bind SecureUdpListener.
  2. Otherwise, only the plaintext path remains. It is bound only if both --features unsafe-plaintext-udp is compiled in and --i-accept-plaintext-udp was passed on the command line.
  3. Any other configuration is a hard error (InvalidInput).

When the plaintext path is taken, a high-visibility varta_warn! is emitted at startup naming the bound address, so the choice appears in SIEM / syslog logs:

UDP on <addr> is running WITHOUT authentication (--i-accept-plaintext-udp). Any device with network reach to this port can inject heartbeats, suppress stall detection, or trigger false recovery commands. NOT for production / safety-critical use.

--i-accept-plaintext-udp is intentionally verbose: an operator who types it is making an explicit statement that this build is for development or testing, not for a hospital VLAN.

Why no kernel-level UDP credentials

Unix Domain Sockets carry SCM_CREDENTIALS / LOCAL_PEERTOKEN / SCM_CREDS per-datagram. UDP carries none of those. Even on a single host where --udp-bind-addr 127.0.0.1 is used, any local process can send to that port — there is no equivalent of --socket-mode 0600 for network sockets. AEAD is the only durable defence.

Recovery eligibility and transport-origin gating

Recovery commands (--recovery-cmd / --recovery-exec and the *-file variants) take the stalled agent’s frame.pid and substitute it into the spawned process (kill -9 {pid}, systemctl restart agent@{pid}.service, etc.). That makes recovery a privileged action that targets an arbitrary process by id — and means the wire-level frame.pid must be tied back to the real sending process, not just to whoever holds an AEAD key.

The trust invariant

A recovery command MUST NEVER fire for a pid whose beat lifetime is not kernel-attested. In practice that means:

TransportKernel-attested?Recovery-eligible by default?
UDSYes — SO_PASSCRED / LOCAL_PEERTOKEN / SCM_CREDSYes
Plaintext UDPNo — peer_pid is always 0No
Secure UDPNo — frame is cryptographically authenticated but the kernel does not attest the sending process; a holder of the AEAD key (or a per-agent key derived from a leaked master key) can forge a beat for any pidNo

Internally each beat is tagged with a BeatOrigin (KernelAttested vs NetworkUnverified). The tracker pins the origin on the slot’s first beat and rejects subsequent beats from a different origin as Event::OriginConflict (counter: varta_origin_conflict_total). First-origin-wins prevents an attacker on an untrusted transport from “tainting” a slot that legitimately belongs to a kernel-attested agent.

Two-layer enforcement

  1. Startup hard-error. If any --recovery-cmd / --recovery-cmd-file / --recovery-exec / --recovery-exec-file is configured and --udp-port is set, the daemon refuses to start with ConfigError::RecoveryRequiresAuthenticatedTransport. Operators must pass --i-accept-recovery-on-unauthenticated-transport to proceed. The flag is verbose by design (matches the --i-accept-<risk> convention) and shows up in cargo tree / startup banners.

  2. Runtime origin gate. Even with the accept flag, Recovery::on_stall refuses to spawn the recovery command when the stalled slot’s pinned origin is NetworkUnverified. The refusal returns the typed RecoveryOutcome::RefusedUnauthenticatedSource { pid }, increments varta_recovery_refused_total{reason="unauthenticated_transport"}, and emits a structured refused record into the recovery audit log (--recovery-audit-file). To enable UDP-origin recovery the operator must construct the Recovery with with_allow_unauthenticated_source(true) — a second, conscious choice on top of the startup flag.

Why secure-UDP isn’t enough

The secure-UDP master-key mode binds frame.pid to the 4-byte PID prefix in iv_random[0..4] and derives a per-agent key from the master key. That is a useful cryptographic binding for the UDP threat model — a holder of a single derived agent key cannot forge frames for other pids. But the binding lives at the protocol layer, not at the kernel layer:

  • A leak of the shared key lets anyone forge any pid.
  • A leak of the master key lets anyone derive any agent key.
  • A leak of any per-agent key still lets that agent forge its own pid to misbehave (e.g. stop sending → trigger recovery against its own pid during legitimate maintenance windows).

Kernel attestation has no such failure mode: the kernel knows which process owns the socket fd, and that knowledge cannot be forged by any amount of key material. This is why Varta classifies all UDP variants (plain and secure) as NetworkUnverified for the recovery-eligibility decision.

Recovery command authentication boundary

--recovery-cmd (inline shell) and --recovery-cmd-file (file-based shell) both spawn /bin/sh -c <template> with the observer’s full process authority. In a safety-critical deployment a recovery template like systemctl restart {service} or kill -9 {pid} can terminate unrelated production processes if the template body is mis-edited or if shell metacharacters appear unexpectedly.

To prevent accidental shell-mode deployment, shell mode requires --i-accept-shell-risk at runtime. Without that flag, startup hard-errors with a message that recommends --recovery-exec (which calls execvp(2) directly — no shell, no metacharacter interpretation, no injection surface). This applies to both the inline and file-based forms; the shell-injection risk is identical regardless of where the template comes from.

--recovery-exec and --recovery-exec-file do not require an accept flag — they are the default-safe path.

Prometheus /metrics endpoint exposure

The /metrics endpoint is HTTP/1.0 with mandatory bearer-token authentication. When --prom-addr is set, --prom-token-file is required: the observer refuses to start without it. Every scrape must send Authorization: Bearer <hex> where <hex> is the lowercase 64-byte hex form of the file’s 32 random bytes (the format produced by openssl rand -hex 32). Missing or wrong tokens get HTTP/1.0 401 Unauthorized and bump varta_prom_auth_failures_total.

The token file is loaded through the same hardened validator that guards --key-file (see “Secret-file validation” below): regular file, no symlinks, owned by the observer UID, mode 0o600 or stricter, opened with O_NOFOLLOW.

The endpoint also retains four DoS-protection layers from earlier work, so that a hostile scraper cannot exhaust file descriptors or starve the observer’s poll loop even before the auth check runs:

  1. Serve budget — at most PROM_MAX_CONNECTIONS_PER_SERVE=8 accepted connections per outer poll tick, and a 100 ms wall-clock deadline.
  2. Drain budget — after the serve budget is exhausted, an additional PROM_MAX_DRAIN_PER_SERVE=50 connections may be accepted and immediately closed, so the kernel accept queue does not back up.
  3. Per-source-IP token bucket — every accepted connection (in both serve and drain phases) decrements a per-IP token bucket sized by --prom-rate-limit-burst (default 10) and refilled at --prom-rate-limit-per-sec (default 5). Connections from an IP whose bucket is empty are closed without serving and counted as varta_prom_connections_dropped_total{reason="rate_limit"}.
  4. Per-IP table cap — the per-IP map is bounded to 1024 entries; when full, stale entries (no activity in 60 s) are evicted first, then if necessary the oldest entry is force-evicted and counted as varta_prom_connections_dropped_total{reason="ip_table_full"}.

Token comparison is constant-time

The exporter compares the presented and expected tokens via varta_vlp::ct_eq — the same constant-time XOR-and-OR routine that guards Poly1305 tag verification. This prevents byte-by-byte timing oracles from leaking the prefix of the token to a remote scraper.

Bind-address recommendation

The bearer token is the authoritative authentication boundary. Loopback bind (127.0.0.1:<port> or [::1]:<port>) behind a reverse proxy remains the recommended posture for defense in depth, but is no longer the only defense. The observer still emits a startup varta_warn! whenever the bound address is non-loopback, so the exposure is visible in audit logs.

Prometheus scrape config

The standard authorization: block injects the bearer token verbatim:

scrape_configs:
  - job_name: 'varta'
    static_configs:
      - targets: ['varta-host:9100']
    authorization:
      type: Bearer
      credentials_file: /etc/prometheus/varta-prom.token

The credentials_file should be the same content as --prom-token-file on the observer; Prometheus reads it with the same 0600-or-stricter expectation.

Secret-file validation

Every file containing key material — --key-file, --accepted-key-file, --master-key-file, and the new --prom-token-file — flows through validate_secret_file in varta-watch/src/config.rs. The validator enforces:

  1. The path is not a symlink (symlink_metadata + is_symlink).
  2. The path resolves to a regular file (not a directory, FIFO, block/char device, etc.).
  3. The mode is 0o600 or stricter (mode & 0o077 == 0).
  4. The file is owned by the observer’s UID (kernel-attested via stat.uid, not derived from the env).
  5. The file is opened with O_NOFOLLOW to close the TOCTOU window between the metadata check and the read.

A failure on any of these aborts startup with a typed ConfigError naming the failing constraint (insecure permissions ..., must not be a symlink, owned by uid X, expected uid Y, etc.).

Why environment-variable keys are gone

Earlier releases offered --key-env <NAME> as a key-source fallback. That flag is removed. Passing it now returns ConfigError::RemovedFlag with an inline migration hint pointing at --key-file. The motivation:

  • On Linux, /proc/<pid>/environ is readable by any process running under the same UID; a peer with a UDS connection to the observer (which already has UID-restricted access) can read the master key out of the observer’s own environment.
  • In containers, docker inspect <container> exposes every environment variable to anyone with read access to the Docker socket — typically all members of the docker group, which is often a superset of the in-container UID.
  • systemd-journald captures process environment on demand for crash reports; an env-var key ends up in /var/log/journal indefinitely.

File-based keys avoid all three exposures and slot into the same ownership/permission model as TLS private keys, SSH host keys, and any other long-lived secret an operator already knows how to manage.

The Key type in varta_vlp::crypto also lost its Copy derive and gained a Drop impl that volatile-zeros the secret bytes before the allocation is returned to the stack, closing a small but real leak surface in core dumps and ASLR-defeated speculative reads.

Shutdown grace and systemd

--shutdown-grace-ms (default 5000, minimum 100) bounds the time Recovery::drop blocks waiting for outstanding recovery children to exit after issuing SIGKILL during shutdown. Children that outlive the grace are abandoned to PID 1 for reaping; the observer process exits either way, so the bound on shutdown latency is deterministic.

In a systemd unit, TimeoutStopSec must be at least shutdown_grace_ms + 2 s (roughly: grace + reap margin) to ensure that systemd does not SIGKILL the observer mid-grace and leak an unreaped recovery child:

[Service]
Environment=VARTA_SHUTDOWN_GRACE_MS=5000
ExecStart=/usr/local/bin/varta-watch --shutdown-grace-ms ${VARTA_SHUTDOWN_GRACE_MS} ...
TimeoutStopSec=7s
KillMode=mixed

KillMode=mixed is recommended: systemd sends SIGTERM to the main observer process only; the observer then runs its own Drop sequence to kill+reap any recovery children it had spawned. This is what the shutdown-grace tunable is designed around.

Recovery command environment isolation

When --recovery-env KEY=VALUE is specified (repeatable), the recovery child process runs with a sanitized environment:

  1. The child’s environment is cleared entirely.
  2. PATH is set to /usr/bin:/bin (sufficient to locate common tools).
  3. Only the explicitly-listed KEY=VALUE pairs are exported.

Without --recovery-env, the child inherits the observer’s full environment (backward compatible). This flag provides defense-in-depth against environment-variable-based injection vectors (e.g. a malicious LD_PRELOAD or IFS in the observer’s environment that could affect /bin/sh -c behaviour).

Shell-mode recovery is gated by --i-accept-shell-risk at startup (see the “Recovery command authentication boundary” section above). When the flag is set, the observer still emits a single audit-trail varta_warn! at startup so that the choice is captured in any SIEM / syslog ingest alongside the other startup banners.

Template safety

The {pid} substitution in --recovery-cmd is safe regardless of the authentication outcome. A u32 PID formatted as a decimal string contains only the characters 09 and can never carry shell metacharacters (;, |, &, $, `, etc.).

Metrics

MetricTypeDescription
varta_frame_auth_failures_totalcounterIncremented every time a frame’s claimed PID does not match the kernel-verified sender PID (Linux only).
varta_beats_total{pid="..."}counterPer-PID total of accepted beats (only incremented after authentication passes).
varta_prom_connections_dropped_total{reason="..."}counter/metrics connections accepted but closed before serving. Reasons: drain (serve budget exhausted), rate_limit (per-IP token bucket empty), ip_table_full (per-IP state map force-evicted).
varta_prom_auth_failures_totalcounter/metrics scrapes that arrived without Authorization: Bearer <hex> or with a wrong token. Always emitted on every scrape (even at zero), so absent() alert rules stay green-on-green until the first incident.
varta_recovery_refused_total{reason="..."}counterRecovery commands NOT spawned because of a structural safety gate. Only reason currently defined: unauthenticated_transport (stalled slot’s pinned origin was NetworkUnverified and the operator did not enable UDP-origin recovery). Emitted at zero on every scrape.
varta_origin_conflict_totalcounterBeats dropped because the slot’s pinned transport origin disagreed with the beat’s origin (first-origin-wins). Non-zero values indicate either operator misconfiguration (same pid emitted from two transports) or an active spoofing attempt.

Trust model summary

 Process ── connect(2) to UDS ──┐
                                   ├─ [FAIL]  Kernel blocks (Layer 1: --socket-mode 0600, wrong UID)
                                   ├─ [PASS]  Layer 2: SO_PASSCRED → ucred.pid (Linux)
                                   │          Layer 2: LOCAL_PEERTOKEN → audit_token.pid (macOS, best-effort)
                                   │          Layer 2: LOCAL_CREDS → cmsgcred.pid (FreeBSD, DragonFly, NetBSD)
                                   │          ├─ [PID MISMATCH] → Drop frame + bump counter
                                   │          ├─ [UID MISMATCH] → Drop frame as IoError
                                   │          └─ [PID MATCH + UID MATCH] →
                                   ↓
                              [SUCCESS]  Observer trusts the PID → tracks,
                                         surfaces stalls, triggers --recovery-cmd
                                         with {pid} substitution.

The trust boundary is the kernel: a frame is only accepted if the kernel attests that the sending process’s PID matches the one encoded in the VLP frame and that the sending process runs under the observer’s UID. On Linux this is enforced per-datagram via SO_PASSCRED; on macOS via getsockopt(LOCAL_PEERTOKEN) with LOCAL_PEERPID/LOCAL_PEERCRED fallback; on FreeBSD / DragonFly / NetBSD via LOCAL_CREDS + SCM_CREDS. Platforms without kernel-level credential passing fall back to --socket-mode 0600.

Security limitations

No forward secrecy

The KDF derives per-agent and per-epoch keys from a single master key. An epoch key can decrypt frames from past epochs if the agent key is compromised. True forward secrecy requires bidirectional ephemeral key exchange (e.g. X25519), which is incompatible with the connectionless, one-way heartbeat model.

When the master key is rotated, all agents must be updated atomically. The observer reads the master key once at startup from --master-key-file. To rotate keys, restart the observer with the new master key file. SIGHUP-based hot-reload is planned for a future release.

Panic-hook entropy policy (secure UDP)

install_panic_handler_secure_udp reads 8 bytes of cryptographic entropy at install time (getrandom(2) on Linux, getentropy(3) on macOS/BSD, falling back to /dev/urandom). The IV is pre-computed once so that no file I/O occurs inside the panic handler itself (async-signal-safety).

Fail-closed default: if all entropy sources fail — common in chrooted environments without a mounted /dev — the function returns Err(PanicInstallError::EntropyUnavailable) and the hook is NOT registered. This prevents a panic-time Critical frame from reusing a deterministic IV under the same AEAD key, which would be a catastrophic nonce-reuse failure.

Degraded-entropy opt-in: use install_panic_handler_secure_udp_accept_degraded_entropy to fall back to a non-cryptographic IV derived from PID, TID, monotonic time, and a counter (SipHash-2-4). This always succeeds but accepts nonce-reuse risk if the process panics more than once. The verbose function name is intentional structural enforcement matching the project’s --i-accept-<risk> convention.

Little-endian only

The VLP wire format uses little-endian integer encoding natively. Protocol correctness depends on the host being little-endian (all tier-1 targets — x86_64 and aarch64 — satisfy this). Building on a big-endian host is a compile error. See book/src/architecture/vlp-frame.md for design rationale.

Panic-hook key lifetime — accepted residual

The secure-UDP panic handler (install_panic_handler_secure_udp, install_panic_handler_secure_udp_accept_degraded_entropy) captures a Key by move into a Box<dyn Fn> registered via std::panic::set_hook. The Box is the single owner of the captured Key for the lifetime of the process — Key is !Clone (see crates/varta-vlp/src/crypto/mod.rs), so no duplicate of the secret bytes can exist anywhere else in the address space.

The !Clone invariant pins the count of in-memory copies to one. The remaining concern is the lifetime of that one copy on process exit:

  • Normal hook replacement (std::panic::take_hook): the prior Box is dropped, the captured Key’s ZeroizeOnDrop fires, and the 32 secret bytes are wiped before the heap page is returned to the allocator. OK.
  • panic = "unwind" profile, normal process exit: the panic-hook Box is leaked by the runtime — Drop is not called on registry-held objects at exit. The captured Key bytes persist in heap memory until the kernel reclaims the page. Linux does not zero pages on reclaim (memory contents are reused; zero-on-allocation guarantees apply only to new allocations into the same process).
  • panic = "abort" profile: the panic-hook closure never runs, but set_hook still owns the Box — same residual as the normal-exit case. Additionally, no Drop runs anywhere during abort().

This residual is accepted: there is no async-signal-safe mechanism that can reliably wipe a heap-resident secret at process exit. atexit handlers do not run on abort(), are not async-signal-safe, and race the panic hook firing. mlock / memfd_secret cannot prevent the kernel from copying the page during scheduler context switches or core dumps. The minimum-surface design is to keep the captured Key alive in a single Box and treat the OS process boundary as the security boundary: inspecting the memory of a live process requires ptrace or /proc/<pid>/mem privileges, at which point all in-memory secrets in any design are accessible.


Cross-references

  • Safety profiles — compile-time feature gating for dangerous recovery paths; production-safe build verification recipe
  • Observer liveness — defending against varta-watch itself crashing or hanging
  • VLP transports — transport-level trust classification and BeatOrigin semantics

PID-namespace semantics

Varta agents and the varta-watch observer can run on the same host but in different Linux PID namespaces (typical when agents run in containers and the observer on the host, or vice-versa). This document defines what the protocol does in that case, why, and how operators configure it.

Problem statement

std::process::id() (called by Varta::beat()) returns the agent’s PID in the calling process’s PID namespace (see pid_namespaces(7)). The observer’s kernel-attested peer PID (SO_PASSCRED / LOCAL_PEERTOKEN / SCM_CREDS) is the PID as seen from the observer’s namespace.

Two consequences when namespaces differ:

  1. The numeric pid is meaningless across the boundary. PID 17 in container A is a different process from PID 17 on the host. kill(2) against PID 17 in the observer’s namespace targets the observer-namespace process, not the agent.
  2. Collisions are guaranteed. Every container’s first process is PID 1. Two containerized agents binding the same observer socket will both claim PID 1.

Threat model

ScenarioRisk
Host observer, host agentsNone.
Host observer, agent in --pid=host containerNone — agent uses host PIDs.
Host observer, agent in private-PID containerCross-namespace: kill targets wrong process.
Two private-PID containers, shared observerPid collisions: containers claim same pid.
Container observer, host agentsCross-namespace.

Detection

On Linux, every process’s PID namespace has a unique inode exposed at /proc/<pid>/ns/pid (stat(1) it, or readlink(1) for the canonical pid:[NNNN] form). Two processes share a PID namespace iff their /proc/<pid>/ns/pid symlinks resolve to the same inode.

varta-watch caches its own inode at startup (crate::peer_cred::observer_pid_namespace_inode()) and, for every kernel-attested beat, reads the peer’s inode (crate::peer_cred::read_pid_namespace_inode(peer_pid)). Both helpers are allocation-free; the per-beat read is one readlink(2) syscall into a stack buffer (sub-microsecond on modern Linux).

Non-Linux platforms (macOS, BSD) return None from both helpers and the comparison short-circuits to “match”. UDP listeners set peer_pid_ns_inode = None because there is no kernel attestation; the existing UDP recovery refusal gate is the relevant protection there.

Mitigation by deployment style

DeploymentDefault behaviourOperator action
Single namespace (host or container)Pass-through.None.
Containerized agents with --pid=hostPass-through (same kernel-attested ns).None.
Containerized agents with private PID namespaceBeats dropped at receive; recovery refused. Audit log shows reason=cross_namespace_agent.Either fix the deployment (run agents with --pid=host) or accept the risk via --allow-cross-namespace-agents and arrange out-of-band PID translation in the recovery template.
Mixed: some agents same-ns, some cross-nsSame-ns agents work; cross-ns agents refused and audit-logged.Same as above; the gate is per-beat.
Operator wants fail-fast on misconfigureDefaults silently drop and audit.Pass --strict-namespace-check — daemon exits non-zero on first cross-ns beat.

Audit and metrics inventory

SurfaceLinux signal
varta_frame_namespace_mismatch_total (counter)Kernel-attested frames dropped at receive (peer ns ≠ observer ns).
varta_tracker_namespace_conflict_total (counter)Beats dropped because the slot’s pinned ns inode disagreed with the beat’s (first-namespace-wins).
varta_recovery_refused_total{reason="cross_namespace_agent"} (counter)Stalls refused at recovery time because the slot’s ns inode differed from the observer’s.
varta_recovery_outcomes_total{outcome="refused_cross_namespace"} (counter)Same event, broken down on the outcome axis.
Audit log record with reason=cross_namespace_agentTSV record in --recovery-audit-file.
Event::NamespaceConflictEmitted to consumers via Observer::poll() so file/Prom exporters can record it.

All counters are emitted at every scrape even at zero, so absent() alert rules stay green-on-green until the first event.

API surface

  • Observer::observer_pid_namespace_inode() -> Option<u64> — returns the observer’s cached PID-namespace inode (Linux only).
  • Observer::with_allow_cross_namespace(bool) -> Self — opt out of the default refuse-and-audit behaviour. Wired from --allow-cross-namespace-agents.
  • Observer::drain_cross_namespace_drops() -> u64 — counter drain.
  • Observer::drain_namespace_conflicts() -> u64 — counter drain.
  • Tracker::pid_ns_inode_of(pid: u32) -> Option<Option<u64>> — observer-side introspection.
  • Recovery::with_allow_cross_namespace(bool) -> Self — same opt-out at the recovery layer.
  • Recovery::on_stall(pid, origin, cross_namespace_agent: bool) — caller-supplied cross-ns flag (typically derived from Event::Stall::pid_ns_inode vs Observer::observer_pid_namespace_inode()).
  • Recovery::take_refused_cross_namespace() -> u64 — counter drain.
  • RecoveryOutcome::RefusedCrossNamespace { pid } — refusal variant.

CLI flags

--allow-cross-namespace-agents   Permit beats and recovery for agents whose
                                 kernel-attested PID namespace differs from
                                 the observer's. Default off — beats dropped
                                 at receive (counted) and recovery refused
                                 (audit + counter).

--strict-namespace-check         Fatal startup error on first cross-namespace
                                 beat. Default off — log + counter only.

Edge cases

  • /proc/<peer_pid>/ns/pid unreadable (ptrace_may_access denial, peer exited between recvmsg and readlink, /proc not mounted): the helper returns None. The tracker’s None → Some upgrade allows one-shot recovery so a transient /proc unavailability does not pin a slot as permanently unknown.
  • Existing frame.pid != peer_pid check fires first for most real cross-namespace traffic (the two namespaces almost always produce different numeric pids for the same process). The namespace gate is belt-and-suspenders for the surprising case where the pids happen to collide.
  • unsafe_code = "deny" is workspace-wide. The new readlink FFI follows the established peer_cred.rs pattern (extern "C" + one-line unsafe { ... } blocks with a SAFETY comment).
  • Frame ABI is unchanged — the 32-byte Frame is not touched. All state lives observer-side.

Cross-references

  • vlp-transports.md — overall transport model.
  • peer-authentication.md — kernel-attested PID and the BeatOrigin trust classification.
  • pid_namespaces(7) and user_namespaces(7) man pages — kernel reference.

Recovery audit log (schema v2)

The recovery audit log (varta-watch/src/audit.rs) is the canonical forensic record of every recovery action the daemon took or refused. It exists to satisfy three operational requirements:

  1. Traceability. For an IEC 62304 Class C device — or an aviation ground-station — every recovery action must be reconstructable after the fact: what was spawned, when, why, with what outcome.
  2. Survivability. A power cut on the host must not silently drop the most recent audit records.
  3. Tamper-evidence. A reviewer must be able to detect retroactive editing of historical records.

Schema v1 (the pre-2026 format) satisfied only the first of these. Schema v2 — the current format — satisfies all three when the daemon is built with the audit-chain feature.

File format

Two file-level header lines, then one record per line. Fields are tab-separated. Every record kind carries a leading seq column and a trailing chain column. Free-form fields (program paths, refusal reasons) have their \t, \n, and \r bytes replaced with a single space at write time so a maliciously-chosen argv[0] can never inject columns.

# varta-watch recovery audit v2

boot

seq    wallclock_ms    observer_ns    boot    daemon_pid    prev_chain|-    reason    chain

A boot record opens every audit-log session and every post-rotation generation. The reason column carries one of six stable tokens:

reasonwhen it firesprev_chain
freshbrand-new file with no prior content-
resumeclean v2 tail from a prior sessionlast chain
legacy_v1existing file uses v1 schema; v2 section starts here-
corrupt_tailv2 file with a torn last record (kernel partial write); the file is ftruncate’d to the last newline before this record is appendedlast good chain if recoverable, else -
schema_driftheader is neither v1 nor v2-
rotationrotation generation rolllast chain of pre-rotation file

spawn

seq    wallclock_ms    observer_ns    spawn    agent_pid    child_pid    mode    program    source    template_len    chain

Emitted at the moment a recovery child is fork(2) + execvp(2)’d. mode ∈ {exec, shell}; program is the path actually invoked (/bin/sh for shell mode, argv[0] for exec mode); source is either the literal "inline" or the path-string for --recovery-cmd-file / --recovery-exec-file. The command template itself is not logged — it may contain secrets, and the source path is already auditable.

complete

seq    wallclock_ms    observer_ns    complete    agent_pid    child_pid    outcome    exit_code|-    signal|-    duration_ns    stdout_len    stderr_len    truncated    chain

Emitted on reap, kill-after-timeout, or reap failure. outcome is one of reaped, killed, reap_failed. exit_code and signal are mutually exclusive: at most one is a number, the other is -.

refused

seq    wallclock_ms    observer_ns    refused    agent_pid    reason    chain

Emitted when a stall is detected but recovery is structurally declined (e.g. unauthenticated transport, cross-namespace agent). reason is a stable short token so SIEM consumers can alert on it without parsing free text.

Sequencing

seq is a u64 starting at 1 on the first boot record. It is strictly monotonic within a daemon lifetime and across daemon restarts (the new daemon resumes from last_seq + 1 after parsing the existing tail). A consumer detects record loss as a gap: seq[i+1] - seq[i] > 1.

Durability cadence

Every record_* call is followed by BufWriter::flush() and File::sync_data() (= fdatasync(2) on Linux) at a configurable cadence controlled by --recovery-audit-sync-every <N>:

  • N = 1 (default, IEC 62304 Class C-conforming): one fdatasync per record.
  • N > 1: one fdatasync per N records. The daemon emits a startup warning and the build is not Class C-conforming. Up to N - 1 records can be lost on power cut.
  • N = 0: rejected at parse time.

In addition, the daemon unconditionally syncs:

  • Before every rotation rename.
  • After writing the post-rotation boot record.
  • In Drop (best-effort; not load-bearing for correctness).

Tamper-evidence: the hash chain

When the daemon is built with --features audit-chain, every record’s trailing chain column is the lowercase-hex SHA-256 of:

DOMAIN || 0x00 || kind || 0x00 || prev_chain_raw || 0x00 || body_with_seq

where:

  • DOMAIN = b"VARTA-AUDIT-v2". The trailing v2 is the schema version; a future v3 mandatorily bumps this so chains across schemas cannot be confused.
  • kind is the bytes b"boot" / b"spawn" / b"complete" / b"refused".
  • prev_chain_raw is the raw 32-byte prior chain hash (not its hex form), or [0u8; 32] for the very first record in a fresh file.
  • body_with_seq is the TSV line from the seq column up to (but not including) the chain column — no trailing \n.
  • Four 0x00 separators prevent field-boundary confusion: e.g. (kind="ab", body="cd") and (kind="abcd", body="") hash to distinct strings.

The construction is implemented once in crates/varta-vlp/src/crypto/hash.rs::audit_chain_hash so callers cannot accidentally drop the domain separation or transpose the input order.

What this detects

  • Any byte edited in any historical record. The edited record’s own chain stops matching, and every subsequent chain also stops matching.
  • Any record deleted. The chain breaks at the deletion point.
  • Any record inserted. Same — the chain over the synthetic record cannot match the next legitimate record.
  • Records reordered. The chain validates only in original order.

What this does NOT detect

A pure SHA-256 hash chain — without a secret key — can be recomputed end-to-end by an attacker with write access to the file. Tampering is only detectable when the latest chain head is verified against an externally trusted source. Operators in safety-critical deployments should periodically export tail -1 audit.log | cut -f<last> to a sealed log (Tang, AWS S3 with object-lock, a hardware HSM, etc.). The daemon does not do this — it is an operational policy decision.

A future HMAC-keyed mode is out of scope for v2 to avoid forcing a key-distribution workflow on every Class C deployment.

When audit-chain is disabled

If the daemon is built without --features audit-chain:

  • The chain column is the literal string -.
  • The daemon emits a startup warning explicitly stating that the build is not IEC 62304 Class C-conforming.
  • seq and fdatasync cadence still work — record loss is detectable; power-cut durability is preserved; only tamper-evidence is absent.

The build remains zero-registry-dep (the audit-chain feature propagates the existing optional crypto deps in varta-vlp/crypto).

Rotation

When --recovery-audit-max-bytes <N> is set, the file rotates after any write that pushes it over the threshold: PATHPATH.1 → … → PATH.5. Five generations are kept; the oldest is unlinked. The same generation count as the event-stream FileExporter.

The chain spans rotation: the first non-header record in the new generation is a boot with reason=rotation whose prev_chain column is the final chain of the just-rotated file. A reviewer who pieces generations together by seq order can replay-verify the chain across the entire history.

Verification recipe

# 1. Confirm seq is strictly monotonic across all generations.
cat audit.log.5 audit.log.4 audit.log.3 audit.log.2 audit.log.1 audit.log \
    | grep -v '^#' \
    | awk -F'\t' 'NR==1 { prev = $1; next } $1 != prev+1 { print "GAP at seq", $1; exit 1 } { prev = $1 }'

# 2. Confirm chain validates (requires the daemon's
# audit_chain_hash helper exposed in a verification tool — out of scope
# for the daemon binary itself, see book/src/architecture/peer-authentication.md
# for the pattern).

# 3. Cross-check that the chain head matches the latest sealed-log entry
# the operator exports to their trusted store.

CLI surface

FlagRequiredDefaultMeaning
--recovery-audit-file <PATH>nounsetAppend audit records to PATH. Created mode 0600.
--recovery-audit-max-bytes <N>nounboundedRotate after a write that pushes the file past N bytes.
--recovery-audit-sync-every <N>no1fdatasync cadence. 1 is the only Class C-conforming value.

Threat model

ThreatDetected?Mechanism
Record loss from buffer-only flush + power cutyesseq gap; durability cadence; rotation pre-rename sync
Record loss from process killyesseq gap; resume boot on restart
Single record edit (any byte)yes (with chain)hash chain divergence
Bulk re-write by attacker with file-write access AND chain re-computationnorequires an external sealed chain-head log
Schema downgrade (v2 → v1)yesschema_drift boot or first-line header check
Replay of a captured audit file in a different deploymentyes (with chain)initial prev_chain = [0; 32] differs per host/lifetime

Compile-time Configuration (Class-A profile)

The Class-A safety-critical profile builds varta-watch with the compile-time-config Cargo feature. In this profile the runtime binary has no argv parser, no Prometheus HTTP exporter, and a single neutral --help body that mentions no flag names. Every operational knob is supplied at compile time by build.rs from a static KEY = VALUE file pointed to by the VARTA_CONFIG_FILE environment variable.

The Class-A binary is verified by the CI safety-profiles job:

B=target/release/varta-watch
strings "$B" | grep -E -- "(GET /metrics|HTTP/1\.|--[a-z])"
# expect: no output

When to use this profile

  • Hospital VLAN deployments where every CVE surface is a liability.
  • IEC 62304 Class C medical devices (insulin pumps, holter monitors, ventilators) where the host configuration is part of the validated firmware.
  • Avionics / industrial-control systems where the binary must boot from a signed image and accept no operator input post-deployment.

For SRE / cloud deployments use the default-feature build (or --features prometheus-exporter for /metrics). The two profiles are mutually exclusive at compile time via a compile_error! guard in crates/varta-watch/src/lib.rs.

Build recipe

export VARTA_CONFIG_FILE=/etc/varta/varta.conf
cargo build -p varta-watch --release \
  --no-default-features --features secure-udp,compile-time-config

secure-udp is the recommended companion feature — Class-A almost always wants authenticated transport. Other features that combine cleanly with compile-time-config: audit-chain, json-log, unsafe-shell-recovery (only when the operator’s signed config explicitly opts in via i_accept_shell_risk = true).

The prometheus-exporter feature is forbidden in combination with compile-time-config; cargo build fails with a clear compile_error! diagnostic.

File grammar

Plain text, UTF-8. Lines that begin with # or are entirely whitespace are ignored. Each remaining line is KEY = VALUE:

  • The = separator may have any amount of whitespace on either side.
  • KEY must be in the KNOWN_KEYS catalogue (see below).
  • VALUE is the rest of the line after the first =, trimmed.
  • Quoting is not supported — paths and strings are taken verbatim.
  • Repeated singleton keys are a build error; repeated list keys (recovery_env) accumulate.
  • Unknown keys are a build error that surfaces during cargo build.

Example:

# /etc/varta/varta.conf

socket = /run/varta/varta.sock
threshold_ms = 5000
socket_mode = 0600

# Recovery: exec-mode only, never shell.
recovery_exec_cmd = /usr/local/sbin/varta-recover {pid}
recovery_audit_file = /var/log/varta/recovery.tsv
recovery_audit_sync_every = 1

# Authenticated UDP listener bound to loopback.
udp_port = 8443
udp_bind_addr = 127.0.0.1
secure_key_file = /etc/varta/agent.key

# Hospital deployment: medical-device clock semantics + strict mode.
clock_source = boottime
strict_namespace_check = true

Accepted keys

KeyTypeDefaultNotes
socketpathrequiredUDS path the observer binds.
threshold_msu64requiredPer-pid silence window. Minimum 10.
socket_modeoctal0600UDS file mode after bind.
read_timeout_msu64100UDS read timeout per poll call.
udp_portu16noneBind a UDP listener on this port.
udp_bind_addripruntime defaultLoopback for secure-UDP; 0.0.0.0 for plaintext.
secure_key_filepathnone64-hex-char primary key (secure-udp).
accepted_key_filepathnoneOne key per line for rotation.
master_key_filepathnone64-hex-char master for per-agent derivation.
recovery_cmdstringnoneShell template (requires unsafe-shell-recovery).
recovery_exec_cmdstringnoneprogram args … invoked via execvp.
recovery_cmd_filepathnoneRead recovery_cmd from a hardened file.
recovery_exec_filepathnoneRead recovery_exec_cmd from a hardened file.
recovery_debounce_msu641000Per-pid debounce window.
recovery_envlist-of-stringemptyKEY=VALUE; repeatable.
recovery_timeout_msu64noneKill-after deadline for recovery children.
recovery_audit_filepathnoneTSV recovery audit log.
recovery_audit_max_bytesu64noneAudit-file rotation byte cap.
recovery_audit_sync_everyu321fdatasync cadence (1 = every record).
recovery_capture_stdioboolfalseCapture child stdio for audit.
recovery_capture_bytesu324096Stdio capture cap. Max 1048576.
file_exportpathnoneTSV event-stream sink.
export_file_max_bytesu64noneEvent-file rotation cap.
heartbeat_filepathnonePer-tick liveness file.
tracker_capacityusize256Max tracked PIDs.
tracker_eviction_policyenumstrictstrict or balanced.
eviction_scan_windowusize256Max slots scanned per eviction attempt. Range [1, 4096].
max_beat_rateu32nonePer-pid beats/sec cap.
clock_sourceenummonotonicmonotonic or boottime (Linux only).
iteration_budget_msu64250Per-iteration soft budget. Range [50, 60000].
scrape_budget_msu64250Per-serve_pending soft budget. Range [50, 60000].
shutdown_after_secsu64noneSelf-terminate after this uptime.
shutdown_grace_msu645000Drop blocking time during shutdown. Minimum 100.
self_watchdog_secsu64noneSelf-watchdog deadline (auto-enables under systemd).
hw_watchdogpathnoneHardware watchdog device (/dev/watchdog).
i_accept_plaintext_udpboolfalseRuntime acknowledgement.
i_accept_shell_riskboolfalseRuntime acknowledgement.
i_accept_recovery_on_secure_udpboolfalseRecovery on secure-UDP transport.
i_accept_recovery_on_plaintext_udpboolfalseRecovery on plaintext UDP.
i_accept_secure_udp_non_loopbackboolfalseNon-loopback secure-UDP bind.
allow_cross_namespace_agentsboolfalsePermit cross-PID-namespace beats.
strict_namespace_checkboolfalseFatal exit on cross-namespace agent.
inject_wedge_msu64noneTest-hooks only (requires test-hooks feature).

Operational contract

  • --help (and any other argv) is rejected at startup. The binary exits non-zero with the neutral diagnostic “this binary was configured at compile time; refusing to accept command-line arguments”.
  • Diagnostic messages in stderr / sd_notify use neutral wording — no --flag-name strings appear anywhere in the binary. See the cerebrum entry on pub const &str being unconditionally linked for the rationale.
  • The configuration file is consumed once, at cargo build time. The resulting binary is immutable: redeployment requires a new build. This is the structural feature operators rely on for Class-A release-gating.

See also

Safety Profiles

varta-watch ships with a two-layer gate for every structurally-dangerous capability: a compile-time Cargo feature that must be explicitly enabled, AND a runtime flag that must be passed by the operator. Neither layer alone is sufficient; both must be active.

This document defines what “production-safe” means for Varta and how to verify a binary before deploying it to a safety-critical environment.

Profile matrix

ProfileFeaturesargv/metricsRecovery
SRE / cloudprometheus-exporter (+ optional unsafe-*)full GNU-style parserHTTP /metrics + Bearer-tokenshell or exec
Class-A safety-criticalsecure-udp,compile-time-confignone (build-time fixed)absentexec only (or unsafe-shell-recovery + signed acknowledgement)

The two profiles are mutually exclusive: prometheus-exporter cannot combine with compile-time-config (a compile_error! in crates/varta-watch/src/lib.rs rejects the combination at build time). This is the structural guarantee Class-A builds rest on — the Class-A binary cannot ship with an HTTP server linked in.


Production-safe build

A production-safe varta-watch binary is built with default features only:

cargo build -p varta-watch --release

No --features argument is needed or wanted. Default features are empty.

What is absent from a production-safe build

Dangerous capabilityCargo featureRuntime flag
Plaintext (unauthenticated) UDP listenerunsafe-plaintext-udp--i-accept-plaintext-udp
Shell-mode recovery (/bin/sh -c)unsafe-shell-recovery--i-accept-shell-risk

Without the compile-time feature, the code path is not linked into the binary. A misconfigured deployment cannot accidentally enable the dangerous path at runtime.

Verification recipe

cargo build -p varta-watch --release
strings target/release/varta-watch | grep -F "/bin/sh" && echo "FAIL" || echo "OK"

The strings check is belt-and-suspenders: because the dangerous code is #[cfg(feature = ...)]-gated at the source level, the literal string is never even parsed by the compiler, so it cannot appear in the binary.


Unsafe features

unsafe-plaintext-udp

Compiles in the plaintext UdpListener transport. Any device with network access to the bound port can inject heartbeats, suppress stall detection, or trigger false recovery commands.

# varta-watch/Cargo.toml
[features]
unsafe-plaintext-udp = ["udp-core"]

Even with this feature, the listener will not bind unless --i-accept-plaintext-udp is also passed at runtime.

unsafe-shell-recovery

Compiles in the RecoveryMode::Shell variant, which passes the recovery template to the system shell (sh -c). A template-injection vector can execute arbitrary commands with the observer’s authority.

[features]
unsafe-shell-recovery = []

Even with this feature, shell-mode recovery will not activate unless --i-accept-shell-risk is also passed at runtime.


Class-A safety-critical features

prometheus-exporter (opt-in HTTP exposition)

The Prometheus /metrics endpoint, the bearer-token loader, the per-IP rate-limit table, and every --prom-* argv flag live behind this feature. When absent the binary contains zero HTTP / TCP-accept code and the only exporter linked is FileExporter (one-way append-only TSV sink — no listener, no network surface).

[features]
prometheus-exporter = []

Verification recipe (default build, feature off):

cargo build -p varta-watch --release
B=target/release/varta-watch
strings "$B" | grep -E -- "(GET /metrics|HTTP/1\.|WWW-Authenticate|Bearer realm)" \
  && echo "FAIL" || echo "OK"

compile-time-config (no argv parser, no runtime config)

Replaces the runtime argv parser with a build-script-generated constant populated from $VARTA_CONFIG_FILE (a KEY = VALUE text file). When the feature is on:

  • Config::from_args is excluded from compilation; the 292-arm match block carrying every --flag-name literal is not linked.
  • Config::HELP is a neutral one-liner that contains no flag names.
  • The binary refuses any argv tokens with CompileTimeArgvForbidden.

Cannot be combined with prometheus-exporter — the combination is rejected at compile time by a compile_error! in lib.rs.

export VARTA_CONFIG_FILE=/etc/varta/varta.conf
cargo build -p varta-watch --release \
  --no-default-features --features secure-udp,compile-time-config

Verification recipe:

B=target/release/varta-watch
FORBIDDEN="GET /metrics|HTTP/1\.|WWW-Authenticate|--socket|--prom-addr|--help|--i-accept|/bin/sh"
strings "$B" | grep -E -- "$FORBIDDEN" && echo "FAIL" || echo "OK"

See compile-time-config.md for the canonical KEY=VALUE grammar and key catalogue.


Always use --recovery-exec instead of --recovery-cmd for production deployments. --recovery-exec invokes the program directly via execvp(2) with no shell involved; shell metacharacters have no effect.


Miri policy

Miri (cargo miri test) runs on every push under -Zmiri-strict-provenance and covers the three unsafe-code clusters that cannot be audited by reading alone:

ClusterMiri targetWhat it proves
peer_cred cmsg pointer-walkcargo miri test -p varta-watch --lib peer_credNo UB in the hand-written cmsghdr traversal; synthetic buffers only — no syscalls
Tracker slot-index arithmeticcargo miri test -p varta-watch --lib trackerNo out-of-bounds indexing or stale pointer reads in the fixed-capacity slot array
Client classifiercargo miri test -p varta-client --test classifierBeatError is Copy-safe and errno extraction has no provenance issues

Tests that require real syscalls (Unix datagram bind, recvmsg, process spawn) carry #[cfg_attr(miri, ignore)] so they are silently skipped when Miri runs, without requiring a separate test-filter command.


Clock source for stall detection

Stall threshold accounting depends on a monotonic time source. Which “monotonic” is correct depends on the deployment profile:

Profile--clock-sourceRationale
SRE / cloud server / VMmonotonic (default)CLOCK_MONOTONIC pauses on host suspend, hypervisor pause, and live-migration freeze. A 30-minute host-suspend-for-maintenance must NOT fan out a stall alert across every agent.
Medical implant / holter / insulin pump (Linux)boottime (Linux only)CLOCK_BOOTTIME advances during suspend. A 4-hour deep-sleep IS a 4-hour silence; stall detection MUST fire on wake-up regardless of whether the device suspended itself.
Embedded sensor with deep sleep (Linux)boottime (Linux only)Same as medical — battery-conscious devices that aggressively suspend need stall semantics that count the suspended time.
macOS / iOS-hosted device with sleep semanticsmonotonic-raw (macOS / iOS only)CLOCK_MONOTONIC_RAW on Darwin is backed by mach_continuous_time and advances through sleep — the Darwin equivalent of Linux’s CLOCK_BOOTTIME.

Platform support

boottime semantics require Linux’s CLOCK_BOOTTIME clock (clk_id 7, available since 2.6.39). The Darwin equivalent is CLOCK_MONOTONIC_RAW (clk_id = 4), backed by mach_continuous_time; it advances through sleep just like CLOCK_BOOTTIME. Because the same numeric clk_id = 4 on Linux refers to CLOCK_MONOTONIC_RAW with different semantics (it opts out of NTP slewing but still pauses during suspend), the two are exposed as distinct ClockSource variants — boottime (Linux only) and monotonic-raw (macOS / iOS only) — and each is rejected at startup on the other family with ConfigError::ClockSourceUnsupported.

BSD operators have only monotonic: no kernel clock on FreeBSD / NetBSD / OpenBSD / DragonFly advances through suspend in a way usable by clock_gettime(2).

Example rejection messages:

clock source `boottime` is not supported on `macos` (Linux only; on
macOS use `monotonic-raw` for advance-through-sleep semantics)
clock source `monotonic-raw` is not supported on `linux` (macOS / iOS
only; on Linux use `boottime` for advance-through-sleep semantics)

This is structural enforcement: a misconfigured medical-device deployment exits non-zero rather than silently picking a clock that pauses on sleep.

Self-watchdog alignment

The in-process self-watchdog (--self-watchdog-secs) reads the same kernel clock as the observer. An operator who configures boottime for the observer gets watchdog deadline accounting that also advances during suspend; an SRE operator on monotonic gets identical-to-historical watchdog behaviour minus the previous wall-clock NTP-backward-step foot-gun.

Verification recipe (Linux)

# Confirm the configured clock source is in effect.
journalctl -u varta-watch | grep -i 'clock'   # binary logs no startup banner today;
                                              # operators can read /proc/<pid>/maps
                                              # to confirm clock_gettime imports.

# Behavioural smoke test — requires a real suspend / resume cycle:
systemctl suspend && sleep 60 && systemctl resume
curl -fsS http://localhost:9090/metrics -H "Authorization: Bearer <hex>" \
  | grep -E 'varta_(stall_total|beats_total|watch_uptime_seconds)'
# Expect: with --clock-source boottime, varta_stall_total advanced during the
# suspend window; with --clock-source monotonic, it did not.

Cross-reference

The secure-udp transport applies the same “no surprises on the beat path” posture: the IV-prefix derivation (H6) reads OS entropy only at connect() and reconnect() — every steady-state beat uses a deterministic HKDF counter-mode expansion. Together, H6 + H7 keep the agent and observer loops free of any syscall that can block or stall under suspend.


Cross-references

Varta v0.1.0 — Bench Harness Results

Per-metric measurements captured by the dependency-free varta-bench harness (Session 06). Each row corresponds to one acceptance contract assertion in docs/acceptance/varta-v0-1-0.md.

Host

FieldValue
OSDarwin 25.4.0 (xnu-12377.101.15) arm64
HardwareApple Silicon (Mac, T6050 series)
Rust toolchainrustc 1.93.1 (01f6ddf75 2026-02-11) — pinned via rust-toolchain.toml
Working treeepic/varta-v0-1-0--s06-integration-and-bench clean at run time

Results

MetricThresholdMeasuredStatusCommand
latencyp99 < 1 µsp99 = 916 nsPASScargo run -p varta-bench --release -- latency
cpu-50-agents< 0.1 %0.0552 %PASScargo run -p varta-bench --release -- cpu-50-agents
binary-sizeΔ < 20 KBΔ = 3 872 BPASScargo run -p varta-bench --release -- binary-size

Auxiliary latency metrics (same run): p50 = 584 ns, p99.9 = 1042 ns.

Reproducibility

# Build the workspace once so varta-watch is in target/release.
cargo build --workspace --release

cargo run -p varta-bench --release -- latency
cargo run -p varta-bench --release -- cpu-50-agents       # ~35 s wall
cargo run -p varta-bench --release -- binary-size         # ~5 s wall

cpu-50-agents waits for the daemon to self-exit via --shutdown-after-secs 35 before snapshotting getrusage(RUSAGE_CHILDREN), so the measurement covers the full wall window over which the 50 agent threads emit at 1 Hz. The wall is therefore the dominant cost.

Threshold notes

  • latency: thresholds are tagged HOST-DEPENDENT in crates/varta-bench/src/main.rs. Apple Silicon laptops show p99 ≈ 900 ns idle. Virtualised CI runners with noisy neighbours can spike — if the bench reports STATUS: WARN with a measured value above 1 µs, the harness is doing its job and a CI gate should classify it as a soft failure (warning, not red).
  • cpu-50-agents: the daemon is mostly blocked in recvfrom(2) with the 100 ms read timeout. CPU usage scales sublinearly with agent count because the kernel batches wakeups. 0.0552 % of a 35 s wall is ~19 ms of daemon CPU.
  • binary-size: link-time pulls in Varta::connect, the Frame codec, and the BeatOutcome enum. The diff is dominated by Rust’s standard- library boilerplate for UnixDatagram plus a few KB of generated code for the encoder. The fixture crates use lto = false, codegen-units = 1, opt-level = 3 so size comparisons are stable across runs.

Status

All three contract assertions PASS on the host above. No WARN or FAIL deviations to record for this session.

Contributing to Varta

First, thank you for contributing! Varta is a high-assurance health protocol, and we maintain strict architectural and safety standards.

The Varta “Hard Constraints”

Every contribution must adhere to these load-bearing invariants:

  1. Zero Registry Dependencies: Production crates (varta-vlp, varta-client, varta-watch) must have empty [dependencies] sections (other than internal path dependencies).
  2. Zero Heap Allocation: No heap allocation is permitted on the beat() path after connection. We verify this with zero_alloc tests using a guard allocator.
  3. Non-Blocking I/O: The beat path must never block. WouldBlock is handled as Dropped.
  4. ABI Stability: Any change to the 32-byte Frame layout is a breaking change and requires a VLP version bump.
  5. Strict Linting: We run with deny(unsafe_code) at the workspace level. Permitted unsafe blocks (e.g., for FFI) must be explicitly allowed with #[allow(unsafe_code)] to create an audit trail.

Development Workflow

Prerequisites

  • Rust stable (for production builds)
  • Rust nightly (for fuzzing and Miri)
  • cargo-fuzz and miri components installed

The “JUSTIFY” Rule

If you must #[ignore] a test, the CI will fail unless you provide a // JUSTIFY: <reason> comment within 2 lines of the attribute. This ensures we don’t accidentally leave gaps in our safety coverage.

Running the Suite

# Lint & Format
cargo fmt
cargo clippy --workspace -- -D warnings

# Tests
cargo test --workspace

# Fuzzing (Mandatory for protocol changes)
cargo fuzz run frame_decode -- -max_total_time=30

# Miri (UB Audit)
cargo miri test -p varta-vlp

Pull Request Process

  1. Benchmarks: If your change touches the beat() path, you must run cargo run -p varta-bench --release -- latency and include the results in your PR description.
  2. Documentation: Update design.md or crate READMEs if logic changes.
  3. Zero-Alloc Verification: Ensure cargo test -p varta-tests --test zero_alloc still passes.

Code of Conduct

We follow the Contributor Covenant. Please be respectful and professional.

Security Policy

Supported Versions

The following versions of Varta are currently being supported with security updates.

VersionSupported
v0.2.x:white_check_mark:
< v0.2:x:

Reporting a Vulnerability

Varta is designed for high-assurance and safety-critical health monitoring. Security and protocol integrity are our highest priorities.

If you discover a security vulnerability or a protocol-level defect that could compromise system safety, please do not report it via a public issue.

Please use the GitHub Private Vulnerability Reporting feature. This allows you to securely disclose the vulnerability to the maintainers without making it public.

What to include

When reporting, please provide:

  1. A descriptive title.
  2. The specific crate and version affected.
  3. A clear description of the vulnerability or safety concern.
  4. Steps to reproduce (including hardware/OS context if relevant).
  5. A proof-of-concept if available.

Our Commitment

We will:

  • Acknowledge your report within 48 hours.
  • Provide a timeline for a fix and keep you updated.
  • Give credit (if desired) in the eventual security advisory.

Varta Project Roadmap

This roadmap outlines the path from Varta’s current state to a “High-Assurance” v1.0.0 release suitable for safety-critical deployments.

Phase 1: Foundation (Current - v0.2.x) :white_check_mark:

Focus on protocol stability, local/network transport, and security audits.

  • VLP Protocol Definition (32-byte frames).
  • Zero-allocation UDS/UDP transport.
  • AEAD encryption for networked agents.
  • Fuzzing and Miri integration in CI.
  • Initial Prometheus exporter.

Phase 2: Observability & Resilience (v0.3 - v0.5)

Enhancing the observer and providing more “industrial” features.

  • Structured Logging: full json-log support across all crates.
  • Tamper-Evident Logs: SHA-256 hash chaining for recovery audits.
  • mdBook Documentation: A comprehensive “Varta Book” explaining protocol internals.
  • Crates.io Publication: Formal release of production-ready crates.

Phase 3: Compliance & Integration (v0.6 - v0.9)

Preparing for formal certification standards (IEC 62304, ISO 26262).

  • Static Analysis: Integrate cargo-geiger and custom safety-profile audits.
  • Multi-Language SDKs: C/C++ bindings for legacy embedded systems.
  • Hardware Watchdog Integration: Native drivers for Linux watchdogd and platform-specific hardware timers.
  • Self-Diagnostic Suite: Integrated tests for observer clock drift and jitter.

Phase 4: High-Assurance v1.0

The stable, safety-certified release.

  • Formal Verification: TLA+ or Kani proofs for core state machines.
  • Third-Party Security Audit: Formal cryptographic and code audit by a specialized firm.
  • ABI Freeze: Finalize the VLP wire format for long-term compatibility.
  • v1.0.0 Release: LTS support for critical infrastructure.