Introduction
Varta is a zero-dependency, zero-allocation health protocol designed for distributed local agents and networked clusters.
The Problem: “The Observer Gap”
In high-performance or safety-critical systems, monitoring process health is often surprisingly expensive or dangerously imprecise.
- Expensive: Monitoring agents that consume 5-10% CPU just to check if others are alive.
- Imprecise: TCP-based health checks that fail due to network congestion, not process failure.
- Fragile: Monitoring systems that crash when the target process panics or deadlocks.
The Varta Philosophy: “Zero-Everything”
Varta was built to bridge this gap by providing a protocol that is:
- Zero Dependencies: Production crates have empty
[dependencies]sections. - Zero Allocations: After initialization, the beat path never touches the heap.
- Zero Block: The agent never waits for the observer. If the observer is busy, the heartbeat is simply dropped.
How it Works
- Agents emit a 32-byte fixed-layout frame (VLP) over a Unix Domain Socket or UDP.
- The Observer (
varta-watch) polls these frames, tracks per-pid state machines, and triggers recovery actions if a “stall” is detected.
Ready to get started? Check out the Installation guide.
Installation
Varta is currently in rapid development (post-v0.1.0). While it is not yet published to crates.io, it is designed to be easily included as a path dependency.
Adding to your Rust project
Add the varta-client to your Cargo.toml:
[dependencies.varta-client]
path = "path/to/varta/crates/varta-client"
Optional Features
You can enable specific transport or safety features:
[dependencies.varta-client]
path = "path/to/varta/crates/varta-client"
features = [
"panic-handler", # Automatic 'Critical' beat on thread panic
"udp", # Support for networked agents
"secure-udp", # Encrypted UDP transport (requires crypto deps)
]
Installing the Observer
To build and install the varta-watch observer binary:
cargo install --path crates/varta-watch
Verifying the Toolchain
Varta is pinned to a specific stable toolchain via rust-toolchain.toml. We recommend matching this for production builds:
rustup show
The Minimum Supported Rust Version (MSRV) is 1.70.0.
VLP Frame — Wire Layout (v0.2)
The Varta Lifeline Protocol carries a single message type: a 32-byte
fixed-layout health frame. Every byte position is pinned at the protocol level
so encode/decode is a handful of from_le_bytes / to_le_bytes calls and a
single CRC-32C pass — nothing else.
Byte map
offset │ size │ field │ notes
───────┼──────┼────────────┼──────────────────────────────────────────────
0 │ 2 │ magic │ const [0x56, 0x41] (ASCII "VA")
2 │ 1 │ version │ const 0x02 (v0.1 → BadVersion)
3 │ 1 │ status │ Status::{Ok=0, Degraded=1, Critical=2, Stall=3}
4 │ 4 │ pid │ u32 little-endian — emitter's process id
8 │ 8 │ timestamp │ u64 little-endian — emitter-local monotonic
16 │ 8 │ nonce │ u64 little-endian — strictly increasing
24 │ 4 │ payload │ u32 little-endian — opaque app context (v0.2)
28 │ 4 │ crc32c │ u32 LE CRC-32C over bytes 0..28 (v0.2)
───────┴──────┴────────────┴──────────────────────────────────────────────
total 32 bytes
v0.2 wire integrity (CRC-32C)
Bytes 28..32 carry a CRC-32C (Castagnoli, polynomial 0x1EDC6F41,
init 0xFFFFFFFF, reflected, output-XOR 0xFFFFFFFF) computed over
bytes 0..28. The CRC catches:
- Non-ECC RAM bit flips and cosmic-ray single-event upsets on the agent or the observer host.
- NIC firmware corruption between RX queue and userspace.
- In-process memory corruption between
Frame::encodeand the transport write (or between the transport read andFrame::decode), including the gap betweencrypto::seal/crypto::openand the frame-level codec on the secure-UDP transport. AEAD tag failures surface separately ascrypto::AuthError; the CRC is the defence-in-depth catch for everything that AEAD does not (in-process corruption on either side of the seal/open boundary).
Decode order is fixed: magic → version → CRC → status → pid → timestamp → nonce. CRC verification sits between version and field-range checks so
random bytes from a wrong-protocol sender still surface as BadMagic /
BadVersion (preserving the “this isn’t even VLP” diagnostic) while a
single-bit-flipped status byte surfaces as BadCrc, never as a valid
frame with the wrong meaning.
Implementation: crates/varta-vlp/src/crc32c.rs carries a const-fn 256-entry
lookup table; per-frame cost is ~28 cycles (~9 ns on Apple Silicon). Hardware
CRC-32C is available on x86_64 (SSE 4.2) and ARMv8.1+ via core::arch
intrinsics; a future target_feature cfg can drop the cost to ~1 cycle
without changing the wire format.
The payload field shrank from u64 (v0.1) to u32 (v0.2) to make room
for the CRC trailer inside the 32-byte budget. Agents needing more than
4 bytes of context should externalize the data and reference it from the
payload (e.g. as a slot index into a shared ring buffer).
The two compile-time assertions in crates/varta-vlp/src/lib.rs lock this in:
#![allow(unused)]
fn main() {
const _: () = assert!(core::mem::size_of::<Frame>() == 32);
const _: () = assert!(core::mem::align_of::<Frame>() == 8);
}
A drift in field order, padding, or width breaks the build. The integration
test frame_round_trip_matches_golden_bytes cross-checks a hand-computed
golden byte array against Frame::encode, so the layout is also pinned at
runtime.
Why #[repr(C, align(8))]
repr(C)pins field order to declaration order. Without it the compiler is free to reorder fields, which would silently break a wire format consumed by any tool that decodes by offset (includingvarta-watchitself).align(8)makes the struct’s start address 8-byte aligned, matching the natural alignment of the threeu64fields. The first 8 bytes (magic + version + status + pid) total exactly 8 bytes, so once the struct is 8-aligned theu64fields land on 8-byte boundaries with zero padding.size_oftherefore equals the sum of the field widths (32), and the const-assert proves it.- No
unsafeis required at the encode/decode boundary because we never transmute the struct to or from[u8; 32]. The body ofFrame::encodeandFrame::decodeis a sequence ofto_le_bytes/from_le_bytescalls against fixed-length array slices, all of which are checked at the type system level.
Why little-endian on the wire
- Every tier-1 target Varta will plausibly run on (x86_64, aarch64) is
little-endian natively, so
to_le_bytesis a no-op copy on the hot path. - Even on a hypothetical big-endian target the cost is one
bswap-class instruction per integer field — a rounding error against UDS write/read. - Pinning byte order in the spec means a frame captured on one host can be
decoded byte-for-byte on another, which keeps the
varta-watchrecovery command testable in isolation.
Why zero-dependency
- The protocol crate is the foundation everything else links against. Any
registry crate it pulls in (
bytes,byteorder,zerocopy, …) becomes a transitive obligation for every agent that wants to integrate Varta. Keeping[dependencies]empty preserves the “drop in one path dep, get health signaling” contract. - The whole crate is a struct, an enum, and four free functions. There is
nothing here that
coredoes not already provide. - Empty deps also keep the audit surface minimal: the only
unsafein the workspace will live invarta-clientandvarta-watch(where required for UDS plumbing), never in the protocol crate itself.
Cross-references
- Acceptance contract:
docs/acceptance/varta-v0-1-0.md - Crate root:
crates/varta-vlp/src/lib.rs - Integration tests:
crates/varta-vlp/tests/frame.rs
VLP Transports
The Varta Lifeline Protocol (VLP) wire format is entirely transport-agnostic — a 32-byte,
8-byte-aligned #[repr(C)] frame. The transport layer is abstracted via traits that
allow swapping out the underlying socket type without modifying the protocol core.
Architecture
┌──────────────────────────────────────────────────────────────────┐
│ varta-vlp │
│ Frame (32 bytes) │ Status │ DecodeError │
│ Zero dependencies. Never changes. │
└────────────┬───────────────────────────────┬─────────────────────┘
│ │
┌────────▼─────────┐ ┌────────▼──────────┐
│ varta-client │ │ varta-watch │
│ │ │ │
│ BeatTransport │ │ BeatListener │
│ ├── UdsTransport│ │ ├── UdsListener │
│ ├── UdpTransport│ │ ├── UdpListener │
│ └── SecureUdpTransport (secure-udp feat.)│ └── SecureUdpListener (secure-udp feat.)│
│ (udp feat.) │ │ (udp feat.) │
└───────────────────┘ └────────────────────┘
Agent side (varta-client)
#![allow(unused)]
fn main() {
pub trait BeatTransport: Send + 'static {
fn send(&mut self, buf: &[u8; 32]) -> io::Result<usize>;
fn reconnect(&mut self) -> io::Result<()>;
}
}
Varta<T: BeatTransport> owns a transport and calls send(2) on every beat().
The default transport is UdsTransport (Unix Domain Socket). When the udp
feature is enabled, UdpTransport is available via Varta::connect_udp(addr).
When the secure-udp feature is enabled, SecureUdpTransport is available
via Varta::connect_secure_udp(addr, key) — every beat is encrypted with
ChaCha20-Poly1305 AEAD (RFC 8439).
Observer side (varta-watch)
#![allow(unused)]
fn main() {
pub trait BeatListener: Send + 'static {
fn recv(&mut self) -> RecvResult;
fn drain_decrypt_failures(&mut self) -> u64 { 0 } // default = 0
fn drain_truncated(&mut self) -> u64 { 0 } // default = 0
}
}
The Observer holds a Vec<Box<dyn BeatListener>> and polls all listeners
round-robin on each poll() call. When --udp-port is passed at the CLI,
a UdpListener is added alongside the UDS listener.
Transport comparison
| | UDS (default) | UDP (feature = “udp”) | Secure UDP (feature = “secure-udp”) |
|—|—|—|—|—|
| Addressing | Filesystem path | IP:PORT | IP:PORT |
| Encryption | None (kernel isolation) | None | ChaCha20-Poly1305 AEAD |
| Authentication | Kernel PID + UID via SO_PASSCRED (Linux) / LOCAL_PEERTOKEN (macOS) | None | Poly1305 tag + PID in IV prefix (master-key mode) — wire-content only, not the sending process |
| Replay protection | None (local IPC) | None | Per-sender IV counter monotonicity |
| Trust model | Filesystem permissions + kernel credential attestation | Network segmentation | 256-bit pre-shared or per-agent derived key |
| Origin classification | KernelAttested | NetworkUnverified | NetworkUnverified (cryptographic binding ≠ kernel attestation) |
| Recovery-eligible by default? | Yes | No (see [peer-authentication.md → Recovery eligibility]) | No (same gate; even master-key derivation cannot replace kernel attestation) |
| Frame size | 32 bytes | 32 bytes | 60 bytes (AEAD overhead) |
| Socket cleanup | UdsListener::drop unlinks socket | Kernel reclaims port | Kernel reclaims port |
| Use case | Local IPC, process monitoring | IoT/edge, microservices | Anything crossing untrusted networks |
Recovery-on-UDP is structurally rejected by default. Combining any recovery flag (
--recovery-cmd/--recovery-exec/*-file) with--udp-portis a startup hard-error unless the operator passes--i-accept-recovery-on-unauthenticated-transport. Even with the flag, the runtime origin gate still refuses to fire recovery for UDP-origin stalls — flippingRecovery::with_allow_unauthenticated_source(true)is a separate, conscious choice. Seebook/src/architecture/peer-authentication.mdfor the full threat model.
CLI additions
# Listen on UDS only (default)
varta-watch --socket /tmp/varta.sock --threshold-ms 500
# Listen on UDS + UDP (requires --features udp at build time)
varta-watch --socket /tmp/varta.sock --threshold-ms 500 \
--udp-port 9000 --udp-bind-addr 0.0.0.0
# UDP-only (no UDS)
varta-watch --socket /tmp/varta.sock --threshold-ms 500 \
--udp-port 9000
# UDP with ChaCha20-Poly1305 encryption
# Generate a 256-bit key (64 hex chars)
openssl rand -hex 32 > /tmp/varta.key
varta-watch --socket /tmp/varta.sock --threshold-ms 500 \
--udp-port 9000 --key-file /tmp/varta.key
# Rotation: accept old key while transitioning to new key
openssl rand -hex 32 > /tmp/varta-new.key
varta-watch --socket /tmp/varta.sock --threshold-ms 500 \
--udp-port 9000 --key-file /tmp/varta.key \
--accepted-key-file /tmp/varta-new.key
# Per-agent key derivation from master key
# The observer derives agent-specific keys from the PID embedded in
# each frame's iv_random prefix. Compromise of one agent's key does
# not reveal other agents' keys or the master key.
openssl rand -hex 32 > /tmp/varta-master.key
varta-watch --socket /tmp/varta.sock --threshold-ms 500 \
--udp-port 9000 --master-key-file /tmp/varta-master.key
Feature flags
| Crate | Flag | Effect |
|---|---|---|
varta-vlp | crypto | Enables ChaCha20-Poly1305 AEAD (seal, open, Key). No_std-compatible — all four RustCrypto deps are default-features = false. |
varta-vlp | std | Opt-in std-dependent conveniences (Key::from_file, std::path::Path-typed helpers). Off by default so the crate is #![no_std] + alloc-free out of the box — ready for FreeRTOS/Zephyr targets. |
varta-client | udp | Enables UdpTransport, Varta::connect_udp(), install_panic_handler_udp() |
varta-client | secure-udp | Enables SecureUdpTransport, Varta::connect_secure_udp(); implies udp, varta-vlp/crypto, and varta-vlp/std (the secure_udp example calls Key::from_file). |
varta-watch | udp | Enables UdpListener, --udp-port / --udp-bind-addr CLI flags |
varta-watch | secure-udp | Enables SecureUdpListener, --key-file / --accepted-key-file / --master-key-file; implies udp-core |
varta-tests | udp | Enables UDP integration tests |
varta-bench | udp | Enables udp-latency benchmark subcommand |
Security
-
UDS: On Linux, the kernel attests the sender’s PID and UID via
SCM_CREDENTIALS. The observer rejects frames whereframe.pid != peer_pidorpeer_uid != observer_uid. On macOS,getsockopt(LOCAL_PEERTOKEN)is attempted for the same verification, falling back to--socket-mode 0600. On other platforms, the only defence is--socket-mode. -
UDP (plaintext): No kernel credential mechanism exists.
peer_pidis always 0, which causes the observer to skip PID verification. Trust must be established at the network layer — firewall rules, VPC boundaries. -
UDP (secure): Every frame is encrypted with ChaCha20-Poly1305 (RFC 8439) using a 256-bit key. Primitives are provided by the
chacha20poly1305crate (RustCrypto, NCC Group audit 2020) — no hand-rolled crypto. Key derivation uses HKDF-SHA256 (RFC 5869) via thehkdf+sha2crates. Two key modes:- Shared key: A single pre-shared key for all agents (
--key-file). - Master key: Per-agent keys derived from the agent’s PID via HKDF-SHA256
(
--master-key-file). The PID is embedded in theiv_randomprefix so the observer can derive the correct agent key before decryption. Compromise of one agent’s key does not reveal other agents’ keys or the master key. Note: the HKDF-based KDF is incompatible with the ChaCha20-PRF KDF used in earlier releases — agents must re-key when upgrading from a pre-RustCrypto build if master-key mode was in use. - Replay attacks are blocked by enforcing monotonic IV counters per sender.
Key rotation is supported via
--accepted-key-file(no downtime required). - Panic-hook entropy:
install_panic_handler_secure_udpreads entropy at install time and fails closed if all sources (getrandom,getentropy,/dev/urandom) are unavailable. In chrooted environments without/dev, useinstall_panic_handler_secure_udp_accept_degraded_entropyto opt into a non-cryptographic fallback — seebook/src/architecture/peer-authentication.mdfor the full nonce-reuse risk analysis.
- Shared key: A single pre-shared key for all agents (
-
Recovery commands: Two execution modes:
--recovery-cmd: Shell mode — templates executed via/bin/sh -cwith the PID as$1(positional argument, never string-interpolated).--recovery-exec: Exec mode — commands executed directly viaexecvp(2)with{pid}replaced in arguments. No shell is involved.--recovery-cmd-file/--recovery-exec-file: Read templates from files with mandatory ownership/permission checks (UID match, mode ≤ 0600).
Container / PID-namespace semantics
Frame.pid carries the agent’s PID in the agent’s PID namespace. The
observer’s kernel-attested peer PID (SO_PASSCRED / LOCAL_PEERTOKEN /
SCM_CREDS) is in the observer’s namespace. When the two namespaces
differ:
- The pid in the frame cannot be used to identify a process the observer can
kill(2)orsystemctl restart— the same numeric PID refers to a different process in each namespace. - The existing
frame.pid == peer_pidcheck at observer ingress catches most cases (different namespaces usually produce different numeric pids), but same-pid collisions across containers (every container’s first process is PID 1) are invisible to that gate.
varta-watch therefore (Linux only):
- Reads
/proc/self/ns/pidonce at startup and caches the inode as the observer’s namespace identity. - For every kernel-attested beat (UDS), reads
/proc/<peer_pid>/ns/pidand compares the inode to the observer’s. Mismatch ⇒ drop the beat (varta_frame_namespace_mismatch_total++) and emitEvent::NamespaceConflict. - Per-pid tracker slots pin the namespace inode at first beat; a later beat
with a different
Some(_)inode is rejected asUpdate::NamespaceConflict(varta_tracker_namespace_conflict_total++). - Recovery commands refuse to spawn for cross-namespace stalls and log an
audit record with
reason=cross_namespace_agent(varta_recovery_refused_total{reason="cross_namespace_agent"}++).
Escape hatch — --allow-cross-namespace-agents
When agents are intentionally run with --pid=host (containers sharing the
host PID namespace), the observer’s namespace and the agents’ namespace agree
at the kernel level — the gate above is a no-op.
For deployments where the agent runs in a private namespace and the
operator has out-of-band PID translation (e.g. CNI metadata that lets a
recovery script translate container pids to host pids), pass
--allow-cross-namespace-agents. The audit log and metrics still fire, but
beats are admitted and recovery is permitted.
--strict-namespace-check
Treat namespace mismatch as a fatal startup error: on the first
Event::NamespaceConflict, the daemon logs a FATAL line and exits with a
non-zero status. Used in environments where the operator wants the daemon to
fail loudly rather than silently log audit refusals.
Non-Linux platforms
PID namespaces are a Linux kernel concept. On macOS and the BSDs,
observer_pid_namespace_inode() returns None and all comparisons
short-circuit to “match”. The CLI flags are accepted for portability but
have no runtime effect.
UDP transports
UDP listeners (plain or secure) have no kernel peer-cred mechanism.
peer_pid is 0; peer_pid_ns_inode is None. Recovery is already refused
for NetworkUnverified origins by the existing transport gate — namespace
mismatch adds nothing for UDP. See
peer-authentication.md for the full trust model.
Secure UDP — replay-shadow threat boundary (H4)
SecureUdpListener keeps per-sender replay state in a bounded HashMap
indexed by SocketAddr:
- Capacity:
MAX_SENDER_STATES = 1024simultaneously-tracked senders. - After capacity is reached,
force_evict_oldest_senderstashes the evicted sender’s(addr, SenderState)in a single-slotlast_evicted: Option<(SocketAddr, SenderState)>shadow so a replay attempt from the just-evicted sender is still rejected.
The shadow is one entry deep. An attacker who can spoof UDP source addresses can cycle ≥1025 distinct sources to overwrite the shadow with their own chaff, then replay a captured frame from the target sender as if it were a “new” sender — the listener has no surviving record of the target’s last counter and accepts the replay.
Why the shadow isn’t deeper
A 1-deep shadow is acceptable for the loopback configuration: only
processes on the same host can craft loopback source addresses
(127.0.0.0/8 requires CAP_NET_RAW to set as a UDP source, and even
then the kernel refuses spoofed loopback from external interfaces). On
any reachable network — VLAN, VPC, the public internet — the source
address is freely forgeable, and a deeper shadow merely raises the
attacker’s required address budget rather than closing the gap.
Bounding the shadow to a single slot keeps the eviction story
constant-time and aligns the threat boundary with a clean operational
constraint (network reach), rather than a fuzzy quantitative argument
about how many spoofed sources are “enough”.
Mitigation
varta-watch defaults --udp-bind-addr to 127.0.0.1 when secure-UDP
keys are configured. Operators who genuinely need the listener to
accept non-loopback peers must pass --i-accept-secure-udp-non-loopback
explicitly — a CLI flag whose name signals the residual risk. When the
flag is set, a high-visibility startup warning is emitted to stderr and
the operator is expected to constrain network reach (firewall, private
VLAN, mTLS-fronted tunnel) so that no untrusted host can reach the bound
port.
The recovery gate on NetworkUnverified origins (see
peer-authentication.md) remains independent
of this flag — opting in to non-loopback secure-UDP does NOT enable
recovery commands from UDP-origin beats. Those still require the
separate
--secure-udp-i-accept-recovery-on-unauthenticated-transport
acknowledgement.
Fork-safety on secure-UDP
After fork(2), a child process inherits its parent’s
SecureUdpTransport state — the 16-byte iv_session_salt, the
iv_prefix_index, and the iv_counter. Three nominally-independent
fields whose product defines the AEAD nonce. If the child ever calls
Varta::beat() without intervention, it derives the same 12-byte
ChaCha20-Poly1305 nonce its parent has already emitted under the same
key — a catastrophic confidentiality and integrity failure (Poly1305
key recovery, plaintext XOR leak).
How Varta enforces fork-safety structurally
Varta::connect snapshots std::process::id() into a private
connect_pid field. Every Varta::beat reads the current PID and
compares — on mismatch (i.e. the handle is now in a forked child), the
wrapper invokes transport.reconnect() before building the frame.
SecureUdpTransport::reconnect() re-reads OS entropy into a fresh
16-byte session salt, recomputes the IV prefix, and resets the prefix
index and counter to zero. The child’s first emitted frame therefore
uses an IV prefix derived from independent entropy — nonce collision
across the fork boundary is impossible.
Auto-recovery is silent: the caller observes BeatOutcome::Sent. The
event is observable via Varta::fork_recoveries() -> u64 (suggested
Prometheus name: varta_client_fork_recoveries_total). The local
session epoch resets too — nonce → 0, start → Instant::now(),
last_timestamp → 0, consecutive_dropped → 0 — so the child’s
wire stream looks like a fresh session to the observer.
Observer view
The observer’s per-sender state in SecureUdpListener is keyed by
(SocketAddr, iv_prefix) with a 1-deep replay history (see
H4 replay shadow above).
When the forked child sends frames from the same source port with a
new IV prefix, the observer transitions its current state into the
prev_* slots and accepts the new prefix as a fresh session — no
replay error, no protocol-level signal required. Fork-recovery is
entirely transparent to the wire format.
Advanced callers
Callers using SecureUdpTransport directly (without the Varta
wrapper) do not get auto-detection. The BeatTransport trait is
intentionally low-level; the safety policy lives one layer up.
Direct-transport users must call SecureUdpTransport::reconnect()
themselves in the forked child before the first beat.
Panic-hook parallel
install_panic_handler_secure_udp caches an 8-byte IV at install time
to avoid the (non-async-signal-safe) entropy read inside the panic
hook itself. The same fork hazard applies: a child that panics would
otherwise emit (cached_iv, iv_counter=1) — colliding with the
parent’s identical pair if the parent panicked too. The installer
snapshots install_pid and, inside the hook, re-runs the entropy
chain (getrandom/getentropy → /dev/urandom) when the PID has
changed. The strict variant fails closed (skips the secure frame) when
no entropy source is reachable; the accept-degraded-entropy variant
falls back to fallback_iv_random() per the documented degraded-entropy
policy.
Cross-references
- Observer liveness — the watcher’s own liveness story: in-process self-watchdog, systemd
sd_notify, hardware watchdog, and paired-observer pattern - Safety profiles — compile-time vs. runtime feature gating for production-safe builds
- Peer authentication — kernel-level PID attestation and transport trust classification
- Namespaces — dedicated reference for cross-namespace deployments
Future transports
Additional transports can be implemented by implementing BeatTransport (agent
side) and BeatListener (observer side) without touching the protocol core:
- Shared memory (
memfd,shm) — Wasm plugins writing directly to a shared ring buffer - Unix pipes (
pipe,fifo) — stdin/stdout health frames for supervised processes - WebSocket — for browser-based health dashboards
Observer Liveness — “Who Watches the Watcher?”
varta-watch is the single observer for all agents on a host. If it crashes
or its poll loop hangs, no agent gets a Stall event and no recovery fires —
the entire monitoring layer fails silently. For life-support deployments this
is the most critical functional gap.
This document describes four independent, layered defenses. Deploy as many as your environment supports; each catches failure modes the others cannot.
Threat model
| Failure mode | L1 | L2 | L3 | L4 |
|---|---|---|---|---|
| Poll loop hangs (stuck in I/O or computation) | ✓ | ✓* | ✗ | ✓ |
| Process crash (SIGSEGV, stack overflow, OOM) | ✗ | ✓ | ✓† | ✓ |
| Watchdog thread dies silently (panic, signal) | ✗ | ✓‡ | ✓† | ✓ |
| Kernel hang / host deadlock | ✗ | ✗ | ✓ | ✗ |
| Misconfiguration (wrong socket path, wrong user) | ✗ | ✗ | ✗ | ✓ |
*systemd detects a hang only if WATCHDOG=1 stops arriving; the self-watchdog
ensures that also stops when the loop wedges.
†hardware watchdog fires when the kick loop stops; process crash achieves this.
‡since H5 the watchdog thread is the sole source of WATCHDOG=1; if it
dies, the emission stream stops and systemd’s WatchdogSec= fires.
L1 — In-process self-watchdog (--self-watchdog-secs)
A background thread checks that the main poll loop has ticked at least once
within the configured deadline. If not, it calls process::abort().
varta-watch --self-watchdog-secs 4 ...
- The background thread is the only non-main thread in the binary. The beat path and observer loop remain single-threaded.
process::abort()produces SIGABRT, which appears injournalctl, enables core dumps, and triggersRestart=on-abortin systemd units.- The deadline should be set to roughly 2× the expected worst-case poll
latency (typically
--threshold-ms+ reaping time). - H5 (post-2026-05-13): the watchdog thread is ALSO the sole emitter of
systemd
WATCHDOG=1. Emission used to live on the main loop, which left a silent-failure window: if the watchdog thread died while the main loop remained healthy,WATCHDOG=1kept arriving from the main thread and systemd had no way to notice the in-process abort path was already gone. NowWATCHDOG=1emission is moved to the watchdog thread (via adup(2)-ed copy of the notify socket carved offSdNotifywithtake_watchdog_notifier). If the thread dies, the emission stream stops andWatchdogSec=fires. This is the only design where systemd can detect a dead watchdog while the main loop is still alive. - Auto-enable: when
$WATCHDOG_USECis set by the service manager and--self-watchdog-secsis not passed, the watchdog thread is spawned unconditionally with a 4 s deadline. Operators with tighterWatchdogSec=values can override via the CLI. This collapses the L1+L2 layers structurally: enablingWatchdogSec=in the unit automatically buys both the in-process abort path and the WATCHDOG=1 emission stream.
L2 — systemd sd_notify watchdog integration
varta-watch speaks the sd_notify(3) protocol natively. Set
Type=notify in the service unit and configure WatchdogSec=:
[Service]
Type=notify
NotifyAccess=main
WatchdogSec=5s
Restart=on-watchdog
RestartSec=1s
TimeoutStartSec=10s
ExecStart=/usr/bin/varta-watch \
--socket /run/varta/agents.sock \
--threshold-ms 5000 \
--self-watchdog-secs 4 \
--hw-watchdog /dev/watchdog \
--heartbeat-file /run/varta/heartbeat
varta-watch sends:
READY=1after observer bind succeeds and all listeners are attachedWATCHDOG=1everyWATCHDOG_USEC / 2microseconds while the poll loop runsSTOPPING=1when the SHUTDOWN latch flips
If WATCHDOG=1 stops arriving, systemd kills and restarts the process. This
catches both crashes (no more sends) and hangs (LAST_TICK_NS stops advancing,
the self-watchdog aborts, systemd restarts).
$NOTIFY_SOCKET and $WATCHDOG_USEC are passed automatically by systemd;
no extra flags are needed.
L3 — Hardware watchdog (--hw-watchdog)
On hosts with a kernel hardware watchdog (e.g. /dev/watchdog), varta-watch
can kick it once per poll iteration. If the kick stops, the kernel reboots the
host — even if the OS itself is wedged.
varta-watch --hw-watchdog /dev/watchdog ...
Magic close: on a clean shutdown (SIGTERM/SIGINT followed by graceful exit)
varta-watch writes the magic byte 'V' to disarm the watchdog before
exiting. A crash or hang leaves the watchdog armed; the kernel reboots after
its timeout.
The /dev/watchdog device is typically root-owned (mode 0600). Run
varta-watch as root or grant the CAP_SYS_ADMIN capability, or use a
watchdog daemon (e.g. watchdog(8)) for the actual device management.
L4 — Paired observers (operational)
A second monitoring process scrapes the first observer’s liveness signals and
restarts it if they stall. This requires no code changes — use the existing
--heartbeat-file and /metrics primitives.
Heartbeat-file poller
#!/bin/sh
HEARTBEAT=/run/varta/heartbeat
while :; do
prev=$(awk '{print $1}' "$HEARTBEAT" 2>/dev/null || echo 0)
sleep 5
cur=$(awk '{print $1}' "$HEARTBEAT" 2>/dev/null || echo 0)
if [ "$cur" -le "$prev" ]; then
logger -t varta-watchdog "heartbeat stalled (loop_count=$prev); restarting"
systemctl restart varta-watch
fi
done
The first field in the heartbeat file is a monotonically increasing loop counter. If it stops advancing, the observer is wedged or dead.
Prometheus uptime scraper
/metrics exposes varta_watch_uptime_seconds. A second Prometheus instance
(or Alertmanager rule) can alert when the gauge stops increasing:
# Alert when varta-watch uptime has not increased for 30 seconds.
alert: VartaWatchStalled
expr: rate(varta_watch_uptime_seconds[30s]) == 0
for: 30s
labels:
severity: critical
Threading note
--self-watchdog-secs spawns one background thread. This is the only
non-main thread in the varta-watch binary, and that property is a
load-bearing architectural invariant, not an accident. All agent beat
processing, stall detection, recovery spawning, and Prometheus serving happen
on the main thread. The watchdog thread reads two atomics (SHUTDOWN
and LAST_TICK_NS), calls process::abort() on wedge, and writes
WATCHDOG=1 to its own dup(2)-ed UnixDatagram fd; it never touches
shared mutable state. The dup-ed fd is independent kernel state — both
threads own their own descriptor and there is no synchronisation between
them on the notify path.
The single-threaded design is what lets the project preserve its zero-alloc,
ABI-stable beat contract: a beat is decoded into a stack-allocated
[u8; 32] and dispatched through the per-pid tracker without locking,
because nothing else holds a reference. Moving any phase of the loop to a
second thread would require a lock-free SPSC ring between threads at the
ingress and break that contract. Stall-detection latency under scrape load
is instead bounded by an explicit per-iteration latency budget — see below.
Why /metrics is on the poll thread
“Doesn’t scrape latency variance steal time from beat ingestion?”
It can, by up to ~200 ms per iteration — the structural cap of
PromExporter::serve_pending (100 ms serve deadline + 100 ms drain
deadline, see exporter.rs). The obvious mitigation is to spawn a second
thread that owns serve_pending and reads tracker state through a shared
snapshot. We deliberately do not do this. Three reasons:
- The beat path would acquire a lock on every tick. Whether via
Arc<Mutex<PromExporter>>or an SPSC snapshot ring, every record-side counter increment (pe.record_beat(...),pe.record_stall(...),pe.record_loop_tick(...)etc.) becomes either a mutex acquisition or a single-producer write into a wait-free queue. Neither is zero-overhead on the hot path, and both introduce per-architecture memory-ordering questions that the current&mut selfmodel eliminates by construction. - The zero-allocation invariant becomes harder to enforce. The beat
path is currently zero-alloc post-
connect, enforced by thevarta-testsguard allocator. A snapshot ring requires either a pre-sized arena (more state on the hot path) or per-snapshot allocation (kills the invariant). Both are worse than what we have. - The variance is already bounded and now observable. Scrape work
per iteration is capped at ~200 ms by
PROM_READ_DEADLINE = 10 ms,PROM_MAX_CONNECTIONS_PER_SERVE = 8,PROM_MAX_DRAIN_PER_SERVE = 50, the 100 ms serve deadline, and the per-IP token bucket. Operators see the variance throughvarta_observer_serve_pending_seconds(new — see “Observing scrape-induced latency” below); beat-path latency isiteration_seconds - serve_pending_secondsin PromQL.
Scrape-storm alarms and beat-path alarms therefore route off different metrics, and the load-bearing single-thread invariant is preserved.
Latency budget — worst-case poll iteration time
A bounded iteration time guarantees a bounded stall-detection latency. The
table below names the phases of the poll loop in main.rs and the
upper-bound source for each:
| Phase | Worst case | Source / constant | Observable as |
|---|---|---|---|
| 1. Drain queued stall events | O(queue)·~1 µs | Observer::poll_pending — one stack pop per call | (subsumed in iteration_seconds) |
2. Observer::poll() (one recv each) | ≤ read_timeout·N | UDS recv(2) blocks up to --read-timeout-ms (default 100 ms) per listener; UDP listeners are non-blocking | (subsumed in iteration_seconds) |
| 3. Maintenance counter drains | <1 ms | Constant work over observer.drain_* counters | (subsumed in iteration_seconds) |
3. Recovery::try_reap | ~64 µs | ≤64 waitpid(2, WNOHANG) syscalls (bounded outstanding-pids fan) | (subsumed in iteration_seconds) |
3. PromExporter::serve_pending | ≤200 ms | 100 ms serve deadline + 100 ms drain deadline (see exporter.rs) | varta_observer_serve_pending_seconds (independent histo) |
| 4. Heartbeat-file atomic write | <5 ms | Same-dir write + rename (write_heartbeat_atomic) | (subsumed in iteration_seconds) |
4. sd_notify + HW watchdog kicks | <1 ms | One sendmsg(2) + one write(2) | (subsumed in iteration_seconds) |
| Iteration total (worst case) | ~310 ms | UDS read_timeout (100 ms) + serve_pending (≤200 ms) + small fixed work — assuming a single UDS listener | varta_observer_iteration_seconds |
Two observations the table makes explicit:
- The UDS read-timeout is the idle floor: with no incoming beats and no
scrape pressure, every iteration costs about
read_timeout. This is intentional — it yields CPU between recvs without busy-spinning. Lower the floor by lowering--read-timeout-ms, at the cost of a tighter idle poll loop. - The worst-case active iteration is bounded by
read_timeout + serve_pending, sincerecv(2)returns early as soon as a frame arrives andserve_pendingis the only other phase that can spend more than a few milliseconds.
The default soft budget is 250 ms (--iteration-budget-ms). Iterations
exceeding it increment varta_observer_iteration_budget_exceeded_total and
are visible in the varta_observer_iteration_seconds histogram. The budget
is advisory: hard wedges (seconds, never returning) remain the responsibility
of --self-watchdog-secs.
The idle sleep at the end of an iteration with no pending I/O (10 ms) is excluded from the histogram. Idle time is a throttling primitive, not work latency; including it would mask the bad iterations.
Tuning relationship
For a given --threshold-ms T, stall-detection latency is bounded by
T + per_iteration_worst_case. With defaults
(--threshold-ms 5000, --read-timeout-ms 100, default serve_pending bounds)
the worst case is ~310 ms, so a stalled agent surfaces no later than
~5.31 s after its last beat.
The soft --iteration-budget-ms (default 250 ms) sits between the typical
case (~100 ms idle floor) and the worst case (~310 ms under scrape storm)
so the budget-exceeded counter fires only during real scrape pressure, not
on every active iteration. Operators with higher --read-timeout-ms or
multiple listeners should raise the budget proportionally
(budget ≥ read_timeout × N_listeners + 150 ms).
--self-watchdog-secs should be set such that
self_watchdog_secs × 1000 ≥ 4 × iteration_budget_ms so transient overruns
during scrape bursts do not trigger false-positive aborts. The default
guidance (--self-watchdog-secs 4 with --iteration-budget-ms 250) gives a
16× margin (4000 ms ÷ 250 ms), well above the worst-case ratio.
Observing scrape-induced latency
Three metrics together let an operator separate scrape pressure from beat-path slowness:
varta_observer_iteration_seconds— wall time for the entire poll iteration (drain → poll → maintenance → recovery reap → serve_pending → heartbeat write → watchdog kicks). Bucketed by[0.001, 0.005, 0.01, 0.05, 0.1, 0.25, 0.5, 1.0, +Inf]. Includes serve_pending — unchanged contract.varta_observer_serve_pending_seconds— wall time for theserve_pendingphase alone. Same bucket boundaries asiteration_secondsso the two are coherent. Configurable soft budget via--scrape-budget-ms(default 250 ms); overruns incrementvarta_observer_scrape_budget_exceeded_total.varta_observer_iteration_budget_exceeded_total— iterations exceeding--iteration-budget-ms(default 250 ms). Includes serve_pending time.
Beat-path latency is then a PromQL expression — the difference between iteration time and serve-pending time:
# P99 beat-path latency = P99(iteration_seconds) − P99(serve_pending_seconds).
# Note: subtracting quantiles is approximate (P99 of diff ≠ diff of P99s),
# but in practice serve_pending and the rest of the iteration are weakly
# correlated, so the approximation is monotonic with the true beat-path
# latency. Use sum_by-(le) rate() if you want exact derived histograms
# (compute beat_path_seconds in a recording rule from the two histos).
histogram_quantile(0.99,
sum by (le) (rate(varta_observer_iteration_seconds_bucket[5m])))
- histogram_quantile(0.99,
sum by (le) (rate(varta_observer_serve_pending_seconds_bucket[5m])))
Alarms that should fire on beat-path slowness route off
iteration_seconds - serve_pending_seconds or off
iteration_budget_exceeded_total minus scrape_budget_exceeded_total
when scrape overruns dominate the budget overruns.
Alarms that should fire on scrape-storm pressure route off
scrape_budget_exceeded_total and serve_pending_seconds quantiles
directly.
Recommended Prometheus alerts
# Warn — more than 10% of recent iterations exceeded the soft budget.
alert: VartaIterationBudgetOverruns
expr: rate(varta_observer_iteration_budget_exceeded_total[5m])
/ rate(varta_observer_iteration_seconds_count[5m]) > 0.10
for: 5m
labels: { severity: warning }
# Crit — 99th-percentile iteration time has exceeded 500 ms (twice the budget).
alert: VartaIterationP99High
expr: histogram_quantile(0.99,
sum by (le) (rate(varta_observer_iteration_seconds_bucket[5m]))) > 0.5
for: 5m
labels: { severity: critical }
# Warn — sustained scrape pressure (≥10% of serve_pending calls over budget).
# Fires on scrape-storm symptoms specifically, NOT on beat-path slowness.
alert: VartaScrapeStormPressure
expr: rate(varta_observer_scrape_budget_exceeded_total[5m])
/ rate(varta_observer_serve_pending_seconds_count[5m]) > 0.10
for: 5m
labels: { severity: warning }
# Crit — beat-path P99 latency exceeds 200 ms. Derived: subtract scrape
# time from iteration time so this alarm is immune to scrape storms.
# (See "Observing scrape-induced latency" for the approximation caveat —
# put this in a recording rule for production use.)
alert: VartaBeatPathP99High
expr: |
(histogram_quantile(0.99,
sum by (le) (rate(varta_observer_iteration_seconds_bucket[5m])))
- histogram_quantile(0.99,
sum by (le) (rate(varta_observer_serve_pending_seconds_bucket[5m])))) > 0.2
for: 5m
labels: { severity: critical }
Tracker bounded-work guarantee
Each beat frame triggers at most one call to find_evictable_slot when the
tracker is at capacity. That call scans at most eviction_scan_window slots
(default 256, configurable via --eviction-scan-window).
Per-frame slot reads ≤ eviction_scan_window.
A full table sweep — confirming every slot is ineligible — takes at most:
ceil(tracker_capacity / eviction_scan_window)
consecutive record() calls (the rotating cursor resumes where it stopped).
With defaults (capacity = 256, window = 256) this is 1 call. With
--tracker-capacity 4096 --eviction-scan-window 16 the sweep takes 256 calls —
each individual call still reads ≤ 16 slots, so the per-frame beat-path cost
stays bounded.
The varta_tracker_eviction_scan_window_max gauge (set once at startup) exposes
the configured window so dashboards can derive the worst-case sweep depth.
Operators alert on varta_tracker_eviction_scan_truncated_total to detect when
the cap engages under a unique-pid flood.
Combine this bound with the iteration-budget WCET derivation above:
iteration_max ≤ read_timeout × N_listeners + eviction_scan_window × slot_read_ns
Tick-latency budget and hardware-watchdog margin
Bench-derived p99 cap
Under the canonical stress profile — 4096-slot tracker, balanced
eviction policy, 30 agents × 100 Hz (≈ 3 000 beats/s) over UDS — the
varta_observer_iteration_seconds p99 is ≤ 5 ms.
Run the bench to reproduce the measurement on your hardware:
cargo build --workspace --release --features prometheus-exporter
cargo run -p varta-bench --release -- tick-distribution
The bench asserts p99 ≤ 5 ms and exits non-zero if the cap is breached,
printing the full bucket distribution and observed percentiles for triage.
It also reports varta_tracker_eviction_scan_truncated_total and
varta_observer_iteration_budget_exceeded_total so you can confirm the
eviction-scan cap engages under the test load without blowing the latency
budget.
Soft iteration budget
--iteration-budget-ms (default 250 ms) is the soft per-iteration
ceiling. Overruns increment varta_observer_iteration_budget_exceeded_total
but do not abort the loop. The default 250 ms gives 50× headroom over the
5 ms p99 cap; overruns therefore indicate genuine scrape-storm pressure, not
normal active-load variance. See the “Latency budget” section for the full
derivation.
Hardware-watchdog timeout floor
Operators deploying --hw-watchdog /dev/watchdog must configure the
kernel watchdog device with a timeout of ≥ 30 s. The derivation:
| Margin factor | Value | Note |
|---|---|---|
| p99 iteration time | ≤ 5 ms | Bench-certified under canonical load |
| Iteration budget (soft) | 250 ms | Default; raise for higher --read-timeout-ms |
| Self-watchdog deadline | 4 s | Default auto-set from $WATCHDOG_USEC |
| Recommended device timeout | ≥ 30 s | ≥ 6000× p99 cap, ≥ 7× self-watchdog deadline |
The observer kicks the hardware watchdog at the end of every poll iteration
(after heartbeat-file write and sd_notify). A single missed kick cannot
trip the device; a sustained stall of ≥ device-timeout will. The 30 s
floor provides ample budget for:
- Audit-log filesystem stalls (
varta_log_suppressed_total{kind="audit_io"}will show rate limiting if these recur) - Prometheus scrape contention (
serve_pending_secondsquantiles) - The H5 self-watchdog’s 4 s deadline with ≥ 7× margin
Round-robin fairness bound
Observer::poll() rotates the next_listener_start cursor on every
non-WouldBlock receive. Per-listener worst-case admission delay is therefore
bounded by N_listeners × per-listener-recv-cost. Under the canonical bench
profile (single UDS listener) this is simply the UDS recv latency; with
N additional UDP listeners add N × ~10 µs per iteration.
Eviction scan under stress
The bench will record non-zero varta_tracker_eviction_scan_truncated_total
when the tracker fills and the 256-slot eviction window exhausts without
finding a stalled slot. This is expected and by design — the cap proves the
per-frame cost stays bounded even under a unique-pid flood. The p99 assertion
holds even when the truncation counter is non-zero.
Debounce table semantics under load
The Recovery runner keeps a per-pid ledger of the most recent recovery
fire (LastFiredTable). Each subsequent stall for the same pid is
gated on now - last_fired[pid] >= debounce; closer-than-debounce
stalls return RecoveryOutcome::Debounced and never spawn a child.
Capacity and eviction policy
The ledger is a fixed-size, array-backed table with capacity
MAX_LAST_FIRED_CAPACITY = 4096. Capacity is sized to make the M8
adversarial-burst pattern costly: 4096 distinct pids would have to
stall faster than debounce cadence before the eviction policy is
engaged. Per-slot cost is Option<LastFiredSlot> ≈ 24 bytes →
~96 KiB total — within budget for the observer.
When the table is full and a stall arrives for a new pid, the policy is fail-closed:
- The oldest slot is identified by a single bounded linear scan.
- If that slot’s age is at least
debounce, it is evicted and the new pid takes its place. Per-pid debounce semantics are preserved because the evicted pid’s window has already elapsed. The eviction is counted invarta_recovery_last_fired_evictions_total(operators tune capacity on this signal). - If the oldest slot’s age is below
debounce, the recovery is refused. The runner returnsRecoveryOutcome::RefusedDebounceCapacity { pid }, emits aRefusedRecord { reason: "debounce_capacity" }to the audit log, and bumps bothvarta_recovery_outcomes_total{outcome="refused_debounce_capacity"}andvarta_recovery_refused_total{reason="debounce_capacity"}.
Eviction is debounce-respecting churn; refusal is suppression. Operators tune capacity on the first signal and alert on the second.
Clock-regression defense
All age comparisons use Instant::saturating_duration_since, which
returns Duration::ZERO on regression. ZERO-duration entries are
treated as “not eligible for eviction” — preventing a backwards
clock blip from auto-evicting the whole table.
Recommended alerts
# Alert immediately on any debounce-capacity refusal — this is either
# legitimate scale-out past 4096 concurrent stalls or the M8
# adversarial stall-burst pattern. Either case warrants paging.
rate(varta_recovery_refused_total{reason="debounce_capacity"}[5m]) > 0
# Warn on sustained eviction churn — debounce semantics are still
# intact, but capacity is becoming a bottleneck under steady-state
# load. Tune MAX_LAST_FIRED_CAPACITY or audit which pids are
# stalling.
rate(varta_recovery_last_fired_evictions_total[5m]) > 0.1
# Page on any non-zero invariant-violation count — the defensive
# fall-throughs in LastFiredTable should never fire in correct
# operation. Non-zero values indicate a code bug, not load.
varta_recovery_invariant_violations_total > 0
Bounded-WCET guarantee
Every LastFiredTable operation is a linear scan over a fixed-size
backing store. The unit test last_fired_table_prune_bounded_wcet
asserts the prune sweep completes in under 5 ms in debug builds at
full capacity (a future refactor that reintroduces O(n²) behaviour
disguised as “cleanup” is caught by this test).
The pre-M8 HashMap-based implementation was the source of the
debounce-bypass bug closed by this section: reactive pruning at the
top of on_stall (prune_threshold = debounce * 10) left the map
full of fresh entries under adversarial load, and the at_capacity
branch skipped the debounce check entirely. The new table never
skips the check; capacity pressure surfaces as a refusal or an
audited eviction.
Cross-references
- Safety profiles — compile-time vs. runtime feature gating for production-safe builds
- VLP transports — transport-level trust classification
- Peer authentication — kernel-level PID attestation
- Verification — symbolic verification of
Frame::decode(M7) and the LastFiredTable invariants on the verification roadmap
Recovery — Non-Blocking Spawn / Async Reap
Status: implemented (Sessions 01–03 completed). The
--recovery-timeout-msflag is live invarta-watch; seecrates/varta-watch/src/config.rsandcrates/varta-watch/src/recovery.rs.
1. Problem
varta-watch runs a single thread driving Observer::poll on a 100 ms
read-timeout cadence. When a stalled pid crosses its silence threshold,
the observer surfaces Event::Stall and the binary calls
Recovery::on_stall(pid).
Today, Recovery::on_stall (crates/varta-watch/src/recovery.rs:71)
shells out via Command::new("/bin/sh").arg("-c").arg(&rendered).status().
status() blocks the calling thread until the child exits, which means
the entire poll loop — beat decoding, exporter pumping, Prometheus
serving, stall surfacing for other pids — freezes for the duration
of the recovery template. A misbehaving template (sleep 30, a slow
restart script) effectively takes the observer offline.
This is blocker B1 for v0.1.0.
2. Goal
Replace the blocking shell-out with a non-blocking spawn followed by an asynchronous reap on subsequent observer ticks, and add an optional kill-after deadline so a runaway template cannot consume an unbounded recovery slot. All within the project’s hard constraints:
- Zero registry dependencies in
varta-watch(path-only deps). - No new threads. No
tokio, no executors. - No
unsafe. The crate already declares#![deny(unsafe_op_in_unsafe_fn, rust_2018_idioms)]. - Library code does not print; diagnostics live in
crates/varta-watch/src/main.rsonly.
3. API surface (Session 01 lock-in)
The public surface in varta_watch::recovery becomes:
#![allow(unused)]
fn main() {
use std::process::ExitStatus;
use std::time::Duration;
#[derive(Debug)]
pub enum RecoveryOutcome {
/// A child process was forked and is now outstanding. The observer
/// has NOT waited on it. Reap on a later tick via `try_reap`.
Spawned { child_pid: u32 },
/// The previous invocation for this pid is still inside the per-pid
/// debounce window; nothing was spawned.
Debounced,
/// `Command::spawn` failed before the shell could run (e.g. fork
/// failure, `/bin/sh` missing). Surfaced verbatim.
SpawnFailed(std::io::Error),
/// A previously-`Spawned` child has exited and was reaped on this
/// tick. The observer never blocks waiting for this transition.
Reaped { child_pid: u32, status: ExitStatus },
/// A previously-`Spawned` child exceeded `recovery_timeout` and was
/// killed via `kill(2)` on this tick.
Killed { child_pid: u32 },
/// `try_wait` or `kill` failed for an outstanding child. The pid is
/// still tracked; the observer will retry on the next tick.
ReapFailed(std::io::Error),
}
pub struct Recovery { /* private */ }
impl Recovery {
/// Backwards-compatible constructor. Equivalent to
/// `with_timeout(template, debounce, None)`.
pub fn new(template: String, debounce: Duration) -> Self;
/// Construct a runner with an optional per-child deadline.
///
/// `timeout = None` ⇒ children are reaped but never killed
/// (preserves v0.1.0 semantics for users who tolerate long-running
/// recovery templates).
pub fn with_timeout(
template: String,
debounce: Duration,
timeout: Option<Duration>,
) -> Self;
/// Render `{pid}` and spawn `/bin/sh -c <rendered>` non-blockingly.
/// Returns `Spawned`, `Debounced`, or `SpawnFailed` — never blocks.
pub fn on_stall(&mut self, pid: u32) -> RecoveryOutcome;
/// Drain completed (or deadline-exceeded) children for one observer
/// tick. Returns one outcome per state transition observed:
/// `Reaped`, `Killed`, or `ReapFailed`. Never blocks; returns an
/// empty vector when no children have transitioned since the last
/// tick.
pub fn try_reap(&mut self) -> Vec<RecoveryOutcome>;
}
}
Config gains:
#![allow(unused)]
fn main() {
pub struct Config {
/* existing fields */
pub recovery_timeout: Option<Duration>,
}
}
The --recovery-timeout-ms <MS> flag is not parsed in Session 01 —
that is Session 03’s deliverable. Session 01 only widens the type.
4. Lifecycle of one recovery
debounce-suppressed
┌──────────────► Debounced
│
Event::Stall ─┤ spawn ok
│ ┌────────────► Outstanding
└─► Recovery::on_stall(pid) ───┤
│ spawn err
└────────────► SpawnFailed
(terminal)
on every Observer tick:
Recovery::try_reap()
│
├─► child exited ─────► Reaped { child_pid, status } (terminal)
│
├─► deadline exceeded ─► kill(2) ─► Killed { child_pid } (terminal)
│
└─► try_wait/kill errno ─► ReapFailed(io::Error) (retry)
Outstanding lives in a HashMap<u32, _> keyed by stalled pid (cold
path; allocation acceptable per the operator rules). One outstanding
child per stalled pid; if the pid stalls again while a child is still
outstanding, the per-pid debounce window suppresses a duplicate spawn.
5. Tick budget
The observer’s READ_TIMEOUT is 100 ms. try_reap is invoked once
per Observer::poll iteration (Session 02 owns the wiring). Worst-case
latencies:
| Event | Latency upper bound |
|---|---|
Successful child → Reaped surfaces | one tick (≤ 100 ms) after exit |
Deadline exceeded → Killed surfaces | one tick (≤ 100 ms) after deadline |
kill(2) → Reaped of killed child | one further tick (≤ 100 ms) |
These are additive with the observer’s normal stall-detection latency; they do not affect beat decoding or exporter throughput on the critical path.
6. Default behaviour when --recovery-timeout-ms is omitted
Config::recovery_timeout = None is the default. In that mode,
Recovery::with_timeout stores no deadline; outstanding children are
reaped on completion but are never killed. This preserves v0.1.0
semantics for operators whose recovery templates are intentionally
long-running (e.g. service restarts that block on health checks).
Operators who want the kill-after behaviour set
--recovery-timeout-ms <MS> explicitly. Sub-100 ms values still work
but the kill is surfaced no faster than one tick after the deadline.
7. Concurrency model
- Children are pid-indexed in
HashMap<u32, Outstanding>. The observer’sTrackeris bounded to 64 distinct pids, so the map caps at 64 outstanding children in steady state. - Debounce is per-pid and unchanged. A repeat stall for the same pid
inside the debounce window returns
Debouncedregardless of whether a child is still outstanding. - No locks; the
Recoverystruct is owned exclusively by the binary’s poll loop and is!Sendby virtue of holdingstd::process::Childvalues, which is fine since the observer is single-threaded.
8. Out of scope for this epic
varta-vlp(frame ABI is frozen).varta-client(no agent-side change).- Observer poll cadence (still 100 ms read timeout).
- Exporter line schema.
- Panic-handler feature.
9. Cross-references
- Session 02 (
docs/claude-sessions/recovery-async-spawn/session-02-recovery-impl.md) owns the green-phase implementation incrates/varta-watch/src/recovery.rsand thetry_reapwiring incrates/varta-watch/src/main.rs/observer.rs. - Session 03 (
docs/claude-sessions/recovery-async-spawn/session-03-cli-and-loop-integration.md) owns the--recovery-timeout-msparser, the HELP-text update, and threadingcfg.recovery_timeoutintoRecovery::with_timeoutat the binary call site. - Acceptance contract:
docs/acceptance/varta-v0-1-0.md, subsection Recovery — non-blocking.
10. Failing tests gating Sessions 02 and 03
Session 01 lands these as red-phase acceptance tests:
| Test | File | Owned by |
|---|---|---|
recovery_spawn_returns_within_50ms_for_slow_template | crates/varta-watch/tests/recovery_e2e.rs | Session 02 |
recovery_try_reap_yields_reaped_for_completed_child | crates/varta-watch/tests/recovery_e2e.rs | Session 02 |
recovery_try_reap_kills_after_timeout | crates/varta-watch/tests/recovery_e2e.rs | Session 02 |
recovery_concurrent_pids_run_in_parallel | crates/varta-watch/tests/recovery_e2e.rs | Session 02 |
cli_help_lists_recovery_timeout_ms_flag | crates/varta-watch/tests/cli_smoke.rs | Session 03 |
cli_parses_recovery_timeout_ms | crates/varta-watch/tests/cli_smoke.rs | Session 03 |
Peer Authentication
Varta’s observer trusts the kernel, not the wire. Two layers of defence in-depth ensure that process identity cannot be spoofed by anything that can reach the Unix Domain Socket.
Layer 1: socket file permissions (--socket-mode)
After bind(2), the observer chmods the socket file to 0600 by
default (owner read and write only). Only processes running under the
same UID as the observer can connect(2) to the socket.
| Flag | Default | Format | Behaviour |
|---|---|---|---|
--socket-mode | 0600 | Octal (e.g. 0660) | File mode applied via chmod(2) after bind. Pass 0660 to allow group access. |
Layer 2: kernel credential verification
Linux
The observer sets SO_PASSCRED on the socket after binding. Every
recvmsg(2) call then receives a SCM_CREDENTIALS ancillary message
containing a struct ucred { pid, uid, gid } populated by the kernel.
The observer compares ucred.pid against frame.pid from the VLP wire
format. If they disagree the frame is silently dropped and
varta_frame_auth_failures_total is incremented. The ucred.uid
field is implicitly trusted by Layer 1 (--socket-mode 0600 already
restricts access to the owning UID), but could be checked as a
fail-safe if a permission bypass is ever discovered.
macOS
On macOS, the observer first attempts getsockopt(LOCAL_PEERTOKEN)
immediately after each recvmsg(2). LOCAL_PEERTOKEN returns an
audit_token_t containing the sender’s PID, UID, GID, and audit
information. Because the observer is single-threaded and calls
getsockopt immediately after recvmsg, no other datagram can arrive
between the two syscalls.
When LOCAL_PEERTOKEN succeeds, the observer performs the same PID +
UID verification as on Linux. When it fails (e.g. on older macOS
versions or unconnected SOCK_DGRAM where the kernel doesn’t expose
per-datagram credentials), the observer falls back to two separate
getsockopt calls:
LOCAL_PEERPID(0x0002) — returns the peer’s PID directly.LOCAL_PEERCRED(0x0001) — returns astruct xucredwith the peer’s UID incr_uid.
If the fallback also fails, the observer falls back to the sentinel
PID 0 — relying on --socket-mode 0600 as the primary defence.
FreeBSD, DragonFly BSD, NetBSD
On FreeBSD-family platforms, the observer sets LOCAL_CREDS on the
socket (value 0x0002 on FreeBSD/DragonFly, 0x0001 on NetBSD). Every
recvmsg(2) then receives a SCM_CREDS ancillary message containing a
struct cmsgcred { cmcred_pid, cmcred_uid, cmcred_euid, cmcred_gid, ... }
populated by the kernel. The observer extracts cmcred_pid and
cmcred_euid and performs the same PID + UID verification as on Linux.
The ancillary buffer is sized at 256 bytes — sufficient for the 84-byte
cmsgcred with generous headroom for future kernel extensions.
Note: On platforms other than Linux, macOS, FreeBSD, DragonFly, and NetBSD (OpenBSD, Solaris, illumos, etc.),
varta-watchemits a startup warning via stderr:"per-datagram PID verification is unavailable. The only defence is --socket-mode (default 0600); any process under the same UID can impersonate any PID."This is by design — the kernel does not expose per-datagram peer credentials for unconnectedSOCK_DGRAMon these platforms. Containers that run multiple processes under the same UID should be aware of this limitation.
UDP transport authentication
For network-based agents that emit beats over UDP, the trust model is
cryptographic, not kernel-attested. UDP has no peer-credential
mechanism on any platform — recvmsg(2) cannot tell the observer who
sent a datagram, only where it claims to be from. Varta therefore
requires authentication at the AEAD layer, and refuses to bind an
unauthenticated UDP listener without two layers of explicit opt-in.
Compile-time features (crates/varta-watch/Cargo.toml)
| Cargo feature | What it enables | Production posture |
|---|---|---|
secure-udp | SecureUdpListener (ChaCha20-Poly1305 AEAD + per-sender replay) | Recommended |
unsafe-plaintext-udp | UdpListener (no authentication) | Forbidden in production |
udp-core | Internal — shared UDP socket wiring | (transitive) |
A build that does not include unsafe-plaintext-udp cannot link the
plaintext path at all. Passing --udp-port without keys to such a build
hard-errors at startup; there is no warn-and-continue path.
Runtime selection rules
When --udp-port is set, the observer chooses exactly one listener:
- If
--features secure-udpis compiled in and--key-file/--master-key-fileresolve to a usable key, bindSecureUdpListener. - Otherwise, only the plaintext path remains. It is bound only if
both
--features unsafe-plaintext-udpis compiled in and--i-accept-plaintext-udpwas passed on the command line. - Any other configuration is a hard error (
InvalidInput).
When the plaintext path is taken, a high-visibility varta_warn! is
emitted at startup naming the bound address, so the choice appears in
SIEM / syslog logs:
UDP on <addr> is running WITHOUT authentication (--i-accept-plaintext-udp).Any device with network reach to this port can inject heartbeats, suppressstall detection, or trigger false recovery commands. NOT for production /safety-critical use.
--i-accept-plaintext-udp is intentionally verbose: an operator who
types it is making an explicit statement that this build is for
development or testing, not for a hospital VLAN.
Why no kernel-level UDP credentials
Unix Domain Sockets carry SCM_CREDENTIALS / LOCAL_PEERTOKEN /
SCM_CREDS per-datagram. UDP carries none of those. Even on a single
host where --udp-bind-addr 127.0.0.1 is used, any local process can
send to that port — there is no equivalent of --socket-mode 0600 for
network sockets. AEAD is the only durable defence.
Recovery eligibility and transport-origin gating
Recovery commands (--recovery-cmd / --recovery-exec and the *-file
variants) take the stalled agent’s frame.pid and substitute it into
the spawned process (kill -9 {pid}, systemctl restart agent@{pid}.service,
etc.). That makes recovery a privileged action that targets an arbitrary
process by id — and means the wire-level frame.pid must be tied back to
the real sending process, not just to whoever holds an AEAD key.
The trust invariant
A recovery command MUST NEVER fire for a pid whose beat lifetime is not kernel-attested. In practice that means:
| Transport | Kernel-attested? | Recovery-eligible by default? |
|---|---|---|
| UDS | Yes — SO_PASSCRED / LOCAL_PEERTOKEN / SCM_CREDS | Yes |
| Plaintext UDP | No — peer_pid is always 0 | No |
| Secure UDP | No — frame is cryptographically authenticated but the kernel does not attest the sending process; a holder of the AEAD key (or a per-agent key derived from a leaked master key) can forge a beat for any pid | No |
Internally each beat is tagged with a BeatOrigin
(KernelAttested vs NetworkUnverified). The tracker pins the origin on
the slot’s first beat and rejects subsequent beats from a different
origin as Event::OriginConflict (counter:
varta_origin_conflict_total). First-origin-wins prevents an attacker on
an untrusted transport from “tainting” a slot that legitimately belongs to
a kernel-attested agent.
Two-layer enforcement
-
Startup hard-error. If any
--recovery-cmd/--recovery-cmd-file/--recovery-exec/--recovery-exec-fileis configured and--udp-portis set, the daemon refuses to start withConfigError::RecoveryRequiresAuthenticatedTransport. Operators must pass--i-accept-recovery-on-unauthenticated-transportto proceed. The flag is verbose by design (matches the--i-accept-<risk>convention) and shows up incargo tree/ startup banners. -
Runtime origin gate. Even with the accept flag,
Recovery::on_stallrefuses to spawn the recovery command when the stalled slot’s pinned origin isNetworkUnverified. The refusal returns the typedRecoveryOutcome::RefusedUnauthenticatedSource { pid }, incrementsvarta_recovery_refused_total{reason="unauthenticated_transport"}, and emits a structuredrefusedrecord into the recovery audit log (--recovery-audit-file). To enable UDP-origin recovery the operator must construct theRecoverywithwith_allow_unauthenticated_source(true)— a second, conscious choice on top of the startup flag.
Why secure-UDP isn’t enough
The secure-UDP master-key mode binds frame.pid to the 4-byte PID prefix
in iv_random[0..4] and derives a per-agent key from the master key.
That is a useful cryptographic binding for the UDP threat model — a
holder of a single derived agent key cannot forge frames for other
pids. But the binding lives at the protocol layer, not at the kernel
layer:
- A leak of the shared key lets anyone forge any pid.
- A leak of the master key lets anyone derive any agent key.
- A leak of any per-agent key still lets that agent forge its own pid to misbehave (e.g. stop sending → trigger recovery against its own pid during legitimate maintenance windows).
Kernel attestation has no such failure mode: the kernel knows which
process owns the socket fd, and that knowledge cannot be forged by any
amount of key material. This is why Varta classifies all UDP variants
(plain and secure) as NetworkUnverified for the recovery-eligibility
decision.
Recovery command authentication boundary
--recovery-cmd (inline shell) and --recovery-cmd-file (file-based
shell) both spawn /bin/sh -c <template> with the observer’s full
process authority. In a safety-critical deployment a recovery template
like systemctl restart {service} or kill -9 {pid} can terminate
unrelated production processes if the template body is mis-edited or if
shell metacharacters appear unexpectedly.
To prevent accidental shell-mode deployment, shell mode requires
--i-accept-shell-risk at runtime. Without that flag, startup
hard-errors with a message that recommends --recovery-exec (which
calls execvp(2) directly — no shell, no metacharacter interpretation,
no injection surface). This applies to both the inline and file-based
forms; the shell-injection risk is identical regardless of where the
template comes from.
--recovery-exec and --recovery-exec-file do not require an
accept flag — they are the default-safe path.
Prometheus /metrics endpoint exposure
The /metrics endpoint is HTTP/1.0 with mandatory bearer-token
authentication. When --prom-addr is set, --prom-token-file is
required: the observer refuses to start without it. Every scrape must
send Authorization: Bearer <hex> where <hex> is the lowercase 64-byte
hex form of the file’s 32 random bytes (the format produced by
openssl rand -hex 32). Missing or wrong tokens get
HTTP/1.0 401 Unauthorized and bump varta_prom_auth_failures_total.
The token file is loaded through the same hardened validator that
guards --key-file (see “Secret-file validation” below): regular file,
no symlinks, owned by the observer UID, mode 0o600 or stricter,
opened with O_NOFOLLOW.
The endpoint also retains four DoS-protection layers from earlier work, so that a hostile scraper cannot exhaust file descriptors or starve the observer’s poll loop even before the auth check runs:
- Serve budget — at most
PROM_MAX_CONNECTIONS_PER_SERVE=8accepted connections per outer poll tick, and a 100 ms wall-clock deadline. - Drain budget — after the serve budget is exhausted, an
additional
PROM_MAX_DRAIN_PER_SERVE=50connections may be accepted and immediately closed, so the kernel accept queue does not back up. - Per-source-IP token bucket — every accepted connection (in both
serve and drain phases) decrements a per-IP token bucket sized by
--prom-rate-limit-burst(default 10) and refilled at--prom-rate-limit-per-sec(default 5). Connections from an IP whose bucket is empty are closed without serving and counted asvarta_prom_connections_dropped_total{reason="rate_limit"}. - Per-IP table cap — the per-IP map is bounded to 1024 entries;
when full, stale entries (no activity in 60 s) are evicted first,
then if necessary the oldest entry is force-evicted and counted as
varta_prom_connections_dropped_total{reason="ip_table_full"}.
Token comparison is constant-time
The exporter compares the presented and expected tokens via
varta_vlp::ct_eq — the same constant-time XOR-and-OR routine that
guards Poly1305 tag verification. This prevents byte-by-byte timing
oracles from leaking the prefix of the token to a remote scraper.
Bind-address recommendation
The bearer token is the authoritative authentication boundary. Loopback
bind (127.0.0.1:<port> or [::1]:<port>) behind a reverse proxy
remains the recommended posture for defense in depth, but is no longer
the only defense. The observer still emits a startup varta_warn!
whenever the bound address is non-loopback, so the exposure is visible
in audit logs.
Prometheus scrape config
The standard authorization: block injects the bearer token verbatim:
scrape_configs:
- job_name: 'varta'
static_configs:
- targets: ['varta-host:9100']
authorization:
type: Bearer
credentials_file: /etc/prometheus/varta-prom.token
The credentials_file should be the same content as
--prom-token-file on the observer; Prometheus reads it with the same
0600-or-stricter expectation.
Secret-file validation
Every file containing key material — --key-file, --accepted-key-file,
--master-key-file, and the new --prom-token-file — flows through
validate_secret_file in varta-watch/src/config.rs. The validator
enforces:
- The path is not a symlink (
symlink_metadata+is_symlink). - The path resolves to a regular file (not a directory, FIFO, block/char device, etc.).
- The mode is
0o600or stricter (mode & 0o077 == 0). - The file is owned by the observer’s UID (kernel-attested via
stat.uid, not derived from the env). - The file is opened with
O_NOFOLLOWto close the TOCTOU window between the metadata check and the read.
A failure on any of these aborts startup with a typed ConfigError
naming the failing constraint (insecure permissions ..., must not be a symlink, owned by uid X, expected uid Y, etc.).
Why environment-variable keys are gone
Earlier releases offered --key-env <NAME> as a key-source fallback.
That flag is removed. Passing it now returns
ConfigError::RemovedFlag with an inline migration hint pointing at
--key-file. The motivation:
- On Linux,
/proc/<pid>/environis readable by any process running under the same UID; a peer with a UDS connection to the observer (which already has UID-restricted access) can read the master key out of the observer’s own environment. - In containers,
docker inspect <container>exposes every environment variable to anyone with read access to the Docker socket — typically all members of thedockergroup, which is often a superset of the in-container UID. systemd-journaldcaptures process environment on demand for crash reports; an env-var key ends up in/var/log/journalindefinitely.
File-based keys avoid all three exposures and slot into the same ownership/permission model as TLS private keys, SSH host keys, and any other long-lived secret an operator already knows how to manage.
The Key type in varta_vlp::crypto also lost its Copy derive and
gained a Drop impl that volatile-zeros the secret bytes before the
allocation is returned to the stack, closing a small but real leak
surface in core dumps and ASLR-defeated speculative reads.
Shutdown grace and systemd
--shutdown-grace-ms (default 5000, minimum 100) bounds the time
Recovery::drop blocks waiting for outstanding recovery children to
exit after issuing SIGKILL during shutdown. Children that outlive the
grace are abandoned to PID 1 for reaping; the observer process exits
either way, so the bound on shutdown latency is deterministic.
In a systemd unit, TimeoutStopSec must be at least
shutdown_grace_ms + 2 s (roughly: grace + reap margin) to ensure
that systemd does not SIGKILL the observer mid-grace and leak an
unreaped recovery child:
[Service]
Environment=VARTA_SHUTDOWN_GRACE_MS=5000
ExecStart=/usr/local/bin/varta-watch --shutdown-grace-ms ${VARTA_SHUTDOWN_GRACE_MS} ...
TimeoutStopSec=7s
KillMode=mixed
KillMode=mixed is recommended: systemd sends SIGTERM to the main
observer process only; the observer then runs its own Drop sequence to
kill+reap any recovery children it had spawned. This is what the
shutdown-grace tunable is designed around.
Recovery command environment isolation
When --recovery-env KEY=VALUE is specified (repeatable), the recovery
child process runs with a sanitized environment:
- The child’s environment is cleared entirely.
PATHis set to/usr/bin:/bin(sufficient to locate common tools).- Only the explicitly-listed
KEY=VALUEpairs are exported.
Without --recovery-env, the child inherits the observer’s full
environment (backward compatible). This flag provides defense-in-depth
against environment-variable-based injection vectors (e.g. a malicious
LD_PRELOAD or IFS in the observer’s environment that could affect
/bin/sh -c behaviour).
Shell-mode recovery is gated by --i-accept-shell-risk at startup
(see the “Recovery command authentication boundary” section above).
When the flag is set, the observer still emits a single audit-trail
varta_warn! at startup so that the choice is captured in any SIEM /
syslog ingest alongside the other startup banners.
Template safety
The {pid} substitution in --recovery-cmd is safe regardless of the
authentication outcome. A u32 PID formatted as a decimal string
contains only the characters 0–9 and can never carry shell
metacharacters (;, |, &, $, `, etc.).
Metrics
| Metric | Type | Description |
|---|---|---|
varta_frame_auth_failures_total | counter | Incremented every time a frame’s claimed PID does not match the kernel-verified sender PID (Linux only). |
varta_beats_total{pid="..."} | counter | Per-PID total of accepted beats (only incremented after authentication passes). |
varta_prom_connections_dropped_total{reason="..."} | counter | /metrics connections accepted but closed before serving. Reasons: drain (serve budget exhausted), rate_limit (per-IP token bucket empty), ip_table_full (per-IP state map force-evicted). |
varta_prom_auth_failures_total | counter | /metrics scrapes that arrived without Authorization: Bearer <hex> or with a wrong token. Always emitted on every scrape (even at zero), so absent() alert rules stay green-on-green until the first incident. |
varta_recovery_refused_total{reason="..."} | counter | Recovery commands NOT spawned because of a structural safety gate. Only reason currently defined: unauthenticated_transport (stalled slot’s pinned origin was NetworkUnverified and the operator did not enable UDP-origin recovery). Emitted at zero on every scrape. |
varta_origin_conflict_total | counter | Beats dropped because the slot’s pinned transport origin disagreed with the beat’s origin (first-origin-wins). Non-zero values indicate either operator misconfiguration (same pid emitted from two transports) or an active spoofing attempt. |
Trust model summary
Process ── connect(2) to UDS ──┐
├─ [FAIL] Kernel blocks (Layer 1: --socket-mode 0600, wrong UID)
├─ [PASS] Layer 2: SO_PASSCRED → ucred.pid (Linux)
│ Layer 2: LOCAL_PEERTOKEN → audit_token.pid (macOS, best-effort)
│ Layer 2: LOCAL_CREDS → cmsgcred.pid (FreeBSD, DragonFly, NetBSD)
│ ├─ [PID MISMATCH] → Drop frame + bump counter
│ ├─ [UID MISMATCH] → Drop frame as IoError
│ └─ [PID MATCH + UID MATCH] →
↓
[SUCCESS] Observer trusts the PID → tracks,
surfaces stalls, triggers --recovery-cmd
with {pid} substitution.
The trust boundary is the kernel: a frame is only accepted if the kernel
attests that the sending process’s PID matches the one encoded in the
VLP frame and that the sending process runs under the observer’s UID.
On Linux this is enforced per-datagram via SO_PASSCRED; on macOS via
getsockopt(LOCAL_PEERTOKEN) with LOCAL_PEERPID/LOCAL_PEERCRED fallback;
on FreeBSD / DragonFly / NetBSD via LOCAL_CREDS + SCM_CREDS. Platforms
without kernel-level credential passing fall back to --socket-mode 0600.
Security limitations
No forward secrecy
The KDF derives per-agent and per-epoch keys from a single master key. An epoch key can decrypt frames from past epochs if the agent key is compromised. True forward secrecy requires bidirectional ephemeral key exchange (e.g. X25519), which is incompatible with the connectionless, one-way heartbeat model.
When the master key is rotated, all agents must be updated atomically.
The observer reads the master key once at startup from --master-key-file. To
rotate keys, restart the observer with the new master key file. SIGHUP-based
hot-reload is planned for a future release.
Panic-hook entropy policy (secure UDP)
install_panic_handler_secure_udp reads 8 bytes of cryptographic entropy at
install time (getrandom(2) on Linux, getentropy(3) on macOS/BSD, falling
back to /dev/urandom). The IV is pre-computed once so that no file I/O
occurs inside the panic handler itself (async-signal-safety).
Fail-closed default: if all entropy sources fail — common in chrooted
environments without a mounted /dev — the function returns
Err(PanicInstallError::EntropyUnavailable) and the hook is NOT registered.
This prevents a panic-time Critical frame from reusing a deterministic IV
under the same AEAD key, which would be a catastrophic nonce-reuse failure.
Degraded-entropy opt-in: use
install_panic_handler_secure_udp_accept_degraded_entropy to fall back to a
non-cryptographic IV derived from PID, TID, monotonic time, and a counter
(SipHash-2-4). This always succeeds but accepts nonce-reuse risk if the
process panics more than once. The verbose function name is intentional
structural enforcement matching the project’s --i-accept-<risk> convention.
Little-endian only
The VLP wire format uses little-endian integer encoding natively.
Protocol correctness depends on the host being little-endian (all tier-1
targets — x86_64 and aarch64 — satisfy this). Building on a big-endian
host is a compile error. See book/src/architecture/vlp-frame.md for design
rationale.
Panic-hook key lifetime — accepted residual
The secure-UDP panic handler (install_panic_handler_secure_udp,
install_panic_handler_secure_udp_accept_degraded_entropy) captures a Key
by move into a Box<dyn Fn> registered via std::panic::set_hook. The Box
is the single owner of the captured Key for the lifetime of the
process — Key is !Clone (see crates/varta-vlp/src/crypto/mod.rs), so
no duplicate of the secret bytes can exist anywhere else in the address
space.
The !Clone invariant pins the count of in-memory copies to one. The
remaining concern is the lifetime of that one copy on process exit:
- Normal hook replacement (
std::panic::take_hook): the prior Box is dropped, the capturedKey’sZeroizeOnDropfires, and the 32 secret bytes are wiped before the heap page is returned to the allocator. OK. panic = "unwind"profile, normal process exit: the panic-hook Box is leaked by the runtime —Dropis not called on registry-held objects at exit. The capturedKeybytes persist in heap memory until the kernel reclaims the page. Linux does not zero pages on reclaim (memory contents are reused; zero-on-allocation guarantees apply only to new allocations into the same process).panic = "abort"profile: the panic-hook closure never runs, butset_hookstill owns the Box — same residual as the normal-exit case. Additionally, noDropruns anywhere duringabort().
This residual is accepted: there is no async-signal-safe mechanism
that can reliably wipe a heap-resident secret at process exit. atexit
handlers do not run on abort(), are not async-signal-safe, and race the
panic hook firing. mlock / memfd_secret cannot prevent the kernel
from copying the page during scheduler context switches or core dumps.
The minimum-surface design is to keep the captured Key alive in a
single Box and treat the OS process boundary as the security boundary:
inspecting the memory of a live process requires ptrace or
/proc/<pid>/mem privileges, at which point all in-memory secrets in
any design are accessible.
Cross-references
- Safety profiles — compile-time feature gating for dangerous recovery paths; production-safe build verification recipe
- Observer liveness — defending against
varta-watchitself crashing or hanging - VLP transports — transport-level trust classification and
BeatOriginsemantics
PID-namespace semantics
Varta agents and the varta-watch observer can run on the same host but in
different Linux PID namespaces (typical when agents run in containers and the
observer on the host, or vice-versa). This document defines what the protocol
does in that case, why, and how operators configure it.
Problem statement
std::process::id() (called by Varta::beat()) returns the agent’s PID in
the calling process’s PID namespace (see pid_namespaces(7)). The observer’s
kernel-attested peer PID (SO_PASSCRED / LOCAL_PEERTOKEN / SCM_CREDS) is
the PID as seen from the observer’s namespace.
Two consequences when namespaces differ:
- The numeric pid is meaningless across the boundary. PID 17 in container
A is a different process from PID 17 on the host.
kill(2)against PID 17 in the observer’s namespace targets the observer-namespace process, not the agent. - Collisions are guaranteed. Every container’s first process is PID 1. Two containerized agents binding the same observer socket will both claim PID 1.
Threat model
| Scenario | Risk |
|---|---|
| Host observer, host agents | None. |
Host observer, agent in --pid=host container | None — agent uses host PIDs. |
| Host observer, agent in private-PID container | Cross-namespace: kill targets wrong process. |
| Two private-PID containers, shared observer | Pid collisions: containers claim same pid. |
| Container observer, host agents | Cross-namespace. |
Detection
On Linux, every process’s PID namespace has a unique inode exposed at
/proc/<pid>/ns/pid (stat(1) it, or readlink(1) for the canonical
pid:[NNNN] form). Two processes share a PID namespace iff their
/proc/<pid>/ns/pid symlinks resolve to the same inode.
varta-watch caches its own inode at startup
(crate::peer_cred::observer_pid_namespace_inode()) and, for every
kernel-attested beat, reads the peer’s inode
(crate::peer_cred::read_pid_namespace_inode(peer_pid)). Both helpers are
allocation-free; the per-beat read is one readlink(2) syscall into a stack
buffer (sub-microsecond on modern Linux).
Non-Linux platforms (macOS, BSD) return None from both helpers and the
comparison short-circuits to “match”. UDP listeners set peer_pid_ns_inode = None because there is no kernel attestation; the existing UDP recovery
refusal gate is the relevant protection there.
Mitigation by deployment style
| Deployment | Default behaviour | Operator action |
|---|---|---|
| Single namespace (host or container) | Pass-through. | None. |
Containerized agents with --pid=host | Pass-through (same kernel-attested ns). | None. |
| Containerized agents with private PID namespace | Beats dropped at receive; recovery refused. Audit log shows reason=cross_namespace_agent. | Either fix the deployment (run agents with --pid=host) or accept the risk via --allow-cross-namespace-agents and arrange out-of-band PID translation in the recovery template. |
| Mixed: some agents same-ns, some cross-ns | Same-ns agents work; cross-ns agents refused and audit-logged. | Same as above; the gate is per-beat. |
| Operator wants fail-fast on misconfigure | Defaults silently drop and audit. | Pass --strict-namespace-check — daemon exits non-zero on first cross-ns beat. |
Audit and metrics inventory
| Surface | Linux signal |
|---|---|
varta_frame_namespace_mismatch_total (counter) | Kernel-attested frames dropped at receive (peer ns ≠ observer ns). |
varta_tracker_namespace_conflict_total (counter) | Beats dropped because the slot’s pinned ns inode disagreed with the beat’s (first-namespace-wins). |
varta_recovery_refused_total{reason="cross_namespace_agent"} (counter) | Stalls refused at recovery time because the slot’s ns inode differed from the observer’s. |
varta_recovery_outcomes_total{outcome="refused_cross_namespace"} (counter) | Same event, broken down on the outcome axis. |
Audit log record with reason=cross_namespace_agent | TSV record in --recovery-audit-file. |
Event::NamespaceConflict | Emitted to consumers via Observer::poll() so file/Prom exporters can record it. |
All counters are emitted at every scrape even at zero, so absent() alert
rules stay green-on-green until the first event.
API surface
Observer::observer_pid_namespace_inode() -> Option<u64>— returns the observer’s cached PID-namespace inode (Linux only).Observer::with_allow_cross_namespace(bool) -> Self— opt out of the default refuse-and-audit behaviour. Wired from--allow-cross-namespace-agents.Observer::drain_cross_namespace_drops() -> u64— counter drain.Observer::drain_namespace_conflicts() -> u64— counter drain.Tracker::pid_ns_inode_of(pid: u32) -> Option<Option<u64>>— observer-side introspection.Recovery::with_allow_cross_namespace(bool) -> Self— same opt-out at the recovery layer.Recovery::on_stall(pid, origin, cross_namespace_agent: bool)— caller-supplied cross-ns flag (typically derived fromEvent::Stall::pid_ns_inodevsObserver::observer_pid_namespace_inode()).Recovery::take_refused_cross_namespace() -> u64— counter drain.RecoveryOutcome::RefusedCrossNamespace { pid }— refusal variant.
CLI flags
--allow-cross-namespace-agents Permit beats and recovery for agents whose
kernel-attested PID namespace differs from
the observer's. Default off — beats dropped
at receive (counted) and recovery refused
(audit + counter).
--strict-namespace-check Fatal startup error on first cross-namespace
beat. Default off — log + counter only.
Edge cases
/proc/<peer_pid>/ns/pidunreadable (ptrace_may_accessdenial, peer exited betweenrecvmsgandreadlink,/procnot mounted): the helper returnsNone. The tracker’sNone → Someupgrade allows one-shot recovery so a transient/procunavailability does not pin a slot as permanently unknown.- Existing
frame.pid != peer_pidcheck fires first for most real cross-namespace traffic (the two namespaces almost always produce different numeric pids for the same process). The namespace gate is belt-and-suspenders for the surprising case where the pids happen to collide. unsafe_code = "deny"is workspace-wide. The newreadlinkFFI follows the establishedpeer_cred.rspattern (extern "C"+ one-lineunsafe { ... }blocks with a SAFETY comment).- Frame ABI is unchanged — the 32-byte
Frameis not touched. All state lives observer-side.
Cross-references
vlp-transports.md— overall transport model.peer-authentication.md— kernel-attested PID and theBeatOrigintrust classification.pid_namespaces(7)anduser_namespaces(7)man pages — kernel reference.
Recovery audit log (schema v2)
The recovery audit log (varta-watch/src/audit.rs) is the canonical
forensic record of every recovery action the daemon took or refused. It
exists to satisfy three operational requirements:
- Traceability. For an IEC 62304 Class C device — or an aviation ground-station — every recovery action must be reconstructable after the fact: what was spawned, when, why, with what outcome.
- Survivability. A power cut on the host must not silently drop the most recent audit records.
- Tamper-evidence. A reviewer must be able to detect retroactive editing of historical records.
Schema v1 (the pre-2026 format) satisfied only the first of these. Schema
v2 — the current format — satisfies all three when the daemon is built
with the audit-chain feature.
File format
Two file-level header lines, then one record per line. Fields are
tab-separated. Every record kind carries a leading seq column and a
trailing chain column. Free-form fields (program paths, refusal
reasons) have their \t, \n, and \r bytes replaced with a single
space at write time so a maliciously-chosen argv[0] can never inject
columns.
# varta-watch recovery audit v2
boot
seq wallclock_ms observer_ns boot daemon_pid prev_chain|- reason chain
A boot record opens every audit-log session and every post-rotation
generation. The reason column carries one of six stable tokens:
| reason | when it fires | prev_chain |
|---|---|---|
fresh | brand-new file with no prior content | - |
resume | clean v2 tail from a prior session | last chain |
legacy_v1 | existing file uses v1 schema; v2 section starts here | - |
corrupt_tail | v2 file with a torn last record (kernel partial write); the file is ftruncate’d to the last newline before this record is appended | last good chain if recoverable, else - |
schema_drift | header is neither v1 nor v2 | - |
rotation | rotation generation roll | last chain of pre-rotation file |
spawn
seq wallclock_ms observer_ns spawn agent_pid child_pid mode program source template_len chain
Emitted at the moment a recovery child is fork(2) + execvp(2)’d.
mode ∈ {exec, shell}; program is the path actually invoked
(/bin/sh for shell mode, argv[0] for exec mode); source is either
the literal "inline" or the path-string for --recovery-cmd-file /
--recovery-exec-file. The command template itself is not logged —
it may contain secrets, and the source path is already auditable.
complete
seq wallclock_ms observer_ns complete agent_pid child_pid outcome exit_code|- signal|- duration_ns stdout_len stderr_len truncated chain
Emitted on reap, kill-after-timeout, or reap failure. outcome is one of
reaped, killed, reap_failed. exit_code and signal are mutually
exclusive: at most one is a number, the other is -.
refused
seq wallclock_ms observer_ns refused agent_pid reason chain
Emitted when a stall is detected but recovery is structurally declined
(e.g. unauthenticated transport, cross-namespace agent). reason is a
stable short token so SIEM consumers can alert on it without parsing
free text.
Sequencing
seq is a u64 starting at 1 on the first boot record. It is strictly
monotonic within a daemon lifetime and across daemon restarts (the
new daemon resumes from last_seq + 1 after parsing the existing tail).
A consumer detects record loss as a gap: seq[i+1] - seq[i] > 1.
Durability cadence
Every record_* call is followed by BufWriter::flush() and
File::sync_data() (= fdatasync(2) on Linux) at a configurable cadence
controlled by --recovery-audit-sync-every <N>:
N = 1(default, IEC 62304 Class C-conforming): onefdatasyncper record.N > 1: onefdatasyncperNrecords. The daemon emits a startup warning and the build is not Class C-conforming. Up toN - 1records can be lost on power cut.N = 0: rejected at parse time.
In addition, the daemon unconditionally syncs:
- Before every rotation rename.
- After writing the post-rotation
bootrecord. - In
Drop(best-effort; not load-bearing for correctness).
Tamper-evidence: the hash chain
When the daemon is built with --features audit-chain, every record’s
trailing chain column is the lowercase-hex SHA-256 of:
DOMAIN || 0x00 || kind || 0x00 || prev_chain_raw || 0x00 || body_with_seq
where:
DOMAIN = b"VARTA-AUDIT-v2". The trailingv2is the schema version; a future v3 mandatorily bumps this so chains across schemas cannot be confused.kindis the bytesb"boot"/b"spawn"/b"complete"/b"refused".prev_chain_rawis the raw 32-byte prior chain hash (not its hex form), or[0u8; 32]for the very first record in a fresh file.body_with_seqis the TSV line from theseqcolumn up to (but not including) the chain column — no trailing\n.- Four
0x00separators prevent field-boundary confusion: e.g.(kind="ab", body="cd")and(kind="abcd", body="")hash to distinct strings.
The construction is implemented once in
crates/varta-vlp/src/crypto/hash.rs::audit_chain_hash so callers
cannot accidentally drop the domain separation or transpose the input
order.
What this detects
- Any byte edited in any historical record. The edited record’s own chain stops matching, and every subsequent chain also stops matching.
- Any record deleted. The chain breaks at the deletion point.
- Any record inserted. Same — the chain over the synthetic record cannot match the next legitimate record.
- Records reordered. The chain validates only in original order.
What this does NOT detect
A pure SHA-256 hash chain — without a secret key — can be recomputed
end-to-end by an attacker with write access to the file. Tampering is
only detectable when the latest chain head is verified against an
externally trusted source. Operators in safety-critical deployments
should periodically export tail -1 audit.log | cut -f<last> to a
sealed log (Tang, AWS S3 with object-lock, a hardware HSM, etc.). The
daemon does not do this — it is an operational policy decision.
A future HMAC-keyed mode is out of scope for v2 to avoid forcing a key-distribution workflow on every Class C deployment.
When audit-chain is disabled
If the daemon is built without --features audit-chain:
- The
chaincolumn is the literal string-. - The daemon emits a startup warning explicitly stating that the build is not IEC 62304 Class C-conforming.
seqandfdatasynccadence still work — record loss is detectable; power-cut durability is preserved; only tamper-evidence is absent.
The build remains zero-registry-dep (the audit-chain feature
propagates the existing optional crypto deps in varta-vlp/crypto).
Rotation
When --recovery-audit-max-bytes <N> is set, the file rotates after any
write that pushes it over the threshold: PATH → PATH.1 → … →
PATH.5. Five generations are kept; the oldest is unlinked. The same
generation count as the event-stream FileExporter.
The chain spans rotation: the first non-header record in the new
generation is a boot with reason=rotation whose prev_chain column
is the final chain of the just-rotated file. A reviewer who pieces
generations together by seq order can replay-verify the chain across
the entire history.
Verification recipe
# 1. Confirm seq is strictly monotonic across all generations.
cat audit.log.5 audit.log.4 audit.log.3 audit.log.2 audit.log.1 audit.log \
| grep -v '^#' \
| awk -F'\t' 'NR==1 { prev = $1; next } $1 != prev+1 { print "GAP at seq", $1; exit 1 } { prev = $1 }'
# 2. Confirm chain validates (requires the daemon's
# audit_chain_hash helper exposed in a verification tool — out of scope
# for the daemon binary itself, see book/src/architecture/peer-authentication.md
# for the pattern).
# 3. Cross-check that the chain head matches the latest sealed-log entry
# the operator exports to their trusted store.
CLI surface
| Flag | Required | Default | Meaning |
|---|---|---|---|
--recovery-audit-file <PATH> | no | unset | Append audit records to PATH. Created mode 0600. |
--recovery-audit-max-bytes <N> | no | unbounded | Rotate after a write that pushes the file past N bytes. |
--recovery-audit-sync-every <N> | no | 1 | fdatasync cadence. 1 is the only Class C-conforming value. |
Threat model
| Threat | Detected? | Mechanism |
|---|---|---|
| Record loss from buffer-only flush + power cut | yes | seq gap; durability cadence; rotation pre-rename sync |
| Record loss from process kill | yes | seq gap; resume boot on restart |
| Single record edit (any byte) | yes (with chain) | hash chain divergence |
| Bulk re-write by attacker with file-write access AND chain re-computation | no | requires an external sealed chain-head log |
| Schema downgrade (v2 → v1) | yes | schema_drift boot or first-line header check |
| Replay of a captured audit file in a different deployment | yes (with chain) | initial prev_chain = [0; 32] differs per host/lifetime |
Compile-time Configuration (Class-A profile)
The Class-A safety-critical profile builds varta-watch with the
compile-time-config Cargo feature. In this profile the runtime binary
has no argv parser, no Prometheus HTTP exporter, and a single
neutral --help body that mentions no flag names. Every operational
knob is supplied at compile time by build.rs from a static
KEY = VALUE file pointed to by the VARTA_CONFIG_FILE environment
variable.
The Class-A binary is verified by the CI safety-profiles job:
B=target/release/varta-watch
strings "$B" | grep -E -- "(GET /metrics|HTTP/1\.|--[a-z])"
# expect: no output
When to use this profile
- Hospital VLAN deployments where every CVE surface is a liability.
- IEC 62304 Class C medical devices (insulin pumps, holter monitors, ventilators) where the host configuration is part of the validated firmware.
- Avionics / industrial-control systems where the binary must boot from a signed image and accept no operator input post-deployment.
For SRE / cloud deployments use the default-feature build (or
--features prometheus-exporter for /metrics). The two profiles are
mutually exclusive at compile time via a compile_error! guard in
crates/varta-watch/src/lib.rs.
Build recipe
export VARTA_CONFIG_FILE=/etc/varta/varta.conf
cargo build -p varta-watch --release \
--no-default-features --features secure-udp,compile-time-config
secure-udp is the recommended companion feature — Class-A almost
always wants authenticated transport. Other features that combine
cleanly with compile-time-config: audit-chain, json-log,
unsafe-shell-recovery (only when the operator’s signed config
explicitly opts in via i_accept_shell_risk = true).
The prometheus-exporter feature is forbidden in combination with
compile-time-config; cargo build fails with a clear compile_error!
diagnostic.
File grammar
Plain text, UTF-8. Lines that begin with # or are entirely whitespace
are ignored. Each remaining line is KEY = VALUE:
- The
=separator may have any amount of whitespace on either side. KEYmust be in theKNOWN_KEYScatalogue (see below).VALUEis the rest of the line after the first=, trimmed.- Quoting is not supported — paths and strings are taken verbatim.
- Repeated singleton keys are a build error; repeated list keys
(
recovery_env) accumulate. - Unknown keys are a build error that surfaces during
cargo build.
Example:
# /etc/varta/varta.conf
socket = /run/varta/varta.sock
threshold_ms = 5000
socket_mode = 0600
# Recovery: exec-mode only, never shell.
recovery_exec_cmd = /usr/local/sbin/varta-recover {pid}
recovery_audit_file = /var/log/varta/recovery.tsv
recovery_audit_sync_every = 1
# Authenticated UDP listener bound to loopback.
udp_port = 8443
udp_bind_addr = 127.0.0.1
secure_key_file = /etc/varta/agent.key
# Hospital deployment: medical-device clock semantics + strict mode.
clock_source = boottime
strict_namespace_check = true
Accepted keys
| Key | Type | Default | Notes |
|---|---|---|---|
socket | path | required | UDS path the observer binds. |
threshold_ms | u64 | required | Per-pid silence window. Minimum 10. |
socket_mode | octal | 0600 | UDS file mode after bind. |
read_timeout_ms | u64 | 100 | UDS read timeout per poll call. |
udp_port | u16 | none | Bind a UDP listener on this port. |
udp_bind_addr | ip | runtime default | Loopback for secure-UDP; 0.0.0.0 for plaintext. |
secure_key_file | path | none | 64-hex-char primary key (secure-udp). |
accepted_key_file | path | none | One key per line for rotation. |
master_key_file | path | none | 64-hex-char master for per-agent derivation. |
recovery_cmd | string | none | Shell template (requires unsafe-shell-recovery). |
recovery_exec_cmd | string | none | program args … invoked via execvp. |
recovery_cmd_file | path | none | Read recovery_cmd from a hardened file. |
recovery_exec_file | path | none | Read recovery_exec_cmd from a hardened file. |
recovery_debounce_ms | u64 | 1000 | Per-pid debounce window. |
recovery_env | list-of-string | empty | KEY=VALUE; repeatable. |
recovery_timeout_ms | u64 | none | Kill-after deadline for recovery children. |
recovery_audit_file | path | none | TSV recovery audit log. |
recovery_audit_max_bytes | u64 | none | Audit-file rotation byte cap. |
recovery_audit_sync_every | u32 | 1 | fdatasync cadence (1 = every record). |
recovery_capture_stdio | bool | false | Capture child stdio for audit. |
recovery_capture_bytes | u32 | 4096 | Stdio capture cap. Max 1048576. |
file_export | path | none | TSV event-stream sink. |
export_file_max_bytes | u64 | none | Event-file rotation cap. |
heartbeat_file | path | none | Per-tick liveness file. |
tracker_capacity | usize | 256 | Max tracked PIDs. |
tracker_eviction_policy | enum | strict | strict or balanced. |
eviction_scan_window | usize | 256 | Max slots scanned per eviction attempt. Range [1, 4096]. |
max_beat_rate | u32 | none | Per-pid beats/sec cap. |
clock_source | enum | monotonic | monotonic or boottime (Linux only). |
iteration_budget_ms | u64 | 250 | Per-iteration soft budget. Range [50, 60000]. |
scrape_budget_ms | u64 | 250 | Per-serve_pending soft budget. Range [50, 60000]. |
shutdown_after_secs | u64 | none | Self-terminate after this uptime. |
shutdown_grace_ms | u64 | 5000 | Drop blocking time during shutdown. Minimum 100. |
self_watchdog_secs | u64 | none | Self-watchdog deadline (auto-enables under systemd). |
hw_watchdog | path | none | Hardware watchdog device (/dev/watchdog). |
i_accept_plaintext_udp | bool | false | Runtime acknowledgement. |
i_accept_shell_risk | bool | false | Runtime acknowledgement. |
i_accept_recovery_on_secure_udp | bool | false | Recovery on secure-UDP transport. |
i_accept_recovery_on_plaintext_udp | bool | false | Recovery on plaintext UDP. |
i_accept_secure_udp_non_loopback | bool | false | Non-loopback secure-UDP bind. |
allow_cross_namespace_agents | bool | false | Permit cross-PID-namespace beats. |
strict_namespace_check | bool | false | Fatal exit on cross-namespace agent. |
inject_wedge_ms | u64 | none | Test-hooks only (requires test-hooks feature). |
Operational contract
--help(and any other argv) is rejected at startup. The binary exits non-zero with the neutral diagnostic “this binary was configured at compile time; refusing to accept command-line arguments”.- Diagnostic messages in stderr / sd_notify use neutral wording — no
--flag-namestrings appear anywhere in the binary. See the cerebrum entry onpub const &strbeing unconditionally linked for the rationale. - The configuration file is consumed once, at
cargo buildtime. The resulting binary is immutable: redeployment requires a new build. This is the structural feature operators rely on for Class-A release-gating.
See also
- Safety profiles overview
- Peer authentication — key-file requirements
- Observer liveness — self-watchdog wiring
Safety Profiles
varta-watch ships with a two-layer gate for every structurally-dangerous
capability: a compile-time Cargo feature that must be explicitly enabled,
AND a runtime flag that must be passed by the operator. Neither layer
alone is sufficient; both must be active.
This document defines what “production-safe” means for Varta and how to verify a binary before deploying it to a safety-critical environment.
Profile matrix
| Profile | Features | argv | /metrics | Recovery |
|---|---|---|---|---|
| SRE / cloud | prometheus-exporter (+ optional unsafe-*) | full GNU-style parser | HTTP /metrics + Bearer-token | shell or exec |
| Class-A safety-critical | secure-udp,compile-time-config | none (build-time fixed) | absent | exec only (or unsafe-shell-recovery + signed acknowledgement) |
The two profiles are mutually exclusive: prometheus-exporter cannot
combine with compile-time-config (a compile_error! in
crates/varta-watch/src/lib.rs rejects the combination at build time).
This is the structural guarantee Class-A builds rest on — the Class-A
binary cannot ship with an HTTP server linked in.
Production-safe build
A production-safe varta-watch binary is built with default features only:
cargo build -p varta-watch --release
No --features argument is needed or wanted. Default features are empty.
What is absent from a production-safe build
| Dangerous capability | Cargo feature | Runtime flag |
|---|---|---|
| Plaintext (unauthenticated) UDP listener | unsafe-plaintext-udp | --i-accept-plaintext-udp |
Shell-mode recovery (/bin/sh -c) | unsafe-shell-recovery | --i-accept-shell-risk |
Without the compile-time feature, the code path is not linked into the binary. A misconfigured deployment cannot accidentally enable the dangerous path at runtime.
Verification recipe
cargo build -p varta-watch --release
strings target/release/varta-watch | grep -F "/bin/sh" && echo "FAIL" || echo "OK"
The strings check is belt-and-suspenders: because the dangerous code is
#[cfg(feature = ...)]-gated at the source level, the literal string is never
even parsed by the compiler, so it cannot appear in the binary.
Unsafe features
unsafe-plaintext-udp
Compiles in the plaintext UdpListener transport. Any device with network
access to the bound port can inject heartbeats, suppress stall detection, or
trigger false recovery commands.
# varta-watch/Cargo.toml
[features]
unsafe-plaintext-udp = ["udp-core"]
Even with this feature, the listener will not bind unless
--i-accept-plaintext-udp is also passed at runtime.
unsafe-shell-recovery
Compiles in the RecoveryMode::Shell variant, which passes the recovery
template to the system shell (sh -c). A template-injection vector can
execute arbitrary commands with the observer’s authority.
[features]
unsafe-shell-recovery = []
Even with this feature, shell-mode recovery will not activate unless
--i-accept-shell-risk is also passed at runtime.
Class-A safety-critical features
prometheus-exporter (opt-in HTTP exposition)
The Prometheus /metrics endpoint, the bearer-token loader, the per-IP
rate-limit table, and every --prom-* argv flag live behind this
feature. When absent the binary contains zero HTTP / TCP-accept code
and the only exporter linked is FileExporter (one-way append-only
TSV sink — no listener, no network surface).
[features]
prometheus-exporter = []
Verification recipe (default build, feature off):
cargo build -p varta-watch --release
B=target/release/varta-watch
strings "$B" | grep -E -- "(GET /metrics|HTTP/1\.|WWW-Authenticate|Bearer realm)" \
&& echo "FAIL" || echo "OK"
compile-time-config (no argv parser, no runtime config)
Replaces the runtime argv parser with a build-script-generated constant
populated from $VARTA_CONFIG_FILE (a KEY = VALUE text file). When
the feature is on:
Config::from_argsis excluded from compilation; the 292-arm match block carrying every--flag-nameliteral is not linked.Config::HELPis a neutral one-liner that contains no flag names.- The binary refuses any argv tokens with
CompileTimeArgvForbidden.
Cannot be combined with prometheus-exporter — the combination is
rejected at compile time by a compile_error! in lib.rs.
export VARTA_CONFIG_FILE=/etc/varta/varta.conf
cargo build -p varta-watch --release \
--no-default-features --features secure-udp,compile-time-config
Verification recipe:
B=target/release/varta-watch
FORBIDDEN="GET /metrics|HTTP/1\.|WWW-Authenticate|--socket|--prom-addr|--help|--i-accept|/bin/sh"
strings "$B" | grep -E -- "$FORBIDDEN" && echo "FAIL" || echo "OK"
See compile-time-config.md for the canonical KEY=VALUE grammar and key catalogue.
Recommended transport for recovery
Always use --recovery-exec instead of --recovery-cmd for production
deployments. --recovery-exec invokes the program directly via execvp(2)
with no shell involved; shell metacharacters have no effect.
Miri policy
Miri (cargo miri test) runs on every push under -Zmiri-strict-provenance and covers
the three unsafe-code clusters that cannot be audited by reading alone:
| Cluster | Miri target | What it proves |
|---|---|---|
peer_cred cmsg pointer-walk | cargo miri test -p varta-watch --lib peer_cred | No UB in the hand-written cmsghdr traversal; synthetic buffers only — no syscalls |
| Tracker slot-index arithmetic | cargo miri test -p varta-watch --lib tracker | No out-of-bounds indexing or stale pointer reads in the fixed-capacity slot array |
| Client classifier | cargo miri test -p varta-client --test classifier | BeatError is Copy-safe and errno extraction has no provenance issues |
Tests that require real syscalls (Unix datagram bind, recvmsg, process spawn) carry
#[cfg_attr(miri, ignore)] so they are silently skipped when Miri runs, without
requiring a separate test-filter command.
Clock source for stall detection
Stall threshold accounting depends on a monotonic time source. Which “monotonic” is correct depends on the deployment profile:
| Profile | --clock-source | Rationale |
|---|---|---|
| SRE / cloud server / VM | monotonic (default) | CLOCK_MONOTONIC pauses on host suspend, hypervisor pause, and live-migration freeze. A 30-minute host-suspend-for-maintenance must NOT fan out a stall alert across every agent. |
| Medical implant / holter / insulin pump (Linux) | boottime (Linux only) | CLOCK_BOOTTIME advances during suspend. A 4-hour deep-sleep IS a 4-hour silence; stall detection MUST fire on wake-up regardless of whether the device suspended itself. |
| Embedded sensor with deep sleep (Linux) | boottime (Linux only) | Same as medical — battery-conscious devices that aggressively suspend need stall semantics that count the suspended time. |
| macOS / iOS-hosted device with sleep semantics | monotonic-raw (macOS / iOS only) | CLOCK_MONOTONIC_RAW on Darwin is backed by mach_continuous_time and advances through sleep — the Darwin equivalent of Linux’s CLOCK_BOOTTIME. |
Platform support
boottime semantics require Linux’s CLOCK_BOOTTIME clock (clk_id 7,
available since 2.6.39). The Darwin equivalent is CLOCK_MONOTONIC_RAW
(clk_id = 4), backed by mach_continuous_time; it advances through
sleep just like CLOCK_BOOTTIME. Because the same numeric clk_id = 4
on Linux refers to CLOCK_MONOTONIC_RAW with different semantics (it
opts out of NTP slewing but still pauses during suspend), the two are
exposed as distinct ClockSource variants — boottime (Linux only) and
monotonic-raw (macOS / iOS only) — and each is rejected at startup on
the other family with ConfigError::ClockSourceUnsupported.
BSD operators have only monotonic: no kernel clock on FreeBSD /
NetBSD / OpenBSD / DragonFly advances through suspend in a way usable
by clock_gettime(2).
Example rejection messages:
clock source `boottime` is not supported on `macos` (Linux only; on
macOS use `monotonic-raw` for advance-through-sleep semantics)
clock source `monotonic-raw` is not supported on `linux` (macOS / iOS
only; on Linux use `boottime` for advance-through-sleep semantics)
This is structural enforcement: a misconfigured medical-device deployment exits non-zero rather than silently picking a clock that pauses on sleep.
Self-watchdog alignment
The in-process self-watchdog (--self-watchdog-secs) reads the same kernel
clock as the observer. An operator who configures boottime for the
observer gets watchdog deadline accounting that also advances during
suspend; an SRE operator on monotonic gets identical-to-historical
watchdog behaviour minus the previous wall-clock NTP-backward-step
foot-gun.
Verification recipe (Linux)
# Confirm the configured clock source is in effect.
journalctl -u varta-watch | grep -i 'clock' # binary logs no startup banner today;
# operators can read /proc/<pid>/maps
# to confirm clock_gettime imports.
# Behavioural smoke test — requires a real suspend / resume cycle:
systemctl suspend && sleep 60 && systemctl resume
curl -fsS http://localhost:9090/metrics -H "Authorization: Bearer <hex>" \
| grep -E 'varta_(stall_total|beats_total|watch_uptime_seconds)'
# Expect: with --clock-source boottime, varta_stall_total advanced during the
# suspend window; with --clock-source monotonic, it did not.
Cross-reference
The secure-udp transport applies the same “no surprises on the beat path”
posture: the IV-prefix derivation (H6) reads OS entropy only at connect()
and reconnect() — every steady-state beat uses a deterministic HKDF
counter-mode expansion. Together, H6 + H7 keep the agent and observer
loops free of any syscall that can block or stall under suspend.
Cross-references
- Observer liveness — defending against
varta-watchitself crashing or hanging - Peer authentication — kernel-level PID attestation and transport trust classification
Varta v0.1.0 — Bench Harness Results
Per-metric measurements captured by the dependency-free varta-bench
harness (Session 06). Each row corresponds to one acceptance contract
assertion in docs/acceptance/varta-v0-1-0.md.
Host
| Field | Value |
|---|---|
| OS | Darwin 25.4.0 (xnu-12377.101.15) arm64 |
| Hardware | Apple Silicon (Mac, T6050 series) |
| Rust toolchain | rustc 1.93.1 (01f6ddf75 2026-02-11) — pinned via rust-toolchain.toml |
| Working tree | epic/varta-v0-1-0--s06-integration-and-bench clean at run time |
Results
| Metric | Threshold | Measured | Status | Command |
|---|---|---|---|---|
latency | p99 < 1 µs | p99 = 916 ns | PASS | cargo run -p varta-bench --release -- latency |
cpu-50-agents | < 0.1 % | 0.0552 % | PASS | cargo run -p varta-bench --release -- cpu-50-agents |
binary-size | Δ < 20 KB | Δ = 3 872 B | PASS | cargo run -p varta-bench --release -- binary-size |
Auxiliary latency metrics (same run): p50 = 584 ns, p99.9 = 1042 ns.
Reproducibility
# Build the workspace once so varta-watch is in target/release.
cargo build --workspace --release
cargo run -p varta-bench --release -- latency
cargo run -p varta-bench --release -- cpu-50-agents # ~35 s wall
cargo run -p varta-bench --release -- binary-size # ~5 s wall
cpu-50-agents waits for the daemon to self-exit via
--shutdown-after-secs 35 before snapshotting getrusage(RUSAGE_CHILDREN),
so the measurement covers the full wall window over which the 50 agent
threads emit at 1 Hz. The wall is therefore the dominant cost.
Threshold notes
latency: thresholds are tagged HOST-DEPENDENT incrates/varta-bench/src/main.rs. Apple Silicon laptops show p99 ≈ 900 ns idle. Virtualised CI runners with noisy neighbours can spike — if the bench reportsSTATUS: WARNwith a measured value above 1 µs, the harness is doing its job and a CI gate should classify it as a soft failure (warning, not red).cpu-50-agents: the daemon is mostly blocked inrecvfrom(2)with the 100 ms read timeout. CPU usage scales sublinearly with agent count because the kernel batches wakeups. 0.0552 % of a 35 s wall is ~19 ms of daemon CPU.binary-size: link-time pulls inVarta::connect, theFramecodec, and theBeatOutcomeenum. The diff is dominated by Rust’s standard- library boilerplate forUnixDatagramplus a few KB of generated code for the encoder. The fixture crates uselto = false,codegen-units = 1,opt-level = 3so size comparisons are stable across runs.
Status
All three contract assertions PASS on the host above. No WARN or FAIL deviations to record for this session.
Contributing to Varta
First, thank you for contributing! Varta is a high-assurance health protocol, and we maintain strict architectural and safety standards.
The Varta “Hard Constraints”
Every contribution must adhere to these load-bearing invariants:
- Zero Registry Dependencies: Production crates (
varta-vlp,varta-client,varta-watch) must have empty[dependencies]sections (other than internal path dependencies). - Zero Heap Allocation: No heap allocation is permitted on the
beat()path after connection. We verify this withzero_alloctests using a guard allocator. - Non-Blocking I/O: The beat path must never block.
WouldBlockis handled asDropped. - ABI Stability: Any change to the 32-byte
Framelayout is a breaking change and requires a VLP version bump. - Strict Linting: We run with
deny(unsafe_code)at the workspace level. Permitted unsafe blocks (e.g., for FFI) must be explicitly allowed with#[allow(unsafe_code)]to create an audit trail.
Development Workflow
Prerequisites
- Rust stable (for production builds)
- Rust nightly (for fuzzing and Miri)
cargo-fuzzandmiricomponents installed
The “JUSTIFY” Rule
If you must #[ignore] a test, the CI will fail unless you provide a // JUSTIFY: <reason> comment within 2 lines of the attribute. This ensures we don’t accidentally leave gaps in our safety coverage.
Running the Suite
# Lint & Format
cargo fmt
cargo clippy --workspace -- -D warnings
# Tests
cargo test --workspace
# Fuzzing (Mandatory for protocol changes)
cargo fuzz run frame_decode -- -max_total_time=30
# Miri (UB Audit)
cargo miri test -p varta-vlp
Pull Request Process
- Benchmarks: If your change touches the
beat()path, you must runcargo run -p varta-bench --release -- latencyand include the results in your PR description. - Documentation: Update
design.mdor crate READMEs if logic changes. - Zero-Alloc Verification: Ensure
cargo test -p varta-tests --test zero_allocstill passes.
Code of Conduct
We follow the Contributor Covenant. Please be respectful and professional.
Security Policy
Supported Versions
The following versions of Varta are currently being supported with security updates.
| Version | Supported |
|---|---|
| v0.2.x | :white_check_mark: |
| < v0.2 | :x: |
Reporting a Vulnerability
Varta is designed for high-assurance and safety-critical health monitoring. Security and protocol integrity are our highest priorities.
If you discover a security vulnerability or a protocol-level defect that could compromise system safety, please do not report it via a public issue.
Recommended Method: GitHub Private Vulnerability Reporting
Please use the GitHub Private Vulnerability Reporting feature. This allows you to securely disclose the vulnerability to the maintainers without making it public.
What to include
When reporting, please provide:
- A descriptive title.
- The specific crate and version affected.
- A clear description of the vulnerability or safety concern.
- Steps to reproduce (including hardware/OS context if relevant).
- A proof-of-concept if available.
Our Commitment
We will:
- Acknowledge your report within 48 hours.
- Provide a timeline for a fix and keep you updated.
- Give credit (if desired) in the eventual security advisory.
Varta Project Roadmap
This roadmap outlines the path from Varta’s current state to a “High-Assurance” v1.0.0 release suitable for safety-critical deployments.
Phase 1: Foundation (Current - v0.2.x) :white_check_mark:
Focus on protocol stability, local/network transport, and security audits.
- VLP Protocol Definition (32-byte frames).
- Zero-allocation UDS/UDP transport.
- AEAD encryption for networked agents.
- Fuzzing and Miri integration in CI.
- Initial Prometheus exporter.
Phase 2: Observability & Resilience (v0.3 - v0.5)
Enhancing the observer and providing more “industrial” features.
- Structured Logging: full
json-logsupport across all crates. - Tamper-Evident Logs: SHA-256 hash chaining for recovery audits.
- mdBook Documentation: A comprehensive “Varta Book” explaining protocol internals.
- Crates.io Publication: Formal release of production-ready crates.
Phase 3: Compliance & Integration (v0.6 - v0.9)
Preparing for formal certification standards (IEC 62304, ISO 26262).
- Static Analysis: Integrate
cargo-geigerand custom safety-profile audits. - Multi-Language SDKs: C/C++ bindings for legacy embedded systems.
- Hardware Watchdog Integration: Native drivers for Linux
watchdogdand platform-specific hardware timers. - Self-Diagnostic Suite: Integrated tests for observer clock drift and jitter.
Phase 4: High-Assurance v1.0
The stable, safety-certified release.
- Formal Verification: TLA+ or Kani proofs for core state machines.
- Third-Party Security Audit: Formal cryptographic and code audit by a specialized firm.
- ABI Freeze: Finalize the VLP wire format for long-term compatibility.
- v1.0.0 Release: LTS support for critical infrastructure.