Observability and results
Observability should explain what happened without copying sensitive payloads into every system that watches a sandbox. A result object, audit record, log line, metric, and receipt each have a different job.
Use this guide when an SDK, agent tool, CI job, or service wrapper needs to return command status and enough evidence to debug or verify the run later.
Evidence model
Section titled “Evidence model”| Evidence | Best for | Do not store |
|---|---|---|
| Command result | Immediate caller decision: success, failure, timeout. | Unbounded stdout/stderr or raw secrets. |
| Receipt | Portable verification for one run. | Raw argv, env values, stdout, stderr, or host paths. |
| Audit ID | Correlating local lifecycle and policy events. | Guest payload bytes. |
| Logs | Operator debugging. | Unredacted secrets, full model prompts, large file contents. |
| Boot report | Readiness and startup debugging. | Application payloads. |
| Metrics | Dashboards, alerts, trend analysis. | Labels with argv, env values, file contents, or secret names. |
Keep identifiers everywhere. Keep payloads only where the caller explicitly asked for bounded output.
CLI result workflow
Section titled “CLI result workflow”For a one-shot command:
mvmctl run \ --profile restrictive \ --json \ --receipt /tmp/agent-run-receipt.json \ -- python /work/task.pyThen verify and inspect:
mvmctl receipt verify /tmp/agent-run-receipt.jsonmvmctl audit tail -n 20mvmctl metrics --jsonStore the receipt path, exit status, run identifier, audit identifier, and timeout state with the job record. Do not copy raw command output into audit metadata or metric labels.
Named sandbox workflow
Section titled “Named sandbox workflow”For a long-running sandbox:
mvmctl up ./service --name service-dev --metrics-port 9100mvmctl wait service-dev --for allmvmctl boot-report service-dev --jsonmvmctl logs service-dev -n 200mvmctl metrics --jsonUse:
boot-reportfor launch and readiness questions;logsfor operator debugging;metricsfor health and trend data;audit tailandaudit verifyfor policy and lifecycle evidence.
SDK result target
Section titled “SDK result target”Runtime SDKs should converge on a bounded result shape:
result = sandbox.commands.run( ["python", "/work/task.py"], timeout_seconds=30, max_output_bytes=65536,)
job_record = { "sandbox_id": sandbox.id, "exit_code": result.exit_code, "timed_out": result.timed_out, "stdout": redact(result.stdout), "stderr": redact(result.stderr), "run_id": result.run_id, "audit_id": result.audit_id, "receipt": result.receipt_path,}const result = await sandbox.commands.run(["node", "/work/task.js"], { timeoutSeconds: 30, maxOutputBytes: 65536,});
const jobRecord = { sandbox_id: sandbox.id, exit_code: result.exitCode, timed_out: result.timedOut, stdout: redact(result.stdout), stderr: redact(result.stderr), run_id: result.runId, audit_id: result.auditId, receipt: result.receiptPath,};This is a product target. Check Lifecycle matrix and Operations cookbook before treating a helper as shipped in a specific language.
Redaction rules
Section titled “Redaction rules”Redact before data leaves the process that received it:
- secrets and credential-like strings;
- model prompts or user data unless the caller explicitly asked for them;
- guest file contents in error messages;
- long stdout/stderr beyond a configured byte budget;
- host paths that reveal local usernames or workspace layout;
- network URLs with embedded credentials or tokens.
If redaction fails or times out, return a redaction failure instead of sending unreviewed output to a model, ticket, trace, or shared log.
Metrics rules
Section titled “Metrics rules”Good metric labels:
- backend name;
- lifecycle state;
- policy profile;
- result class;
- exit status bucket;
- timeout boolean;
- image or artifact hash prefix when policy allows it.
Bad metric labels:
- command argv;
- environment values;
- secret names or values;
- stdout or stderr fragments;
- file paths containing user data;
- model prompts, URLs with tokens, or request bodies.
Metrics are for aggregate behavior. Use receipts and audit IDs for drill-down.
Failure classification
Section titled “Failure classification”Return typed failures to callers:
| Failure | Example caller action |
|---|---|
| Validation failed | Reject the model/tool request before launch. |
| Policy denied | Ask for a reviewed policy change or remove the operation. |
| Build failed | Inspect build logs and artifact inputs. |
| Boot not ready | Inspect boot report, backend logs, and guest readiness. |
| Command failed | Return bounded stderr/stdout and exit code. |
| Timeout | Stop, snapshot, or retry intentionally. |
| Transport failed | Retry only if the operation is idempotent. |
| Cleanup failed | Mark retained state sensitive and ask an operator to clean up. |
Do not collapse these into one generic runtime error. Policy denial, command failure, timeout, and cleanup failure all imply different next steps.