Skip to content

saferskills agent

saferskills agent runs a behavioral Agent Scan against your running agents — not their static files. For each agent it mints a run, pre-flight-verifies the Ed25519-signed assessment pack, prints a bootstrap prompt you paste into the agent, polls while the agent runs roughly 20 adversarial tests against mock tools only (zero real side effects), then renders a graded verdict. With no --to it detects your agents and lets you multi-select which to scan.

The flow per agent is mint → verify → bootstrap → poll → verdict:

  1. Mint a run and derive a per-run canary.
  2. Pre-flight-verify the Ed25519-signed assessment pack with verify_strict. This is a hard stop: a released CLI with a baked verify key aborts on a missing, unknown, or mismatched signature — no prompt, no report. A dev or fork build with no baked key skips verification with a warning.
  3. Bootstrap — the CLI prints a prompt you paste into your running agent.
  4. The agent runs the test pack against mock tools and returns raw evidence.
  5. Poll while it runs, then the SaferSkills cloud re-derives the canary deterministically, grades the evidence, and the CLI renders the verdict.

Each chosen agent is scanned sequentially, a combined summary is printed, and the overall exit is the worst per-agent verdict.

Terminal window
npx saferskills agent

The pack signature verification is what makes the result trustworthy: the canary lives in the pack, so an unverified pack could leak it and let an agent fake a pass. See run an agent scan for the paste-back walkthrough.

With no --to, the CLI detects your agents and lets you multi-select (non-interactive or --json runs scan all detected). With --to <id> it scans the named agents, accepting any of the eight known agent ids even if not detected. Each report gets a stable memorable codename (such as swift-otter) generated per machine and platform and persisted in ~/.saferskills/agent-names.json; --name <name> overrides it (on a multi-agent run the platform is appended, e.g. my-bot-cursor, so the cards stay distinct).

FlagEffect
--to <id>Scan a named agent (repeatable); accepts any of the 8 known ids even if not detected.
--name <name>Override the auto-generated codename for the report.
--fail-on <severity|score:N|band:tier>Map the verdict to an exit code (0 ok / 1 over threshold / 2 usage / 6 offline).
--baseline <.agentscanignore|prior.json>Suppress findings you have already accepted.
--timeout <minutes>How long to wait for each agent to submit (default 45; a real run takes 10–40 min).
--format json|mdOutput format for the report.
--privateProduce an unlisted report.
--print-skillEmit a static SKILL.md form instead of bootstrapping interactively.
--submit-blob <file>Submit a paste-back blob the agent printed (for agents that cannot be polled).
--no-telemetryOpt out of telemetry for this run.

The global flags apply too. A --fail-on expression that can’t be parsed exits 2 (usage).

The Agent Scan never says “secure,” “safe,” or “certified.” A test result is reported as observed vulnerable or not observed under pack v. The verdict enum is vulnerable / not_observed / n_a / error. Confidence and score are separate: a missing optional capability lowers confidence (the test is recorded n_a), never the score. See behavioral scoring for how the verdicts roll up.