Skip to content

Compliance & Audit

AgentSpec's compliance system scores your agent against security and quality best practices.

Running an Audit

bash
# Declaration checks only (no I/O)
agentspec audit agent.yaml

# + proof records from a running sidecar (dual score)
agentspec audit agent.yaml --url http://localhost:4001

Output without --url:

  AgentSpec Audit — my-agent
  ────────────────────────────────
  Score : B  78/100
  Rules : 18 passed / 4 failed / 22 total

  Category Scores
    owasp-llm-top10          65% ████████████░░░░░░░░
    model-resilience          90% ██████████████████░░
    memory-hygiene            70% ██████████████░░░░░░
    evaluation-coverage       85% █████████████████░░░

  Violations (4)

  [critical] [X] SEC-LLM-06 — Sensitive data disclosure: PII scrub in memory hygiene
    Long-term memory declared without piiScrubFields — PII may be persisted.
    Path: /spec/memory/hygiene/piiScrubFields
    → Add spec.memory.hygiene.piiScrubFields: [ssn, credit_card, bank_account]
    → Prove: Microsoft Presidio
    https://microsoft.github.io/presidio/

Output with --url http://localhost:4001 (dual score):

  AgentSpec Audit — research-agent
  ══════════════════════════════════
  Declared score : D  65/100  — what your spec says
  Proved score   : F  35/100  — what has been verified
  Pending proof  : 4 rules — run external tools and POST to http://localhost:4001/proof/rule/:ruleId
  Rules : 18 passed / 4 failed / 22 total

Evidence Tiers

Every audit rule carries an evidence tier label that tells you what kind of evidence backs the finding:

BadgeTierMeaningHow to prove
[D]DeclarativeManifest analysis only — reads the YAML, no I/O(always available)
[P]ProbedHealth check verified at infrastructure levelagentspec health <file>
[B]BehavioralRuntime events confirmed actual executionAgentSpec EventPush + sidecar
[X]ExternalProved by an external CI tool (k6, Presidio, Promptfoo, LiteLLM)POST to /proof/rule/:ruleId

Declared vs Proved

The declared score reflects what your agent.yaml says. It only tells you that you've filled in the fields. The proved score tells you what has actually been verified:

Declared score:  65  D   ← you said it; we checked the YAML
Proved score:    35  F   ← only this fraction has been independently verified
Pending proof:   4 rules ← these pass declaratively but need external tool verification

Use the sidecar proof endpoint to submit verification results:

bash
# After k6 rate limit test passes
curl -X POST http://localhost:4001/proof/rule/SEC-LLM-04 \
  -H 'Content-Type: application/json' \
  -d '{"verifiedBy":"k6","method":"1200 req/min, 429 at 1000 — 100% enforced"}'

See Proof Integration Guide for tool-by-tool instructions.

Rule Classification

All 25 rules are classified by evidence tier:

Probed — verified by agentspec health

RuleDescriptionSeverity
SEC-LLM-03System prompt loaded from versioned $file:medium
SEC-LLM-05Model provider and version pinnedmedium
SEC-LLM-09Evaluation framework + CI gate configuredmedium
SEC-LLM-10API keys use $secret: not $env:high
MODEL-02Model version pinned (not "latest")medium
MEM-02TTL set for all memory backendshigh
MEM-03Audit log enabledmedium
MEM-04Vector store namespace isolatedmedium
MEM-05Short-term memory max tokens boundedlow
EVAL-01Evaluation dataset declaredmedium
EVAL-02CI gate enabledmedium
EVAL-03Hallucination threshold configuredmedium
OBS-01Tracing backend declaredmedium

Behavioral — verified by runtime events

RuleDescriptionSeverityProof tool
SEC-LLM-01Input guardrail actually invokedhighAgentSpec EventPush
SEC-LLM-02Output guardrail actually invokedhighAgentSpec EventPush
OBS-02Log lines contain structured JSONlowAgentSpec EventPush

External — verified by dedicated CI tools

RuleDescriptionSeverityProof tool
SEC-LLM-04Rate limit enforced under loadmediumk6
SEC-LLM-06PII actually scrubbed from memorycriticalMicrosoft Presidio
SEC-LLM-07Tool annotations respected by agentmediumPromptfoo
SEC-LLM-08Destructive tools flagged and constrainedhighPromptfoo
MODEL-01Fallback actually invoked on failurehighLiteLLM chaos test
MODEL-03Cost controls enforced by spend trackermediumLiteLLM Spend Tracking
MODEL-04Retry strategy works correctlylowpytest-mockllm
MEM-01PII scrub fields actually prevent PII storagecriticalMicrosoft Presidio
OBS-03Log redaction prevents PII in log aggregatorsmediumMicrosoft Presidio

Compliance Packs

owasp-llm-top10

10 rules aligned to OWASP LLM Top 10 (2025):

Rule IDDescriptionSeverityTier
SEC-LLM-01Input guardrail required (prompt injection)high[B]
SEC-LLM-02Output guardrail required (insecure output)high[B]
SEC-LLM-03System prompt loaded from versioned filemedium[P]
SEC-LLM-04Rate limiting + cost controls (model DoS)medium[X]
SEC-LLM-05Model provider and version pinnedmedium[P]
SEC-LLM-06PII scrub for long-term memorycritical[X]
SEC-LLM-07Tool annotations declaredmedium[X]
SEC-LLM-08destructiveHint on all toolshigh[X]
SEC-LLM-09Evaluation + CI gatemedium[P]
SEC-LLM-10API keys use $secret not $envhigh[P]

model-resilience

Rule IDDescriptionSeverityTier
MODEL-01Fallback model declaredhigh[X]
MODEL-02Model version pinned (not "latest")medium[P]
MODEL-03Cost controls declaredmedium[X]
MODEL-04Fallback retry strategylow[X]

memory-hygiene

Rule IDDescriptionSeverityTier
MEM-01PII scrub fields for long-term memorycritical[X]
MEM-02TTL set for all memory backendshigh[P]
MEM-03Audit log enabledmedium[P]
MEM-04Vector store namespace isolatedmedium[P]
MEM-05Short-term memory max tokens boundedlow[P]

evaluation-coverage

Rule IDDescriptionSeverityTier
EVAL-01Evaluation dataset declaredmedium[P]
EVAL-02CI gate enabledmedium[P]
EVAL-03Hallucination threshold configuredmedium[P]

observability

Rule IDDescriptionSeverityTier
OBS-01Tracing backend declaredmedium[P]
OBS-02Structured logging enabledlow[B]
OBS-03Sensitive fields redacted from logsmedium[X]

Scoring

  • Each rule has a weight: critical=4, high=3, medium=2, low=1, info=0
  • Declared score = (sum of passed weights) / (sum of total weights) × 100
  • Proved score = (sum of proved weights) / (sum of total weights) × 100
    • Proved = [P] rules that pass + [B] rules observed + [X] rules with proof records
  • Grades: A≥90, B≥75, C≥60, D≥45, F<45

Suppressing Rules

If a rule doesn't apply to your use case:

yaml
spec:
  compliance:
    suppressions:
      - rule: SEC-LLM-10
        reason: "Development environment only — production uses $secret"
        approvedBy: security-team
        expires: 2026-06-01    # ISO date — suppression auto-expires

Suppressed rules are excluded from scoring but logged in the audit report.

Running in CI

bash
# Fail CI if declared score drops below 70
agentspec audit agent.yaml --fail-below 70

# Run only security rules
agentspec audit agent.yaml --pack owasp-llm-top10

# Fetch proof records from sidecar + dual score in JSON
agentspec audit agent.yaml --url http://localhost:4001 --json --output audit-report.json

# Output JSON for processing
agentspec audit agent.yaml --json --output audit-report.json

Scheduled Audits

yaml
spec:
  compliance:
    packs:
      - owasp-llm-top10
      - model-resilience
    auditSchedule: weekly    # daily | weekly | monthly | on-change

This is declarative — actual scheduling requires a cron job or CI workflow that runs agentspec audit.

See also

Released under the Apache 2.0 License.