Skip to main content

Security Guardrails

Document statusPublic documentation
Target audienceSecurity engineers, penetration testers, technical evaluators
Last updated2026-04-10

This page is the technical deep-dive companion to the AI Governance overview. It details the specific controls Phoenix applies to prevent prompt injection, sensitive data leakage, and adversarial misuse.

1. Input sanitization

All user-supplied input is validated and sanitized before it reaches the LLM.

1.1 Prompt injection detection

Phoenix scans every inbound prompt for injection patterns including:

  • Instruction override attempts ("ignore previous instructions", "you are now…")
  • Role hijacking ("act as an unrestricted AI", "developer mode")
  • System prompt extraction requests ("repeat your system prompt", "show me your instructions")
  • Encoded payloads — Base64-encoded content is decoded and inspected before processing
  • Multi-encoding bypasses (ROT13, leetspeak, Unicode confusables)
  • Delimiter escaping and XML/JSON injection attempts

Detection covers dozens of patterns across these categories and is updated as new techniques emerge.

1.2 Content normalization

Before analysis, input undergoes:

  • Zero-width character stripping — removes invisible characters used to smuggle instructions
  • Unicode normalization — canonicalizes homoglyphs and confusable characters
  • Whitespace normalization — collapses encoding tricks that exploit tokenizer behavior

1.3 Boundary enforcement

  • User input is wrapped in explicit boundary tags so the LLM cannot confuse it with system instructions
  • System prompts use XML-delimited sections with immutable assistant identity and role-locking constraints
  • Size limits enforced: 50 KB per field, 100 KB total payload — preventing resource exhaustion and prompt-stuffing attacks

2. Sensitive data handling

2.1 Prompt-level redaction

If a user inadvertently includes sensitive data in a prompt (API keys, passwords, JWTs, connection strings, emails, phone numbers), Phoenix redacts it before storage. Redaction is not a post-retrieval filter — sensitive values never reach the trace database.

Recognized sensitive field patterns include categories such as:

  • Credentials: password, secret, credential, passphrase
  • Tokens: token, access_token, refresh_token, jwt, bearer
  • API keys: api_key, apikey, api_secret, patterns like sk-*, sk_live_*, AKIA*, AIza*, phx_*
  • Infrastructure: database connection strings (PostgreSQL, MySQL, MongoDB, Redis), connection_string, dsn
  • PII: email, phone, ssn
  • Auth headers: authorization, x-api-key

2.2 Redaction characteristics

  • Recursive — nested objects and arrays are traversed; no depth limit
  • Audit-logged — each redaction event records the field name and pattern matched, without exposing the original value
  • Pre-storage — redaction happens in the processing pipeline, not at read time

For the full list of redaction targets, see Section 4.2 of the AI Governance overview.

3. Output guardrails

3.1 System tag leak prevention

Phoenix applies hard blocks if an LLM response contains internal system tags or delimiters that should never appear in user-facing output. This prevents the model from regurgitating system prompt fragments.

3.2 Behavioral shift detection

Responses are monitored for warning-level flags indicating suspicious behavioral shifts — for example, the model suddenly adopting a persona or tone inconsistent with its assigned role. These flags are logged for review.

3.3 Structured output enforcement

Agent prompts enforce consistent output formats (HTML briefs, structured JSON) with source attribution requirements, limiting the surface area for freeform hallucination or instruction leakage.

4. Adversarial testing

4.1 Red team program

Phoenix runs an automated red team testing program using promptfoo, an open-source LLM security testing framework.

Latest round (2026-Q1):

  • 700 test scenarios covering:
    • Jailbreak attempts (DAN, roleplay, hypothetical framing)
    • Direct and indirect prompt injection
    • Encoded bypasses (Base64, ROT13, leetspeak, Unicode)
    • Multi-turn escalation chains
    • PII and credential extraction attempts
  • Results feed an active remediation roadmap — findings are triaged, patched, and regression-tested

4.2 Continuous testing

Adversarial testing is not a one-time exercise. New test scenarios are added as novel attack techniques are published, and the suite is re-run against each major release. The goal is continuous coverage, not point-in-time certification.

5. Summary of defense layers

LayerWhat it doesWhen it runs
Input sanitizationDetects injection, normalizes encoding, enforces size limitsBefore LLM receives the prompt
Boundary enforcementSeparates user input from system instructionsAt prompt construction
Sensitive data redactionScrubs credentials, PII, and secretsBefore trace storage
Output guardrailsBlocks system tag leaks, flags behavioral shiftsAfter LLM response
Adversarial testingValidates controls against attack scenariosOngoing (per release + quarterly)

Document control

VersionDateAuthorChanges
1.02026-04-10Phoenix TeamInitial publication based on TALES security assessment responses