Security Guardrails
| Document status | Public documentation |
| Target audience | Security engineers, penetration testers, technical evaluators |
| Last updated | 2026-04-10 |
This page is the technical deep-dive companion to the AI Governance overview. It details the specific controls Phoenix applies to prevent prompt injection, sensitive data leakage, and adversarial misuse.
1. Input sanitization
All user-supplied input is validated and sanitized before it reaches the LLM.
1.1 Prompt injection detection
Phoenix scans every inbound prompt for injection patterns including:
- Instruction override attempts ("ignore previous instructions", "you are now…")
- Role hijacking ("act as an unrestricted AI", "developer mode")
- System prompt extraction requests ("repeat your system prompt", "show me your instructions")
- Encoded payloads — Base64-encoded content is decoded and inspected before processing
- Multi-encoding bypasses (ROT13, leetspeak, Unicode confusables)
- Delimiter escaping and XML/JSON injection attempts
Detection covers dozens of patterns across these categories and is updated as new techniques emerge.
1.2 Content normalization
Before analysis, input undergoes:
- Zero-width character stripping — removes invisible characters used to smuggle instructions
- Unicode normalization — canonicalizes homoglyphs and confusable characters
- Whitespace normalization — collapses encoding tricks that exploit tokenizer behavior
1.3 Boundary enforcement
- User input is wrapped in explicit boundary tags so the LLM cannot confuse it with system instructions
- System prompts use XML-delimited sections with immutable assistant identity and role-locking constraints
- Size limits enforced: 50 KB per field, 100 KB total payload — preventing resource exhaustion and prompt-stuffing attacks
2. Sensitive data handling
2.1 Prompt-level redaction
If a user inadvertently includes sensitive data in a prompt (API keys, passwords, JWTs, connection strings, emails, phone numbers), Phoenix redacts it before storage. Redaction is not a post-retrieval filter — sensitive values never reach the trace database.
Recognized sensitive field patterns include categories such as:
- Credentials:
password,secret,credential,passphrase - Tokens:
token,access_token,refresh_token,jwt,bearer - API keys:
api_key,apikey,api_secret, patterns likesk-*,sk_live_*,AKIA*,AIza*,phx_* - Infrastructure: database connection strings (PostgreSQL, MySQL, MongoDB, Redis),
connection_string,dsn - PII:
email,phone,ssn - Auth headers:
authorization,x-api-key
2.2 Redaction characteristics
- Recursive — nested objects and arrays are traversed; no depth limit
- Audit-logged — each redaction event records the field name and pattern matched, without exposing the original value
- Pre-storage — redaction happens in the processing pipeline, not at read time
For the full list of redaction targets, see Section 4.2 of the AI Governance overview.
3. Output guardrails
3.1 System tag leak prevention
Phoenix applies hard blocks if an LLM response contains internal system tags or delimiters that should never appear in user-facing output. This prevents the model from regurgitating system prompt fragments.
3.2 Behavioral shift detection
Responses are monitored for warning-level flags indicating suspicious behavioral shifts — for example, the model suddenly adopting a persona or tone inconsistent with its assigned role. These flags are logged for review.
3.3 Structured output enforcement
Agent prompts enforce consistent output formats (HTML briefs, structured JSON) with source attribution requirements, limiting the surface area for freeform hallucination or instruction leakage.
4. Adversarial testing
4.1 Red team program
Phoenix runs an automated red team testing program using promptfoo, an open-source LLM security testing framework.
Latest round (2026-Q1):
- 700 test scenarios covering:
- Jailbreak attempts (DAN, roleplay, hypothetical framing)
- Direct and indirect prompt injection
- Encoded bypasses (Base64, ROT13, leetspeak, Unicode)
- Multi-turn escalation chains
- PII and credential extraction attempts
- Results feed an active remediation roadmap — findings are triaged, patched, and regression-tested
4.2 Continuous testing
Adversarial testing is not a one-time exercise. New test scenarios are added as novel attack techniques are published, and the suite is re-run against each major release. The goal is continuous coverage, not point-in-time certification.
5. Summary of defense layers
| Layer | What it does | When it runs |
|---|---|---|
| Input sanitization | Detects injection, normalizes encoding, enforces size limits | Before LLM receives the prompt |
| Boundary enforcement | Separates user input from system instructions | At prompt construction |
| Sensitive data redaction | Scrubs credentials, PII, and secrets | Before trace storage |
| Output guardrails | Blocks system tag leaks, flags behavioral shifts | After LLM response |
| Adversarial testing | Validates controls against attack scenarios | Ongoing (per release + quarterly) |
Document control
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2026-04-10 | Phoenix Team | Initial publication based on TALES security assessment responses |