Security Guardrails


Document status	Public documentation
Target audience	Security engineers, penetration testers, technical evaluators
Last updated	2026-04-10

This page is the technical deep-dive companion to the AI Governance overview. It details the specific controls Phoenix applies to prevent prompt injection, sensitive data leakage, and adversarial misuse.

1. Input sanitization

All user-supplied input is validated and sanitized before it reaches the LLM.

1.1 Prompt injection detection

Phoenix scans every inbound prompt for injection patterns including:

Instruction override attempts ("ignore previous instructions", "you are now…")
Role hijacking ("act as an unrestricted AI", "developer mode")
System prompt extraction requests ("repeat your system prompt", "show me your instructions")
Encoded payloads — Base64-encoded content is decoded and inspected before processing
Multi-encoding bypasses (ROT13, leetspeak, Unicode confusables)
Delimiter escaping and XML/JSON injection attempts

Detection covers dozens of patterns across these categories and is updated as new techniques emerge.

1.2 Content normalization

Before analysis, input undergoes:

Zero-width character stripping — removes invisible characters used to smuggle instructions
Unicode normalization — canonicalizes homoglyphs and confusable characters
Whitespace normalization — collapses encoding tricks that exploit tokenizer behavior

1.3 Boundary enforcement

User input is wrapped in explicit boundary tags so the LLM cannot confuse it with system instructions
System prompts use XML-delimited sections with immutable assistant identity and role-locking constraints
Size limits enforced: 50 KB per field, 100 KB total payload — preventing resource exhaustion and prompt-stuffing attacks

2. Sensitive data handling

2.1 Prompt-level redaction

If a user inadvertently includes sensitive data in a prompt (API keys, passwords, JWTs, connection strings, emails, phone numbers), Phoenix redacts it before storage. Redaction is not a post-retrieval filter — sensitive values never reach the trace database.

Recognized sensitive field patterns include categories such as:

Credentials: password, secret, credential, passphrase
Tokens: token, access_token, refresh_token, jwt, bearer
API keys: api_key, apikey, api_secret, patterns like sk-*, sk_live_*, AKIA*, AIza*, phx_*
Infrastructure: database connection strings (PostgreSQL, MySQL, MongoDB, Redis), connection_string, dsn
PII: email, phone, ssn
Auth headers: authorization, x-api-key

2.2 Redaction characteristics

Recursive — nested objects and arrays are traversed; no depth limit
Audit-logged — each redaction event records the field name and pattern matched, without exposing the original value
Pre-storage — redaction happens in the processing pipeline, not at read time

For the full list of redaction targets, see Section 4.2 of the AI Governance overview.

3. Output guardrails

3.1 System tag leak prevention

Phoenix applies hard blocks if an LLM response contains internal system tags or delimiters that should never appear in user-facing output. This prevents the model from regurgitating system prompt fragments.

3.2 Behavioral shift detection

Responses are monitored for warning-level flags indicating suspicious behavioral shifts — for example, the model suddenly adopting a persona or tone inconsistent with its assigned role. These flags are logged for review.

3.3 Structured output enforcement

Agent prompts enforce consistent output formats (HTML briefs, structured JSON) with source attribution requirements, limiting the surface area for freeform hallucination or instruction leakage.

4. Adversarial testing

4.1 Red team program

Phoenix runs an automated red team testing program using promptfoo, an open-source LLM security testing framework.

Latest round (2026-Q1):

700 test scenarios covering:
- Jailbreak attempts (DAN, roleplay, hypothetical framing)
- Direct and indirect prompt injection
- Encoded bypasses (Base64, ROT13, leetspeak, Unicode)
- Multi-turn escalation chains
- PII and credential extraction attempts
Results feed an active remediation roadmap — findings are triaged, patched, and regression-tested

4.2 Continuous testing

Adversarial testing is not a one-time exercise. New test scenarios are added as novel attack techniques are published, and the suite is re-run against each major release. The goal is continuous coverage, not point-in-time certification.

5. Summary of defense layers

Layer	What it does	When it runs
Input sanitization	Detects injection, normalizes encoding, enforces size limits	Before LLM receives the prompt
Boundary enforcement	Separates user input from system instructions	At prompt construction
Sensitive data redaction	Scrubs credentials, PII, and secrets	Before trace storage
Output guardrails	Blocks system tag leaks, flags behavioral shifts	After LLM response
Adversarial testing	Validates controls against attack scenarios	Ongoing (per release + quarterly)

Document control

Version	Date	Author	Changes
1.0	2026-04-10	Phoenix Team	Initial publication based on TALES security assessment responses

1. Input sanitization​

1.1 Prompt injection detection​

1.2 Content normalization​

1.3 Boundary enforcement​

2. Sensitive data handling​

2.1 Prompt-level redaction​

2.2 Redaction characteristics​

3. Output guardrails​

3.1 System tag leak prevention​

3.2 Behavioral shift detection​

3.3 Structured output enforcement​

4. Adversarial testing​

4.1 Red team program​

4.2 Continuous testing​

5. Summary of defense layers​

Document control​