What are AI guardrails?

Guardrails enforce hard rules on AI behavior that prompts alone can't guarantee. They filter inputs, validate outputs, and catch failures before they reach users.

How do you prevent AI from doing things it shouldn't?

System prompts tell the model how to behave. But models don't always follow instructions perfectly. They can be confused, manipulated, or simply make mistakes. When the stakes are high, "usually follows instructions" isn't good enough.

Guardrails are hard controls: code that runs before and after the model, enforcing rules the model itself can't guarantee.

The difference from prompts

System prompts are "soft" controls:

  • The model usually follows them
  • Clever users can sometimes bypass them
  • Edge cases slip through
  • No guarantee of compliance

Guardrails are "hard" controls:

  • Enforced by code, not model behavior
  • Can't be bypassed by prompt manipulation
  • Catch failures before they cause harm
  • Deterministic when needed

Think of prompts as asking politely. Guardrails are locked doors.

Input guardrails: filtering before the model

Input guardrails check user messages before they reach the model:

Common input guardrails:

  • Content filtering: Block prohibited topics, explicit content, harmful requests
  • Prompt injection detection: Identify attempts to manipulate the model
  • PII detection: Catch and mask sensitive personal information
  • Topic restrictions: Enforce domain boundaries (e.g., medical bot shouldn't discuss legal advice)
  • Rate limiting: Prevent abuse through volume limits

Output guardrails: filtering after the model

Output guardrails check model responses before they reach users:

  • Content safety: Remove harmful, offensive, or inappropriate content
  • Format validation: Ensure responses match expected structure (JSON, specific formats)
  • Factual checking: Verify claims against known facts (limited but possible for specific domains)
  • PII redaction: Catch personal information the model shouldn't reveal
  • Brand compliance: Ensure tone and content match guidelines
  • Citation verification: Check that referenced sources exist

Why not just train the model better?

Base model safety training helps enormously. But guardrails serve different purposes:

  • Domain specificity: Your application has rules the base model doesn't know
  • Regulatory compliance: Laws vary by jurisdiction and industry
  • Brand requirements: Tone, topics, and style specific to your product
  • Liability protection: Catch edge cases before they become incidents
  • Rapid response: Update guardrails instantly vs retraining
  • Defense in depth: Multiple layers catch more failures

Training handles the general case. Guardrails handle your specific case.

The guardrails trade-off

Guardrails have costs:

  • False positives: Blocking legitimate requests frustrates users
  • Latency: Each check adds processing time
  • Maintenance: Rules need updating as threats evolve
  • Brittleness: Rules that are too specific fail on variations

The goal is finding the right balance: catch genuine problems without blocking legitimate use. This requires iteration, monitoring, and adjustment.

Guardrails in context

Guardrails are one layer of a defense-in-depth approach:

  1. Training: Build safety into the model
  2. System prompts: Guide model behavior
  3. Input guardrails: Filter problematic requests
  4. Output guardrails: Catch problematic responses
  5. Human oversight: Review edge cases
  6. Monitoring: Detect patterns indicating problems

No single layer is sufficient. Together, they create a robust system.

Sources & Further Reading

๐Ÿ“– Docs
Guardrails AI
Guardrails
๐Ÿ”— Article
๐Ÿ“– Docs