A system-level approach to prompt injection: separating instruction and data channels in LLM agents [P]

Prompt injection has emerged as one of the most persistent failure modes in tool-using LLM systems, particularly in agentic workflows where models interact with external data sources.

Most mitigation strategies focus on input filtering or model-side alignment, but these approaches struggle because the core issue is structural:

Approach

I explored a system-level mitigation strategy by introducing a middleware layer (Sentinel Gateway) that enforces a strict separation between:

- Instruction channel: trusted, runtime-issued commands

- Data channel: untrusted external inputs (web, files, APIs)

Instead of attempting to classify malicious inputs, the system ensures that:

All agent actions require a signed, scoped runtime authorization token, effectively decoupling observation from execution.

Implementation

- FastAPI middleware layer for agent tool calls

- Token-based authorization for execution requests

- Streamlit interface for inspection and debugging

- Audit logging of agent decisions and tool usage

- Supports multi-agent integration patterns (e.g., Claude-based sessions)

- Local or Postgres-backed persistence layer

Repo

https://github.com/cmtopbas/Sentinel-Gateway

Discussion question

I’m interested in feedback on:

- whether instruction/data separation is a meaningful abstraction for agent safety

- failure modes in token-based execution gating

- how this compares conceptually to other agent safety or sandboxing approaches

原始关键词#instruction#separating#injection#approach#channels#agents

查看原文reddit.com

单一来源，暂无交叉验证