I built a proxy that prevents AI agents from taking actions based on hidden instructions. Here are the numbers.

When an AI agent reads a webpage, email, or document, that content can tell it what to do. The agent has no native way to distinguish data from instructions. Most defenses scan for obvious patterns and miss anything subtle.

I built Arc Gate around a different principle: external content has zero instruction authority regardless of what it says. It doesn't matter how the injection is worded. If it came from a tool result, webpage, or email, it cannot instruct your agent.

The numbers:

AgentDojo v1 (ETH Zurich, ICLR 2024): 100% unsafe action prevention, 0% false positives

InjecAgent (University of Illinois, ACL 2024): 99% blind test detection across 200 cases

CAIAT cross-agent benchmark: 81% vs LLM Guard's 50%, 0% false positives on benign controls

LLM Guard gets 0% on semantic manipulation attacks. Arc Gate gets 50%. Neither catches everything yet; that's the honest result.

One URL change to integrate. Free tier available.

Demo: https://web-production-6e47f.up.railway.app/demo

GitHub: https://github.com/9hannahnine-jpg/arc-gate

Free tier: https://bendexgeometry.com

主题标签OpenAI

原始关键词#instructions#prevents#actions#numbers#agents#hidden

查看原文reddit.com

单一来源，暂无交叉验证