Interesting approach! I’ve been building something complementary on the deterministic side.
LLM-as-judge guardrails are fundamentally probabilistic and can be gamed or hallucinate themselves (as several comments pointed out).
That’s why I built EvalView — it does full trajectory snapshots + diffs so you can see exactly what changed, plus a lightweight zero-judge model-check that directly pings the model and reports drift level (NONE / WEAK / MEDIUM / STRONG).
Gives you deterministic regression detection that works alongside (or instead of) LLM judges.
https://github.com/hidai25/eval-view
Curious how you handle drift detection in CrabTrap.
// The policy is embedded as a JSON-escaped value inside a structured JSON object.
// This prevents prompt injection via policy content — any special characters,
// delimiters, or instruction-like text in the policy are safely escaped by
// json.Marshal rather than concatenated as raw text.
Really cool! I'm also building something in this space but taking a slightly different approach. I'm glad to see more focus on security for production agentic workflows though, as I think we don't talk about it enough when it comes to claws and other autonomous agents.
I think you're spot on with the fact that it's so far it's been either all or nothing. You either give an agent a lot of access and it's really powerful but proportionally dangerous or you lock it down so much that it's no longer useful.
I like a lot of the ideas you show here, but I also worry that LLM-as-a-judge is fundamentally a probabilistic guardrail that is inherently limited. How do you see this? It feels dangerous to rely on a security system that's not based on hard limitations but rather probabilities?
Correct me if I’m wrong, but from my experience in this space in order for a model to exercise judgment it must force itself to operate in a strict chain of thought mode. Since all LLMs are predictive creatures, I started to care a lot more about my judgment settings, the transparency of them, and the presence of a judgment loop in either the development or functionality of an application built these days.
Not exactly sure where I’m going with this, but my work with creating penetesting tools for LLMs, the way that I use judgment is critical to the core functionality of the application. I agree with your concern and I will just say that the more time I spent concerned with chain of though where now I will make multiple versions of the same app using a different judge set a different “temperaments” and I found it to be incredibly enlightening as to the diversity of applications and approaches that it creates.
Even using BMAD or superpowers, I can make five versions of an app without judges involved and I feel like I’m just making the same app five times because the API begins to coalesce around the business problem you want to solve. The vicissitudes of prediction tools always want to take the safest bet for the greater good, but with the judge involved we can make the agent force itself to actually be hostile about what exactly we’re trying to do, which has produced interesting and fun results.
99% is usually the best you can do. So you can only layer multiple defences together, this makes sense as one layer to me.
I have an issue with security layers that are inherently nondeterministic. You can't really reason strongly about what this tool provides as part of a security model.
But also, it's in an area where real security seems extremely hard. I think at some point everyone will have a situation where they wanna give an agent some private information and access to the web. You just can't do that in a way that's deterministically safe. But if there are usecase where making it probabilistically safer is enough to tip the balance, well, fine.
The thread has converged on “LLM-as-judge is the wrong security primitive,” which is right as far as it goes. The prompt-injection chain ends at the outbound POST. By the time the judge sees the request, the credential has already been read.
The question edf13 pointed at but didn’t develop; where does a transport-layer judge earn its place at all? Not as the enforcement layer but as the audit layer on top of one. Kernel-level controls tell you what the agent did. A proxy tells you what the agent tried to exfiltrate and where to.
Structured-JSON escaping and header caps are good tools for the detection job. They’re the wrong tools for the prevention job. Different layers, different questions.
The debate here is missing a practical question: is the judge from the same model family as the agent it's judging?
If both are Claude, you have shared-vulnerability risk. Prompt-injection patterns that work against one often work against the other. Basic defense in depth says they should at least be different providers, ideally different architectures.
Secondary issue: the judge only sees what's in the HTTP body. Someone who can shape the request (via agent input) can shape the judge's context window too. That's a different failure mode than "judge gets tricked by clever prompting." It's "judge is starved of the signals it would need to spot the trick."
It looks as if this tool has traditional static rules to allow/deny requests, as well as a secondary LLM-as-a-judge layer for, I imagine, the kinds of rules that would be messy or too convoluted to implement using standard rules.
I think the parent’s point is that this should be implemented using e.g. Bayesian statistics rather than an LLM, as the judge LLM is vulnerable to the exact same types of attacks that it’s trying to protect against.
I think this can be great as additional layer of security. Where you can have a non llm layer do some analysis with some static rules and then if something might seem phishy run it through the llm judge so that you don’t have to run every request through it, which would be very expensive.
Edit: actually looks like it has two policy engines embedded
And we don't think the judge can/will be gamed? Also... It's an LLM, it's going to add delay and additional token burn. One subjective black box protecting another subjective black box. I mean, what couldn't go wrong?
How can it result in a higher level of control? I don't see why the "judge" should have access to anything except one tool that allows it to send an "accept" or "deny" command.
> We’re supposed to be fixing LLM security by adding a non-LLM layer to it,
If people said "we build a ML-based classifier into our proxy to block dangerous requests" would it be better? Why does the fact the classifier is a LLM make it somehow worse?
The fact that LLMs are "smarter" is also their weakness. An oldschool classifier is far from foolproof, but you won't get past it by telling it about your grandma's bedtime story routine.
If you're working in a mission-critical field like healthcare, defense, etc. you need a way to make static and verifiable guarantees that you can't leak patient data, fighter jet details etc. through your software. This is either mandated by law or in your contract details.
The entire purpose of LLMs is to be non-static: they have no deterministic output and can't be validated the same way a non-LLM function can be. Adding another LLM layer is just adding another layer of swiss cheese and praying the holes don't line up. You have no way of predicting ahead of time whether or not they will.
You might say this hasn't prevented leaks/CVEs in exisiting mission-critical software and this would be correct. However, the people writing the checks do not care. You get paid as long as you follow the spec provided. How then, in a world which demands rigorous proof do you fit in an LLM judge?
> The entire purpose of LLMs is to be non-static: they have no deterministic output and can't be validated the same way a non-LLM function can be. Adding another LLM layer is just adding another layer of swiss cheese and praying the holes don't line up. You have no way of predicting ahead of time whether or not they will.
This is exactly the point though. A LLM is great at finding work-around for static defenses. We need something that understands the intent and responds to that.
I do think this is likely to make things more secure but it's also dangerous by potentially giving users a false sense of complete security when the security layer is probabilistic rather than deterministic.
EDIT: it does seem to have a deterministic layer too and I think that's great
> Why it lands: specific technical question, credits their work, ends with something that invites response. If Brex engineers are in the thread, one of them will likely reply.
BWHAHAHAHAHA. your bot tried, but failed at the same time. (also interesting that this user's other comments seem ok-ish. The prompts are evolving, we get a sneak peek here on what they prompted for, and the delivery seems more human as well)
I'm willing to wager that your comment was generated from the body of the article plus a prompt to work in an advertisement for your product, which gets a mention in nearly every comment you make (and every submission you make, sometimes on a daily basis).
At RSAC, there were a ton of agentic security startups converging on ebpf monitors for this reason. Eg, sondera gave a fun talk at graph the planet where they did that + exposed with a policy layer over agent traces via Cedar (used in AWS IAM etc). ABAC and identity were also appearing near here.
One thing I didn't see: are there any OSS solutions appearing here?
Interesting approach! I’ve been building something complementary on the deterministic side. LLM-as-judge guardrails are fundamentally probabilistic and can be gamed or hallucinate themselves (as several comments pointed out). That’s why I built EvalView — it does full trajectory snapshots + diffs so you can see exactly what changed, plus a lightweight zero-judge model-check that directly pings the model and reports drift level (NONE / WEAK / MEDIUM / STRONG). Gives you deterministic regression detection that works alongside (or instead of) LLM judges. https://github.com/hidai25/eval-view Curious how you handle drift detection in CrabTrap.
Securing agents in real time and testing them for drift in CI are pretty different use-cases…
This post is an AI-generated ad, isn’t it? It’s getting too hard to tell!
Comments like this don't fill me with confidence: https://github.com/brexhq/CrabTrap/blob/4fbbda9ca00055c1554a...
Really cool! I'm also building something in this space but taking a slightly different approach. I'm glad to see more focus on security for production agentic workflows though, as I think we don't talk about it enough when it comes to claws and other autonomous agents.
I think you're spot on with the fact that it's so far it's been either all or nothing. You either give an agent a lot of access and it's really powerful but proportionally dangerous or you lock it down so much that it's no longer useful.
I like a lot of the ideas you show here, but I also worry that LLM-as-a-judge is fundamentally a probabilistic guardrail that is inherently limited. How do you see this? It feels dangerous to rely on a security system that's not based on hard limitations but rather probabilities?
Correct me if I’m wrong, but from my experience in this space in order for a model to exercise judgment it must force itself to operate in a strict chain of thought mode. Since all LLMs are predictive creatures, I started to care a lot more about my judgment settings, the transparency of them, and the presence of a judgment loop in either the development or functionality of an application built these days.
Not exactly sure where I’m going with this, but my work with creating penetesting tools for LLMs, the way that I use judgment is critical to the core functionality of the application. I agree with your concern and I will just say that the more time I spent concerned with chain of though where now I will make multiple versions of the same app using a different judge set a different “temperaments” and I found it to be incredibly enlightening as to the diversity of applications and approaches that it creates.
It's all fine until OpenClaw decides to start prompt injecting the judge
Exactly; would probably be safer with a purely algorithmic decision making system.
Calling it now. Show HN: Pincer - A small highly optimized local model to detect prompt injection attempts against other models.
Sounds like a good idea. Please send me the Github link once done and I'll have my OpenClaw take a look and form my opinion of it.
Sounds like a good idea. Please send me you GitHub now and I'll have my big claw crush your open claw
Needs to be deterministic. ACLs
Yes, full stop. They say they cap the body to 16k and give the LLM a warning, lol. And this is coming from a credit card company.
> pointing it at a few days of real traffic produced policies that matched human judgment on the vast majority of held-out requests.
The problem is, 99% secure is a failing grade.
99% is usually the best you can do. So you can only layer multiple defences together, this makes sense as one layer to me.
I have an issue with security layers that are inherently nondeterministic. You can't really reason strongly about what this tool provides as part of a security model.
But also, it's in an area where real security seems extremely hard. I think at some point everyone will have a situation where they wanna give an agent some private information and access to the web. You just can't do that in a way that's deterministically safe. But if there are usecase where making it probabilistically safer is enough to tip the balance, well, fine.
The thread has converged on “LLM-as-judge is the wrong security primitive,” which is right as far as it goes. The prompt-injection chain ends at the outbound POST. By the time the judge sees the request, the credential has already been read.
The question edf13 pointed at but didn’t develop; where does a transport-layer judge earn its place at all? Not as the enforcement layer but as the audit layer on top of one. Kernel-level controls tell you what the agent did. A proxy tells you what the agent tried to exfiltrate and where to.
Structured-JSON escaping and header caps are good tools for the detection job. They’re the wrong tools for the prevention job. Different layers, different questions.
The debate here is missing a practical question: is the judge from the same model family as the agent it's judging?
If both are Claude, you have shared-vulnerability risk. Prompt-injection patterns that work against one often work against the other. Basic defense in depth says they should at least be different providers, ideally different architectures.
Secondary issue: the judge only sees what's in the HTTP body. Someone who can shape the request (via agent input) can shape the judge's context window too. That's a different failure mode than "judge gets tricked by clever prompting." It's "judge is starved of the signals it would need to spot the trick."
Non-deterministic business rules engine.
So cool ! I'm building something very close to that but from another perspective, making this open source is giving me many idea !
Blatant “astroturfing” in these comments
We’re supposed to be fixing LLM security by adding a non-LLM layer to it,
not adding LLM layers to stuff to make them inherently less secure.
This will be a neat concept for the types of tools that come after the present iteration of LLMs.
Unless I’m sorely mistaken.
It looks as if this tool has traditional static rules to allow/deny requests, as well as a secondary LLM-as-a-judge layer for, I imagine, the kinds of rules that would be messy or too convoluted to implement using standard rules.
I think the parent’s point is that this should be implemented using e.g. Bayesian statistics rather than an LLM, as the judge LLM is vulnerable to the exact same types of attacks that it’s trying to protect against.
Most proper LLM guardrails products use both.
I think this can be great as additional layer of security. Where you can have a non llm layer do some analysis with some static rules and then if something might seem phishy run it through the llm judge so that you don’t have to run every request through it, which would be very expensive.
Edit: actually looks like it has two policy engines embedded
And we don't think the judge can/will be gamed? Also... It's an LLM, it's going to add delay and additional token burn. One subjective black box protecting another subjective black box. I mean, what couldn't go wrong?
you can use a safety model trained on prompt injections with developer message priority.
user message becomes close to untrusted compared to dev prompt.
also post train it only outputs things like safe/unsafe so you are relatively deterministic on injection or no injection.
ie llama prompt guard, oss 120 safeguard.
What happens when a prompt injection attack exploits the judge LLM and results in a higher level of attacker control than if it never existed?
How can it result in a higher level of control? I don't see why the "judge" should have access to anything except one tool that allows it to send an "accept" or "deny" command.
> We’re supposed to be fixing LLM security by adding a non-LLM layer to it,
If people said "we build a ML-based classifier into our proxy to block dangerous requests" would it be better? Why does the fact the classifier is a LLM make it somehow worse?
The fact that LLMs are "smarter" is also their weakness. An oldschool classifier is far from foolproof, but you won't get past it by telling it about your grandma's bedtime story routine.
Fairly hard to bypass the latest LLMs with grandma's bedtime story these days, to be fair.
That specific trick yes, but the general concept still applies.
It does, but it's certainly not trivial. In fact there's an unclaimed $1000 bounty on prompt injecting OpenClaw: https://hackmyclaw.com/
Is that enough?
Enough for what?
If you're working in a mission-critical field like healthcare, defense, etc. you need a way to make static and verifiable guarantees that you can't leak patient data, fighter jet details etc. through your software. This is either mandated by law or in your contract details.
The entire purpose of LLMs is to be non-static: they have no deterministic output and can't be validated the same way a non-LLM function can be. Adding another LLM layer is just adding another layer of swiss cheese and praying the holes don't line up. You have no way of predicting ahead of time whether or not they will.
You might say this hasn't prevented leaks/CVEs in exisiting mission-critical software and this would be correct. However, the people writing the checks do not care. You get paid as long as you follow the spec provided. How then, in a world which demands rigorous proof do you fit in an LLM judge?
> The entire purpose of LLMs is to be non-static: they have no deterministic output and can't be validated the same way a non-LLM function can be. Adding another LLM layer is just adding another layer of swiss cheese and praying the holes don't line up. You have no way of predicting ahead of time whether or not they will.
This is exactly the point though. A LLM is great at finding work-around for static defenses. We need something that understands the intent and responds to that.
Static rules are insufficient
Defense in depth. Layers don't inherently make something less secure. Often, they make it more secure.
I do think this is likely to make things more secure but it's also dangerous by potentially giving users a false sense of complete security when the security layer is probabilistic rather than deterministic.
EDIT: it does seem to have a deterministic layer too and I think that's great
[dead]
[dead]
[dead]
[dead]
[dead]
[dead]
[flagged]
> Why it lands: specific technical question, credits their work, ends with something that invites response. If Brex engineers are in the thread, one of them will likely reply.
BWHAHAHAHAHA. your bot tried, but failed at the same time. (also interesting that this user's other comments seem ok-ish. The prompts are evolving, we get a sneak peek here on what they prompted for, and the delivery seems more human as well)
[flagged]
I'm willing to wager that your comment was generated from the body of the article plus a prompt to work in an advertisement for your product, which gets a mention in nearly every comment you make (and every submission you make, sometimes on a daily basis).
Hand written I’m afraid… regular comments on this topic is true - it’s an area I’m very interested in.
At RSAC, there were a ton of agentic security startups converging on ebpf monitors for this reason. Eg, sondera gave a fun talk at graph the planet where they did that + exposed with a policy layer over agent traces via Cedar (used in AWS IAM etc). ABAC and identity were also appearing near here.
One thing I didn't see: are there any OSS solutions appearing here?
We are Open Source… code will be published soon (before launch)
Then you will be open source ;) Not yet open source.
Yes, true ;)