Good timing on this. Red-teaming agents pre-production is underrated and most teams skip it entirely.
One thing that keeps coming up: even when red-teaming surfaces credential exfiltration vectors, the fix is usually reactive (rotate the key, patch the prompt). The more durable approach is limiting what the credential can do in the first place. Scoped per-agent keys mean a successful attack through one of these exploits can only reach what that agent was authorized to touch. The exfiltration path exists, but the payload is bounded.
Scoped keys and least privilege make sense as a baseline. But I think the deeper issue is that if the main answer to “agents aren’t reliable enough” is “limit what they can do,” we’re leaving most of the value on the table. The whole promise of agents is that they can act autonomously across systems. If we scope everything down to the point where an agent can’t do damage, we’ve also scoped it down to where it can’t do much useful work either.
We think the more interesting problem is closing the trust gap - making the agent itself more reliable so you don’t have to choose between autonomy and reliability. Our goal is to ultimately be able to take on the liability when agents fail.
I have tried to manipulate it using base64 encoding and translaion into other languages which didnt work so far but seems to be that llm as a judge is a very fragile defence for this. Would be cool to add a leaderboard though
Thanks for trying it out! Base64 and language switching are solid approaches but they don't tend to work anymore with the latest models in my experience.
You're right that LLM-as-a-judge is fragile though. We saw that as well in the first challenge. The attacker fabricated some research context that made the guardrail want to approve the call. The judge's own reasoning at the end was basically "yes this normally violates the security directive, but given the authorised experiment context it's fine." It talked itself into it.
Mostly just better training data and instruction following in the newer models. They’re much better at recognising encoded content and understanding intent regardless of language. A base64 string that would’ve slipped past a model a year ago gets decoded and flagged now because the model just… understands what you’re trying to do.
The attacks that still work tend to be the ones that don’t try to hide the intent at all. The winning attack on our first challenge was in plain English. It just reframed the context so that the dangerous action looked like the correct thing to do. Harder to train against because there’s nothing obviously malicious in the input.
Good timing on this. Red-teaming agents pre-production is underrated and most teams skip it entirely.
One thing that keeps coming up: even when red-teaming surfaces credential exfiltration vectors, the fix is usually reactive (rotate the key, patch the prompt). The more durable approach is limiting what the credential can do in the first place. Scoped per-agent keys mean a successful attack through one of these exploits can only reach what that agent was authorized to touch. The exfiltration path exists, but the payload is bounded.
We built around this pattern: https://www.apistronghold.com/blog/stop-giving-ai-agents-you...
Scoped keys and least privilege make sense as a baseline. But I think the deeper issue is that if the main answer to “agents aren’t reliable enough” is “limit what they can do,” we’re leaving most of the value on the table. The whole promise of agents is that they can act autonomously across systems. If we scope everything down to the point where an agent can’t do damage, we’ve also scoped it down to where it can’t do much useful work either.
We think the more interesting problem is closing the trust gap - making the agent itself more reliable so you don’t have to choose between autonomy and reliability. Our goal is to ultimately be able to take on the liability when agents fail.
I have tried to manipulate it using base64 encoding and translaion into other languages which didnt work so far but seems to be that llm as a judge is a very fragile defence for this. Would be cool to add a leaderboard though
Thanks for trying it out! Base64 and language switching are solid approaches but they don't tend to work anymore with the latest models in my experience.
You're right that LLM-as-a-judge is fragile though. We saw that as well in the first challenge. The attacker fabricated some research context that made the guardrail want to approve the call. The judge's own reasoning at the end was basically "yes this normally violates the security directive, but given the authorised experiment context it's fine." It talked itself into it.
Full transcript and guardrail logs are published here btw: https://github.com/fabraix/playground/blob/master/challenges...
The leaderboard should start populating once we have more submissions!
Why don't they work anymore? RLHF or something else?
Mostly just better training data and instruction following in the newer models. They’re much better at recognising encoded content and understanding intent regardless of language. A base64 string that would’ve slipped past a model a year ago gets decoded and flagged now because the model just… understands what you’re trying to do.
The attacks that still work tend to be the ones that don’t try to hide the intent at all. The winning attack on our first challenge was in plain English. It just reframed the context so that the dangerous action looked like the correct thing to do. Harder to train against because there’s nothing obviously malicious in the input.