Tell HN: AI lies about having sandbox guardrails

7 points | by benjosaur 12 hours ago ago

3 comments

deep1283 7 hours ago ago

thats concerning.If the sandbox actually existed at the system level, the model shouldn’t be able to escape it regardless of what it says or tries.

[-]
- benjosaur 35 minutes ago ago
  
  yes exactly. with proper configuration (e.g. /sandbox with normal claude code) it is impossible for the agent to escape.
  agent orchestrations/wrappers that aim to eliminate friction however subtly override these proper setups, leading to the nasty scenario of:
  1) you assuming anthropic's /sandbox is keeping you safe 2) the model reaffirms your belief in that /sandbox is keeping you safe 3) you are not safe 4) you leave your agent running overnight and goal drift deletes your os
samsam23812 9 hours ago ago

[dead]