yes exactly. with proper configuration (e.g. /sandbox with normal claude code) it is impossible for the agent to escape.
agent orchestrations/wrappers that aim to eliminate friction however subtly override these proper setups, leading to the nasty scenario of:
1) you assuming anthropic's /sandbox is keeping you safe
2) the model reaffirms your belief in that /sandbox is keeping you safe
3) you are not safe
4) you leave your agent running overnight and goal drift deletes your os
thats concerning.If the sandbox actually existed at the system level, the model shouldn’t be able to escape it regardless of what it says or tries.
yes exactly. with proper configuration (e.g. /sandbox with normal claude code) it is impossible for the agent to escape.
agent orchestrations/wrappers that aim to eliminate friction however subtly override these proper setups, leading to the nasty scenario of:
1) you assuming anthropic's /sandbox is keeping you safe 2) the model reaffirms your belief in that /sandbox is keeping you safe 3) you are not safe 4) you leave your agent running overnight and goal drift deletes your os
[dead]