3 comments

  • deep1283 7 hours ago ago

    thats concerning.If the sandbox actually existed at the system level, the model shouldn’t be able to escape it regardless of what it says or tries.

    • benjosaur 35 minutes ago ago

      yes exactly. with proper configuration (e.g. /sandbox with normal claude code) it is impossible for the agent to escape.

      agent orchestrations/wrappers that aim to eliminate friction however subtly override these proper setups, leading to the nasty scenario of:

      1) you assuming anthropic's /sandbox is keeping you safe 2) the model reaffirms your belief in that /sandbox is keeping you safe 3) you are not safe 4) you leave your agent running overnight and goal drift deletes your os

  • samsam23812 9 hours ago ago

    [dead]