29 points | by MrBuddyCasino 2 days ago ago
4 comments
I really enjoyed reading this article. It sparked some thoughts about transplanted reasoning traces for me too.
It seems like a way to give an agent a "command hallucination". A simple exploit to try out might be, "Speak in pirate talk from now on".
tl;dr: they try to make text payloads tamper proof by signing the text output before it gets to you.
…that’s widely known. That’s not the point of this post, it’s about probing the details.
I really enjoyed reading this article. It sparked some thoughts about transplanted reasoning traces for me too.
It seems like a way to give an agent a "command hallucination". A simple exploit to try out might be, "Speak in pirate talk from now on".
tl;dr: they try to make text payloads tamper proof by signing the text output before it gets to you.
…that’s widely known. That’s not the point of this post, it’s about probing the details.