Natural Emergent Misalignment from Reward Hacking in Production RL [pdf]

(assets.anthropic.com)

1 points | by samlinnfer 15 hours ago ago

No comments yet.