My LLM optimization loop reward-hacked its own benchmark (and other lessons) [pdf]

(github.com)

1 points | by CodeReclaimers 6 hours ago ago

3 comments

cold_harbor 6 hours ago ago

reward hacking = the model finding the fastest path to a high score, not the behavior you wanted. same reason RLHF reward models degrade with too many optimization steps.

[-]
- CodeReclaimers 5 hours ago ago
  
  Agreed. The wrinkle I thought was worth writing up is: there's no learned reward model here and no training at all. The "reward" is wall-clock executiion time and the model is frozen; the search is happening at inference time, not in an RL loop. So the usual "the proxy is a fuzzy approximation that degrades under optimization pressure" story doesn't apply.
  This was on a ~200-line surface I thought I'd locked down, and it still got gamed in a way I might not have caught right away if it wasn't a nearly impossible run time (~45usec). So anyways...you apparently don't need a soft proxy or a lot of steps for this kind of thing to show up.
CodeReclaimers 6 hours ago ago

[flagged]