3 comments

  • cold_harbor 6 hours ago ago

    reward hacking = the model finding the fastest path to a high score, not the behavior you wanted. same reason RLHF reward models degrade with too many optimization steps.

    • CodeReclaimers 5 hours ago ago

      Agreed. The wrinkle I thought was worth writing up is: there's no learned reward model here and no training at all. The "reward" is wall-clock executiion time and the model is frozen; the search is happening at inference time, not in an RL loop. So the usual "the proxy is a fuzzy approximation that degrades under optimization pressure" story doesn't apply.

      This was on a ~200-line surface I thought I'd locked down, and it still got gamed in a way I might not have caught right away if it wasn't a nearly impossible run time (~45usec). So anyways...you apparently don't need a soft proxy or a lot of steps for this kind of thing to show up.

  • CodeReclaimers 6 hours ago ago

    [flagged]