Is Opus 4.7 a Downgrade?

(vincentschmalbach.com)

6 points | by vincent_s 12 hours ago ago

5 comments

  • bisonbear 6 hours ago ago

    I'm actually currently working on benchmarking the opus 4.7 reasoning curve against real-world tasks, and have found that reasoning effort does not seem to monotonically improve results (at least on the slice I'm looking at). I've been puzzling about this but perhaps the fact that claude code has adaptive thinking explains some of it - even at medium reasoning effort, it can use more thinking tokens when needed to solve a complex problem.

    Snapshot of the results (sorry for busted format, ask your llm for dataviz. cant seem to format a good table in the comments)

    Opus 4.7 on GraphQL-go-tools:

    Low: 23/29 pass, 10/29 equivalent, 5/29 review-pass, custom avg 2.598, $2.50/task, 384s/task

    Medium: 28/29 pass, 14/29 equivalent, 10/29 review-pass, custom avg 2.759, $3.15/task, 451s/task

    High: 26/29 pass, 12/29 equivalent, 7/29 review-pass, custom avg 2.670, $5.01/task, 716s/task

    Xhigh: 25/29 pass, 11/29 equivalent, 4/29 review-pass, custom avg 2.669, $6.51/task, 804s/task

    Max: 27/29 pass, 13/29 equivalent, 8/29 review-pass, custom avg 2.690, $8.84/task, 997s/task

    (custom avg is a set of rubrics used for llm-as-a-judge, graded out of 4)

    Practically, the results indicate that medium has better outcomes, or at least the same outcomes, considering variance, as higher reasoning efforts, at a much lower cost/time.

  • undefined 11 hours ago ago
    [deleted]
  • gigatexal 12 hours ago ago

    I think it is. We have been using it at my day job and we regularly choose sonnet 4.6 for well scoped things. Opus 4.6 was good but the 4.7 opus model burns so many tokens and dollars that it’s just not worth it given the incremental improvement in results.

    • vincent_s 11 hours ago ago

      They also changed how they count tokens. So you could end up with less reasoning while paying for more tokens. Anthropic’s profit margin is definitely higher on 4.7 then it was an 4.6. I’m pretty sure this was the main driver behind this update.

  • foresightlab 11 hours ago ago

    [dead]