The conclusion "open models are roughly 8-10 months behind the closed frontier" is presented without adequate evidence.
Most of the open models that have been benchmarked are old, e.g. 6 months or more older than the latest open weights models that have been launched during the last months.
The only big and recent open model that I see mentioned is GLM 5.1.
For such a study to be credible, it must benchmark all of the many open weights models that have been launched during the last 3 months, and in their full versions. For such a study, I want to see a list with the exact model versions that have been benchmarked and what was used for inference.
Only then one could give a meaningful conclusion about the current delay in months.
In any case, a single value for how the open models are behind does not tell much, because the actual performance is very dependent on the specific problem.
Going on the links from TFA towards private benchmarks, which are supposed to be more trustworthy, I see benchmarks where GPT 5.5 and Opus 4.6 are beaten by models like Qwen 3.7, e.g. in SimpleBench.
Of course, I assume that on average the OpenAI and Anthropic models are better, but this does not guarantee that they will be better than the open weights models on any particular problem.
So the value of the delay in "months" provides little information. If the OpenAI and Anthropic models had an advantage so great that they would have beaten the Chinese models in any benchmark, that would have been newsworthy. As long as their advantage is only probabilistic, i.e. that they win more benchmarks than they lose, that means that their advantage is not decisive and you cannot be certain that by paying for them you will really get the best results.
I wonder how Sonnet vs Opus stacks up in a similar time-based comparison.
How far behind are open models compared to Sonnet?
It may be that the absolute SOTA models are way ahead of open models, but the gap in the mid tier really does feel like it's compressing. I'd love to see empirical data about it though.
Deepseek is really good, and I've found it consistently better than ChatGPT. I don't know what the hype behind OpenAI is. I don't know if they were better and got worse, because I could swear ChatGPT was good at some point in the past.
I am thrilled with DeepSeek V4 Flash via API as it's very good and very cheap. But for everyday advanced tasks and coding even DS V4 Pro is nowhere near GPT 5.5 nor Claude Opus 4.7 (and never mind 4.8). I think V4 is a great release but the frontier is moving fast, too.
The conclusion "open models are roughly 8-10 months behind the closed frontier" is presented without adequate evidence.
Most of the open models that have been benchmarked are old, e.g. 6 months or more older than the latest open weights models that have been launched during the last months.
The only big and recent open model that I see mentioned is GLM 5.1.
For such a study to be credible, it must benchmark all of the many open weights models that have been launched during the last 3 months, and in their full versions. For such a study, I want to see a list with the exact model versions that have been benchmarked and what was used for inference.
Only then one could give a meaningful conclusion about the current delay in months.
In any case, a single value for how the open models are behind does not tell much, because the actual performance is very dependent on the specific problem.
Going on the links from TFA towards private benchmarks, which are supposed to be more trustworthy, I see benchmarks where GPT 5.5 and Opus 4.6 are beaten by models like Qwen 3.7, e.g. in SimpleBench.
Of course, I assume that on average the OpenAI and Anthropic models are better, but this does not guarantee that they will be better than the open weights models on any particular problem.
So the value of the delay in "months" provides little information. If the OpenAI and Anthropic models had an advantage so great that they would have beaten the Chinese models in any benchmark, that would have been newsworthy. As long as their advantage is only probabilistic, i.e. that they win more benchmarks than they lose, that means that their advantage is not decisive and you cannot be certain that by paying for them you will really get the best results.
I wonder how Sonnet vs Opus stacks up in a similar time-based comparison.
How far behind are open models compared to Sonnet?
It may be that the absolute SOTA models are way ahead of open models, but the gap in the mid tier really does feel like it's compressing. I'd love to see empirical data about it though.
Deepseek is really good, and I've found it consistently better than ChatGPT. I don't know what the hype behind OpenAI is. I don't know if they were better and got worse, because I could swear ChatGPT was good at some point in the past.
I am thrilled with DeepSeek V4 Flash via API as it's very good and very cheap. But for everyday advanced tasks and coding even DS V4 Pro is nowhere near GPT 5.5 nor Claude Opus 4.7 (and never mind 4.8). I think V4 is a great release but the frontier is moving fast, too.
One year max six months at the least and by 2030 the difference will not even matter. And in the memory/chips market the same and in 2035 look out…