For me it's the opposite - we don't get enough models to test. In the last 6 months, we got Claude 3.7, OpenAI o1, Grok 3, Gemini 2.5 Pro, and OpenAI o3. That's it - 5 frontier models. Not that hard to test each one of them manually, which I did for many hours and with many different tasks. o1 --> o3 and 2.5 Pro are the ones I'm using the most.
I couldn't care less about benchmarks - I know what these models are capable of from personal experience.
> There are so many models, and so many new ones being released all the time, that I have a hard time knowing which ones to prioritize testing anecdotally
Just pick one and use it. The ones you’ve heard of (if you are not obsessively refreshing AI model rankings pages) are basically the same.
I’m sure I’ll get a ton of pushback that the one somebody loves is obviously so much better than the other one, but whatever.
Just give me OpenAI’s most popular model, their fastest model, and their newest model. I’ll pick among those 3 based on what I’m prioritizing in the moment (speed, depth, everyday use).
For me it's the opposite - we don't get enough models to test. In the last 6 months, we got Claude 3.7, OpenAI o1, Grok 3, Gemini 2.5 Pro, and OpenAI o3. That's it - 5 frontier models. Not that hard to test each one of them manually, which I did for many hours and with many different tasks. o1 --> o3 and 2.5 Pro are the ones I'm using the most.
I couldn't care less about benchmarks - I know what these models are capable of from personal experience.
> There are so many models, and so many new ones being released all the time, that I have a hard time knowing which ones to prioritize testing anecdotally
Just pick one and use it. The ones you’ve heard of (if you are not obsessively refreshing AI model rankings pages) are basically the same.
I’m sure I’ll get a ton of pushback that the one somebody loves is obviously so much better than the other one, but whatever.
Just give me OpenAI’s most popular model, their fastest model, and their newest model. I’ll pick among those 3 based on what I’m prioritizing in the moment (speed, depth, everyday use).