Hey HN, I'm Adina, Stefan's co-founder at superglue. When we started working on LLM-powered integrations about a year ago, the models were barely good enough to handle simple mappings. We started benchmarking our performance as an internal evals project and thought it would be fun to open source it, to create more transparency around LLM performance. Our goal here is to understand how we can make agents production-ready and improve reliability across the board.
Really appreciate you sharing this. What I am trying to use is gpt o3, so would be curious to see it in the benchmarks. Still seeing the raw traces tells me the tooling is starting to cross the “actually usable” line and makes me want to try on my examples this weekend. Looking forward to the MCP benchmark as well.
Awesome work Stefan, this is super insightful!
Really appreciate the transparency and open-sourcing the benchmark.
The 68% success rate is a wake-up call for anyone building with LLMs.
Your 91% integration layer result is impressive, shows tooling matters.
Excited to see what you uncover next with MCP!
We would've assumed that the llms are much better at writing working code since it's not random APIs but rather established API patterns which they should be able to one-shot (e.g. Stripe).
Bad error messages are a problem indeed.
We will release another one with retries very soon.
Quite interesting actually. not sure why, I assume it just overthinks.
What suprised me even more is how bad o4-mini performed, after taking up hours of evaluation time and more credits than all other llms combined. More thinking != better (integration) coding performance
Hey HN, I'm Adina, Stefan's co-founder at superglue. When we started working on LLM-powered integrations about a year ago, the models were barely good enough to handle simple mappings. We started benchmarking our performance as an internal evals project and thought it would be fun to open source it, to create more transparency around LLM performance. Our goal here is to understand how we can make agents production-ready and improve reliability across the board.
Love the benchmarks. Is better to use single LLM for performance or would always advise to add a self reflection step
self-reflection is very important for both humans and LLMs, indeed
Really appreciate you sharing this. What I am trying to use is gpt o3, so would be curious to see it in the benchmarks. Still seeing the raw traces tells me the tooling is starting to cross the “actually usable” line and makes me want to try on my examples this weekend. Looking forward to the MCP benchmark as well.
Awesome work Stefan, this is super insightful! Really appreciate the transparency and open-sourcing the benchmark. The 68% success rate is a wake-up call for anyone building with LLMs. Your 91% integration layer result is impressive, shows tooling matters. Excited to see what you uncover next with MCP!
I would expect most developers to fail at this challenge. Here’s the doc - you’ve got one chance to get the API to do this.
I can’t tell from the description if the LLMs are allowed to try and then correct based on any errors received.
Though it would be surprising if that helped. Most APIs don’t tell you what you’ve done wrong…
We would've assumed that the llms are much better at writing working code since it's not random APIs but rather established API patterns which they should be able to one-shot (e.g. Stripe). Bad error messages are a problem indeed. We will release another one with retries very soon.
Thanks for the self host option. I tried the slack example and was very impressed with results, thank you!
Fascinating read. Interesting how opus performs worse compared to sonnet
Quite interesting actually. not sure why, I assume it just overthinks. What suprised me even more is how bad o4-mini performed, after taking up hours of evaluation time and more credits than all other llms combined. More thinking != better (integration) coding performance
Exciting benchmarks, great work Adina and Stefan!
What's the hello world of super glue?
very interesting! curious to see the benchmarks for MCP!