Show HN: LLMs suck at writing integration code… for now

(github.com)

20 points | by sfaist 5 days ago ago

13 comments

adinagoerres 5 days ago ago

Hey HN, I'm Adina, Stefan's co-founder at superglue. When we started working on LLM-powered integrations about a year ago, the models were barely good enough to handle simple mappings. We started benchmarking our performance as an internal evals project and thought it would be fun to open source it, to create more transparency around LLM performance. Our goal here is to understand how we can make agents production-ready and improve reliability across the board.

[-]
- hoerzu 5 days ago ago
  
  Love the benchmarks. Is better to use single LLM for performance or would always advise to add a self reflection step
  
  [-]
  - adinagoerres 5 days ago ago
    
    self-reflection is very important for both humans and LLMs, indeed
hande-k 4 days ago ago

Really appreciate you sharing this. What I am trying to use is gpt o3, so would be curious to see it in the benchmarks. Still seeing the raw traces tells me the tooling is starting to cross the “actually usable” line and makes me want to try on my examples this weekend. Looking forward to the MCP benchmark as well.
ThomasMin 5 days ago ago

Awesome work Stefan, this is super insightful! Really appreciate the transparency and open-sourcing the benchmark. The 68% success rate is a wake-up call for anyone building with LLMs. Your 91% integration layer result is impressive, shows tooling matters. Excited to see what you uncover next with MCP!
iamflimflam1 5 days ago ago

I would expect most developers to fail at this challenge. Here’s the doc - you’ve got one chance to get the API to do this.
I can’t tell from the description if the LLMs are allowed to try and then correct based on any errors received.
Though it would be surprising if that helped. Most APIs don’t tell you what you’ve done wrong…

[-]
- sfaist 5 days ago ago
  
  We would've assumed that the llms are much better at writing working code since it's not random APIs but rather established API patterns which they should be able to one-shot (e.g. Stripe). Bad error messages are a problem indeed. We will release another one with retries very soon.
mutant 4 days ago ago

Thanks for the self host option. I tried the slack example and was very impressed with results, thank you!
ForzaAaRon 5 days ago ago

Fascinating read. Interesting how opus performs worse compared to sonnet

[-]
- sfaist 5 days ago ago
  
  Quite interesting actually. not sure why, I assume it just overthinks. What suprised me even more is how bad o4-mini performed, after taking up hours of evaluation time and more credits than all other llms combined. More thinking != better (integration) coding performance
maxprokopp 4 days ago ago

Exciting benchmarks, great work Adina and Stefan!
hoerzu 5 days ago ago

What's the hello world of super glue?
danmeier 5 days ago ago

very interesting! curious to see the benchmarks for MCP!