Ask HN: How do you monitor and retry failed webhooks in production?

4 points | by GoatPerfect 10 hours ago ago

6 comments

blundergoat 10 hours ago ago

We treat webhooks as at-least-once delivery over an unreliable transport and design for duplicates and out-of-order events.
A few rules that have saved us:
- Persist before responding. Never process inline. Write payload to DB, return 200 fast.
- Idempotency key required. Either provider event ID or hash the payload.
- Async worker processes from queue. Exponential backoff + max attempts.
- Dead letter queue + dashboard. Humans need visibility.
- Alert on backlog growth, not single failures. One failure is noise. A growing retry queue is signal.
- Relying on provider retries alone has bitten us more than once.

[-]
- GoatPerfect 9 hours ago ago
  
  Thank you so much for tips! I was feeling nervous about relying on provider retires as well. I especially like the idea of alerting on backlog growth. There's nothing I hate more than a bunch of emails and notifications!
  
  [-]
  - chickensong 9 hours ago ago
    
    This was a nice goat exchange
JacobArthurs 9 hours ago ago

We receive the webhook, return 200 immediately, and push the payload to a message queue for processing. That way you own the retry logic, can inspect stuck messages, and DLQ alerts handle repeated failures automatically.
Idempotency becomes your responsibility, though, since messages can be delivered more than once.
toomuchtodo 10 hours ago ago

Have you checked out https://svix.com? No affiliation, I just like the product. Might also check out https://www.standardwebhooks.com/

[-]
- GoatPerfect 9 hours ago ago
  
  I just checked them out! Looks like it would make handling failures a breeze!