PgQue: Zero-Bloat Postgres Queue

(github.com)

80 points | by gmcabrita 8 hours ago ago

9 comments

mind-blight 36 minutes ago ago

The vacuum pressure is real. Using a system with the skip locked technique + polling caused massive DB perf issues as the queue depth grew. The query to see the current jobs in the queue ended up being the main performance bottleneck, which cause slower throughput, which caused a larger queue depth, which etc.
Scaling the workers sometimes exacerbates the problem because you run into connection limits or polling hammering the DB.
I love the idea of pg as a queue, but I'm a more skeptical of it after dealing with it in production
saberd 3 hours ago ago

I don't understand the latency graph. It says it has 0.25ms consumer latency.
Then in the latency tradeof section it says end to end latency is between 1-2 seconds.
Is this under heavy load or always? How does this compare to pgmq end to end latency?

[-]
- samokhvalov 2 hours ago ago
  
  (PgQue author here)
  I didn't understand nuances in the beginning myself
  We have 3 kinds of latencies when dealing with event messages:
  1. producer latency – how long does it take to insert an event message?
  2. subscriber latency – how long does it take to get a message? (or a batch of all new messages, like in this case)
  3. end-to-end event delivery time – how long does it take for a message to go from producer to consumer?
  In case of PgQ/PgQue, the 3rd one is limited by "tick" frequency – by default, it's once per second (I'm thinking how to simplify more frequent configs, pg_cron is limited by 1/s).
  While 1 and 2 are both sub-ms for PgQue. Consumers just don't see fresh messages until tick happens. Meanwhile, consuming queries is fast.
  Hope this helps. Thanks for the question. Will this to README.
odie5533 3 hours ago ago

Postgres durability without having to run Kafka or RabbitMQ clusters seems pretty enticing. May reach for it when I next need an outbox pattern or small fan out.
cout 3 hours ago ago

I think it's great that projects like this exist where people are building middleware in different ways than others. Still, as someone who routinely uses shared memory queues, the idea of considering a queue built inside a database to be "zero bloat" leaves me scratching my head a bit. I can see why someone would want that, but once person's feature is bloat to someone else.

[-]
- pierrekin 3 hours ago ago
  
  In Postgres land bloat refers to dead tuples that are left in place during certain operations and need to be vacuumed later.
  It’s challenging to write a queue that doesn’t create bloat, hence why this project is citing it as a feature.
halfcat an hour ago ago

So if I understand this correctly, there are three main approaches:
1. SKIP LOCKED family
2. Partition-based + DROP old partitions (no VACUUM required)
3. TRUNCATE family (PgQue’s approach)
And the benefit of PgQue is the failure mode, when a worker gets stuck:
- Table grows indefinitely, instead of
- VACUUM-starved death spiral
And a table growing is easier to reason about operationally?

[-]
- samokhvalov an hour ago ago
  
  Taxonomy is correct. But the benefit isn't "table grows indefinitely vs. vacuum-starved death spiral"
  in all three approaches, if the consumer falls behind, events accumulate
  The real distinction is cost per event under MVCC pressure. Under held xmin (idle-in-transaction, long-running writer, lagging logical slot, physical standby with hot_standby_feedback=on):
  1. SKIP LOCKED systems: every DELETE or UPDATE creates a dead tuple that autovacuum can't reclaim (xmin is frozen). Indexes bloat. Each subsequent FOR UPDATE SKIP LOCKED scans don't help.
  2. Partition + DROP (some SKIP LOCKED systems already support it, e.g. PGMQ): old partitions drop cleanly, but the active partition is still DELETE-based and accumulates dead tuples — same pathology within the active window, just bounded by retention. Another thing is that DROPping and attaching/detaching partitions is more painful than working with a few existing ones and using TRUNCATE.
  3. PgQue / PgQ: active event table is INSERT-only. Each consumer remembers its own pointer (ID of last event processed) independently. CPU stays flat under xmin pressure.
  I posted a few more benchmark charts on my LinkedIn and Twitter, and plan to post an article explaining all this with examples. Among them was a demo where 30-min-held-xmin bench at 2000 ev/s: PgQue sustains full producer rate at ~14% CPU; SKIP LOCKED queues pinned at 55-87% CPU with throughput dropping 20-80% and what's even worse, after xmin horizon gets unblocked, not all of them recovered / caught up consuming withing next 30 min.
bfivyvysj 2 hours ago ago

Cool