Message Broker & Queue Design
Introduction
Message brokers decouple producers and consumers, enabling asynchronous, resilient, and scalable communication. Designing them well requires clarity on delivery guarantees, ordering, failure handling, and consumer behavior.
Core Concepts
- Dead Letter Queue (DLQ): A quarantine queue for messages that keep failing after retries, isolating poison messages and enabling targeted debugging.
- Idempotency: Consumers must safely handle duplicate deliveries (common under at-least-once semantics) using idempotency keys or de-duplication stores.
- Delivery semantics:
- At-least-once (default): May deliver duplicates; requires idempotent consumers.
- At-most-once: Drops on failure; simplest but risks loss.
- Exactly-once: Complex; typically via transactional producers + transactional processing (e.g., Kafka transactions with Kafka Streams/Flink).
- Consumer groups: Multiple consumers share a subscription; each message/partition is handled by one consumer in the group, providing horizontal scale and load balancing.
- Ordering & partitioning: Partitions provide parallelism. Use stable keys (e.g., user_id) to preserve per-key ordering. Global ordering is expensive; design for per-key ordering where possible.
- Offsets & acknowledgments: Durable offsets (Kafka) or explicit acks (RabbitMQ/Pulsar) control progress and replay.
- Backpressure: Use max in-flight limits, prefetch/window tuning, and rate limiting to prevent overload.
Common Patterns in Industry
- Outbox Pattern: Write domain changes and outbound events in the same transactional store; a relay publishes to the broker to avoid dual-write inconsistencies. Pair with CDC or a background dispatcher.
- Inbox / De-duplication: Store processed message IDs (or hashes) to make consumers idempotent under retries and replays.
- Retry with Backoff + DLQ: Configure bounded retries with exponential backoff or jitter; route exhausted messages to DLQ for inspection. Consider a parking lot queue for manual reprocessing.
- Transactional Producer / Exactly-Once Processing: Use transactional producers and transactional sinks (e.g., Kafka transactions + Streams/Flink) when double-effects are unacceptable (payments, ledger).
- Fan-out & Routing: Use topics with multiple consumer groups for pub/sub, or routing keys/exchanges (RabbitMQ) for selective delivery.
- Saga / Orchestration vs Choreography: Coordinate multi-step workflows via orchestrators or event-driven choreography; use compensating actions instead of cross-service distributed transactions.
- Schema Governance: Enforce compatibility with a schema registry (Avro/Protobuf/JSON Schema) to keep producers and consumers in lockstep.
- Log Compaction & Retention: For Kafka-like logs, use compaction for latest-value streams and time/size-based retention for history.
- Observability: Emit structured logs, metrics (lag, throughput, error rate, DLQ depth), and traces around produce/consume paths.
Design Considerations
- Throughput & Latency: Choose partition count, batching, compression, and acks (e.g., acks=all) based on durability vs latency needs.
- Ordering vs Parallelism: More partitions increase parallelism but reduce global ordering; pick keys that match access patterns.
- Storage & Retention: Plan for retention size, compaction, and tiered storage if available.
- Security: TLS in transit, authN/Z (SASL/OAuth/TLS client certs), and ACLs per topic/queue.
- Failure Modes: Plan for consumer restarts, redelivery, offset rewind, poison pills, and broker failover.
When to Choose Which
- High-throughput event streaming & replay: Kafka, Pulsar.
- Work queues with flexible routing and per-message acks: RabbitMQ.
- Cloud-managed simplicity: SQS/SNS, Pub/Sub, Event Hubs.
Quick Checklist
- Define delivery guarantee: at-least-once vs exactly-once.
- Ensure idempotent consumers; add de-dup storage if needed.
- Configure retries with backoff and DLQ.
- Pick partitioning key for required ordering.
- Enforce schemas and compatibility.
- Monitor lag, DLQ depth, error rates, and throughput.