E-Commerce Checkout System Design
Introduction
E-commerce checkout systems are among the most critical and challenging components in online retail. They must handle high write-intensive traffic during peak periods (flash sales), maintain strict data consistency (ACID properties), integrate with unreliable third-party services (payment gateways), and prevent race conditions that could lead to overselling or double-charging customers.
Core Characteristics
- Write-Intensive Workloads: During flash sales or promotional events, the system experiences sudden spikes in write operations (inventory updates, order creation, payment processing).
- Strict Data Consistency: Requires ACID transactions to prevent overselling, ensure payment integrity, and maintain order accuracy.
- Third-Party Integration Risks: Payment gateways, shipping providers, and fraud detection services can fail or timeout, requiring robust error handling.
- Race Condition Sensitivity: Multiple users competing for limited inventory requires sophisticated locking mechanisms.
- Idempotency Requirements: Network retries and user double-clicks must not result in duplicate charges or orders.
High-Level Architecture
graph TD
Client[Frontend Web/App] -->|Checkout Request| APIGW[API Gateway + Rate Limiter]
subgraph "Core Transaction Services"
APIGW --> OrderSvc[Order Orchestrator Service]
OrderSvc -->|1. Reserve Stock| InventorySvc[Inventory Service]
InventorySvc -->|Lock Row| InventoryDB[(SQL DB - ACID)]
OrderSvc -->|2. Process Payment| PaymentSvc[Payment Service Wrapper]
PaymentSvc -->|Idempotent Request| 3rdPartyPG[External Payment Gateway - Stripe/Midtrans]
OrderSvc -->|3. Create Order Data| OrderDB[(Order SQL DB)]
end
subgraph "Post-Transaction Async"
OrderSvc -->|Order Success Event| MQ[Message Queue - RabbitMQ/SQS]
MQ --> NotifSvc[Notification Service - Email/WA]
MQ --> AnalyticsSvc[Data Warehouse Loader]
MQ --> ShippingSvc[Shipping Service]
end
Architecture Components
API Gateway & Rate Limiting
Purpose: Single entry point that handles authentication, rate limiting, and request routing.
- Rate Limiter: Critical for preventing bot attacks during flash sales. Implement token bucket or sliding window algorithms.
- Request Throttling: Protect backend services from traffic spikes using per-user/IP rate limits.
- Authentication: Validate JWT tokens before routing to backend services.
Order Orchestrator Service
Purpose: Central service that coordinates the multi-step checkout process.
- Orchestration Logic: Manages the sequence of operations (inventory reservation → payment → order creation).
- State Management: Tracks order state transitions (pending → processing → completed → failed).
- Compensation Logic: Handles rollback operations when any step fails.
Inventory Service
Purpose: Manages product stock levels with strict consistency guarantees.
- SQL Database Requirement: MUST use SQL database (PostgreSQL/MySQL) for ACID transactions and row-level locking.
- Stock Reservation: Temporary holds on inventory during checkout process.
- Locking Strategy:
- Pessimistic Locking:
SELECT ... FOR UPDATEto prevent concurrent modifications. - Optimistic Locking: Version fields with conflict detection (better for high contention).
- Pessimistic Locking:
Payment Service Wrapper
Purpose: Abstraction layer between internal services and external payment gateways.
- Never Direct Frontend Access: Backend must be the intermediary to record status and handle errors.
- Idempotency Keys: Generate unique transaction IDs to prevent duplicate charges.
- Retry Logic: Implement exponential backoff for transient failures.
- Status Tracking: Maintain payment state (initiated → processing → succeeded/failed).
Message Queue (Post-Transaction)
Purpose: Decouple non-critical operations from the main transaction flow.
- Event Publishing: Emit order success events after transaction commits.
- Downstream Services: Notification, analytics, shipping, and inventory finalization.
- Eventual Consistency: Acceptable for non-critical operations (email notifications, analytics).
Key Design Patterns
1. Saga Pattern for Distributed Transactions
Problem: Traditional ACID transactions don't work across multiple services and external APIs.
Solution: Saga Pattern coordinates distributed transactions through a sequence of local transactions with compensating actions.
Types of Saga:
- Orchestration-Based: Central orchestrator (Order Service) coordinates all steps.
- Pros: Centralized control, easier to understand flow.
- Cons: Single point of failure, tight coupling.
- Choreography-Based: Each service publishes events and reacts to events from others.
- Pros: Loose coupling, better scalability.
- Cons: Harder to debug, distributed control.
Example Flow:
sequenceDiagram
participant Client
participant OrderSvc
participant InventorySvc
participant PaymentSvc
participant PaymentGW
Client->>OrderSvc: Initiate Checkout
OrderSvc->>InventorySvc: Reserve Stock (T1)
InventorySvc-->>OrderSvc: Stock Reserved
OrderSvc->>PaymentSvc: Process Payment (T2)
PaymentSvc->>PaymentGW: Charge Customer
PaymentGW-->>PaymentSvc: Payment Failed
PaymentSvc-->>OrderSvc: Payment Failed
OrderSvc->>InventorySvc: Compensate: Release Stock (C1)
InventorySvc-->>OrderSvc: Stock Released
OrderSvc-->>Client: Checkout Failed
Compensation Actions:
- If payment fails after stock reservation → Release reserved stock.
- If order creation fails after payment → Initiate refund process.
- If inventory update fails after payment → Rollback payment (if possible) or flag for manual review.
2. Idempotency
Problem: Network retries, user double-clicks, or system failures can cause duplicate payment requests.
Solution: Design all payment APIs to be idempotent using unique transaction keys.
Implementation:
- Generate Idempotency Key: Create unique key per transaction (e.g.,
user_id + order_id + timestamp_hash). - Check Before Processing: Query database/Redis to see if this key was already processed.
- Store Result: If processed, return cached result. If not, process and store result.
- TTL: Set expiration on idempotency keys (e.g., 24 hours).
Example:
def process_payment(order_id, amount, idempotency_key):
# Check if already processed
cached_result = redis.get(f"payment:{idempotency_key}")
if cached_result:
return json.loads(cached_result)
# Process payment
result = payment_gateway.charge(amount, idempotency_key)
# Cache result
redis.setex(
f"payment:{idempotency_key}",
86400, # 24 hours TTL
json.dumps(result)
)
return result
3. Handling Race Conditions
Problem: 1000 users trying to buy the last item simultaneously can cause overselling.
Solutions:
Pessimistic Locking (Database Level)
BEGIN TRANSACTION;
SELECT quantity FROM inventory
WHERE product_id = ?
FOR UPDATE; -- Locks the row
UPDATE inventory
SET quantity = quantity - 1
WHERE product_id = ? AND quantity > 0;
COMMIT;
- Pros: Prevents race conditions completely.
- Cons: Can cause deadlocks, reduces concurrency.
Optimistic Locking (Version Field)
UPDATE inventory
SET quantity = quantity - 1, version = version + 1
WHERE product_id = ?
AND version = ? -- Expected version
AND quantity > 0;
- Pros: Better concurrency, no deadlocks.
- Cons: Requires retry logic on version conflicts.
Redis-Based Inventory (High-Performance)
For extremely high-traffic scenarios (flash sales):
-- Lua script for atomic decrement
local current = redis.call('GET', KEYS[1])
if current and tonumber(current) > 0 then
return redis.call('DECR', KEYS[1])
else
return -1
end
- Pros: Very fast, handles millions of requests per second.
- Cons: Requires synchronization with persistent database, eventual consistency.
Database Design Considerations
Inventory Database Schema
CREATE TABLE inventory (
product_id BIGINT PRIMARY KEY,
quantity INT NOT NULL CHECK (quantity >= 0),
reserved_quantity INT DEFAULT 0,
version INT DEFAULT 0, -- For optimistic locking
updated_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_product_id ON inventory(product_id);
Order Database Schema
CREATE TABLE orders (
order_id BIGINT PRIMARY KEY AUTO_INCREMENT,
user_id BIGINT NOT NULL,
status ENUM('pending', 'processing', 'completed', 'failed', 'cancelled'),
total_amount DECIMAL(10, 2) NOT NULL,
payment_id VARCHAR(255) UNIQUE, -- Idempotency key
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE order_items (
order_item_id BIGINT PRIMARY KEY AUTO_INCREMENT,
order_id BIGINT NOT NULL,
product_id BIGINT NOT NULL,
quantity INT NOT NULL,
price DECIMAL(10, 2) NOT NULL,
FOREIGN KEY (order_id) REFERENCES orders(order_id)
);
CREATE INDEX idx_user_id ON orders(user_id);
CREATE INDEX idx_payment_id ON orders(payment_id);
Payment Status Tracking
CREATE TABLE payment_transactions (
transaction_id BIGINT PRIMARY KEY AUTO_INCREMENT,
order_id BIGINT NOT NULL,
idempotency_key VARCHAR(255) UNIQUE NOT NULL,
amount DECIMAL(10, 2) NOT NULL,
status ENUM('initiated', 'processing', 'succeeded', 'failed', 'refunded'),
gateway_response JSON,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
FOREIGN KEY (order_id) REFERENCES orders(order_id)
);
CREATE UNIQUE INDEX idx_idempotency_key ON payment_transactions(idempotency_key);
Payment Integration Patterns
Payment Gateway Wrapper Service
Best Practices:
- Never Expose Gateway Credentials: Frontend should never directly call payment gateways.
- Status Polling: For asynchronous payment gateways, implement webhook handlers and polling fallback.
- Timeout Handling: Set reasonable timeouts (e.g., 30 seconds) and implement retries with exponential backoff.
- Idempotent Requests: Always include idempotency keys in gateway requests.
- Webhook Security: Verify webhook signatures to prevent fraud.
Payment Flow
sequenceDiagram
participant Client
participant OrderSvc
participant PaymentSvc
participant PaymentGW
participant Webhook
Client->>OrderSvc: Initiate Payment
OrderSvc->>PaymentSvc: Process Payment (with idempotency_key)
PaymentSvc->>PaymentGW: Create Charge Request
PaymentGW-->>PaymentSvc: Payment Initiated (pending)
PaymentSvc-->>OrderSvc: Payment Processing
OrderSvc-->>Client: Payment Initiated
PaymentGW->>Webhook: Payment Status Update
Webhook->>PaymentSvc: Update Status
PaymentSvc->>OrderSvc: Payment Completed
OrderSvc->>Client: Order Confirmed
Scalability & Performance
Caching Strategy
- Product Catalog: Cache product details in Redis (TTL: 5 minutes).
- Inventory Counts: Use Redis for real-time inventory during flash sales, sync with DB periodically.
- User Sessions: Store cart data in Redis with session expiration.
Database Optimization
- Connection Pooling: Use connection pools (e.g., 20-50 connections per service instance).
- Read Replicas: Route read queries to replicas, writes to primary.
- Partitioning: Partition orders table by date or user_id for large-scale systems.
- Indexing: Strategic indexes on frequently queried columns (user_id, payment_id, status).
Load Balancing
- Horizontal Scaling: Scale Order Service and Inventory Service independently based on load.
- Database Sharding: For very large scale, shard by user_id or product_id.
- CDN: Serve static assets (product images) via CDN.
Failure Handling
Circuit Breaker Pattern
Implement circuit breakers for payment gateway calls:
- Closed: Normal operation, requests pass through.
- Open: After X consecutive failures, stop sending requests, return cached error.
- Half-Open: After timeout, try one request. If successful, close circuit.
Retry Strategy
def retry_with_backoff(func, max_retries=3):
for attempt in range(max_retries):
try:
return func()
except TransientError as e:
if attempt == max_retries - 1:
raise
wait_time = (2 ** attempt) + random.uniform(0, 1) # Exponential backoff with jitter
time.sleep(wait_time)
Dead Letter Queue (DLQ)
For async operations (notifications, analytics):
- Route failed messages to DLQ after max retries.
- Monitor DLQ depth and alert on threshold.
- Implement manual reprocessing mechanism.
Security Considerations
- PCI DSS Compliance: Never store raw credit card numbers. Use tokenization.
- Input Validation: Validate all inputs (quantities, amounts) to prevent negative values or injection attacks.
- Rate Limiting: Prevent abuse with per-user and per-IP rate limits.
- Audit Logging: Log all payment transactions and inventory changes for compliance.
- Encryption: Encrypt sensitive data in transit (TLS) and at rest.
Monitoring & Observability
Key Metrics
- Order Success Rate: Percentage of successful checkouts.
- Payment Success Rate: Percentage of successful payments.
- Inventory Accuracy: Discrepancy between actual and recorded inventory.
- API Latency: P50, P95, P99 latencies for checkout flow.
- Error Rates: By service and error type.
- Queue Depth: Message queue backlog for async operations.
Distributed Tracing
Use OpenTelemetry to trace requests across:
- API Gateway → Order Service → Inventory Service → Payment Service → Payment Gateway
Alerting
Set up alerts for:
- Payment failure rate > 5%
- Inventory discrepancies
- High API latency (P95 > 2 seconds)
- Queue depth exceeding threshold
- Circuit breaker opening
Design Checklist
- Implement rate limiting at API Gateway level.
- Use SQL database for inventory and orders (ACID guarantees).
- Design payment APIs with idempotency keys.
- Implement Saga pattern for distributed transactions.
- Add compensation logic for rollback scenarios.
- Handle race conditions with appropriate locking strategy.
- Never expose payment gateway credentials to frontend.
- Implement circuit breakers for external service calls.
- Set up retry logic with exponential backoff.
- Configure dead letter queues for async operations.
- Add comprehensive monitoring and alerting.
- Implement distributed tracing for debugging.
- Plan for horizontal scaling of services.
- Design for graceful degradation during failures.
- Ensure PCI DSS compliance for payment handling.