Skip to main content

E-Commerce Checkout System Design

Introduction

E-commerce checkout systems are among the most critical and challenging components in online retail. They must handle high write-intensive traffic during peak periods (flash sales), maintain strict data consistency (ACID properties), integrate with unreliable third-party services (payment gateways), and prevent race conditions that could lead to overselling or double-charging customers.

Core Characteristics

  • Write-Intensive Workloads: During flash sales or promotional events, the system experiences sudden spikes in write operations (inventory updates, order creation, payment processing).
  • Strict Data Consistency: Requires ACID transactions to prevent overselling, ensure payment integrity, and maintain order accuracy.
  • Third-Party Integration Risks: Payment gateways, shipping providers, and fraud detection services can fail or timeout, requiring robust error handling.
  • Race Condition Sensitivity: Multiple users competing for limited inventory requires sophisticated locking mechanisms.
  • Idempotency Requirements: Network retries and user double-clicks must not result in duplicate charges or orders.

High-Level Architecture

graph TD
Client[Frontend Web/App] -->|Checkout Request| APIGW[API Gateway + Rate Limiter]

subgraph "Core Transaction Services"
APIGW --> OrderSvc[Order Orchestrator Service]
OrderSvc -->|1. Reserve Stock| InventorySvc[Inventory Service]
InventorySvc -->|Lock Row| InventoryDB[(SQL DB - ACID)]

OrderSvc -->|2. Process Payment| PaymentSvc[Payment Service Wrapper]
PaymentSvc -->|Idempotent Request| 3rdPartyPG[External Payment Gateway - Stripe/Midtrans]

OrderSvc -->|3. Create Order Data| OrderDB[(Order SQL DB)]
end

subgraph "Post-Transaction Async"
OrderSvc -->|Order Success Event| MQ[Message Queue - RabbitMQ/SQS]
MQ --> NotifSvc[Notification Service - Email/WA]
MQ --> AnalyticsSvc[Data Warehouse Loader]
MQ --> ShippingSvc[Shipping Service]
end

Architecture Components

API Gateway & Rate Limiting

Purpose: Single entry point that handles authentication, rate limiting, and request routing.

  • Rate Limiter: Critical for preventing bot attacks during flash sales. Implement token bucket or sliding window algorithms.
  • Request Throttling: Protect backend services from traffic spikes using per-user/IP rate limits.
  • Authentication: Validate JWT tokens before routing to backend services.

Order Orchestrator Service

Purpose: Central service that coordinates the multi-step checkout process.

  • Orchestration Logic: Manages the sequence of operations (inventory reservation → payment → order creation).
  • State Management: Tracks order state transitions (pending → processing → completed → failed).
  • Compensation Logic: Handles rollback operations when any step fails.

Inventory Service

Purpose: Manages product stock levels with strict consistency guarantees.

  • SQL Database Requirement: MUST use SQL database (PostgreSQL/MySQL) for ACID transactions and row-level locking.
  • Stock Reservation: Temporary holds on inventory during checkout process.
  • Locking Strategy:
    • Pessimistic Locking: SELECT ... FOR UPDATE to prevent concurrent modifications.
    • Optimistic Locking: Version fields with conflict detection (better for high contention).

Payment Service Wrapper

Purpose: Abstraction layer between internal services and external payment gateways.

  • Never Direct Frontend Access: Backend must be the intermediary to record status and handle errors.
  • Idempotency Keys: Generate unique transaction IDs to prevent duplicate charges.
  • Retry Logic: Implement exponential backoff for transient failures.
  • Status Tracking: Maintain payment state (initiated → processing → succeeded/failed).

Message Queue (Post-Transaction)

Purpose: Decouple non-critical operations from the main transaction flow.

  • Event Publishing: Emit order success events after transaction commits.
  • Downstream Services: Notification, analytics, shipping, and inventory finalization.
  • Eventual Consistency: Acceptable for non-critical operations (email notifications, analytics).

Key Design Patterns

1. Saga Pattern for Distributed Transactions

Problem: Traditional ACID transactions don't work across multiple services and external APIs.

Solution: Saga Pattern coordinates distributed transactions through a sequence of local transactions with compensating actions.

Types of Saga:

  • Orchestration-Based: Central orchestrator (Order Service) coordinates all steps.
    • Pros: Centralized control, easier to understand flow.
    • Cons: Single point of failure, tight coupling.
  • Choreography-Based: Each service publishes events and reacts to events from others.
    • Pros: Loose coupling, better scalability.
    • Cons: Harder to debug, distributed control.

Example Flow:

sequenceDiagram
participant Client
participant OrderSvc
participant InventorySvc
participant PaymentSvc
participant PaymentGW

Client->>OrderSvc: Initiate Checkout
OrderSvc->>InventorySvc: Reserve Stock (T1)
InventorySvc-->>OrderSvc: Stock Reserved

OrderSvc->>PaymentSvc: Process Payment (T2)
PaymentSvc->>PaymentGW: Charge Customer
PaymentGW-->>PaymentSvc: Payment Failed
PaymentSvc-->>OrderSvc: Payment Failed

OrderSvc->>InventorySvc: Compensate: Release Stock (C1)
InventorySvc-->>OrderSvc: Stock Released
OrderSvc-->>Client: Checkout Failed

Compensation Actions:

  • If payment fails after stock reservation → Release reserved stock.
  • If order creation fails after payment → Initiate refund process.
  • If inventory update fails after payment → Rollback payment (if possible) or flag for manual review.

2. Idempotency

Problem: Network retries, user double-clicks, or system failures can cause duplicate payment requests.

Solution: Design all payment APIs to be idempotent using unique transaction keys.

Implementation:

  1. Generate Idempotency Key: Create unique key per transaction (e.g., user_id + order_id + timestamp_hash).
  2. Check Before Processing: Query database/Redis to see if this key was already processed.
  3. Store Result: If processed, return cached result. If not, process and store result.
  4. TTL: Set expiration on idempotency keys (e.g., 24 hours).

Example:

def process_payment(order_id, amount, idempotency_key):
# Check if already processed
cached_result = redis.get(f"payment:{idempotency_key}")
if cached_result:
return json.loads(cached_result)

# Process payment
result = payment_gateway.charge(amount, idempotency_key)

# Cache result
redis.setex(
f"payment:{idempotency_key}",
86400, # 24 hours TTL
json.dumps(result)
)

return result

3. Handling Race Conditions

Problem: 1000 users trying to buy the last item simultaneously can cause overselling.

Solutions:

Pessimistic Locking (Database Level)

BEGIN TRANSACTION;

SELECT quantity FROM inventory
WHERE product_id = ?
FOR UPDATE; -- Locks the row

UPDATE inventory
SET quantity = quantity - 1
WHERE product_id = ? AND quantity > 0;

COMMIT;
  • Pros: Prevents race conditions completely.
  • Cons: Can cause deadlocks, reduces concurrency.

Optimistic Locking (Version Field)

UPDATE inventory
SET quantity = quantity - 1, version = version + 1
WHERE product_id = ?
AND version = ? -- Expected version
AND quantity > 0;
  • Pros: Better concurrency, no deadlocks.
  • Cons: Requires retry logic on version conflicts.

Redis-Based Inventory (High-Performance)

For extremely high-traffic scenarios (flash sales):

-- Lua script for atomic decrement
local current = redis.call('GET', KEYS[1])
if current and tonumber(current) > 0 then
return redis.call('DECR', KEYS[1])
else
return -1
end
  • Pros: Very fast, handles millions of requests per second.
  • Cons: Requires synchronization with persistent database, eventual consistency.

Database Design Considerations

Inventory Database Schema

CREATE TABLE inventory (
product_id BIGINT PRIMARY KEY,
quantity INT NOT NULL CHECK (quantity >= 0),
reserved_quantity INT DEFAULT 0,
version INT DEFAULT 0, -- For optimistic locking
updated_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_product_id ON inventory(product_id);

Order Database Schema

CREATE TABLE orders (
order_id BIGINT PRIMARY KEY AUTO_INCREMENT,
user_id BIGINT NOT NULL,
status ENUM('pending', 'processing', 'completed', 'failed', 'cancelled'),
total_amount DECIMAL(10, 2) NOT NULL,
payment_id VARCHAR(255) UNIQUE, -- Idempotency key
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE order_items (
order_item_id BIGINT PRIMARY KEY AUTO_INCREMENT,
order_id BIGINT NOT NULL,
product_id BIGINT NOT NULL,
quantity INT NOT NULL,
price DECIMAL(10, 2) NOT NULL,
FOREIGN KEY (order_id) REFERENCES orders(order_id)
);

CREATE INDEX idx_user_id ON orders(user_id);
CREATE INDEX idx_payment_id ON orders(payment_id);

Payment Status Tracking

CREATE TABLE payment_transactions (
transaction_id BIGINT PRIMARY KEY AUTO_INCREMENT,
order_id BIGINT NOT NULL,
idempotency_key VARCHAR(255) UNIQUE NOT NULL,
amount DECIMAL(10, 2) NOT NULL,
status ENUM('initiated', 'processing', 'succeeded', 'failed', 'refunded'),
gateway_response JSON,
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW(),
FOREIGN KEY (order_id) REFERENCES orders(order_id)
);

CREATE UNIQUE INDEX idx_idempotency_key ON payment_transactions(idempotency_key);

Payment Integration Patterns

Payment Gateway Wrapper Service

Best Practices:

  1. Never Expose Gateway Credentials: Frontend should never directly call payment gateways.
  2. Status Polling: For asynchronous payment gateways, implement webhook handlers and polling fallback.
  3. Timeout Handling: Set reasonable timeouts (e.g., 30 seconds) and implement retries with exponential backoff.
  4. Idempotent Requests: Always include idempotency keys in gateway requests.
  5. Webhook Security: Verify webhook signatures to prevent fraud.

Payment Flow

sequenceDiagram
participant Client
participant OrderSvc
participant PaymentSvc
participant PaymentGW
participant Webhook

Client->>OrderSvc: Initiate Payment
OrderSvc->>PaymentSvc: Process Payment (with idempotency_key)
PaymentSvc->>PaymentGW: Create Charge Request
PaymentGW-->>PaymentSvc: Payment Initiated (pending)
PaymentSvc-->>OrderSvc: Payment Processing
OrderSvc-->>Client: Payment Initiated

PaymentGW->>Webhook: Payment Status Update
Webhook->>PaymentSvc: Update Status
PaymentSvc->>OrderSvc: Payment Completed
OrderSvc->>Client: Order Confirmed

Scalability & Performance

Caching Strategy

  • Product Catalog: Cache product details in Redis (TTL: 5 minutes).
  • Inventory Counts: Use Redis for real-time inventory during flash sales, sync with DB periodically.
  • User Sessions: Store cart data in Redis with session expiration.

Database Optimization

  • Connection Pooling: Use connection pools (e.g., 20-50 connections per service instance).
  • Read Replicas: Route read queries to replicas, writes to primary.
  • Partitioning: Partition orders table by date or user_id for large-scale systems.
  • Indexing: Strategic indexes on frequently queried columns (user_id, payment_id, status).

Load Balancing

  • Horizontal Scaling: Scale Order Service and Inventory Service independently based on load.
  • Database Sharding: For very large scale, shard by user_id or product_id.
  • CDN: Serve static assets (product images) via CDN.

Failure Handling

Circuit Breaker Pattern

Implement circuit breakers for payment gateway calls:

  • Closed: Normal operation, requests pass through.
  • Open: After X consecutive failures, stop sending requests, return cached error.
  • Half-Open: After timeout, try one request. If successful, close circuit.

Retry Strategy

def retry_with_backoff(func, max_retries=3):
for attempt in range(max_retries):
try:
return func()
except TransientError as e:
if attempt == max_retries - 1:
raise
wait_time = (2 ** attempt) + random.uniform(0, 1) # Exponential backoff with jitter
time.sleep(wait_time)

Dead Letter Queue (DLQ)

For async operations (notifications, analytics):

  • Route failed messages to DLQ after max retries.
  • Monitor DLQ depth and alert on threshold.
  • Implement manual reprocessing mechanism.

Security Considerations

  • PCI DSS Compliance: Never store raw credit card numbers. Use tokenization.
  • Input Validation: Validate all inputs (quantities, amounts) to prevent negative values or injection attacks.
  • Rate Limiting: Prevent abuse with per-user and per-IP rate limits.
  • Audit Logging: Log all payment transactions and inventory changes for compliance.
  • Encryption: Encrypt sensitive data in transit (TLS) and at rest.

Monitoring & Observability

Key Metrics

  • Order Success Rate: Percentage of successful checkouts.
  • Payment Success Rate: Percentage of successful payments.
  • Inventory Accuracy: Discrepancy between actual and recorded inventory.
  • API Latency: P50, P95, P99 latencies for checkout flow.
  • Error Rates: By service and error type.
  • Queue Depth: Message queue backlog for async operations.

Distributed Tracing

Use OpenTelemetry to trace requests across:

  • API Gateway → Order Service → Inventory Service → Payment Service → Payment Gateway

Alerting

Set up alerts for:

  • Payment failure rate > 5%
  • Inventory discrepancies
  • High API latency (P95 > 2 seconds)
  • Queue depth exceeding threshold
  • Circuit breaker opening

Design Checklist

  • Implement rate limiting at API Gateway level.
  • Use SQL database for inventory and orders (ACID guarantees).
  • Design payment APIs with idempotency keys.
  • Implement Saga pattern for distributed transactions.
  • Add compensation logic for rollback scenarios.
  • Handle race conditions with appropriate locking strategy.
  • Never expose payment gateway credentials to frontend.
  • Implement circuit breakers for external service calls.
  • Set up retry logic with exponential backoff.
  • Configure dead letter queues for async operations.
  • Add comprehensive monitoring and alerting.
  • Implement distributed tracing for debugging.
  • Plan for horizontal scaling of services.
  • Design for graceful degradation during failures.
  • Ensure PCI DSS compliance for payment handling.