Skip to main content

Jaeger: Distributed Tracing System

Jaeger is an open-source distributed tracing system originally built by teams at Uber and later donated to the Cloud Native Computing Foundation (CNCF). It helps developers monitor and troubleshoot microservices-based distributed systems by providing visibility into request flows across services.

What is Distributed Tracing?

Distributed tracing is a method of tracking requests as they flow through a distributed system. It helps answer questions like:

  • Which services are involved in processing a request?
  • How long does each service take to process the request?
  • Where are the bottlenecks in the system?
  • What happens when a request fails?

Key Concepts

Trace

A trace represents the entire journey of a request through the system, from the initial request to the final response.

Span

A span represents a single operation within a trace. Each span has:

  • Operation Name: What operation is being performed
  • Start Time: When the operation started
  • Duration: How long the operation took
  • Tags: Key-value pairs for metadata
  • Logs: Timestamped events during the span
  • References: Links to parent/child spans

Service

A service is a logical grouping of operations that share the same service name.

Jaeger Architecture

Jaeger consists of several components:

Jaeger Client Libraries

  • Instrumentation Libraries: For different programming languages
  • Tracers: Generate spans and traces
  • Transport: Send trace data to Jaeger backend

Jaeger Agent

  • Deployment: Runs as a daemon on each host
  • Function: Receives traces from client libraries and forwards them to collectors
  • Benefits: Reduces load on client applications

Jaeger Collector

  • Function: Receives traces from agents and clients
  • Processing: Validates, processes, and stores traces
  • Storage: Writes traces to configured storage backend

Jaeger Query Service

  • Function: Provides API for retrieving traces
  • UI: Powers the Jaeger UI for trace visualization
  • Search: Allows searching traces by various criteria

Jaeger Storage

  • Backend: Configurable storage (Elasticsearch, Cassandra, etc.)
  • Retention: Configurable data retention policies

Installation and Setup

Using Docker Compose

version: "3.8"
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "14268:14268" # HTTP collector
- "14250:14250" # gRPC collector
environment:
- COLLECTOR_OTLP_ENABLED=true
command: ["--log-level=debug"]

Using Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:latest
ports:
- containerPort: 16686
- containerPort: 14268
env:
- name: COLLECTOR_OTLP_ENABLED
value: "true"

Instrumentation Examples

Go Example

package main

import (
"context"
"fmt"
"log"
"time"

"github.com/opentracing/opentracing-go"
"github.com/uber/jaeger-client-go"
"github.com/uber/jaeger-client-go/config"
)

func main() {
// Initialize Jaeger tracer
cfg := config.Configuration{
ServiceName: "my-service",
Sampler: &config.SamplerConfig{
Type: "const",
Param: 1,
},
Reporter: &config.ReporterConfig{
LogSpans: true,
},
}

tracer, closer, err := cfg.NewTracer()
if err != nil {
log.Fatal(err)
}
defer closer.Close()

opentracing.SetGlobalTracer(tracer)

// Create a span
span := tracer.StartSpan("main-operation")
defer span.Finish()

// Add tags
span.SetTag("operation", "main")
span.SetTag("version", "1.0.0")

// Simulate work
time.Sleep(100 * time.Millisecond)

// Create child span
childSpan := tracer.StartSpan("child-operation", opentracing.ChildOf(span.Context()))
defer childSpan.Finish()

childSpan.SetTag("child.operation", "processing")
time.Sleep(50 * time.Millisecond)

fmt.Println("Operation completed")
}

Python Example

from jaeger_client import Config
import time

def init_tracer(service_name):
config = Config(
config={
'sampler': {
'type': 'const',
'param': 1,
},
'logging': True,
},
service_name=service_name,
)
return config.initialize_tracer()

def main():
tracer = init_tracer('my-python-service')

with tracer.start_span('main-operation') as span:
span.set_tag('operation', 'main')
span.set_tag('version', '1.0.0')

time.sleep(0.1) # Simulate work

with tracer.start_span('child-operation', child_of=span) as child_span:
child_span.set_tag('child.operation', 'processing')
time.sleep(0.05)

print("Operation completed")

if __name__ == '__main__':
main()

Configuration

Sampling Configuration

sampling:
type: "probabilistic"
param: 0.1 # Sample 10% of traces

Storage Configuration

storage:
type: "elasticsearch"
elasticsearch:
server-urls: "http://elasticsearch:9200"
index-prefix: "jaeger"

Best Practices

1. Meaningful Operation Names

// Good
span := tracer.StartSpan("user-authentication")
span := tracer.StartSpan("database-query")

// Bad
span := tracer.StartSpan("operation1")
span := tracer.StartSpan("do-stuff")

2. Appropriate Tagging

span.SetTag("http.method", "GET")
span.SetTag("http.url", "/api/users")
span.SetTag("http.status_code", 200)
span.SetTag("user.id", userID)

3. Error Handling

span := tracer.StartSpan("risky-operation")
defer span.Finish()

if err := riskyFunction(); err != nil {
span.SetTag("error", true)
span.LogFields(log.String("error.message", err.Error()))
return err
}

4. Sampling Strategies

  • Const: Always sample (useful for development)
  • Probabilistic: Sample a percentage of traces
  • Rate Limiting: Sample up to a certain rate per second
  • Adaptive: Adjust sampling based on traffic

Integration with Other Tools

OpenTelemetry Integration

import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/trace"
)

func initTracer() {
exp, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint("http://localhost:14268/api/traces")))
if err != nil {
log.Fatal(err)
}

tp := trace.NewTracerProvider(
trace.WithBatcher(exp),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
)),
)

otel.SetTracerProvider(tp)
}

Prometheus Integration

# jaeger-prometheus.yml
metrics:
backend: prometheus
prometheus:
listen-address: ":8888"

Troubleshooting

Common Issues

  1. Traces not appearing in UI

    • Check if Jaeger collector is running
    • Verify client configuration
    • Check sampling configuration
  2. High memory usage

    • Adjust batch size
    • Configure appropriate sampling
    • Check for memory leaks in instrumentation
  3. Storage issues

    • Monitor storage backend health
    • Configure appropriate retention policies
    • Consider data archiving

Debug Commands

# Check Jaeger collector health
curl http://localhost:14269/

# Check storage backend
curl http://localhost:16686/api/services

# View trace by trace ID
curl "http://localhost:16686/api/traces/{trace-id}"

Performance Considerations

Sampling

  • Use probabilistic sampling in production
  • Consider adaptive sampling for high-traffic services
  • Monitor sampling overhead

Storage

  • Choose appropriate storage backend
  • Configure retention policies
  • Consider data archiving strategies

Network

  • Use Jaeger agent for high-throughput scenarios
  • Configure appropriate batch sizes
  • Monitor network bandwidth usage

Further Resources