Jaeger: Distributed Tracing System
Jaeger is an open-source distributed tracing system originally built by teams at Uber and later donated to the Cloud Native Computing Foundation (CNCF). It helps developers monitor and troubleshoot microservices-based distributed systems by providing visibility into request flows across services.
What is Distributed Tracing?
Distributed tracing is a method of tracking requests as they flow through a distributed system. It helps answer questions like:
- Which services are involved in processing a request?
- How long does each service take to process the request?
- Where are the bottlenecks in the system?
- What happens when a request fails?
Key Concepts
Trace
A trace represents the entire journey of a request through the system, from the initial request to the final response.
Span
A span represents a single operation within a trace. Each span has:
- Operation Name: What operation is being performed
- Start Time: When the operation started
- Duration: How long the operation took
- Tags: Key-value pairs for metadata
- Logs: Timestamped events during the span
- References: Links to parent/child spans
Service
A service is a logical grouping of operations that share the same service name.
Jaeger Architecture
Jaeger consists of several components:
Jaeger Client Libraries
- Instrumentation Libraries: For different programming languages
- Tracers: Generate spans and traces
- Transport: Send trace data to Jaeger backend
Jaeger Agent
- Deployment: Runs as a daemon on each host
- Function: Receives traces from client libraries and forwards them to collectors
- Benefits: Reduces load on client applications
Jaeger Collector
- Function: Receives traces from agents and clients
- Processing: Validates, processes, and stores traces
- Storage: Writes traces to configured storage backend
Jaeger Query Service
- Function: Provides API for retrieving traces
- UI: Powers the Jaeger UI for trace visualization
- Search: Allows searching traces by various criteria
Jaeger Storage
- Backend: Configurable storage (Elasticsearch, Cassandra, etc.)
- Retention: Configurable data retention policies
Installation and Setup
Using Docker Compose
version: "3.8"
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686" # Jaeger UI
- "14268:14268" # HTTP collector
- "14250:14250" # gRPC collector
environment:
- COLLECTOR_OTLP_ENABLED=true
command: ["--log-level=debug"]
Using Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:latest
ports:
- containerPort: 16686
- containerPort: 14268
env:
- name: COLLECTOR_OTLP_ENABLED
value: "true"
Instrumentation Examples
Go Example
package main
import (
"context"
"fmt"
"log"
"time"
"github.com/opentracing/opentracing-go"
"github.com/uber/jaeger-client-go"
"github.com/uber/jaeger-client-go/config"
)
func main() {
// Initialize Jaeger tracer
cfg := config.Configuration{
ServiceName: "my-service",
Sampler: &config.SamplerConfig{
Type: "const",
Param: 1,
},
Reporter: &config.ReporterConfig{
LogSpans: true,
},
}
tracer, closer, err := cfg.NewTracer()
if err != nil {
log.Fatal(err)
}
defer closer.Close()
opentracing.SetGlobalTracer(tracer)
// Create a span
span := tracer.StartSpan("main-operation")
defer span.Finish()
// Add tags
span.SetTag("operation", "main")
span.SetTag("version", "1.0.0")
// Simulate work
time.Sleep(100 * time.Millisecond)
// Create child span
childSpan := tracer.StartSpan("child-operation", opentracing.ChildOf(span.Context()))
defer childSpan.Finish()
childSpan.SetTag("child.operation", "processing")
time.Sleep(50 * time.Millisecond)
fmt.Println("Operation completed")
}
Python Example
from jaeger_client import Config
import time
def init_tracer(service_name):
config = Config(
config={
'sampler': {
'type': 'const',
'param': 1,
},
'logging': True,
},
service_name=service_name,
)
return config.initialize_tracer()
def main():
tracer = init_tracer('my-python-service')
with tracer.start_span('main-operation') as span:
span.set_tag('operation', 'main')
span.set_tag('version', '1.0.0')
time.sleep(0.1) # Simulate work
with tracer.start_span('child-operation', child_of=span) as child_span:
child_span.set_tag('child.operation', 'processing')
time.sleep(0.05)
print("Operation completed")
if __name__ == '__main__':
main()
Configuration
Sampling Configuration
sampling:
type: "probabilistic"
param: 0.1 # Sample 10% of traces
Storage Configuration
storage:
type: "elasticsearch"
elasticsearch:
server-urls: "http://elasticsearch:9200"
index-prefix: "jaeger"
Best Practices
1. Meaningful Operation Names
// Good
span := tracer.StartSpan("user-authentication")
span := tracer.StartSpan("database-query")
// Bad
span := tracer.StartSpan("operation1")
span := tracer.StartSpan("do-stuff")
2. Appropriate Tagging
span.SetTag("http.method", "GET")
span.SetTag("http.url", "/api/users")
span.SetTag("http.status_code", 200)
span.SetTag("user.id", userID)
3. Error Handling
span := tracer.StartSpan("risky-operation")
defer span.Finish()
if err := riskyFunction(); err != nil {
span.SetTag("error", true)
span.LogFields(log.String("error.message", err.Error()))
return err
}
4. Sampling Strategies
- Const: Always sample (useful for development)
- Probabilistic: Sample a percentage of traces
- Rate Limiting: Sample up to a certain rate per second
- Adaptive: Adjust sampling based on traffic
Integration with Other Tools
OpenTelemetry Integration
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/trace"
)
func initTracer() {
exp, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint("http://localhost:14268/api/traces")))
if err != nil {
log.Fatal(err)
}
tp := trace.NewTracerProvider(
trace.WithBatcher(exp),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
)),
)
otel.SetTracerProvider(tp)
}
Prometheus Integration
# jaeger-prometheus.yml
metrics:
backend: prometheus
prometheus:
listen-address: ":8888"
Troubleshooting
Common Issues
-
Traces not appearing in UI
- Check if Jaeger collector is running
- Verify client configuration
- Check sampling configuration
-
High memory usage
- Adjust batch size
- Configure appropriate sampling
- Check for memory leaks in instrumentation
-
Storage issues
- Monitor storage backend health
- Configure appropriate retention policies
- Consider data archiving
Debug Commands
# Check Jaeger collector health
curl http://localhost:14269/
# Check storage backend
curl http://localhost:16686/api/services
# View trace by trace ID
curl "http://localhost:16686/api/traces/{trace-id}"
Performance Considerations
Sampling
- Use probabilistic sampling in production
- Consider adaptive sampling for high-traffic services
- Monitor sampling overhead
Storage
- Choose appropriate storage backend
- Configure retention policies
- Consider data archiving strategies
Network
- Use Jaeger agent for high-throughput scenarios
- Configure appropriate batch sizes
- Monitor network bandwidth usage
Further Resources
- Jaeger Documentation: https://www.jaegertracing.io/docs/
- OpenTracing Specification: https://opentracing.io/
- OpenTelemetry: https://opentelemetry.io/
- Jaeger GitHub: https://github.com/jaegertracing/jaeger