OpenTelemetry: Observability Framework
OpenTelemetry is a collection of APIs, SDKs, and tools for instrumenting, generating, collecting, and exporting telemetry data (metrics, logs, and traces) to help you analyze your software's performance and behavior. It's a CNCF project that provides a unified approach to observability.
What is OpenTelemetry?
OpenTelemetry is the merger of OpenTracing and OpenCensus projects, providing a single, vendor-neutral observability framework. It enables you to:
- Instrument your applications with minimal code changes
- Generate telemetry data (traces, metrics, logs)
- Collect and export data to various backends
- Analyze application performance and behavior
Three Pillars of Observability
1. Traces
Traces show the path of a request through your system, helping you understand:
- Which services are involved
- How long each operation takes
- Where bottlenecks occur
- How services communicate
2. Metrics
Metrics are numerical measurements over time, including:
- System metrics (CPU, memory, disk)
- Application metrics (request rate, error rate)
- Business metrics (user registrations, sales)
3. Logs
Logs are discrete events with timestamps, providing:
- Detailed information about what happened
- Context for debugging issues
- Audit trails for compliance
Key Components
OpenTelemetry API
- Language-specific APIs for instrumentation
- Vendor-neutral interfaces
- Consistent across programming languages
OpenTelemetry SDK
- Implementation of the API
- Configuration and customization
- Resource detection and processing
OpenTelemetry Collector
- Data pipeline for telemetry data
- Receivers for different data sources
- Processors for data transformation
- Exporters for various backends
Instrumentation Libraries
- Auto-instrumentation for popular frameworks
- Manual instrumentation APIs
- Semantic conventions for consistency
Installation and Setup
Go Example
package main
import (
"context"
"log"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.17.0"
)
func initTracer() func() {
// Create Jaeger exporter
exp, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint("http://localhost:14268/api/traces")))
if err != nil {
log.Fatal(err)
}
// Create resource
res, err := resource.New(context.Background(),
resource.WithAttributes(
semconv.ServiceNameKey.String("my-service"),
semconv.ServiceVersionKey.String("1.0.0"),
),
)
if err != nil {
log.Fatal(err)
}
// Create tracer provider
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exp),
sdktrace.WithResource(res),
)
// Set global tracer provider
otel.SetTracerProvider(tp)
// Return cleanup function
return func() {
if err := tp.Shutdown(context.Background()); err != nil {
log.Fatal(err)
}
}
}
func main() {
cleanup := initTracer()
defer cleanup()
tracer := otel.Tracer("my-service")
// Create a span
ctx, span := tracer.Start(context.Background(), "main-operation")
defer span.End()
// Add attributes
span.SetAttributes(
semconv.ServiceNameKey.String("my-service"),
semconv.ServiceVersionKey.String("1.0.0"),
)
// Simulate work
time.Sleep(100 * time.Millisecond)
// Create child span
childCtx, childSpan := tracer.Start(ctx, "child-operation")
defer childSpan.End()
childSpan.SetAttributes(
semconv.OperationNameKey.String("processing"),
)
time.Sleep(50 * time.Millisecond)
log.Println("Operation completed")
}
Python Example
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
import time
def init_tracer():
# Create Jaeger exporter
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)
# Create resource
resource = Resource.create({
"service.name": "my-python-service",
"service.version": "1.0.0",
})
# Create tracer provider
tracer_provider = TracerProvider(resource=resource)
trace.set_tracer_provider(tracer_provider)
# Add span processor
span_processor = BatchSpanProcessor(jaeger_exporter)
tracer_provider.add_span_processor(span_processor)
return trace.get_tracer(__name__)
def main():
tracer = init_tracer()
with tracer.start_as_current_span("main-operation") as span:
span.set_attributes({
"service.name": "my-python-service",
"service.version": "1.0.0",
})
time.sleep(0.1) # Simulate work
with tracer.start_as_current_span("child-operation") as child_span:
child_span.set_attributes({
"operation.name": "processing",
})
time.sleep(0.05)
print("Operation completed")
if __name__ == "__main__":
main()
OpenTelemetry Collector
Configuration Example
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
limit_mib: 512
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
logging:
loglevel: debug
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [jaeger, logging]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus, logging]
Docker Compose Setup
version: "3.8"
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC receiver
- "4318:4318" # OTLP HTTP receiver
- "8889:8889" # Prometheus metrics
depends_on:
- jaeger
- prometheus
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686"
- "14250:14250"
environment:
- COLLECTOR_OTLP_ENABLED=true
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
Auto-Instrumentation
Go Auto-Instrumentation
import (
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin"
)
// HTTP client instrumentation
client := &http.Client{
Transport: otelhttp.NewTransport(http.DefaultTransport),
}
// Gin framework instrumentation
r := gin.Default()
r.Use(otelgin.Middleware("my-service"))
Python Auto-Instrumentation
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor
# Auto-instrument libraries
RequestsInstrumentor().instrument()
FlaskInstrumentor().instrument_app(app)
Psycopg2Instrumentor().instrument()
Metrics Collection
Go Metrics Example
import (
"go.opentelemetry.io/otel/metric"
"go.opentelemetry.io/otel/metric/global"
)
func main() {
meter := global.Meter("my-service")
// Create counters
requestCounter, _ := meter.Int64Counter(
"requests_total",
metric.WithDescription("Total number of requests"),
)
// Create histograms
requestDuration, _ := meter.Float64Histogram(
"request_duration_seconds",
metric.WithDescription("Request duration in seconds"),
)
// Record metrics
requestCounter.Add(context.Background(), 1)
requestDuration.Record(context.Background(), 0.5)
}
Python Metrics Example
from opentelemetry import metrics
from opentelemetry.metrics import Counter, Histogram
meter = metrics.get_meter("my-python-service")
# Create counters
request_counter = meter.create_counter(
name="requests_total",
description="Total number of requests",
)
# Create histograms
request_duration = meter.create_histogram(
name="request_duration_seconds",
description="Request duration in seconds",
)
# Record metrics
request_counter.add(1)
request_duration.record(0.5)
Logging Integration
Go Logging Example
import (
"go.opentelemetry.io/otel/trace"
"go.opentelemetry.io/otel/log"
)
func logWithTrace(ctx context.Context, message string) {
span := trace.SpanFromContext(ctx)
spanContext := span.SpanContext()
log.Info(message,
"trace_id", spanContext.TraceID().String(),
"span_id", spanContext.SpanID().String(),
)
}
Python Logging Example
import logging
from opentelemetry import trace
def log_with_trace(message):
span = trace.get_current_span()
if span:
logging.info(f"{message} - trace_id: {span.get_span_context().trace_id}")
else:
logging.info(message)
Best Practices
1. Semantic Conventions
import "go.opentelemetry.io/otel/semconv/v1.17.0"
// Use semantic conventions for consistent attributes
span.SetAttributes(
semconv.HTTPMethodKey.String("GET"),
semconv.HTTPURLKey.String("/api/users"),
semconv.HTTPStatusCodeKey.Int(200),
)
2. Resource Attributes
resource := resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
semconv.ServiceVersionKey.String("1.0.0"),
semconv.DeploymentEnvironmentKey.String("production"),
)
3. Sampling Configuration
sampler := sdktrace.TraceIDRatioBased(0.1) // Sample 10% of traces
tp := sdktrace.NewTracerProvider(
sdktrace.WithSampler(sampler),
sdktrace.WithBatcher(exporter),
)
4. Error Handling
defer func() {
if r := recover(); r != nil {
span.RecordError(fmt.Errorf("panic: %v", r))
span.SetStatus(codes.Error, "panic occurred")
panic(r)
}
}()
Integration Examples
Prometheus Integration
# otel-collector-config.yaml
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
const_labels:
label1: value1
metric_relabeling:
- source_labels: [__name__]
regex: "request_duration_seconds"
target_label: "operation"
replacement: "http_request"
Jaeger Integration
exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
sending_queue:
num_consumers: 10
queue_size: 1000
Grafana Integration
# grafana-datasource.yaml
apiVersion: 1
datasources:
- name: Jaeger
type: jaeger
url: http://jaeger:16686
access: proxy
- name: Prometheus
type: prometheus
url: http://prometheus:9090
access: proxy
Performance Considerations
Batch Processing
processors:
batch:
timeout: 1s
send_batch_size: 1024
send_batch_max_size: 2048
Memory Management
processors:
memory_limiter:
limit_mib: 512
spike_limit_mib: 128
check_interval: 5s
Sampling
// Head-based sampling
sampler := sdktrace.TraceIDRatioBased(0.1)
// Tail-based sampling
sampler := sdktrace.ParentBased(sdktrace.TraceIDRatioBased(0.1))
Troubleshooting
Common Issues
-
No traces appearing
- Check collector configuration
- Verify exporter endpoints
- Check sampling configuration
-
High memory usage
- Adjust batch sizes
- Configure memory limits
- Check for memory leaks
-
Performance issues
- Enable batching
- Configure appropriate sampling
- Monitor resource usage
Debug Commands
# Check collector health
curl http://localhost:13133/
# Check metrics endpoint
curl http://localhost:8889/metrics
# View collector logs
docker logs otel-collector
Further Resources
- OpenTelemetry Documentation: https://opentelemetry.io/docs/
- OpenTelemetry Go: https://github.com/open-telemetry/opentelemetry-go
- OpenTelemetry Python: https://github.com/open-telemetry/opentelemetry-python
- OpenTelemetry Collector: https://github.com/open-telemetry/opentelemetry-collector
- Semantic Conventions: https://opentelemetry.io/docs/reference/specification/semantic_conventions/