Skip to main content

OpenTelemetry: Observability Framework

OpenTelemetry is a collection of APIs, SDKs, and tools for instrumenting, generating, collecting, and exporting telemetry data (metrics, logs, and traces) to help you analyze your software's performance and behavior. It's a CNCF project that provides a unified approach to observability.

What is OpenTelemetry?

OpenTelemetry is the merger of OpenTracing and OpenCensus projects, providing a single, vendor-neutral observability framework. It enables you to:

  • Instrument your applications with minimal code changes
  • Generate telemetry data (traces, metrics, logs)
  • Collect and export data to various backends
  • Analyze application performance and behavior

Three Pillars of Observability

1. Traces

Traces show the path of a request through your system, helping you understand:

  • Which services are involved
  • How long each operation takes
  • Where bottlenecks occur
  • How services communicate

2. Metrics

Metrics are numerical measurements over time, including:

  • System metrics (CPU, memory, disk)
  • Application metrics (request rate, error rate)
  • Business metrics (user registrations, sales)

3. Logs

Logs are discrete events with timestamps, providing:

  • Detailed information about what happened
  • Context for debugging issues
  • Audit trails for compliance

Key Components

OpenTelemetry API

  • Language-specific APIs for instrumentation
  • Vendor-neutral interfaces
  • Consistent across programming languages

OpenTelemetry SDK

  • Implementation of the API
  • Configuration and customization
  • Resource detection and processing

OpenTelemetry Collector

  • Data pipeline for telemetry data
  • Receivers for different data sources
  • Processors for data transformation
  • Exporters for various backends

Instrumentation Libraries

  • Auto-instrumentation for popular frameworks
  • Manual instrumentation APIs
  • Semantic conventions for consistency

Installation and Setup

Go Example

package main

import (
"context"
"log"
"time"

"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.17.0"
)

func initTracer() func() {
// Create Jaeger exporter
exp, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint("http://localhost:14268/api/traces")))
if err != nil {
log.Fatal(err)
}

// Create resource
res, err := resource.New(context.Background(),
resource.WithAttributes(
semconv.ServiceNameKey.String("my-service"),
semconv.ServiceVersionKey.String("1.0.0"),
),
)
if err != nil {
log.Fatal(err)
}

// Create tracer provider
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exp),
sdktrace.WithResource(res),
)

// Set global tracer provider
otel.SetTracerProvider(tp)

// Return cleanup function
return func() {
if err := tp.Shutdown(context.Background()); err != nil {
log.Fatal(err)
}
}
}

func main() {
cleanup := initTracer()
defer cleanup()

tracer := otel.Tracer("my-service")

// Create a span
ctx, span := tracer.Start(context.Background(), "main-operation")
defer span.End()

// Add attributes
span.SetAttributes(
semconv.ServiceNameKey.String("my-service"),
semconv.ServiceVersionKey.String("1.0.0"),
)

// Simulate work
time.Sleep(100 * time.Millisecond)

// Create child span
childCtx, childSpan := tracer.Start(ctx, "child-operation")
defer childSpan.End()

childSpan.SetAttributes(
semconv.OperationNameKey.String("processing"),
)

time.Sleep(50 * time.Millisecond)

log.Println("Operation completed")
}

Python Example

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
import time

def init_tracer():
# Create Jaeger exporter
jaeger_exporter = JaegerExporter(
agent_host_name="localhost",
agent_port=6831,
)

# Create resource
resource = Resource.create({
"service.name": "my-python-service",
"service.version": "1.0.0",
})

# Create tracer provider
tracer_provider = TracerProvider(resource=resource)
trace.set_tracer_provider(tracer_provider)

# Add span processor
span_processor = BatchSpanProcessor(jaeger_exporter)
tracer_provider.add_span_processor(span_processor)

return trace.get_tracer(__name__)

def main():
tracer = init_tracer()

with tracer.start_as_current_span("main-operation") as span:
span.set_attributes({
"service.name": "my-python-service",
"service.version": "1.0.0",
})

time.sleep(0.1) # Simulate work

with tracer.start_as_current_span("child-operation") as child_span:
child_span.set_attributes({
"operation.name": "processing",
})
time.sleep(0.05)

print("Operation completed")

if __name__ == "__main__":
main()

OpenTelemetry Collector

Configuration Example

receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318

processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
limit_mib: 512

exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
logging:
loglevel: debug

service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [jaeger, logging]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus, logging]

Docker Compose Setup

version: "3.8"
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC receiver
- "4318:4318" # OTLP HTTP receiver
- "8889:8889" # Prometheus metrics
depends_on:
- jaeger
- prometheus

jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686"
- "14250:14250"
environment:
- COLLECTOR_OTLP_ENABLED=true

prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml

Auto-Instrumentation

Go Auto-Instrumentation

import (
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin"
)

// HTTP client instrumentation
client := &http.Client{
Transport: otelhttp.NewTransport(http.DefaultTransport),
}

// Gin framework instrumentation
r := gin.Default()
r.Use(otelgin.Middleware("my-service"))

Python Auto-Instrumentation

from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor

# Auto-instrument libraries
RequestsInstrumentor().instrument()
FlaskInstrumentor().instrument_app(app)
Psycopg2Instrumentor().instrument()

Metrics Collection

Go Metrics Example

import (
"go.opentelemetry.io/otel/metric"
"go.opentelemetry.io/otel/metric/global"
)

func main() {
meter := global.Meter("my-service")

// Create counters
requestCounter, _ := meter.Int64Counter(
"requests_total",
metric.WithDescription("Total number of requests"),
)

// Create histograms
requestDuration, _ := meter.Float64Histogram(
"request_duration_seconds",
metric.WithDescription("Request duration in seconds"),
)

// Record metrics
requestCounter.Add(context.Background(), 1)
requestDuration.Record(context.Background(), 0.5)
}

Python Metrics Example

from opentelemetry import metrics
from opentelemetry.metrics import Counter, Histogram

meter = metrics.get_meter("my-python-service")

# Create counters
request_counter = meter.create_counter(
name="requests_total",
description="Total number of requests",
)

# Create histograms
request_duration = meter.create_histogram(
name="request_duration_seconds",
description="Request duration in seconds",
)

# Record metrics
request_counter.add(1)
request_duration.record(0.5)

Logging Integration

Go Logging Example

import (
"go.opentelemetry.io/otel/trace"
"go.opentelemetry.io/otel/log"
)

func logWithTrace(ctx context.Context, message string) {
span := trace.SpanFromContext(ctx)
spanContext := span.SpanContext()

log.Info(message,
"trace_id", spanContext.TraceID().String(),
"span_id", spanContext.SpanID().String(),
)
}

Python Logging Example

import logging
from opentelemetry import trace

def log_with_trace(message):
span = trace.get_current_span()
if span:
logging.info(f"{message} - trace_id: {span.get_span_context().trace_id}")
else:
logging.info(message)

Best Practices

1. Semantic Conventions

import "go.opentelemetry.io/otel/semconv/v1.17.0"

// Use semantic conventions for consistent attributes
span.SetAttributes(
semconv.HTTPMethodKey.String("GET"),
semconv.HTTPURLKey.String("/api/users"),
semconv.HTTPStatusCodeKey.Int(200),
)

2. Resource Attributes

resource := resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
semconv.ServiceVersionKey.String("1.0.0"),
semconv.DeploymentEnvironmentKey.String("production"),
)

3. Sampling Configuration

sampler := sdktrace.TraceIDRatioBased(0.1) // Sample 10% of traces

tp := sdktrace.NewTracerProvider(
sdktrace.WithSampler(sampler),
sdktrace.WithBatcher(exporter),
)

4. Error Handling

defer func() {
if r := recover(); r != nil {
span.RecordError(fmt.Errorf("panic: %v", r))
span.SetStatus(codes.Error, "panic occurred")
panic(r)
}
}()

Integration Examples

Prometheus Integration

# otel-collector-config.yaml
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
const_labels:
label1: value1
metric_relabeling:
- source_labels: [__name__]
regex: "request_duration_seconds"
target_label: "operation"
replacement: "http_request"

Jaeger Integration

exporters:
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
sending_queue:
num_consumers: 10
queue_size: 1000

Grafana Integration

# grafana-datasource.yaml
apiVersion: 1
datasources:
- name: Jaeger
type: jaeger
url: http://jaeger:16686
access: proxy
- name: Prometheus
type: prometheus
url: http://prometheus:9090
access: proxy

Performance Considerations

Batch Processing

processors:
batch:
timeout: 1s
send_batch_size: 1024
send_batch_max_size: 2048

Memory Management

processors:
memory_limiter:
limit_mib: 512
spike_limit_mib: 128
check_interval: 5s

Sampling

// Head-based sampling
sampler := sdktrace.TraceIDRatioBased(0.1)

// Tail-based sampling
sampler := sdktrace.ParentBased(sdktrace.TraceIDRatioBased(0.1))

Troubleshooting

Common Issues

  1. No traces appearing

    • Check collector configuration
    • Verify exporter endpoints
    • Check sampling configuration
  2. High memory usage

    • Adjust batch sizes
    • Configure memory limits
    • Check for memory leaks
  3. Performance issues

    • Enable batching
    • Configure appropriate sampling
    • Monitor resource usage

Debug Commands

# Check collector health
curl http://localhost:13133/

# Check metrics endpoint
curl http://localhost:8889/metrics

# View collector logs
docker logs otel-collector

Further Resources