OpenTelemetry: Observability Framework

OpenTelemetry is a collection of APIs, SDKs, and tools for instrumenting, generating, collecting, and exporting telemetry data (metrics, logs, and traces) to help you analyze your software's performance and behavior. It's a CNCF project that provides a unified approach to observability.

What is OpenTelemetry?

OpenTelemetry is the merger of OpenTracing and OpenCensus projects, providing a single, vendor-neutral observability framework. It enables you to:

Instrument your applications with minimal code changes
Generate telemetry data (traces, metrics, logs)
Collect and export data to various backends
Analyze application performance and behavior

Three Pillars of Observability

1. Traces

Traces show the path of a request through your system, helping you understand:

Which services are involved
How long each operation takes
Where bottlenecks occur
How services communicate

2. Metrics

Metrics are numerical measurements over time, including:

System metrics (CPU, memory, disk)
Application metrics (request rate, error rate)
Business metrics (user registrations, sales)

3. Logs

Logs are discrete events with timestamps, providing:

Detailed information about what happened
Context for debugging issues
Audit trails for compliance

Key Components

OpenTelemetry API

Language-specific APIs for instrumentation
Vendor-neutral interfaces
Consistent across programming languages

OpenTelemetry SDK

Implementation of the API
Configuration and customization
Resource detection and processing

OpenTelemetry Collector

Data pipeline for telemetry data
Receivers for different data sources
Processors for data transformation
Exporters for various backends

Instrumentation Libraries

Auto-instrumentation for popular frameworks
Manual instrumentation APIs
Semantic conventions for consistency

Installation and Setup

Go Example

package main

import (
    "context"
    "log"
    "time"

    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.17.0"
)

func initTracer() func() {
    // Create Jaeger exporter
    exp, err := jaeger.New(jaeger.WithCollectorEndpoint(jaeger.WithEndpoint("http://localhost:14268/api/traces")))
    if err != nil {
        log.Fatal(err)
    }

    // Create resource
    res, err := resource.New(context.Background(),
        resource.WithAttributes(
            semconv.ServiceNameKey.String("my-service"),
            semconv.ServiceVersionKey.String("1.0.0"),
        ),
    )
    if err != nil {
        log.Fatal(err)
    }

    // Create tracer provider
    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exp),
        sdktrace.WithResource(res),
    )

    // Set global tracer provider
    otel.SetTracerProvider(tp)

    // Return cleanup function
    return func() {
        if err := tp.Shutdown(context.Background()); err != nil {
            log.Fatal(err)
        }
    }
}

func main() {
    cleanup := initTracer()
    defer cleanup()

    tracer := otel.Tracer("my-service")

    // Create a span
    ctx, span := tracer.Start(context.Background(), "main-operation")
    defer span.End()

    // Add attributes
    span.SetAttributes(
        semconv.ServiceNameKey.String("my-service"),
        semconv.ServiceVersionKey.String("1.0.0"),
    )

    // Simulate work
    time.Sleep(100 * time.Millisecond)

    // Create child span
    childCtx, childSpan := tracer.Start(ctx, "child-operation")
    defer childSpan.End()

    childSpan.SetAttributes(
        semconv.OperationNameKey.String("processing"),
    )

    time.Sleep(50 * time.Millisecond)

    log.Println("Operation completed")
}

Python Example

from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
import time

def init_tracer():
    # Create Jaeger exporter
    jaeger_exporter = JaegerExporter(
        agent_host_name="localhost",
        agent_port=6831,
    )

    # Create resource
    resource = Resource.create({
        "service.name": "my-python-service",
        "service.version": "1.0.0",
    })

    # Create tracer provider
    tracer_provider = TracerProvider(resource=resource)
    trace.set_tracer_provider(tracer_provider)

    # Add span processor
    span_processor = BatchSpanProcessor(jaeger_exporter)
    tracer_provider.add_span_processor(span_processor)

    return trace.get_tracer(__name__)

def main():
    tracer = init_tracer()

    with tracer.start_as_current_span("main-operation") as span:
        span.set_attributes({
            "service.name": "my-python-service",
            "service.version": "1.0.0",
        })

        time.sleep(0.1)  # Simulate work

        with tracer.start_as_current_span("child-operation") as child_span:
            child_span.set_attributes({
                "operation.name": "processing",
            })
            time.sleep(0.05)

    print("Operation completed")

if __name__ == "__main__":
    main()

OpenTelemetry Collector

Configuration Example

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  memory_limiter:
    limit_mib: 512

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  prometheus:
    endpoint: "0.0.0.0:8889"
  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [jaeger, logging]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus, logging]

Docker Compose Setup

version: "3.8"
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317" # OTLP gRPC receiver
      - "4318:4318" # OTLP HTTP receiver
      - "8889:8889" # Prometheus metrics
    depends_on:
      - jaeger
      - prometheus

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"
      - "14250:14250"
    environment:
      - COLLECTOR_OTLP_ENABLED=true

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

Auto-Instrumentation

Go Auto-Instrumentation

import (
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
    "go.opentelemetry.io/contrib/instrumentation/github.com/gin-gonic/gin/otelgin"
)

// HTTP client instrumentation
client := &http.Client{
    Transport: otelhttp.NewTransport(http.DefaultTransport),
}

// Gin framework instrumentation
r := gin.Default()
r.Use(otelgin.Middleware("my-service"))

Python Auto-Instrumentation

from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.psycopg2 import Psycopg2Instrumentor

# Auto-instrument libraries
RequestsInstrumentor().instrument()
FlaskInstrumentor().instrument_app(app)
Psycopg2Instrumentor().instrument()

Metrics Collection

Go Metrics Example

import (
    "go.opentelemetry.io/otel/metric"
    "go.opentelemetry.io/otel/metric/global"
)

func main() {
    meter := global.Meter("my-service")

    // Create counters
    requestCounter, _ := meter.Int64Counter(
        "requests_total",
        metric.WithDescription("Total number of requests"),
    )

    // Create histograms
    requestDuration, _ := meter.Float64Histogram(
        "request_duration_seconds",
        metric.WithDescription("Request duration in seconds"),
    )

    // Record metrics
    requestCounter.Add(context.Background(), 1)
    requestDuration.Record(context.Background(), 0.5)
}

Python Metrics Example

from opentelemetry import metrics
from opentelemetry.metrics import Counter, Histogram

meter = metrics.get_meter("my-python-service")

# Create counters
request_counter = meter.create_counter(
    name="requests_total",
    description="Total number of requests",
)

# Create histograms
request_duration = meter.create_histogram(
    name="request_duration_seconds",
    description="Request duration in seconds",
)

# Record metrics
request_counter.add(1)
request_duration.record(0.5)

Logging Integration

Go Logging Example

import (
    "go.opentelemetry.io/otel/trace"
    "go.opentelemetry.io/otel/log"
)

func logWithTrace(ctx context.Context, message string) {
    span := trace.SpanFromContext(ctx)
    spanContext := span.SpanContext()

    log.Info(message,
        "trace_id", spanContext.TraceID().String(),
        "span_id", spanContext.SpanID().String(),
    )
}

Python Logging Example

import logging
from opentelemetry import trace

def log_with_trace(message):
    span = trace.get_current_span()
    if span:
        logging.info(f"{message} - trace_id: {span.get_span_context().trace_id}")
    else:
        logging.info(message)

Best Practices

1. Semantic Conventions

import "go.opentelemetry.io/otel/semconv/v1.17.0"

// Use semantic conventions for consistent attributes
span.SetAttributes(
    semconv.HTTPMethodKey.String("GET"),
    semconv.HTTPURLKey.String("/api/users"),
    semconv.HTTPStatusCodeKey.Int(200),
)

2. Resource Attributes

resource := resource.NewWithAttributes(
    semconv.SchemaURL,
    semconv.ServiceNameKey.String("my-service"),
    semconv.ServiceVersionKey.String("1.0.0"),
    semconv.DeploymentEnvironmentKey.String("production"),
)

3. Sampling Configuration

sampler := sdktrace.TraceIDRatioBased(0.1) // Sample 10% of traces

tp := sdktrace.NewTracerProvider(
    sdktrace.WithSampler(sampler),
    sdktrace.WithBatcher(exporter),
)

4. Error Handling

defer func() {
    if r := recover(); r != nil {
        span.RecordError(fmt.Errorf("panic: %v", r))
        span.SetStatus(codes.Error, "panic occurred")
        panic(r)
    }
}()

Integration Examples

Prometheus Integration

# otel-collector-config.yaml
exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
    const_labels:
      label1: value1
    metric_relabeling:
      - source_labels: [__name__]
        regex: "request_duration_seconds"
        target_label: "operation"
        replacement: "http_request"

Jaeger Integration

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
    sending_queue:
      num_consumers: 10
      queue_size: 1000

Grafana Integration

# grafana-datasource.yaml
apiVersion: 1
datasources:
  - name: Jaeger
    type: jaeger
    url: http://jaeger:16686
    access: proxy
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    access: proxy

Performance Considerations

Batch Processing

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
    send_batch_max_size: 2048

Memory Management

processors:
  memory_limiter:
    limit_mib: 512
    spike_limit_mib: 128
    check_interval: 5s

Sampling

// Head-based sampling
sampler := sdktrace.TraceIDRatioBased(0.1)

// Tail-based sampling
sampler := sdktrace.ParentBased(sdktrace.TraceIDRatioBased(0.1))

Troubleshooting

Common Issues

No traces appearing
- Check collector configuration
- Verify exporter endpoints
- Check sampling configuration
High memory usage
- Adjust batch sizes
- Configure memory limits
- Check for memory leaks
Performance issues
- Enable batching
- Configure appropriate sampling
- Monitor resource usage

Debug Commands

# Check collector health
curl http://localhost:13133/

# Check metrics endpoint
curl http://localhost:8889/metrics

# View collector logs
docker logs otel-collector

Further Resources

OpenTelemetry Documentation: https://opentelemetry.io/docs/
OpenTelemetry Go: https://github.com/open-telemetry/opentelemetry-go
OpenTelemetry Python: https://github.com/open-telemetry/opentelemetry-python
OpenTelemetry Collector: https://github.com/open-telemetry/opentelemetry-collector
Semantic Conventions: https://opentelemetry.io/docs/reference/specification/semantic_conventions/

What is OpenTelemetry?​

Three Pillars of Observability​

1. Traces​

2. Metrics​

3. Logs​

Key Components​

OpenTelemetry API​

OpenTelemetry SDK​

OpenTelemetry Collector​

Instrumentation Libraries​

Installation and Setup​

Go Example​

Python Example​

OpenTelemetry Collector​

Configuration Example​

Docker Compose Setup​

Auto-Instrumentation​

Go Auto-Instrumentation​

Python Auto-Instrumentation​

Metrics Collection​

Go Metrics Example​

Python Metrics Example​

Logging Integration​

Go Logging Example​

Python Logging Example​

Best Practices​

1. Semantic Conventions​

2. Resource Attributes​

3. Sampling Configuration​

4. Error Handling​

Integration Examples​

Prometheus Integration​

Jaeger Integration​

Grafana Integration​

Performance Considerations​

Batch Processing​

Memory Management​

Sampling​

Troubleshooting​

Common Issues​

Debug Commands​

Further Resources​

What is OpenTelemetry?

Three Pillars of Observability

1. Traces

2. Metrics

3. Logs

Key Components

OpenTelemetry API

OpenTelemetry SDK

OpenTelemetry Collector

Instrumentation Libraries

Installation and Setup

Go Example

Python Example

OpenTelemetry Collector

Configuration Example

Docker Compose Setup

Auto-Instrumentation

Go Auto-Instrumentation

Python Auto-Instrumentation

Metrics Collection

Go Metrics Example

Python Metrics Example

Logging Integration

Go Logging Example

Python Logging Example

Best Practices

1. Semantic Conventions

2. Resource Attributes

3. Sampling Configuration

4. Error Handling

Integration Examples

Prometheus Integration

Jaeger Integration

Grafana Integration

Performance Considerations

Batch Processing

Memory Management

Sampling

Troubleshooting

Common Issues

Debug Commands

Further Resources