Tracing & APM
Checking access...
Distributed tracing solves a problem that metrics and logs cannot: tracing a single request across service boundaries. When a user clicks “Place Order” and the request travels through a load balancer, authentication service, product catalog, order service, payment gateway, and notification service — and one of them is slow — tracing reveals exactly where the latency lives.
Core Tracing Concepts
A trace represents the end-to-end journey of a single request. It is composed of spans, each representing a unit of work (an HTTP request, a database query, a queue publish). Spans contain:
- Span ID — unique identifier for this unit of work
- Trace ID — shared across all spans in the request
- Parent Span ID — links this span to its caller
- Timing — start time and duration
- Attributes — HTTP method, URL, status code, error details
- Events — application-level annotations (e.g., “cache miss”, “retry attempt”)
AWS X-Ray
X-Ray is AWS’s managed tracing service. It receives trace data from instrumented applications and the X-Ray daemon, then provides a service map showing connections and latency between components.
Instrumenting with the X-Ray SDK
Python (Flask) example:
from aws_xray_sdk.core import xray_recorderfrom aws_xray_sdk.ext.flask.middleware import XRayMiddleware
app = Flask(__name__)xray_recorder.configure(service='order-service')XRayMiddleware(app, xray_recorder)
@app.route('/api/orders')def list_orders(): with xray_recorder.in_subsegment('query_database'): orders = db.query('SELECT * FROM orders') return jsonify(orders)X-Ray Daemon
The daemon runs as a sidecar or host-level service, buffering and batching trace data before sending it to the X-Ray API:
# Run X-Ray daemon in a Docker containerdocker run --network host \ -e AWS_REGION=us-east-1 \ -v /home/ec2-user/.aws/:/root/.aws/ \ amazon/aws-xray-daemonFor ECS Fargate tasks, deploy the daemon as a sidecar container in the same task definition.
OpenTelemetry
OpenTelemetry (OTel) is the industry-standard, vendor-neutral framework for generating telemetry. It consists of SDKs for instrumenting applications and the OpenTelemetry Collector for receiving, processing, and exporting telemetry to any backend.
OpenTelemetry Collector
The Collector is a vendor-agnostic proxy that sits between your instrumented applications and your observability backends:
receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318
processors: batch: timeout: 1s send_batch_size: 1024
exporters: awsxray: region: us-east-1 prometheus: endpoint: 0.0.0.0:8889 logging: loglevel: debug
service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [awsxray, logging] metrics: receivers: [otlp] processors: [batch] exporters: [prometheus, logging]Instrumenting with OpenTelemetry SDKs
Node.js example using the OpenTelemetry JS SDK:
const { NodeSDK } = require('@opentelemetry/sdk-node');const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: 'http://otel-collector:4317', }), instrumentations: [getNodeAutoInstrumentations()],});
sdk.start();The auto-instrumentation packages capture HTTP requests, database calls, and gRPC calls without any manual span creation.
Tip
OpenTelemetry’s big advantage is portability. Instrument once with the OTel SDK, point the Collector at any backend (X-Ray, Datadog, Jaeger, Zipkin, Grafana Tempo), and switch providers by changing exporter config — no code changes.
APM Tools Comparison
| Tool | Strengths | Weaknesses | Typical Cost |
|---|---|---|---|
| AWS X-Ray | Deep AWS integration, pay-per-traces, no infrastructure to manage | AWS-only, limited query capabilities | ~$5/month per million traces |
| Datadog APM | Rich UI, custom dashboards, integrated logs/traces/metrics | Expensive at scale, vendor lock-in | ~$31/host/month |
| New Relic | Full-stack observability, AI-driven insights, broad language support | Pricing complexity, data retention limits | ~$0.25/GB ingested |
| Grafana Tempo | Open source, works with Grafana+Loki+Prometheus, cost-effective | Requires self-managed infrastructure, steeper learning curve | Free (self-hosted) or ~$8/host (Grafana Cloud) |
Sampling Strategies
Tracing every request at scale is expensive. Use sampling to keep costs predictable:
- Head-based sampling — Decide at the request start whether to trace (e.g., 5% of all requests). Simple but may miss rare errors.
- Tail-based sampling — Record all spans but decide which traces to retain after seeing the full trace. Keeps all errors and slow traces.
- Rate-based sampling — Trace up to N requests per second.
Info
A common pattern is head-based sampling at 5% for normal traffic plus recording 100% of traces that contain errors — the X-Ray SDK supports this via the sampling-rules configuration.
Summary
Distributed tracing is essential for understanding request paths in microservice architectures. AWS X-Ray provides a managed, AWS-native solution while OpenTelemetry gives you vendor independence. The Collector pattern — applications send OTLP to a collector, which exports to one or more backends — has become the industry standard and is the approach the module project uses.