Skip to main content

Skillber v1.0 is here!

Learn more

Dashboards & Alerting

Checking access...

Dashboards make metrics visible; alerting makes them actionable. This page covers Grafana — the leading open-source visualization platform — and the discipline of defining service-level objectives and alerting rules that balance sensitivity with specificity.

Grafana

Grafana connects to dozens of data sources (CloudWatch, Prometheus, Azure Monitor, GCP Monitoring, Loki, Elasticsearch) and renders them into customizable dashboards.

Adding a Data Source

Terminal window
# Provision a CloudWatch data source via YAML
cat <<EOF > datasources.yaml
apiVersion: 1
datasources:
- name: CloudWatch
type: cloudwatch
access: proxy
jsonData:
authType: default
defaultRegion: us-east-1
editable: false
EOF

For Prometheus data from the OpenTelemetry Collector:

- name: OTEL Prometheus
type: prometheus
access: proxy
url: http://otel-collector:8889
isDefault: true

Building a Dashboard

A well-designed dashboard answers a specific question. Common panel types:

  • Time series — CPU, memory, request rate, latency over time
  • Stat — Single current value (e.g., “5xx errors in last 5 minutes”)
  • Table — Top-N lists (e.g., slowest endpoints)
  • Heatmap — Latency distribution over time

Use template variables to make dashboards reusable across environments:

# Query variable for service selection
label_values(aws_ec2_cpuutilization_average, {dimensions.InstanceId})

Tip

Follow the “single-purpose dashboard” principle: each dashboard should answer one question. A “latency” dashboard, an “errors” dashboard, and a “capacity” dashboard are more useful than a single 40-panel monster.

Integrating Loki for Logs

Loki is Grafana’s log aggregation system designed for cost-efficient log storage. It indexes labels (not full text) and works directly with Prometheus-style selectors:

{app="order-service"} |= "ERROR" |= "database_timeout"

Add Loki as a data source and use the Explore view to correlate logs with metrics panels.

SLOs and SLIs

Service Level Indicators (SLIs) are the metrics you measure — request latency, error rate, uptime. Service Level Objectives (SLOs) are the targets you set — “99.9% of requests complete in under 500ms over a 30-day rolling window.”

Defining an SLO

SLI: Proportion of GET /api/products requests with status < 500
Goal: 99.9%
Window: 30 days rolling
Measurement: (successful requests / total requests) * 100

Burn Rate Alerting

Burn rate measures how fast you are consuming your error budget. A 1% error rate against a 99.9% SLO (0.1% budget) means you burn through your budget 10x faster than expected.

# Prometheus-style burn rate alert
groups:
- name: slo-alerts
rules:
- alert: HighErrorBudgetBurnRate
expr: |
(
1 - (sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h])))
) < 0.999
for: 5m
labels:
severity: critical
annotations:
summary: "Error budget burn rate is critically high"

Alert Fatigue Prevention

Alert fatigue occurs when too many alerts desensitize operators. Prevention strategies:

  • Use multi-condition rules — Alert only when a high-severity condition persists (e.g., error rate > 5% for 5 minutes)
  • Group related alerts — A single “Region us-east-1 is degraded” avoids 50 per-service alerts
  • Define runbook links — Every alert should link to a playbook
  • Review alert noise quarterly — Tune thresholds or remove stale rules

Caution

Never use email for production alerts. Use PagerDuty, OpsGenie, or a dedicated Slack channel with on-call rotation. Email is too easy to ignore.

CloudWatch Alarm Example

Terminal window
aws cloudwatch put-metric-alarm \
--alarm-name "order-service-5xx-high" \
--metric-name 5xxCount \
--namespace MyApp \
--statistic Sum \
--period 300 \
--evaluation-periods 2 \
--threshold 50 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:123456789012:ops-team

Summary

Grafana provides a unified view across clouds and on-premise infrastructure. SLOs and burn-rate alerting give you a data-driven approach to reliability. Invest time in dashboard design and alert tuning — they are force multipliers for any operations team.