Skip to main content

Grafana Dashboards and Alerting for Rust Metrics

Grafana is the primary visualization tool for Prometheus metrics. It turns raw data (requests per second, latency percentiles) into intuitive dashboards with graphs, heatmaps, and gauges. Grafana also powers alerting: define rules like "Alert if p99 latency exceeds 500ms" and have Grafana notify you via Slack, email, or PagerDuty.

This article covers creating dashboards, writing PromQL queries, and setting up alerts to measure SLO attainment and catch problems before they impact users.

Setting Up Grafana with Prometheus

If you are running Prometheus locally, start Grafana:

docker run -p 3000:3000 \
-e GF_SECURITY_ADMIN_PASSWORD=admin \
grafana/grafana:latest

Grafana runs on http://localhost:3000. Log in with username admin, password admin.

Add Prometheus as a data source:

  1. Go to Settings (gear icon) → Data Sources.
  2. Click Add data source.
  3. Select Prometheus.
  4. Set URL to http://localhost:9090 (your Prometheus server).
  5. Click Save & Test.

Now you can build dashboards and queries.

Creating a Dashboard

  1. Click Create (plus icon) → Dashboard.
  2. Click Add a new panel.
  3. Configure the query (see below).
  4. Customize visualization options.
  5. Save the dashboard.

A typical dashboard for a Rust service includes:

  • Request rate (requests per second).
  • Latency (p50, p95, p99).
  • Error rate (failed requests as percentage).
  • Resource usage (CPU, memory, connections).

Writing PromQL Queries for Common Metrics

Request Rate

rate(http_requests_total[5m])

This computes the rate of requests per second over the last 5 minutes. The result is a time series that you can graph.

Error Rate

sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ sum(rate(http_requests_total[5m]))

This is the ratio of 5xx errors to all requests. Multiply by 100 for a percentage.

Latency Percentiles

# p50 (median)
histogram_quantile(0.5, http_request_duration_ms_bucket)

# p95
histogram_quantile(0.95, http_request_duration_ms_bucket)

# p99
histogram_quantile(0.99, http_request_duration_ms_bucket)

Each query returns a time series. You can overlay them on a single graph to see how latency distribution changes over time.

Requests by Endpoint

sum by (endpoint) (rate(http_requests_total[5m]))

This splits request rate by endpoint, so you can identify which endpoints are busiest.

Memory Usage

process_memory_usage_bytes

This is a gauge that tracks current memory. You can calculate percentage of available memory:

process_memory_usage_bytes / on() group_left() node_memory_MemTotal_bytes * 100

Building a Dashboard: Complete Example

Here is a dashboard for a Rust HTTP service:

Panel 1: Request Rate (Graph)

  • Title: Requests Per Second
  • Query: rate(http_requests_total[5m])
  • Graph type: Line
  • Y-axis label: Requests/sec

Panel 2: Latency (Graph with multiple series)

  • Title: Request Latency (p50, p95, p99)
  • Queries:
    • histogram_quantile(0.50, http_request_duration_ms_bucket) (alias: p50)
    • histogram_quantile(0.95, http_request_duration_ms_bucket) (alias: p95)
    • histogram_quantile(0.99, http_request_duration_ms_bucket) (alias: p99)
  • Graph type: Line
  • Y-axis label: Milliseconds

Panel 3: Error Rate (Gauge)

  • Title: Error Rate
  • Query: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
  • Graph type: Gauge
  • Thresholds: 0-1% green, 1-5% yellow, 5%+ red

Panel 4: Requests by Endpoint (Bar chart)

  • Title: Requests by Endpoint
  • Query: sum by (endpoint) (rate(http_requests_total[5m]))
  • Graph type: Bar chart

Setting Up Alerts

Grafana can fire alerts when conditions are met (high error rate, latency spike, down service).

Step 1: Create an Alert Rule

  1. Open a dashboard.
  2. Click a panel's menu → Edit.
  3. Go to the Alert tab.
  4. Set a Condition (e.g., "if p99 latency > 500ms").
  5. Set For (how long the condition must be true: 5m, 1m, etc.).
  6. Save.

Step 2: Configure Notification Channel

  1. Go to AlertingNotification channels.
  2. Click New channel.
  3. Select type: Slack, Email, PagerDuty, Webhook, etc.
  4. Configure credentials (Slack webhook URL, email address).
  5. Save.

In the alert rule, select your notification channel from the Send to dropdown.

Example Alert Rule: SLO Violation

Create an alert that fires if p99 latency exceeds your SLO:

Condition: histogram_quantile(0.99, http_request_duration_ms_bucket) > 200
For: 5m
Send to: #sre-alerts (Slack)

This alert fires if 99% of requests take longer than 200ms for 5 consecutive minutes.

Error rate alert:

Condition: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) > 0.01
For: 1m
Send to: #incident-response (Slack/PagerDuty)

This fires if more than 1% of requests are errors for 1 minute.

Dashboard Best Practices

  1. Group related metrics: Put request rate and latency on the same dashboard so you can correlate (latency spike often follows traffic spike).

  2. Use consistent time ranges: Show all panels over the same interval (last 24 hours, last 7 days) for coherent analysis.

  3. Annotate events: Add annotations for deployments, config changes, etc., so you can see their impact on metrics.

  4. Use thresholds: Set visual thresholds on gauges (green/yellow/red) to quickly spot problems.

  5. Label metrics clearly: "Request Rate (p99 latency, 5m rolling)" is better than "req/s".

  6. Avoid dashboard bloat: A dashboard with 20+ panels is hard to read. Create separate dashboards for different concerns (performance, errors, resources).

Exporting Dashboards as Code (Dashboard as Code)

Grafana dashboards are JSON. You can version control them in Git:

  1. Click dashboard menu → Export.
  2. Copy the JSON.
  3. Save to dashboards/my-service.json in your repo.
  4. Check it into Git.

To import:

  1. Click CreateImport.
  2. Paste the JSON or upload the file.

This lets you version-control dashboards alongside your infrastructure-as-code (Terraform, Ansible).

Common Query Patterns

Use CasePromQL Query
Request raterate(http_requests_total[5m])
Error rate (%)(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100
p99 latencyhistogram_quantile(0.99, http_request_duration_ms_bucket)
Active connectionsdb_connections_active
Cache hit ratiocache_hits_total / (cache_hits_total + cache_misses_total)
CPU usage (%)rate(process_cpu_seconds_total[5m]) * 100
Memory usage (MB)process_memory_usage_bytes / 1024 / 1024

Key Takeaways

  • Grafana visualizes Prometheus metrics in dashboards with graphs, gauges, heatmaps.
  • Write PromQL queries to compute rates, percentiles, and aggregations.
  • Define alert rules (latency > threshold, error rate > threshold) to catch problems early.
  • Configure notification channels (Slack, email, PagerDuty) so you are alerted 24/7.
  • Version control dashboards as JSON in Git for reproducibility.
  • Group related metrics on dashboards to correlate cause-and-effect.

Frequently Asked Questions

Can I create dashboards programmatically?

Yes. Use the Grafana API to create/update dashboards via HTTP. You can also use Terraform providers (grafana, prometheus) to manage dashboards as infrastructure-as-code.

How do I alert on missing data?

Use up metric (Prometheus-specific):

up{job="my_service"} == 0

This fires if Prometheus cannot scrape your service (service is down).

Can I create dashboards from JSON templates?

Yes. Grafana supports dashboard templating with variables. You can create a template dashboard, export it, and reuse it for multiple services by changing variable values.

What is the difference between Grafana and Prometheus?

  • Prometheus: Time-series database, stores metrics, executes PromQL queries.
  • Grafana: Visualization layer, reads from Prometheus, displays graphs, alerts.

How do I avoid alert fatigue?

  1. Set thresholds realistically: Alert on 5% error rate, not every error.
  2. Use "For" duration: Require condition to persist (5+ minutes) before alerting.
  3. Group alerts: Send related alerts to a single channel to avoid duplication.
  4. Auto-resolve: Configure alerts to auto-resolve when the condition clears.

Further Reading