Grafana Dashboards and Alerting for Rust Metrics
Grafana is the primary visualization tool for Prometheus metrics. It turns raw data (requests per second, latency percentiles) into intuitive dashboards with graphs, heatmaps, and gauges. Grafana also powers alerting: define rules like "Alert if p99 latency exceeds 500ms" and have Grafana notify you via Slack, email, or PagerDuty.
This article covers creating dashboards, writing PromQL queries, and setting up alerts to measure SLO attainment and catch problems before they impact users.
Setting Up Grafana with Prometheus
If you are running Prometheus locally, start Grafana:
docker run -p 3000:3000 \
-e GF_SECURITY_ADMIN_PASSWORD=admin \
grafana/grafana:latest
Grafana runs on http://localhost:3000. Log in with username admin, password admin.
Add Prometheus as a data source:
- Go to Settings (gear icon) → Data Sources.
- Click Add data source.
- Select Prometheus.
- Set URL to
http://localhost:9090(your Prometheus server). - Click Save & Test.
Now you can build dashboards and queries.
Creating a Dashboard
- Click Create (plus icon) → Dashboard.
- Click Add a new panel.
- Configure the query (see below).
- Customize visualization options.
- Save the dashboard.
A typical dashboard for a Rust service includes:
- Request rate (requests per second).
- Latency (p50, p95, p99).
- Error rate (failed requests as percentage).
- Resource usage (CPU, memory, connections).
Writing PromQL Queries for Common Metrics
Request Rate
rate(http_requests_total[5m])
This computes the rate of requests per second over the last 5 minutes. The result is a time series that you can graph.
Error Rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
This is the ratio of 5xx errors to all requests. Multiply by 100 for a percentage.
Latency Percentiles
# p50 (median)
histogram_quantile(0.5, http_request_duration_ms_bucket)
# p95
histogram_quantile(0.95, http_request_duration_ms_bucket)
# p99
histogram_quantile(0.99, http_request_duration_ms_bucket)
Each query returns a time series. You can overlay them on a single graph to see how latency distribution changes over time.
Requests by Endpoint
sum by (endpoint) (rate(http_requests_total[5m]))
This splits request rate by endpoint, so you can identify which endpoints are busiest.
Memory Usage
process_memory_usage_bytes
This is a gauge that tracks current memory. You can calculate percentage of available memory:
process_memory_usage_bytes / on() group_left() node_memory_MemTotal_bytes * 100
Building a Dashboard: Complete Example
Here is a dashboard for a Rust HTTP service:
Panel 1: Request Rate (Graph)
- Title: Requests Per Second
- Query:
rate(http_requests_total[5m]) - Graph type: Line
- Y-axis label: Requests/sec
Panel 2: Latency (Graph with multiple series)
- Title: Request Latency (p50, p95, p99)
- Queries:
histogram_quantile(0.50, http_request_duration_ms_bucket)(alias: p50)histogram_quantile(0.95, http_request_duration_ms_bucket)(alias: p95)histogram_quantile(0.99, http_request_duration_ms_bucket)(alias: p99)
- Graph type: Line
- Y-axis label: Milliseconds
Panel 3: Error Rate (Gauge)
- Title: Error Rate
- Query:
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 - Graph type: Gauge
- Thresholds: 0-1% green, 1-5% yellow, 5%+ red
Panel 4: Requests by Endpoint (Bar chart)
- Title: Requests by Endpoint
- Query:
sum by (endpoint) (rate(http_requests_total[5m])) - Graph type: Bar chart
Setting Up Alerts
Grafana can fire alerts when conditions are met (high error rate, latency spike, down service).
Step 1: Create an Alert Rule
- Open a dashboard.
- Click a panel's menu → Edit.
- Go to the Alert tab.
- Set a Condition (e.g., "if p99 latency > 500ms").
- Set For (how long the condition must be true: 5m, 1m, etc.).
- Save.
Step 2: Configure Notification Channel
- Go to Alerting → Notification channels.
- Click New channel.
- Select type: Slack, Email, PagerDuty, Webhook, etc.
- Configure credentials (Slack webhook URL, email address).
- Save.
Step 3: Link Alert to Notification Channel
In the alert rule, select your notification channel from the Send to dropdown.
Example Alert Rule: SLO Violation
Create an alert that fires if p99 latency exceeds your SLO:
Condition: histogram_quantile(0.99, http_request_duration_ms_bucket) > 200
For: 5m
Send to: #sre-alerts (Slack)
This alert fires if 99% of requests take longer than 200ms for 5 consecutive minutes.
Error rate alert:
Condition: (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) > 0.01
For: 1m
Send to: #incident-response (Slack/PagerDuty)
This fires if more than 1% of requests are errors for 1 minute.
Dashboard Best Practices
-
Group related metrics: Put request rate and latency on the same dashboard so you can correlate (latency spike often follows traffic spike).
-
Use consistent time ranges: Show all panels over the same interval (last 24 hours, last 7 days) for coherent analysis.
-
Annotate events: Add annotations for deployments, config changes, etc., so you can see their impact on metrics.
-
Use thresholds: Set visual thresholds on gauges (green/yellow/red) to quickly spot problems.
-
Label metrics clearly: "Request Rate (p99 latency, 5m rolling)" is better than "req/s".
-
Avoid dashboard bloat: A dashboard with 20+ panels is hard to read. Create separate dashboards for different concerns (performance, errors, resources).
Exporting Dashboards as Code (Dashboard as Code)
Grafana dashboards are JSON. You can version control them in Git:
- Click dashboard menu → Export.
- Copy the JSON.
- Save to
dashboards/my-service.jsonin your repo. - Check it into Git.
To import:
- Click Create → Import.
- Paste the JSON or upload the file.
This lets you version-control dashboards alongside your infrastructure-as-code (Terraform, Ansible).
Common Query Patterns
| Use Case | PromQL Query |
|---|---|
| Request rate | rate(http_requests_total[5m]) |
| Error rate (%) | (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100 |
| p99 latency | histogram_quantile(0.99, http_request_duration_ms_bucket) |
| Active connections | db_connections_active |
| Cache hit ratio | cache_hits_total / (cache_hits_total + cache_misses_total) |
| CPU usage (%) | rate(process_cpu_seconds_total[5m]) * 100 |
| Memory usage (MB) | process_memory_usage_bytes / 1024 / 1024 |
Key Takeaways
- Grafana visualizes Prometheus metrics in dashboards with graphs, gauges, heatmaps.
- Write PromQL queries to compute rates, percentiles, and aggregations.
- Define alert rules (latency > threshold, error rate > threshold) to catch problems early.
- Configure notification channels (Slack, email, PagerDuty) so you are alerted 24/7.
- Version control dashboards as JSON in Git for reproducibility.
- Group related metrics on dashboards to correlate cause-and-effect.
Frequently Asked Questions
Can I create dashboards programmatically?
Yes. Use the Grafana API to create/update dashboards via HTTP. You can also use Terraform providers (grafana, prometheus) to manage dashboards as infrastructure-as-code.
How do I alert on missing data?
Use up metric (Prometheus-specific):
up{job="my_service"} == 0
This fires if Prometheus cannot scrape your service (service is down).
Can I create dashboards from JSON templates?
Yes. Grafana supports dashboard templating with variables. You can create a template dashboard, export it, and reuse it for multiple services by changing variable values.
What is the difference between Grafana and Prometheus?
- Prometheus: Time-series database, stores metrics, executes PromQL queries.
- Grafana: Visualization layer, reads from Prometheus, displays graphs, alerts.
How do I avoid alert fatigue?
- Set thresholds realistically: Alert on 5% error rate, not every error.
- Use "For" duration: Require condition to persist (5+ minutes) before alerting.
- Group alerts: Send related alerts to a single channel to avoid duplication.
- Auto-resolve: Configure alerts to auto-resolve when the condition clears.