Skip to main content

Histograms for Latency: Measuring Request Distribution

A histogram measures the distribution of values over time, answering questions like "How many requests took less than 100ms?" and "What is the 99th percentile latency?" Histograms are essential for latency-sensitive services: you cannot guarantee 99% of requests meet your SLA by only knowing the average.

Histograms automatically compute percentiles (p50, p95, p99, p99.9) and expose them as separate metrics, enabling precise alerting. Unlike counters and gauges, histograms require choosing bucket boundaries upfront, but once configured, they operate with minimal overhead.

Understanding Histogram Buckets

A histogram divides a range of values into buckets and counts how many observations fall into each. When Prometheus scrapes your metrics, it receives cumulative bucket counts, and the percentile calculation happens in the visualization layer or alerting rules.

For request latency in milliseconds, typical buckets might be:

Bucket   Upper bound (ms)   Count
1 5 ms 5,234 requests
2 10 ms 3,102 requests
3 25 ms 1,847 requests
4 50 ms 812 requests
5 100 ms 421 requests
6 250 ms 103 requests
7 500 ms 34 requests
8 1000 ms 12 requests
9 2500 ms 2 requests
10 +Inf 1 request

From this histogram, you can compute:

  • p50 (median): between 5-10ms (approximately 6-7ms by interpolation).
  • p99: between 100-250ms (approximately 120ms).
  • p99.9: between 250-500ms.

The wider your buckets, the less accurate the percentile estimate, but the smaller your memory footprint. Histograms with 10-15 buckets are standard.

Creating Histograms in Rust

use opentelemetry::global;
use opentelemetry::KeyValue;

let meter = global::meter("my_app");

let request_duration = meter.f64_histogram("http_request_duration_ms")
.with_description("HTTP request duration in milliseconds")
.with_unit("ms")
.init();

// In your request handler, measure elapsed time:
let start = std::time::Instant::now();

// ... handle request ...

let elapsed_ms = start.elapsed().as_secs_f64() * 1000.0;
request_duration.record(elapsed_ms, &[
KeyValue::new("http.method", "GET"),
KeyValue::new("http.status_code", 200),
]);

Histograms record a single observation per call. OpenTelemetry automatically buckets it and updates counts. Unlike counters which you increment, you call .record() with the measured value.

Automatic Bucket Boundaries

By default, OpenTelemetry uses exponential bucket boundaries (1, 2, 4, 8, 16, 32, ..., up to infinity). This is suitable for most latencies. If you have specific requirements, set custom boundaries:

use opentelemetry::metrics::HistogramBuilder;

let request_duration = meter.f64_histogram("http_request_duration_ms")
.with_description("HTTP request duration in milliseconds")
// Explicit boundaries: 1, 5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000 ms
.with_boundaries(vec![
1.0, 5.0, 10.0, 25.0, 50.0, 100.0, 250.0, 500.0, 1000.0, 2500.0, 5000.0,
])
.init();

Choose boundaries that align with your SLA. If your SLA is "99% of requests under 200ms", use buckets like [50, 100, 150, 200, 300, ...] so that 200ms is a bucket boundary.

Instrumenting Request Handlers with Histograms

In a real web service:

use axum::{Router, routing::get};
use std::time::Instant;
use opentelemetry::global;

#[tokio::main]
async fn main() {
let meter = global::meter("web_service");
let request_duration = meter.f64_histogram("http_request_duration_ms")
.with_description("Request latency")
.with_unit("ms")
.init();

let app = Router::new()
.route("/api/users", get(list_users_with_histogram));

let listener = tokio::net::TcpListener::bind("127.0.0.1:8080")
.await
.expect("Failed to bind");

axum::serve(listener, app).await.expect("Server error");
}

async fn list_users_with_histogram() -> String {
let start = Instant::now();

// Simulate some work
tokio::time::sleep(tokio::time::Duration::from_millis(45)).await;
let result = "Users: Alice, Bob, Charlie".to_string();

let duration_ms = start.elapsed().as_secs_f64() * 1000.0;
let meter = global::meter("web_service");
let request_duration = meter.f64_histogram("http_request_duration_ms").init();
request_duration.record(duration_ms, &[]);

result
}

The issue with this approach is that you must manually wrap every handler. For cleaner instrumentation, use a middleware or the tracing crate (covered in later articles).

Histograms with Attributes

Like counters and gauges, histograms can have attributes to split results by dimension:

let request_duration = meter.f64_histogram("http_request_duration_ms")
.with_description("Request latency by method and status")
.init();

request_duration.record(elapsed_ms, &[
KeyValue::new("http.method", "GET"),
KeyValue::new("http.status_code", 200),
KeyValue::new("http.target", "/api/users"),
]);

This creates separate histogram distributions for each combination of attributes. You can then alert on p99 latency per endpoint: "Alert if p99 latency for /api/users > 500ms".

Comparing Histograms to Gauges for Latency

Metric TypeUse CaseExample
HistogramTrack distribution of values, compute percentilesRequest latency (p50, p95, p99)
GaugeTrack a single instantaneous valueLast request duration
CounterTrack cumulative countTotal requests processed

You might emit both: a histogram for latency distribution analysis, and a gauge for "last request duration" (useful for debugging individual requests).

Reading Prometheus Output

When you scrape metrics, a histogram generates multiple time series. For http_request_duration_ms, Prometheus emits:

# Buckets
http_request_duration_ms_bucket{le="1.0", ...} 0
http_request_duration_ms_bucket{le="5.0", ...} 123
http_request_duration_ms_bucket{le="10.0", ...} 234
...
http_request_duration_ms_bucket{le="+Inf", ...} 456

# Sum and count
http_request_duration_ms_sum 15234.5 # total ms across all requests
http_request_duration_ms_count 456 # total number of requests

The _count and _sum suffix allow the avg() function: sum(http_request_duration_ms_sum) / sum(http_request_duration_ms_count) = average latency. The bucket boundaries enable percentile queries.

Key Takeaways

  • Histograms measure distributions of values and compute percentiles (p50, p95, p99).
  • Choose bucket boundaries aligned with your SLA (e.g., if SLA is less than 200ms, make 200ms a bucket).
  • OpenTelemetry uses exponential buckets by default; customize if needed.
  • Record latency as (end_time - start_time).as_secs_f64() * 1000.0 for milliseconds.
  • Histograms create multiple time series (one per bucket); use judiciously to avoid cardinality explosion.

Frequently Asked Questions

What is the difference between histogram buckets and percentiles?

Buckets are the storage mechanism; percentiles are computed from buckets. A histogram with buckets 1, 5, 10, 25, 50 can estimate p50 (median) by finding the bucket where cumulative count crosses 50% of total. The more buckets, the more accurate the percentile.

How do I compute p99 latency in Prometheus?

Use the histogram_quantile() function:

histogram_quantile(0.99, rate(http_request_duration_ms_bucket[5m]))

This computes the 99th percentile latency over the last 5 minutes. Replace 0.99 with 0.95 for p95, 0.50 for p50 (median), etc.

Is histogram cardinality a problem?

Yes. A histogram with 10 buckets and 5 attributes creates 50 time series per unique attribute combination. Limit attributes carefully. If you have 100 endpoints, a histogram per endpoint might emit 1,000+ series (100 endpoints × 10 buckets). Consider aggregating or using high-level metrics.

Can I change bucket boundaries after deployment?

No. Bucket boundaries are fixed at initialization. If you need to change them, you must redeploy. Plan boundaries based on your SLA and expected latency ranges.

What is the overhead of histograms compared to counters?

Histograms have higher memory and CPU overhead than counters because they maintain multiple buckets and compute statistics. For high-concurrency services, use sampling (record 1 in 1,000 requests) to reduce overhead.

Further Reading