Skip to content

OpenTelemetry stats reports histograms incorrectly #31016

Closed
@dashpole

Description

@dashpole

Title: OpenTelemetry stats reports histograms incorrectly

Description:

Sending envoy OpenTelemetry metrics to an OpenTelemetry collector, and using the logging exporter, I observed a histogram where the Count did not match the count of the buckets (see below). From the OTLP proto definition:

// bucket_counts is an optional field contains the count values of histogram
// for each bucket.
//
// The sum of the bucket_counts must equal the value in the count field.
//
// The number of elements in bucket_counts array must be by one greater than
// the number of elements in explicit_bounds array.
repeated fixed64 bucket_counts = 6;

The number of bucket_counts also appears to be the same as the number of explicit bounds, rather than one greater.

Reading through the implementation, it looks like we are using computedBuckets():

data_point->add_bucket_counts(histogram_stats.computedBuckets()[i]);

... which appears to be the count of the number below the threshold:

/**
* Returns computed bucket values during the period. The vector contains an approximation
* of samples below each quantile bucket defined in supportedBuckets(). This vector is
* guaranteed to be the same length as supportedBuckets().
*/
virtual const std::vector<uint64_t>& computedBuckets() const PURE;

computeDisjointBuckets() seems like it potentially does what we are looking for.

/**
* Returns version of computedBuckets() with disjoint buckets. This vector is
* guaranteed to be the same length as supportedBuckets().
*/
virtual std::vector<uint64_t> computeDisjointBuckets() const PURE;

Collector logging exporter output:

StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2023-11-22 00:54:12.184643877 +0000 UTC
Count: 1
Sum: 375.000000
ExplicitBounds #0: 0.500000
ExplicitBounds #1: 1.000000
ExplicitBounds #2: 5.000000
ExplicitBounds #3: 10.000000
ExplicitBounds #4: 25.000000
ExplicitBounds #5: 50.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 1000.000000
ExplicitBounds #10: 2500.000000
ExplicitBounds #11: 5000.000000
ExplicitBounds #12: 10000.000000
ExplicitBounds #13: 30000.000000
ExplicitBounds #14: 60000.000000
ExplicitBounds #15: 300000.000000
ExplicitBounds #16: 600000.000000
ExplicitBounds #17: 1800000.000000
ExplicitBounds #18: 3600000.000000
Buckets #0, Count: 0
Buckets #1, Count: 0
Buckets #2, Count: 0
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 1
Buckets #9, Count: 1
Buckets #10, Count: 1
Buckets #11, Count: 1
Buckets #12, Count: 1
Buckets #13, Count: 1
Buckets #14, Count: 1
Buckets #15, Count: 1
Buckets #16, Count: 1
Buckets #17, Count: 1
Buckets #18, Count: 1

The sum of buckets is 10, but the count is 1.

Repro steps:
Run envoy configured with the OpenTelemetry stats sync and send to an OpenTelemetry collector with the logging exporter, with logLevel: debug to print out the OTLP.

Note: The Envoy_collect tool
gathers a tarball with debug logs, config and the following admin
endpoints: /stats, /clusters and /server_info. Please note if there are
privacy concerns, sanitize the data prior to sharing the tarball/pasting.

Admin and Stats Output:

Include the admin output for the following endpoints: /stats,
/clusters, /routes, /server_info. For more information, refer to the
admin endpoint documentation.

Note: If there are privacy concerns, sanitize the data prior to
sharing.

Config:

Include the config used to configure Envoy.

Logs:

Include the access logs and the Envoy logs.

Note: If there are privacy concerns, sanitize the data prior to
sharing.

Call Stack:

If the Envoy binary is crashing, a call stack is required.
Please refer to the Bazel Stack trace documentation.

cc @ohadvano

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions