Best Practices for Managing Data Volume with the OpenTelemetry Collector

CaitlinHalla · ‎10-24-2024

We can’t guarantee the health of our services or a great user experience without data from our applications. Is CPU usage high? Are there an increased number of requests? Do we have too many Kubernetes nodes stuck in a NotReady state? These metrics and many others impact our services and our customers, but if we can’t see them, we can’t fix them. So, we build out an observability practice, instrument our services, collect all the metrics, export metrics to an observability backend, and quickly realize that our systems produce an overwhelming amount of data. We store metrics to understand trends, but data storage costs money so collecting and storing everything quickly becomes a budgetary problem. In the noise of all that data, it’s also difficult to identify which metrics are relevant in determining what is actually negatively impacting our services and users.

In a world where setting up a functional observability practice is as easy as installing the OpenTelemetry Collector configured with auto-instrumentation, there are different ways to manage metric pipelines so that metric collection is sustainable and cost-effective and provides real value to the reliability of our applications. In this post, we’ll look at how OpenTelemetry processors specifically can help manage data from within metric pipelines to avoid exporting and storing unhelpful data so you can focus on service reliability, faster troubleshooting, and lower observability costs.

Managing Metric Data Volume Best Practices

The following best practices help eliminate metric noise, reduce metric collection volume, and ensure helpful metrics are available and ready to support troubleshooting efforts.

Use OpenTelemetry Semantic conventions for metric names and attributes
Collect metrics intentionally
Monitor the pipeline itself
Optimize the pipeline and exporting processes

Let’s take a look at each of these.

OpenTelemetry Semantic Conventions

Using OpenTelemetry metrics semantic conventions when naming metrics or metric attributes helps with data analysis and troubleshooting. Defining clear metric names and attributes also helps identify redundancies or commonalities between metrics. Without the use of semantic conventions, different engineering teams might use different names for the same metric, leading to metric redundancy and increased data volume. For example, metrics around total HTTP requests could be named: http_requests_total, total_http_requests, http_request_count, etc. With semantic conventions in place, these individual metrics can be consolidated into one single, shared metric like http.server.requests, which captures aggregated total requests and attributes like request method and endpoint. When metrics follow naming conventions, aggregations, filters, and transformations can more easily be applied to reduce the volume of metric data, reduce the cost of backend platform storage, and improve the effectiveness of observability practices.

Collect and Store Metrics Intentionally

With semantically named metrics, it’s easier to identify those that provide value and rename or remove the ones that don’t, but how do you determine which metrics are and are not helpful? Here are some questions to consider:

Could the metric be used for an actionable, high-priority alert?
Would the data reported by the metric create a meaningful dashboard?
Is the individual metric meaningful? Or would an aggregation be more impactful?

It’s also important to note that not all metrics need or should be exported to backend observability platforms – not all data is relevant for troubleshooting or development purposes and doesn’t need to be readily available within observability backends. Cold data that won’t actively be used, like metrics necessary for compliance or audit purposes, can be exported to backend storage like Amazon S3 (perhaps even Glacier). This can lower storage costs and keep observability backends clear of metrics that aren’t immediately helpful in monitoring the resiliency and performance of applications.

Monitor the pipeline

Monitoring the pipeline itself (e.g. Collector performance and resource limits) can help identify delays and/or constraints in processing or exporting. This ensures data integrity and quick insight into any issues with metric collection. It also provides insight into the performance and effectiveness of metric collection so you can iterate on which metrics you’re collecting and how you’re collecting them.

Optimize the pipeline and exporting process

Optimizing the pipeline collection and exporting processes ensures efficient data flow from collection to the backend platform so you can prevent bottlenecks and delays and successfully use the metrics you collect for performance monitoring and troubleshooting.

OpenTelemetry Processors

So how do you put these best practices into… practice? The OpenTelemetry Collector provides several processors that can be configured to transform data before it’s sent to observability platform backends. We can think of these processors more as pre-processors, taking many points of data and interpreting or condensing them into more meaningful information. Processors offer more control over metric collection so data can be reported in useful ways that reduce metric noise and storage costs.

Filter Processor

Metric data can be included or excluded through configuration of the filter processor in the OpenTelemetry Collector configuration file. Any low-priority or unhelpful metrics, like those with invalid types or specified values, can be filtered out. Here’s an example that shows how to drop an HTTP healthcheck metric:

fileter processor.png

This metric doesn’t provide meaningful or actionable data around application performance or reliability, so to reduce metric volume and storage costs, we can drop it before exporting it to our backend observability platform.

Attribute and Metric Transform Processors

The OpenTelemetry attribute and metrics transform processors can be configured to modify and/or consolidate metrics. Their functionalities overlap a bit – you can add attributes or update attribute values using either processor. The metric transform processor provides more room for data manipulation, and the docs recommend that if you’re already using the metrics transform processor functionality, there’s no need to switch over to the attribute processor.

The metric transform processor can be used for renaming metrics so you can modify metric names or attributes while sticking to semantic conventions to reduce the number of discrete but related metrics, which are often billed separately in observability backends. For example, when using multiple cloud providers like Amazon Web Services (AWS) or Google Cloud Platform (GCP), each provider reports CPU utilization data under different metric names. Instead of reporting these metrics separately, they can be combined into a single metric name following semantic conventions to reduce cardinality and improve metric management. Here’s an example of updating AWS and GCP CPU utilization metrics to report under a single cloud.vm.cpu.utilization metric with attributes indicating the cloud provider and cloud service:

cloud provider metricstransform.png

Group by Attributes Processor

To organize metric data and more easily apply aggregations and transformations to specific groups of metrics, use the group by attributes processor. For example, if you’re collecting data from multiple services each running on multiple instances, the group by attribute processor can be configured to group metrics by service name as follows:

Aggregations can be applied to sum or average metrics for each instance of each service. If service_a is running on instance_1, instance_2, and instance_3, we can use the group by attributes processor to combine these individual instance metrics into one single aggregated service_a metric. This reduces cardinality and data volume, while also making the metric data easier to troubleshoot.

Batch Processor

While the batch processor doesn’t manipulate the raw metric data itself, it does contribute to an effective metrics pipeline by improving export performance. Effective exporting of our data means it gets to where it needs to go and is readily available for use when we need it. Batching metrics to compress data reduces the number of outgoing connections in order to improve exporting performance. It can easily be configured within the Collector configuration file by specifying batch under the processors block:

batch processor.png

Or additional configuration options like batch size and timeout can be specific for more fine-grained control:

Memory Limiter Processor

Like the batch processor, the memory limiter processor is related to the overall functionality of the metric pipeline. Using this processor ensures metric collection functions properly and data is collected, processed, and exported successfully. The memory limiter processor performs periodic checks of memory usage to prevent the Collector from running out of memory. If the Collector hits memory limits, it will start refusing data and lead to metric data loss. Here’s how to configure the memory limiter with a couple of the available options within the Collector configuration file:

memory limiter processor.png

Wrap Up

Observability isn’t about collecting all of the data. Collecting all the data is counterproductive to maintaining reliable systems that successfully support our customers. Instead, observability is about surfacing and analyzing actionable data. Managing metrics data volume at the point of collection with OpenTelemetry processors can help reduce the noise making it easier to detect anomalies and resolve issues faster.

OpenTelemetry, a native part of Splunk Observability Cloud, provides built-in metric pipeline management and one unified observability backend platform for profiles, metrics, traces, and logs – no third-party pipeline management tools required. With Splunk Observability Cloud, you can also manage some metrics after ingestion with Metrics Pipeline Management (MPM). Interested in reducing metric volume, storage costs, and troubleshooting efficiency? Start a Splunk Observability Cloud 14-day free trial and adopt OpenTelemetry to tame your metric pipeline and to experience the benefits of one centrally-located backend observability platform (aka Splunk Observability Cloud). Using the power of Splunk Observability, watch your metric data flow in and your data storage costs go down (all while optimizing troubleshooting and reducing time to resolve incidents thanks to helpful and well-managed metric data).

Best Practices for Managing Data Volume with the OpenTelemetry Collector

Managing Metric Data Volume Best Practices

OpenTelemetry Semantic Conventions

Collect and Store Metrics Intentionally

Monitor the pipeline

Optimize the pipeline and exporting process

OpenTelemetry Processors

Filter Processor

Attribute and Metric Transform Processors

Group by Attributes Processor

Batch Processor

Memory Limiter Processor

Wrap Up

Resources

Now Playing: Splunk Education Summer Learning Premieres

The Visibility Gap: Hybrid Networks and IT Services

Get Operational Insights Quickly with Natural Language on the Splunk Platform

Are you a member of the Splunk Community?

Best Practices for Managing Data Volume with the OpenTelemetry Collector

Managing Metric Data Volume Best Practices

OpenTelemetry Semantic Conventions

Collect and Store Metrics Intentionally

Monitor the pipeline

Optimize the pipeline and exporting process

OpenTelemetry Processors

Filter Processor

Attribute and Metric Transform Processors

Group by Attributes Processor

Batch Processor

Memory Limiter Processor

Wrap Up

Resources

Now Playing: Splunk Education Summer Learning Premieres

The Visibility Gap: Hybrid Networks and IT Services

Get Operational Insights Quickly with Natural Language on the Splunk Platform