This blog post is part of an ongoing series on OpenTelemetry.
What happens if the OpenTelemetry collector cannot send data? Will it drop, queue in memory or on disk? Let's find out which settings are available and how they work!
To check how queuing works, I've set up a small test environment. This environment consists of several data sources (postgresql, collectd and prometheus) a opentelemetry collector and a Splunk enterprise as destination.
All collected data is sent to Splunk via one HEC endpoint. All internal metrics of the otel collector are sent to Splunk via another HEC endpoint.
This test environment, including all example configurations can be found at my Splunk GitHub.
I will test the queueing of the otel collector by temporarily disabling the HEC endpoint which receives the collected metrics. Via the other HEC endpoint I will still receive the internal metrics, so we can see what's going on with the collector while it cannot send metrics.
Queueing in the OpenTelemetry Collector
Queueing is implemented in most exporters via the exporterhelper. These settings can at least be used in all exporters relevant for Splunk, sapm, splunkhec. For other exporters please check the documentation or the implementation of these exporters.
Settings for queueing are thus set per exporter in the collector.yaml file. We have two types of queues available which I will be using together:
In Memory Queue
By default, without any configuration, data is queued in memory only. When data cannot be sent it is retried a few times (up to 5 mins. by default) and then dropped.
If for any reason, the collector is restarted in this period, the queued data will be gone. If you can accept this type of data loss, you can keep the defaults or tune the queue size in collector.yaml for each exporter like this:
sending_queue queue_size(default = 5000): Maximum number of batches kept in memory before dropping. User should calculate this asnum_seconds * requests_per_second / requests_per_batch where: num_secondsis the number of seconds to buffer in case of a backend outage requests_per_secondis the average number of requests per seconds requests_per_batchis the average number of requests per batch (ifthe batch processoris used, the metricbatch_send_sizecan be used for estimation)
On Disk Persistant Queue
This queue is stored to the disk, so it will persist, even when the collector is restarted. On restart the queue will be picked up and exporting is restarted.
Testing Queueing out
In my example test environment, I've set up the metrics HEC exporter to use a persistent queue which is stored on disk using the filestorage extension.
From my collector.yaml:
extensions: file_storage/psq: directory: /persistant_sending_queue timeout: 10s # in what time a file lock should be obtained compaction: directory: /persistant_sending_queue on_start: true on_rebound: true rebound_needed_threshold_mib: 5 rebound_trigger_threshold_mib: 3
I put the persistent queue on disk in the directory called /persistant_sending_queue. For the test, I've set some very small limits on the size of the queue on disk, 3mb and 5mb as triggers for compaction.
I've also configured my splunk_hec/metrics exporter to use this queue:
When I disable the metrics HEC input we see the following happens:
data is queued
data is persisted to the disk
no metrics are being received
What Happens When the Connection is Restored?
When the metrics HEC endpoint is enabled again (see green annotation) we see the following happens:
queue is drained
persistent queue size on disk is slightly reduced
metric data from the past is filled in, no gaps remain
Even though the queue is completely drained, the on-disk persistent queue has not reduced in size much. As we didn't hit the configured high water mark of 5mb for the on-disk queue, compaction is not started. We did go below the low water mark of 3mb. This means the next time we go above 5 mb and flush the queue compaction will happen.
I ran the same experiment again. Let the queue fill up and enable the HEC endpoint again to flush the queue. This time I went above the high water mark of 5mb and we see the on-disk file size is reduced as it should be.
Considerations When Scaling up
Be very mindful of your queue sizes. sending_queue.queue_size controls how much memory or disk space will be used before the collector will drop data. This value is set for each exporter, so to determine the sizing of the exporter all values should be added together.