Observability Cloud | A Guide to Spinnaker Infrastructure Monitoring

rohits · ‎08-10-2023

Introduction

Spinnaker is an open-source, multi-cloud continuous delivery platform composed of a series of microservices, each performing a specific function. Understanding the performance and health of these individual components is critical for maintaining a robust Spinnaker environment. Here are the primary services you might want to monitor:

Deck: The browser-based UI. Key monitoring aspects include load times, error rates, and user activity.
Gate: The API gateway and the main entry point for all programmatic access to internal Spinnaker services. Monitor error rates, request/response times, and traffic volume.
Orca: The orchestration engine. It handles all ad-hoc operations and pipelines. Monitor task execution times, task failure rates, and queue length.
Clouddriver: Responsible for all mutating operations and caching infrastructure. Monitor cache times, error rates, and operation completion times.
Front50: The metadata store, persisting application, project, pipeline definitions, and pipeline execution history. Monitor read/write times, error rates, and data volumes.
Rosco: Responsible for producing machine images. Monitor image baking times, error rates, and queue lengths.
Echo: The eventing service. Monitor event delivery times, error rates, and queue lengths.
Fiat: The authorization service. Monitor authorization times, error rates, and volume of authorization checks.
Igor: Integrates with CI systems and other tools. Monitor job completion times, error rates, and queue lengths.

Each of these services exposes its metrics that can be scraped by a Splunk Distribution of the OpenTelemetry collector and analyzed for performance and health insights.

This document serves as a comprehensive guide for monitoring Spinnaker infrastructure and services running on Kubernetes (K8s) via Splunk Observability Cloud & Splunk Platform. Given that there is no direct integration available currently, we will need to establish several steps to enable end-to-end monitoring of Spinnaker.

Enabling Prometheus Endpoints for Spinnaker Monitoring

Armory, the provider of hosted Spinnaker, recommends using their Observability plugin to expose metrics via Prometheus endpoints. These endpoints can then be scraped through Splunk OpenTelemetry collector.

Follow the link to install the Observability plugin in your Spinnaker cluster: Armory Observability Plugin Installation

The plugin offers direct integration with other observability products, but connecting it to Splunk Observability is straightforward once the Splunk Distribution of the OpenTelemetry collector is set to scrape the Prometheus endpoints.

Installation of the Splunk Distribution of the OpenTelemetry Collector

Install the collector using the Helm chart on the Spinnaker Kubernetes Cluster. For detailed instructions, refer to Collector Installation.

Given the need for extensive custom configuration to enable the sending of logs to Splunk Cloud/Enterprise and metrics to Splunk Observability Cloud, we recommend the use of a custom values.yaml file.

Below is a sample configuration snippet for Splunk Observability Cloud and Splunk Platform:

splunkPlatform:

# Required for Splunk Enterprise/Cloud. URL to a Splunk instance to send data

# to. e.g. "http://10.202.11.190:8088/services/collector/event". Setting this parameter

# enables Splunk Platform as a destination. Use the /services/collector/event

# endpoint for proper extraction of fields.

  endpoint: https://10.202.7.134:8088/services/collector

# Required for Splunk Enterprise/Cloud (if `endpoint` is specified). Splunk

# Alternatively the token can be provided as a secret.

# Refer to https://github.com/signalfx/splunk-otel-collector-chart/blob/main/docs/advanced-configuration.md#provide-tokens-as-a-secret

# HTTP Event Collector token.

  token: xxx-xxx-xxx-xxx-xxx

# Name of the Splunk event type index targeted. Required when ingesting logs to Splunk Platform.

  index: "pure"

  logsEnabled: true

  metricsEnabled: false

  tracesEnabled: false

splunkObservability:

realm: us1

accessToken: XXXXXXXX

ingestUrl: https://ingest.us1.signalfx.com

apiUrl: ""

metricsEnabled: true

tracesEnabled: true

logsEnabled: false

logsEngine: otel

Ensure that metricsEnabled is set to true to send Prometheus metrics. If logsEnabled is set to true on Splunk Platform, it can be false here.

The following configuration in values.yaml file will enable collector pods to scrape Spinnaker pods for Prometheus metrics:

config:

  receivers:

    prometheus/spinnaker:

      config:

        scrape_configs:

         - job_name: 'spinnaker'

           kubernetes_sd_configs:

             - role: pod

          metrics_path: /aop-prometheus

          scheme: https

            relabel_configs:

         - source_labels: [__meta_kubernetes_pod_ip, __meta_kubernetes_pod_container_port_number]

              action: replace

              target_label: __address__

              separator: ":"

            tls_config:

          insecure_skip_verify: true

This configuration sets up a job to scrape metrics from all pods (role: pod) in the Spinnaker Kubernetes cluster, using the /aop-prometheus metrics path. The insecure_skip_verify: true is used to bypass TLS verification, but be aware that this can be a security risk and should only be used for testing purposes or if you understand the implications.

Sample helm command to install otel collector using:

helm install splunk-otel-collector splunk-otel-collector-chart/splunk-otel-collector -f condensed_values.yaml

Verification of Metrics Ingestion in Splunk Observability Cloud

Confirming the correct ingestion of metrics in Splunk Observability Cloud may initially pose a challenge, particularly if you don't immediately know the name of the Prometheus metrics. However, you can work around this by performing the following steps:

Curl the ndpoint: Curl the /aop/prometheus endpoint to retrieve the names of the metrics. https://localhost:7002/aop-prometheus
Enable Debug Logging on the OpenTelemetry Collector: Adjust the collector's configuration to enable debug logging. This setting will let you view more detailed information about the collector's operations, including metric names. Here is a sample configuration to enable debug logging:

config: 

  service:

    telemetry: 

      logs:

        level: "debug"

Use Splunk SignalFlow to Identify Metrics: Splunk SignalFlow allows you to write data computations for your metrics. Using SignalFlow, you can isolate and display the metrics collected from the Prometheus endpoints. Here's an example of a SignalFlow query that lists all the metrics exposed by Prometheus endpoints:

A = data('*', filter=filter('sf_metric', '*') and filter('k8s.pod.name', 'spin-orca-*')).count(by=['sf_metric']).publish(label='A')

By following these steps, you should be able to verify the ingestion of metrics from your Spinnaker services into Splunk Observability Cloud.

Creation of Spinnaker Metrics Dashboard

Presently, Splunk Observability Cloud does not come with out-of-the-box (OOTB) dashboards for Spinnaker. However, this does not preclude you from creating insightful, customized visualizations of your Spinnaker performance metrics.

One approach is to leverage the plethora of open-source Grafana dashboards available for each Spinnaker service. A repository containing these dashboards can be found at uneeq-oss/spinnaker-mixin.

To create your own Splunk dashboards, examine the code of these Grafana dashboards and construct corresponding SignalFlow expressions in Splunk. Let's consider the following Grafana query as an example:

lessCopy code'sum by (controller, status) (

  rate(controller_invocations_seconds_sum{container="orca"}[$__rate_interval])

)

/

sum by (controller, status) (

  rate(controller_invocations_seconds_count{container="orca"}[$__rate_interval])

)

You can translate this Grafana query into the following SignalFlow expressions:

A = data('controller_invocations_seconds_sum', filter=filter('k8s.container.name', 'orca')).sum(by=['controller', 'status']).publish(label='A', enable=False)

B = data('controller_invocations_seconds_count', filter=filter('k8s.container.name', 'orca')).sum(by=['controller', 'status']).publish(label='B', enable=False)

C = (A/B).publish(label='C')

Using this strategy, you can create a Splunk Observability Cloud dashboard that suits your specific monitoring needs for Spinnaker.

Sample dashboard would look like this .

Enabling Armory Continuous Deployment Logging Data

To log data about individual accounts and functions within Armory Continuous Deployment, you can directly push this data to Splunk HEC endpoints without going through the OpenTelemetry collector. For more details, follow this link: Developer Insights.

We could import the JSON code for available dashboards into Splunk cloud/enterprise. Read more.
Modify the index accordingly as the default index in these dashboards is spinnaker.

Troubleshooting Tips

If you encounter issues, such as not seeing any metrics being ingested from the Prometheus endpoints, consider the following tips to identify and resolve the problem:

Inspect the Logs of the OpenTelemetry Collector Pods: The logs of the collector pods can provide valuable insights when you don't see any metrics coming in from the Prometheus endpoints. These logs may contain debugging information that can help pinpoint any issues with the scraping of the endpoints.You can check the logs of the collector pods by using the kubectl logs command. Suppose your OpenTelemetry Collector pod is named otel-collector-abcde, you can view its logs with the following command:

kubectl logs otel-collector-abcde

To continuously stream the logs, add the -f flag as shown below:

kubectl logs -f otel-collector-abcde

Replace app.kubernetes.io/name=otel-collector with the appropriate label selector for your OpenTelemetry Collector pods.

Understand the Role of the Service Discovery (SD) Config: To view the SD Config, you need to inspect the configuration of your OpenTelemetry Collector. Depending on how you have deployed the collector, this configuration might be located in a ConfigMap, a command-line argument, or a file in the pod. If it's in a ConfigMap, you can view it with:

kubectl get configmap otel-collector-config -o yaml

If the configuration is passed as a command-line argument or a file, you might need to examine the pod specification or access the pod's filesystem to find it. To inspect the pod specification, use:

kubectl get pod otel-collector-abcde -o yaml

Datapoints being dropped: There are certain organizational level dashboards which could help in finding if there are any throttling issues at thecollector/token level which might stop the data points from being ingested into the platform.
Start by looking at the dashboards here. Dashboards -> Built-in Dashboard Groups -> Organization metrics -> IMM Throttling
Look at the token throttling and data points dropped dashboards, which will look something like below:

Start by looking at the collector pods which would give you the metrics which are getting dropped. The logs may look like this:

2023-07-20T22:45:51.977Z debug translation/converter.go:240 dropping datapoint {"kind": "exporter", "data_type": "metrics", "name": "signalfx", "reason": "number of dimensions is larger than 36", "datapoint": "source:\"\" metric:\"controller_invocations_contentLength_total

Splunk will drop the data points/MTS if they don't follow certain standards , like in the above case the number of dimensions reached more than 36. Read more.

The best way to eliminate throttling and get rid of these errors is to drop these metrics at the prometheus receiver level if they are not required. metric_relabel_configs is the important key here. Read more.

receivers:

prometheus:

   config:

     scrape_configs:

       - job_name: 'otel-collector'

         scrape_interval: 5s

         static_configs:

           - targets: ['0.0.0.0:8888']

       - job_name: k8s

         kubernetes_sd_configs:

         - role: pod

         relabel_configs:

         - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]

           regex: "true"

           action: keep

         metric_relabel_configs:

         - source_labels: [__name__]

           regex: "(request_duration_seconds.*|response_duration_seconds.*)"

           action: keep

Another error that can cause data points to be dropped and token throttling could be this: 2023-07-21T17:28:49.553Z debug translation/converter.go:105 Datapoint does not match the filter, skipping {"kind": "exporter", "data_type": "metrics", "name": "signalfx", "dp": "source:\"\" metric:\"k8s.pod.memory.working_set\"

The above error shows that the Signalfx exporter is skipping this metric as it is excluded by default from the exporter. These errors can be ignored as they are expected out of k8s clusters. Read more.

-----------

We hope you found this informative and helpful. Want to dive in even further? Experience the difference for yourself and start your free trial of our observability platform now!

Observability Cloud | A Guide to Spinnaker Infrastructure Monitoring

Enabling Prometheus Endpoints for Spinnaker Monitoring

Installation of the Splunk Distribution of the OpenTelemetry Collector

Verification of Metrics Ingestion in Splunk Observability Cloud

Creation of Spinnaker Metrics Dashboard

Enabling Armory Continuous Deployment Logging Data

Troubleshooting Tips

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

Splunk Observability for AI

🔐 Trust at Every Hop: How mTLS in Splunk Enterprise 10.0 Makes Security Simpler

Are you a member of the Splunk Community?

Observability Cloud | A Guide to Spinnaker Infrastructure Monitoring

Enabling Prometheus Endpoints for Spinnaker Monitoring

Installation of the Splunk Distribution of the OpenTelemetry Collector

Verification of Metrics Ingestion in Splunk Observability Cloud

Creation of Spinnaker Metrics Dashboard

Enabling Armory Continuous Deployment Logging Data

Troubleshooting Tips

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

Splunk Observability for AI

🔐 Trust at Every Hop: How mTLS in Splunk Enterprise 10.0 Makes Security Simpler