Introduction
Spinnaker is an open-source, multi-cloud continuous delivery platform composed of a series of microservices, each performing a specific function. Understanding the performance and health of these individual components is critical for maintaining a robust Spinnaker environment. Here are the primary services you might want to monitor:
Each of these services exposes its metrics that can be scraped by a Splunk Distribution of the OpenTelemetry collector and analyzed for performance and health insights.
This document serves as a comprehensive guide for monitoring Spinnaker infrastructure and services running on Kubernetes (K8s) via Splunk Observability Cloud & Splunk Platform. Given that there is no direct integration available currently, we will need to establish several steps to enable end-to-end monitoring of Spinnaker.
Armory, the provider of hosted Spinnaker, recommends using their Observability plugin to expose metrics via Prometheus endpoints. These endpoints can then be scraped through Splunk OpenTelemetry collector.
The plugin offers direct integration with other observability products, but connecting it to Splunk Observability is straightforward once the Splunk Distribution of the OpenTelemetry collector is set to scrape the Prometheus endpoints.
Install the collector using the Helm chart on the Spinnaker Kubernetes Cluster. For detailed instructions, refer to Collector Installation.
Given the need for extensive custom configuration to enable the sending of logs to Splunk Cloud/Enterprise and metrics to Splunk Observability Cloud, we recommend the use of a custom values.yaml file.
Below is a sample configuration snippet for Splunk Observability Cloud and Splunk Platform:
splunkPlatform:
# Required for Splunk Enterprise/Cloud. URL to a Splunk instance to send data
# to. e.g. "http://10.202.11.190:8088/services/collector/event". Setting this parameter
# enables Splunk Platform as a destination. Use the /services/collector/event
# endpoint for proper extraction of fields.
endpoint: https://10.202.7.134:8088/services/collector
# Required for Splunk Enterprise/Cloud (if `endpoint` is specified). Splunk
# Alternatively the token can be provided as a secret.
# Refer to https://github.com/signalfx/splunk-otel-collector-chart/blob/main/docs/advanced-configuration.md#provide-tokens-as-a-secret
# HTTP Event Collector token.
token: xxx-xxx-xxx-xxx-xxx
# Name of the Splunk event type index targeted. Required when ingesting logs to Splunk Platform.
index: "pure"
logsEnabled: true
metricsEnabled: false
tracesEnabled: false
splunkObservability:
realm: us1
accessToken: XXXXXXXX
ingestUrl: https://ingest.us1.signalfx.com
apiUrl: ""
metricsEnabled: true
tracesEnabled: true
logsEnabled: false
logsEngine: otel
Ensure that metricsEnabled is set to true to send Prometheus metrics. If logsEnabled is set to true on Splunk Platform, it can be false here.
The following configuration in values.yaml file will enable collector pods to scrape Spinnaker pods for Prometheus metrics:
config:
receivers:
prometheus/spinnaker:
config:
scrape_configs:
- job_name: 'spinnaker'
kubernetes_sd_configs:
- role: pod
metrics_path: /aop-prometheus
scheme: https
relabel_configs:
- source_labels: [__meta_kubernetes_pod_ip, __meta_kubernetes_pod_container_port_number]
action: replace
target_label: __address__
separator: ":"
tls_config:
insecure_skip_verify: true
This configuration sets up a job to scrape metrics from all pods (role: pod) in the Spinnaker Kubernetes cluster, using the /aop-prometheus metrics path. The insecure_skip_verify: true is used to bypass TLS verification, but be aware that this can be a security risk and should only be used for testing purposes or if you understand the implications.
Sample helm command to install otel collector using:
helm install splunk-otel-collector splunk-otel-collector-chart/splunk-otel-collector -f condensed_values.yaml
Confirming the correct ingestion of metrics in Splunk Observability Cloud may initially pose a challenge, particularly if you don't immediately know the name of the Prometheus metrics. However, you can work around this by performing the following steps:
config:
service:
telemetry:
logs:
level: "debug"
Use Splunk SignalFlow to Identify Metrics: Splunk SignalFlow allows you to write data computations for your metrics. Using SignalFlow, you can isolate and display the metrics collected from the Prometheus endpoints. Here's an example of a SignalFlow query that lists all the metrics exposed by Prometheus endpoints:
A = data('*', filter=filter('sf_metric', '*') and filter('k8s.pod.name', 'spin-orca-*')).count(by=['sf_metric']).publish(label='A')
By following these steps, you should be able to verify the ingestion of metrics from your Spinnaker services into Splunk Observability Cloud.
Presently, Splunk Observability Cloud does not come with out-of-the-box (OOTB) dashboards for Spinnaker. However, this does not preclude you from creating insightful, customized visualizations of your Spinnaker performance metrics.
One approach is to leverage the plethora of open-source Grafana dashboards available for each Spinnaker service. A repository containing these dashboards can be found at uneeq-oss/spinnaker-mixin.
To create your own Splunk dashboards, examine the code of these Grafana dashboards and construct corresponding SignalFlow expressions in Splunk. Let's consider the following Grafana query as an example:
lessCopy code'sum by (controller, status) (
rate(controller_invocations_seconds_sum{container="orca"}[$__rate_interval])
)
/
sum by (controller, status) (
rate(controller_invocations_seconds_count{container="orca"}[$__rate_interval])
)
You can translate this Grafana query into the following SignalFlow expressions:
A = data('controller_invocations_seconds_sum', filter=filter('k8s.container.name', 'orca')).sum(by=['controller', 'status']).publish(label='A', enable=False)
B = data('controller_invocations_seconds_count', filter=filter('k8s.container.name', 'orca')).sum(by=['controller', 'status']).publish(label='B', enable=False)
C = (A/B).publish(label='C')
Using this strategy, you can create a Splunk Observability Cloud dashboard that suits your specific monitoring needs for Spinnaker.
To log data about individual accounts and functions within Armory Continuous Deployment, you can directly push this data to Splunk HEC endpoints without going through the OpenTelemetry collector. For more details, follow this link: Developer Insights.
If you encounter issues, such as not seeing any metrics being ingested from the Prometheus endpoints, consider the following tips to identify and resolve the problem:
kubectl logs otel-collector-abcde
To continuously stream the logs, add the -f flag as shown below:
kubectl logs -f otel-collector-abcde
Replace app.kubernetes.io/name=otel-collector with the appropriate label selector for your OpenTelemetry Collector pods.
kubectl get configmap otel-collector-config -o yaml
If the configuration is passed as a command-line argument or a file, you might need to examine the pod specification or access the pod's filesystem to find it. To inspect the pod specification, use:
kubectl get pod otel-collector-abcde -o yaml
2023-07-20T22:45:51.977Z debug translation/converter.go:240 dropping datapoint {"kind": "exporter", "data_type": "metrics", "name": "signalfx", "reason": "number of dimensions is larger than 36", "datapoint": "source:\"\" metric:\"controller_invocations_contentLength_total
Splunk will drop the data points/MTS if they don't follow certain standards , like in the above case the number of dimensions reached more than 36. Read more.
receivers:
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 5s
static_configs:
- targets: ['0.0.0.0:8888']
- job_name: k8s
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
regex: "true"
action: keep
metric_relabel_configs:
- source_labels: [__name__]
regex: "(request_duration_seconds.*|response_duration_seconds.*)"
action: keep
The above error shows that the Signalfx exporter is skipping this metric as it is excluded by default from the exporter. These errors can be ignored as they are expected out of k8s clusters. Read more.
-----------
We hope you found this informative and helpful. Want to dive in even further? Experience the difference for yourself and start your free trial of our observability platform now!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.