Getting Data In

GPU metrics monitoring via nvidia-smi

radko
Explorer

Hello. I have the following issue: I can't make splunk index GPU data in a metrics index. On the GPU server I have a working forwarder that forwards infrastructure data via the Splunk Add-on for Linux and Unix in a metrics index called infra_metrics. Unfortunately I can't make splunk index data in the same index from the gpu metrics. I am using a script that collect metrics and is executable by the splunkfwd user: 

/opt/splunkforwarder/etc/apps/gpu_monitor/bin/gpu_metrics.sh
metric_name:gpu.utilization _value=0 gpu_index=0 gpu_name=NVIDIA_L40S
metric_name:gpu.memory_used_pct _value=0.00 gpu_index=0 gpu_name=NVIDIA_L40S
metric_name:gpu.temperature _value=35 gpu_index=0 gpu_name=NVIDIA_L40S
metric_name:gpu.power_draw _value=38.95 gpu_index=0 gpu_name=NVIDIA_L40S

I have the following setup:
/opt/splunkforwarder/etc/apps/gpu_monitor/local# cat inputs.conf
[script:///opt/splunkforwarder/etc/apps/gpu_monitor/bin/gpu_metrics.sh]
interval = 60
index = infra_metrics
sourcetype = gpu:metrics
disabled = false

/opt/splunkforwarder/etc/apps/gpu_monitor/local# cat props.conf
[gpu:metrics]
DATAMODE = metric
METRICS_PROTOCOL = true
LINE_BREAKER = ([\r\n]+)

Labels (4)
0 Karma
1 Solution

PickleRick
SplunkTrust
SplunkTrust

OK. If you were able to successfully run 

/opt/splunk/bin/splunk cmd /opt/splunkforwarder/etc/apps/gpu_monitor/bin/gpu_metrics.sh

and got meaningful results, I'd go for ingesting the data first into a normal event index. If it does work and doesn't work when trying to get it as metrics, it would mean that there is something about parsing the metrics schema.

View solution in original post

PickleRick
SplunkTrust
SplunkTrust

OK. And what actually is your problem here?

Is your script not being run properly?

Does it not produce data?

Is it not getting parsed?

Something else?

What have you already done around debugging the issue.

0 Karma

radko
Explorer

When I check the contents of my metric index I don't see any gpu values (via | mcatalog values(metric_name) where index=infra_metrics). My script shows output:

metric_name=gpu.utilization value=0 gpu_index=0 gpu_name=NVIDIA_L40S metric_name=gpu.memory_used_pct value=0.00 gpu_index=0 gpu_name=NVIDIA_L40S metric_name=gpu.temperature value=35 gpu_index=0 gpu_name=NVIDIA_L40S metric_name=gpu.power_draw value=39.14 gpu_index=0 gpu_name=NVIDIA_L40S
 
I have also tried naming it metric_name:gpu.temperature. I have established sourcetype=gpu_metrics/gpu:metrics (tried both ways). 

Splunk user is able to run the gpu_metric.sh, able to run directly nvidia-smi commands but nothing is ingested/parsed to my index. Data is just not there when I have done everything accordingly in my opinion.
 
I have used the following architecture:
/opt/splunkforwarder/etc/apps/gpu_monitor/bin/gpu_metrics.sh
 

#!/bin/bash

NVIDIA_SMI=/usr/bin/nvidia-smi

$NVIDIA_SMI \
--query-gpu=index,name,utilization.gpu,utilization.memory,memory.total,memory.used,temperature.gpu,power.draw \
--format=csv,noheader,nounits | while IFS=',' read -r gpu_index gpu_name util_gpu mem_util mem_total mem_used temp power
do
gpu_index=$(echo "$gpu_index" | xargs)
gpu_name=$(echo "$gpu_name" | xargs | tr ' ' '_')
util_gpu=$(echo "$util_gpu" | xargs)
mem_total=$(echo "$mem_total" | xargs)
mem_used=$(echo "$mem_used" | xargs)
temp=$(echo "$temp" | xargs)
power=$(echo "$power" | xargs)

# calculate memory percentage
mem_used_pct=0
if [ "$mem_total" -gt 0 ]; then
mem_used_pct=$(awk "BEGIN {printf \"%.2f\", ($mem_used/$mem_total)*100}")
fi

# Proper Splunk metrics format
echo "metric_name:gpu.utilization _value=$util_gpu gpu_index=$gpu_index gpu_name=$gpu_name"
echo "metric_name:gpu.memory_used_pct _value=$mem_used_pct gpu_index=$gpu_index gpu_name=$gpu_name"
echo "metric_name:gpu.temperature _value=$temp gpu_index=$gpu_index gpu_name=$gpu_name"
echo "metric_name:gpu.power_draw _value=$power gpu_index=$gpu_index gpu_name=$gpu_name"
done

 

/opt/splunkforwarder/etc/apps/gpu_monitor/local# cat inputs.conf
[script:///opt/splunkforwarder/etc/apps/gpu_monitor/bin/gpu_metrics.sh]
interval = 60
index = infra_metrics
sourcetype = gpu:metrics
disabled = false

 

/opt/splunkforwarder/etc/apps/gpu_monitor/local# cat props.conf
[gpu:metrics]
DATAMODE = metric
METRICS_PROTOCOL = true
LINE_BREAKER = ([\r\n]+)

0 Karma

PickleRick
SplunkTrust
SplunkTrust

OK. If you were able to successfully run 

/opt/splunk/bin/splunk cmd /opt/splunkforwarder/etc/apps/gpu_monitor/bin/gpu_metrics.sh

and got meaningful results, I'd go for ingesting the data first into a normal event index. If it does work and doesn't work when trying to get it as metrics, it would mean that there is something about parsing the metrics schema.

radko
Explorer

Thank you for the solution. I did create a normal event index that monitors the output from the gpu_metrics.sh. I've also enabled it to run via cronjob and I get consistent results.

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

[Puzzles] Solve, Learn, Repeat: Matching cron expressions

This puzzle (first published here) is based on matching timestamps to cron expressions.All the timestamps ...

Design, Compete, Win: Submit Your Best Splunk Dashboards for a .conf26 Pass

Hello Splunkers,  We’re excited to kick off a Splunk Dashboard contest! We know that dashboards are a primary ...

May 2026 Splunk Expert Sessions: Security & Observability

Level Up Your Operations: May 2026 Splunk Expert Sessions Whether you are refining your security posture or ...