Hello. I have the following issue: I can't make splunk index GPU data in a metrics index. On the GPU server I have a working forwarder that forwards infrastructure data via the Splunk Add-on for Linux and Unix in a metrics index called infra_metrics. Unfortunately I can't make splunk index data in the same index from the gpu metrics. I am using a script that collect metrics and is executable by the splunkfwd user:
/opt/splunkforwarder/etc/apps/gpu_monitor/bin/gpu_metrics.sh
metric_name:gpu.utilization _value=0 gpu_index=0 gpu_name=NVIDIA_L40S
metric_name:gpu.memory_used_pct _value=0.00 gpu_index=0 gpu_name=NVIDIA_L40S
metric_name:gpu.temperature _value=35 gpu_index=0 gpu_name=NVIDIA_L40S
metric_name:gpu.power_draw _value=38.95 gpu_index=0 gpu_name=NVIDIA_L40S
I have the following setup:
/opt/splunkforwarder/etc/apps/gpu_monitor/local# cat inputs.conf
[script:///opt/splunkforwarder/etc/apps/gpu_monitor/bin/gpu_metrics.sh]
interval = 60
index = infra_metrics
sourcetype = gpu:metrics
disabled = false
/opt/splunkforwarder/etc/apps/gpu_monitor/local# cat props.conf
[gpu:metrics]
DATAMODE = metric
METRICS_PROTOCOL = true
LINE_BREAKER = ([\r\n]+)
OK. If you were able to successfully run
/opt/splunk/bin/splunk cmd /opt/splunkforwarder/etc/apps/gpu_monitor/bin/gpu_metrics.sh
and got meaningful results, I'd go for ingesting the data first into a normal event index. If it does work and doesn't work when trying to get it as metrics, it would mean that there is something about parsing the metrics schema.
OK. And what actually is your problem here?
Is your script not being run properly?
Does it not produce data?
Is it not getting parsed?
Something else?
What have you already done around debugging the issue.
When I check the contents of my metric index I don't see any gpu values (via | mcatalog values(metric_name) where index=infra_metrics). My script shows output:
#!/bin/bash
NVIDIA_SMI=/usr/bin/nvidia-smi
$NVIDIA_SMI \
--query-gpu=index,name,utilization.gpu,utilization.memory,memory.total,memory.used,temperature.gpu,power.draw \
--format=csv,noheader,nounits | while IFS=',' read -r gpu_index gpu_name util_gpu mem_util mem_total mem_used temp power
do
gpu_index=$(echo "$gpu_index" | xargs)
gpu_name=$(echo "$gpu_name" | xargs | tr ' ' '_')
util_gpu=$(echo "$util_gpu" | xargs)
mem_total=$(echo "$mem_total" | xargs)
mem_used=$(echo "$mem_used" | xargs)
temp=$(echo "$temp" | xargs)
power=$(echo "$power" | xargs)
# calculate memory percentage
mem_used_pct=0
if [ "$mem_total" -gt 0 ]; then
mem_used_pct=$(awk "BEGIN {printf \"%.2f\", ($mem_used/$mem_total)*100}")
fi
# Proper Splunk metrics format
echo "metric_name:gpu.utilization _value=$util_gpu gpu_index=$gpu_index gpu_name=$gpu_name"
echo "metric_name:gpu.memory_used_pct _value=$mem_used_pct gpu_index=$gpu_index gpu_name=$gpu_name"
echo "metric_name:gpu.temperature _value=$temp gpu_index=$gpu_index gpu_name=$gpu_name"
echo "metric_name:gpu.power_draw _value=$power gpu_index=$gpu_index gpu_name=$gpu_name"
done
/opt/splunkforwarder/etc/apps/gpu_monitor/local# cat inputs.conf
[script:///opt/splunkforwarder/etc/apps/gpu_monitor/bin/gpu_metrics.sh]
interval = 60
index = infra_metrics
sourcetype = gpu:metrics
disabled = false
/opt/splunkforwarder/etc/apps/gpu_monitor/local# cat props.conf
[gpu:metrics]
DATAMODE = metric
METRICS_PROTOCOL = true
LINE_BREAKER = ([\r\n]+)
OK. If you were able to successfully run
/opt/splunk/bin/splunk cmd /opt/splunkforwarder/etc/apps/gpu_monitor/bin/gpu_metrics.sh
and got meaningful results, I'd go for ingesting the data first into a normal event index. If it does work and doesn't work when trying to get it as metrics, it would mean that there is something about parsing the metrics schema.
Thank you for the solution. I did create a normal event index that monitors the output from the gpu_metrics.sh. I've also enabled it to run via cronjob and I get consistent results.