Splunk Search

Parsing multiple values in fields from JSON key value format

sholl
Engager

I have some JSON output that is in key value structure (protobuf3 formatted--this is OTLP data going into Splunk Enterprise events) and it has multiple values in each field. There are multiple key value attributes stored under an attributes parent, and then its fields are under a metric parent. I want to take the host.name attribute and map it to every metrics value I see.

Here is working example of the raw json:

 

 

{
  "resourceMetrics": [
    {
      "resource": {
        "attributes": [
          {
            "key": "host.name",
            "value": {
              "stringValue": "myname1"
            }
          },
          {
            "key": "telemetry.sdk.name",
            "value": {
              "stringValue": "my_sdk"
            }
          }
        ]
      },
      "scopeMetrics": [
        {
          "metrics": [
            {
              "name": "hw.host.energy",
              "gauge": {
                "dataPoints": [
                  {
                    "timeUnixNano": "1712951030986039000",
                    "asDouble": 359
                  }
                ]
              }
            },
                    {
              "name": "hw.host.power",
              "gauge": {
                "dataPoints": [
                  {
                    "timeUnixNano": "1712951030986039000",
                    "asDouble": 26
                  }
                ]
              }
            }
          ]
        }
      ]
    },
    {
      "resource": {
        "attributes": [
          {
            "key": "host.name",
            "value": {
              "stringValue": "myname2"
            }
          },
          {
            "key": "telemetry.sdk.name",
            "value": {
              "stringValue": "my_sdk"
            }
          }
        ]
      },
      "scopeMetrics": [
        {
          "metrics": [
            {
              "name": "hw.host.energy",
              "gauge": {
                "dataPoints": [
                  {
                    "timeUnixNano": "1712951030987780000",
                    "asDouble": 211
                  }
                ]
              }
            }
          ]
        }
      ]
    }
  ]
}

 

 

There may be multiple attributes, in various order, but I am only interested in grabbing the host.name value from there, and then associating host.name to all metrics under the metrics parent within the resource parent. The metrics parent may contain multiple metrics in the array. And then new resources (with new host.name and new metrics) would show up as the next resource entry in the resources array.

So what I want is something like this in a row-based format of host.name.value > metric:

host.namemetric
host.name,myname1hw.host.energy,359
host.name,myname1hw.host.power,26
host.name,myname2hw.host.energy,211

 

The problem I am having is I don't want the other attributes from the attribute parent, which in the example is the telemetry.sdk.name key and value. But since they are there, I can't figure out how to zip and expand properly, as the telemetry.sdk.name value gets associated to legit metrics, looking something like below, which would mean if I drop row 2 I lose the power metric = 26 for myname1.

Parsing some spaths, the structure looks something like this:

attr_zipmetric_zip
host.name,myname1hw.host.energy,359
telemetry.sdk.name,my_sdkhw.host.power,26
host.name,myname2hw.host.energy,211
telemetry.sdk.name,my_sdk 

 

I looked at mvfilter but can't seem to find a way to handle a variable amount of attributes that may show up in the left column attr_zip, as it seems I ned to know how many values I fill down in the field, and I am not sure how to get a count of the values fro the right column metric_zip to know how many values down in attr_zip to fill.

In JSON, all the metrics values share the same resource so I should logically be able to reference the parent resource.attribute.host.name.value, and concatenate that to every metric value.

Here's my current SPL, where I can get the columns concatenated properly, but would need to drop the rows in attr_zip that don't match the key of host.name:

 

 

| spath output=host_name path=resourceMetrics{}.resource.attributes{} 
| mvexpand host_name 
| spath output=attribute path=resourceMetrics{}.resource.attributes{}.key
| spath output=attribute_value path=resourceMetrics{}.resource.attributes{}.value.stringValue
| spath output=time resourceMetrics{}.scopeMetrics{}.metrics{}.gauge.dataPoints{}.timeUnixNano
| spath output=metric_name resourceMetrics{}.scopeMetrics{}.metrics{}.name
| spath output=metric_value resourceMetrics{}.scopeMetrics{}.metrics{}.gauge.dataPoints{}.asDouble
| eval attr_zip=mvzip(attribute, attribute_value)
| eval metric_zip=mvzip(metric_name, metric_value)
| table attribute,attribute_value, attr_zip, metric_zip

 

 

 

Anyone able to offer some guidance?

Labels (1)
0 Karma
1 Solution

yuanliu
SplunkTrust
SplunkTrust

Before I rant, thank you for sharing valid mock data in text.  This said, this is the second time in as many consecutive days I feel like screaming at lazy developers who makes some terrible use of JSON arrays. (The developer might be you.  But the rant stands😏)  Your data would have much cleaner, self-evidenced semantics had the developer simply use this:

 

[
 {
  "attributes": {"host.name":{"stringValue":"myname1"},"telemetry.sdk.name":{"stringValue":"my_sdk"}},
  "metrics": {"hw.host.energy":{"dataPoints":[{"timeUnixNano":"1712951030986039000","asDouble":359}]},"hw.host.power":{"dataPoints":[{"timeUnixNano":"1712951030986039000","asDouble":26}]}}
 },
 {
  "attributes": {"host.name":{"stringValue":"myname2"},"telemetry.sdk.name":{"stringValue":"my_sdk"}},
  "metrics": {"hw.host.energy":{"dataPoints":[{"timeUnixNano":"1712951030987780000","asDouble":211}]}}
]

 

In other words, only two JSON arrays in the original data are used correctly.  resourceMetrics.resource.attributes[] and resourceMetrics.scopeMetrics.metrics[] are total abomination of the intent of JSON arrays.  Speak to your developers to see if they could change the data structure not just for Splunk, but for future maintainers of their own code and any other downstream team as well.

Now that this is off my chest, I understand that it will take more than one day for developers to change code even if you convince them on day one.  Here is the SPL that I use to tabulate your data like the following:

host.name.stringValuehw.host.energy{}.asDoublehw.host.energy{}.timeUnixNanohw.host.power{}.asDoublehw.host.power{}.timeUnixNanosdk.name.stringValue
myname13591712951030986039000261712951030986039000my_sdk
myname22111712951030987780000  my_sdk

In this form, I have assumed that dataPoints[] is the only node of interest under resourceMetrics[].scopeMetrics[].metrics.gauge 

| spath path=resourceMetrics{}
| fields - _* resourceMetrics{}.*
| mvexpand resourceMetrics{}
| spath input=resourceMetrics{} path=resource.attributes{}
| spath input=resourceMetrics{} path=scopeMetrics{}
| spath input=scopeMetrics{} path=metrics{}
| fields - resourceMetrics{} scopeMetrics{}
| foreach resource.attributes{} mode=multivalue
    [eval key = mvappend(key, json_extract(<<ITEM>>, "key"))]
| eval idx = mvrange(0, mvcount(key))
| eval attributes_good = json_object()
| foreach idx mode=multivalue
    [eval attribute = mvindex('resource.attributes{}', <<ITEM>>),
    attributes_good = json_set_exact(attributes_good, json_extract(attribute, "key"), json_extract(attribute, "value"))]
| fields - key attribute resource.attributes{}
| foreach metrics{} mode=multivalue
    [eval name = mvappend(name, json_extract(<<ITEM>>, "name"))]
| eval name = if(isnull(name), json_extract('metrics{}', "name"), name)
| eval idx = mvrange(0, mvcount(name))
| eval metrics_good = json_object()
| foreach idx mode=multivalue
    [eval metric = mvindex('metrics{}', <<ITEM>>),
    metrics_good = json_set_exact(metrics_good, json_extract(metric, "name"), json_extract(metric, "gauge.dataPoints"))]
    ``` the above assumes that gauge.dataPoints is the only subnode of interest ```
| fields - idx name metric metrics{}
``` the above transforms array-laden JSON into easily understandable JSON ```
| spath input=attributes_good
| spath input=metrics_good
| fields - *_good
``` the following is only needed if dataPoints[] actually contain multiple values.  This is the only code requiring prior knowledge about data fields ```
| mvexpand hw.host.energy{}.timeUnixNano
| mvexpand hw.host.power{}.timeUnixNano

(The fields - xxx commands are not essential; they just declutter view.)  Hope this helps.

This is an emulation you can play with and compare with real data:

| makeresults
| eval _raw = "{
  \"resourceMetrics\": [
    {
      \"resource\": {
        \"attributes\": [
          {
            \"key\": \"host.name\",
            \"value\": {
              \"stringValue\": \"myname1\"
            }
          },
          {
            \"key\": \"telemetry.sdk.name\",
            \"value\": {
              \"stringValue\": \"my_sdk\"
            }
          }
        ]
      },
      \"scopeMetrics\": [
        {
          \"metrics\": [
            {
              \"name\": \"hw.host.energy\",
              \"gauge\": {
                \"dataPoints\": [
                  {
                    \"timeUnixNano\": \"1712951030986039000\",
                    \"asDouble\": 359
                  }
                ]
              }
            },
                    {
              \"name\": \"hw.host.power\",
              \"gauge\": {
                \"dataPoints\": [
                  {
                    \"timeUnixNano\": \"1712951030986039000\",
                    \"asDouble\": 26
                  }
                ]
              }
            }
          ]
        }
      ]
    },
    {
      \"resource\": {
        \"attributes\": [
          {
            \"key\": \"host.name\",
            \"value\": {
              \"stringValue\": \"myname2\"
            }
          },
          {
            \"key\": \"telemetry.sdk.name\",
            \"value\": {
              \"stringValue\": \"my_sdk\"
            }
          }
        ]
      },
      \"scopeMetrics\": [
        {
          \"metrics\": [
            {
              \"name\": \"hw.host.energy\",
              \"gauge\": {
                \"dataPoints\": [
                  {
                    \"timeUnixNano\": \"1712951030987780000\",
                    \"asDouble\": 211
                  }
                ]
              }
            }
          ]
        }
      ]
    }
  ]
}"
| spath
``` data emulation above ```

Final thoughts about data structure with self-evidence semantics: If my speculation about dataPoints[] being the only node of interest under resourceMetrics[].scopeMetrics[].metrics.gauge stands, good data could be further simplified to

[
 {
  "attributes": {"host.name":{"stringValue":"myname1"},"telemetry.sdk.name":{"stringValue":"my_sdk"}},
  "metrics": {"hw.host.energy":[{"timeUnixNano":"1712951030986039000","asDouble":359}],"hw.host.power":[{"timeUnixNano":"1712951030986039000","asDouble":26}]}
 },
 {
  "attributes": {"host.name":{"stringValue":"myname2"},"telemetry.sdk.name":{"stringValue":"my_sdk"}},
  "metrics": {"hw.host.energy":[{"timeUnixNano":"1712951030987780000","asDouble":211}]}
]

I do understand that listing hw.host.energy and hw.host.power as coexisting columns is different from your illustrated output and may not suite your needs.  But presentation can easily be adapted.  Bad data structure remains bad.

View solution in original post

yuanliu
SplunkTrust
SplunkTrust

Before I rant, thank you for sharing valid mock data in text.  This said, this is the second time in as many consecutive days I feel like screaming at lazy developers who makes some terrible use of JSON arrays. (The developer might be you.  But the rant stands😏)  Your data would have much cleaner, self-evidenced semantics had the developer simply use this:

 

[
 {
  "attributes": {"host.name":{"stringValue":"myname1"},"telemetry.sdk.name":{"stringValue":"my_sdk"}},
  "metrics": {"hw.host.energy":{"dataPoints":[{"timeUnixNano":"1712951030986039000","asDouble":359}]},"hw.host.power":{"dataPoints":[{"timeUnixNano":"1712951030986039000","asDouble":26}]}}
 },
 {
  "attributes": {"host.name":{"stringValue":"myname2"},"telemetry.sdk.name":{"stringValue":"my_sdk"}},
  "metrics": {"hw.host.energy":{"dataPoints":[{"timeUnixNano":"1712951030987780000","asDouble":211}]}}
]

 

In other words, only two JSON arrays in the original data are used correctly.  resourceMetrics.resource.attributes[] and resourceMetrics.scopeMetrics.metrics[] are total abomination of the intent of JSON arrays.  Speak to your developers to see if they could change the data structure not just for Splunk, but for future maintainers of their own code and any other downstream team as well.

Now that this is off my chest, I understand that it will take more than one day for developers to change code even if you convince them on day one.  Here is the SPL that I use to tabulate your data like the following:

host.name.stringValuehw.host.energy{}.asDoublehw.host.energy{}.timeUnixNanohw.host.power{}.asDoublehw.host.power{}.timeUnixNanosdk.name.stringValue
myname13591712951030986039000261712951030986039000my_sdk
myname22111712951030987780000  my_sdk

In this form, I have assumed that dataPoints[] is the only node of interest under resourceMetrics[].scopeMetrics[].metrics.gauge 

| spath path=resourceMetrics{}
| fields - _* resourceMetrics{}.*
| mvexpand resourceMetrics{}
| spath input=resourceMetrics{} path=resource.attributes{}
| spath input=resourceMetrics{} path=scopeMetrics{}
| spath input=scopeMetrics{} path=metrics{}
| fields - resourceMetrics{} scopeMetrics{}
| foreach resource.attributes{} mode=multivalue
    [eval key = mvappend(key, json_extract(<<ITEM>>, "key"))]
| eval idx = mvrange(0, mvcount(key))
| eval attributes_good = json_object()
| foreach idx mode=multivalue
    [eval attribute = mvindex('resource.attributes{}', <<ITEM>>),
    attributes_good = json_set_exact(attributes_good, json_extract(attribute, "key"), json_extract(attribute, "value"))]
| fields - key attribute resource.attributes{}
| foreach metrics{} mode=multivalue
    [eval name = mvappend(name, json_extract(<<ITEM>>, "name"))]
| eval name = if(isnull(name), json_extract('metrics{}', "name"), name)
| eval idx = mvrange(0, mvcount(name))
| eval metrics_good = json_object()
| foreach idx mode=multivalue
    [eval metric = mvindex('metrics{}', <<ITEM>>),
    metrics_good = json_set_exact(metrics_good, json_extract(metric, "name"), json_extract(metric, "gauge.dataPoints"))]
    ``` the above assumes that gauge.dataPoints is the only subnode of interest ```
| fields - idx name metric metrics{}
``` the above transforms array-laden JSON into easily understandable JSON ```
| spath input=attributes_good
| spath input=metrics_good
| fields - *_good
``` the following is only needed if dataPoints[] actually contain multiple values.  This is the only code requiring prior knowledge about data fields ```
| mvexpand hw.host.energy{}.timeUnixNano
| mvexpand hw.host.power{}.timeUnixNano

(The fields - xxx commands are not essential; they just declutter view.)  Hope this helps.

This is an emulation you can play with and compare with real data:

| makeresults
| eval _raw = "{
  \"resourceMetrics\": [
    {
      \"resource\": {
        \"attributes\": [
          {
            \"key\": \"host.name\",
            \"value\": {
              \"stringValue\": \"myname1\"
            }
          },
          {
            \"key\": \"telemetry.sdk.name\",
            \"value\": {
              \"stringValue\": \"my_sdk\"
            }
          }
        ]
      },
      \"scopeMetrics\": [
        {
          \"metrics\": [
            {
              \"name\": \"hw.host.energy\",
              \"gauge\": {
                \"dataPoints\": [
                  {
                    \"timeUnixNano\": \"1712951030986039000\",
                    \"asDouble\": 359
                  }
                ]
              }
            },
                    {
              \"name\": \"hw.host.power\",
              \"gauge\": {
                \"dataPoints\": [
                  {
                    \"timeUnixNano\": \"1712951030986039000\",
                    \"asDouble\": 26
                  }
                ]
              }
            }
          ]
        }
      ]
    },
    {
      \"resource\": {
        \"attributes\": [
          {
            \"key\": \"host.name\",
            \"value\": {
              \"stringValue\": \"myname2\"
            }
          },
          {
            \"key\": \"telemetry.sdk.name\",
            \"value\": {
              \"stringValue\": \"my_sdk\"
            }
          }
        ]
      },
      \"scopeMetrics\": [
        {
          \"metrics\": [
            {
              \"name\": \"hw.host.energy\",
              \"gauge\": {
                \"dataPoints\": [
                  {
                    \"timeUnixNano\": \"1712951030987780000\",
                    \"asDouble\": 211
                  }
                ]
              }
            }
          ]
        }
      ]
    }
  ]
}"
| spath
``` data emulation above ```

Final thoughts about data structure with self-evidence semantics: If my speculation about dataPoints[] being the only node of interest under resourceMetrics[].scopeMetrics[].metrics.gauge stands, good data could be further simplified to

[
 {
  "attributes": {"host.name":{"stringValue":"myname1"},"telemetry.sdk.name":{"stringValue":"my_sdk"}},
  "metrics": {"hw.host.energy":[{"timeUnixNano":"1712951030986039000","asDouble":359}],"hw.host.power":[{"timeUnixNano":"1712951030986039000","asDouble":26}]}
 },
 {
  "attributes": {"host.name":{"stringValue":"myname2"},"telemetry.sdk.name":{"stringValue":"my_sdk"}},
  "metrics": {"hw.host.energy":[{"timeUnixNano":"1712951030987780000","asDouble":211}]}
]

I do understand that listing hw.host.energy and hw.host.power as coexisting columns is different from your illustrated output and may not suite your needs.  But presentation can easily be adapted.  Bad data structure remains bad.

sholl
Engager

Thanks, I will check out your advice in a bit.

Yes, I agree that the data structure is not ideal for parsing. Unfortunately, this is output from an OpenTelemetry collector following the OpenTelemetry standard (which Spunk also embraces, though we don’t have native parsing for it yet in Splunk Enterprise), so if this takes off as the cross-vendor standard for pushing telemetry, then we are going to have to deal with ingesting in this format more and more. Or maybe it is an opportunity to suggest formatting changes to CNCF to the standard.

0 Karma
Get Updates on the Splunk Community!

Now Available: Cisco Talos Threat Intelligence Integrations for Splunk Security Cloud ...

At .conf24, we shared that we were in the process of integrating Cisco Talos threat intelligence into Splunk ...

Preparing your Splunk Environment for OpenSSL3

The Splunk platform will transition to OpenSSL version 3 in a future release. Actions are required to prepare ...

Easily Improve Agent Saturation with the Splunk Add-on for OpenTelemetry Collector

Agent Saturation What and Whys In application performance monitoring, saturation is defined as the total load ...