Solved: Parsing multiple values in fields from JSON key va...

sholl · ‎04-17-2024

I have some JSON output that is in key value structure (protobuf3 formatted--this is OTLP data going into Splunk Enterprise events) and it has multiple values in each field. There are multiple key value attributes stored under an attributes parent, and then its fields are under a metric parent. I want to take the host.name attribute and map it to every metrics value I see.

Here is working example of the raw json:

{
  "resourceMetrics": [
    {
      "resource": {
        "attributes": [
          {
            "key": "host.name",
            "value": {
              "stringValue": "myname1"
            }
          },
          {
            "key": "telemetry.sdk.name",
            "value": {
              "stringValue": "my_sdk"
            }
          }
        ]
      },
      "scopeMetrics": [
        {
          "metrics": [
            {
              "name": "hw.host.energy",
              "gauge": {
                "dataPoints": [
                  {
                    "timeUnixNano": "1712951030986039000",
                    "asDouble": 359
                  }
                ]
              }
            },
                    {
              "name": "hw.host.power",
              "gauge": {
                "dataPoints": [
                  {
                    "timeUnixNano": "1712951030986039000",
                    "asDouble": 26
                  }
                ]
              }
            }
          ]
        }
      ]
    },
    {
      "resource": {
        "attributes": [
          {
            "key": "host.name",
            "value": {
              "stringValue": "myname2"
            }
          },
          {
            "key": "telemetry.sdk.name",
            "value": {
              "stringValue": "my_sdk"
            }
          }
        ]
      },
      "scopeMetrics": [
        {
          "metrics": [
            {
              "name": "hw.host.energy",
              "gauge": {
                "dataPoints": [
                  {
                    "timeUnixNano": "1712951030987780000",
                    "asDouble": 211
                  }
                ]
              }
            }
          ]
        }
      ]
    }
  ]
}

There may be multiple attributes, in various order, but I am only interested in grabbing the host.name value from there, and then associating host.name to all metrics under the metrics parent within the resource parent. The metrics parent may contain multiple metrics in the array. And then new resources (with new host.name and new metrics) would show up as the next resource entry in the resources array.

So what I want is something like this in a row-based format of host.name.value > metric:

host.name	metric
host.name,myname1	hw.host.energy,359
host.name,myname1	hw.host.power,26
host.name,myname2	hw.host.energy,211

The problem I am having is I don't want the other attributes from the attribute parent, which in the example is the telemetry.sdk.name key and value. But since they are there, I can't figure out how to zip and expand properly, as the telemetry.sdk.name value gets associated to legit metrics, looking something like below, which would mean if I drop row 2 I lose the power metric = 26 for myname1.

Parsing some spaths, the structure looks something like this:

attr_zip	metric_zip
host.name,myname1	hw.host.energy,359
telemetry.sdk.name,my_sdk	hw.host.power,26
host.name,myname2	hw.host.energy,211
telemetry.sdk.name,my_sdk

I looked at mvfilter but can't seem to find a way to handle a variable amount of attributes that may show up in the left column attr_zip, as it seems I ned to know how many values I fill down in the field, and I am not sure how to get a count of the values fro the right column metric_zip to know how many values down in attr_zip to fill.

In JSON, all the metrics values share the same resource so I should logically be able to reference the parent resource.attribute.host.name.value, and concatenate that to every metric value.

Here's my current SPL, where I can get the columns concatenated properly, but would need to drop the rows in attr_zip that don't match the key of host.name:

| spath output=host_name path=resourceMetrics{}.resource.attributes{} 
| mvexpand host_name 
| spath output=attribute path=resourceMetrics{}.resource.attributes{}.key
| spath output=attribute_value path=resourceMetrics{}.resource.attributes{}.value.stringValue
| spath output=time resourceMetrics{}.scopeMetrics{}.metrics{}.gauge.dataPoints{}.timeUnixNano
| spath output=metric_name resourceMetrics{}.scopeMetrics{}.metrics{}.name
| spath output=metric_value resourceMetrics{}.scopeMetrics{}.metrics{}.gauge.dataPoints{}.asDouble
| eval attr_zip=mvzip(attribute, attribute_value)
| eval metric_zip=mvzip(metric_name, metric_value)
| table attribute,attribute_value, attr_zip, metric_zip

Anyone able to offer some guidance?

yuanliu · ‎04-17-2024

Before I rant, thank you for sharing valid mock data in text. This said, this is the second time in as many consecutive days I feel like screaming at lazy developers who makes some terrible use of JSON arrays. (The developer might be you. But the rant stands😏) Your data would have much cleaner, self-evidenced semantics had the developer simply use this:

[
 {
  "attributes": {"host.name":{"stringValue":"myname1"},"telemetry.sdk.name":{"stringValue":"my_sdk"}},
  "metrics": {"hw.host.energy":{"dataPoints":[{"timeUnixNano":"1712951030986039000","asDouble":359}]},"hw.host.power":{"dataPoints":[{"timeUnixNano":"1712951030986039000","asDouble":26}]}}
 },
 {
  "attributes": {"host.name":{"stringValue":"myname2"},"telemetry.sdk.name":{"stringValue":"my_sdk"}},
  "metrics": {"hw.host.energy":{"dataPoints":[{"timeUnixNano":"1712951030987780000","asDouble":211}]}}
]

In other words, only two JSON arrays in the original data are used correctly. resourceMetrics.resource.attributes[] and resourceMetrics.scopeMetrics.metrics[] are total abomination of the intent of JSON arrays. Speak to your developers to see if they could change the data structure not just for Splunk, but for future maintainers of their own code and any other downstream team as well.

Now that this is off my chest, I understand that it will take more than one day for developers to change code even if you convince them on day one. Here is the SPL that I use to tabulate your data like the following:

host.name.stringValue	hw.host.energy{}.asDouble	hw.host.energy{}.timeUnixNano	hw.host.power{}.asDouble	hw.host.power{}.timeUnixNano	sdk.name.stringValue
myname1	359	1712951030986039000	26	1712951030986039000	my_sdk
myname2	211	1712951030987780000			my_sdk

In this form, I have assumed that dataPoints[] is the only node of interest under resourceMetrics[].scopeMetrics[].metrics.gauge

| spath path=resourceMetrics{}
| fields - _* resourceMetrics{}.*
| mvexpand resourceMetrics{}
| spath input=resourceMetrics{} path=resource.attributes{}
| spath input=resourceMetrics{} path=scopeMetrics{}
| spath input=scopeMetrics{} path=metrics{}
| fields - resourceMetrics{} scopeMetrics{}
| foreach resource.attributes{} mode=multivalue
    [eval key = mvappend(key, json_extract(<<ITEM>>, "key"))]
| eval idx = mvrange(0, mvcount(key))
| eval attributes_good = json_object()
| foreach idx mode=multivalue
    [eval attribute = mvindex('resource.attributes{}', <<ITEM>>),
    attributes_good = json_set_exact(attributes_good, json_extract(attribute, "key"), json_extract(attribute, "value"))]
| fields - key attribute resource.attributes{}
| foreach metrics{} mode=multivalue
    [eval name = mvappend(name, json_extract(<<ITEM>>, "name"))]
| eval name = if(isnull(name), json_extract('metrics{}', "name"), name)
| eval idx = mvrange(0, mvcount(name))
| eval metrics_good = json_object()
| foreach idx mode=multivalue
    [eval metric = mvindex('metrics{}', <<ITEM>>),
    metrics_good = json_set_exact(metrics_good, json_extract(metric, "name"), json_extract(metric, "gauge.dataPoints"))]
    ``` the above assumes that gauge.dataPoints is the only subnode of interest ```
| fields - idx name metric metrics{}
``` the above transforms array-laden JSON into easily understandable JSON ```
| spath input=attributes_good
| spath input=metrics_good
| fields - *_good
``` the following is only needed if dataPoints[] actually contain multiple values.  This is the only code requiring prior knowledge about data fields ```
| mvexpand hw.host.energy{}.timeUnixNano
| mvexpand hw.host.power{}.timeUnixNano

(The fields - xxx commands are not essential; they just declutter view.) Hope this helps.

This is an emulation you can play with and compare with real data:

| makeresults
| eval _raw = "{
  \"resourceMetrics\": [
    {
      \"resource\": {
        \"attributes\": [
          {
            \"key\": \"host.name\",
            \"value\": {
              \"stringValue\": \"myname1\"
            }
          },
          {
            \"key\": \"telemetry.sdk.name\",
            \"value\": {
              \"stringValue\": \"my_sdk\"
            }
          }
        ]
      },
      \"scopeMetrics\": [
        {
          \"metrics\": [
            {
              \"name\": \"hw.host.energy\",
              \"gauge\": {
                \"dataPoints\": [
                  {
                    \"timeUnixNano\": \"1712951030986039000\",
                    \"asDouble\": 359
                  }
                ]
              }
            },
                    {
              \"name\": \"hw.host.power\",
              \"gauge\": {
                \"dataPoints\": [
                  {
                    \"timeUnixNano\": \"1712951030986039000\",
                    \"asDouble\": 26
                  }
                ]
              }
            }
          ]
        }
      ]
    },
    {
      \"resource\": {
        \"attributes\": [
          {
            \"key\": \"host.name\",
            \"value\": {
              \"stringValue\": \"myname2\"
            }
          },
          {
            \"key\": \"telemetry.sdk.name\",
            \"value\": {
              \"stringValue\": \"my_sdk\"
            }
          }
        ]
      },
      \"scopeMetrics\": [
        {
          \"metrics\": [
            {
              \"name\": \"hw.host.energy\",
              \"gauge\": {
                \"dataPoints\": [
                  {
                    \"timeUnixNano\": \"1712951030987780000\",
                    \"asDouble\": 211
                  }
                ]
              }
            }
          ]
        }
      ]
    }
  ]
}"
| spath
``` data emulation above ```

Final thoughts about data structure with self-evidence semantics: If my speculation about dataPoints[] being the only node of interest under resourceMetrics[].scopeMetrics[].metrics.gauge stands, good data could be further simplified to

[
 {
  "attributes": {"host.name":{"stringValue":"myname1"},"telemetry.sdk.name":{"stringValue":"my_sdk"}},
  "metrics": {"hw.host.energy":[{"timeUnixNano":"1712951030986039000","asDouble":359}],"hw.host.power":[{"timeUnixNano":"1712951030986039000","asDouble":26}]}
 },
 {
  "attributes": {"host.name":{"stringValue":"myname2"},"telemetry.sdk.name":{"stringValue":"my_sdk"}},
  "metrics": {"hw.host.energy":[{"timeUnixNano":"1712951030987780000","asDouble":211}]}
]

I do understand that listing hw.host.energy and hw.host.power as coexisting columns is different from your illustrated output and may not suite your needs. But presentation can easily be adapted. Bad data structure remains bad.

View solution in original post

yuanliu · ‎04-17-2024

Before I rant, thank you for sharing valid mock data in text. This said, this is the second time in as many consecutive days I feel like screaming at lazy developers who makes some terrible use of JSON arrays. (The developer might be you. But the rant stands😏) Your data would have much cleaner, self-evidenced semantics had the developer simply use this:

[
 {
  "attributes": {"host.name":{"stringValue":"myname1"},"telemetry.sdk.name":{"stringValue":"my_sdk"}},
  "metrics": {"hw.host.energy":{"dataPoints":[{"timeUnixNano":"1712951030986039000","asDouble":359}]},"hw.host.power":{"dataPoints":[{"timeUnixNano":"1712951030986039000","asDouble":26}]}}
 },
 {
  "attributes": {"host.name":{"stringValue":"myname2"},"telemetry.sdk.name":{"stringValue":"my_sdk"}},
  "metrics": {"hw.host.energy":{"dataPoints":[{"timeUnixNano":"1712951030987780000","asDouble":211}]}}
]

In other words, only two JSON arrays in the original data are used correctly. resourceMetrics.resource.attributes[] and resourceMetrics.scopeMetrics.metrics[] are total abomination of the intent of JSON arrays. Speak to your developers to see if they could change the data structure not just for Splunk, but for future maintainers of their own code and any other downstream team as well.

Now that this is off my chest, I understand that it will take more than one day for developers to change code even if you convince them on day one. Here is the SPL that I use to tabulate your data like the following:

host.name.stringValue	hw.host.energy{}.asDouble	hw.host.energy{}.timeUnixNano	hw.host.power{}.asDouble	hw.host.power{}.timeUnixNano	sdk.name.stringValue
myname1	359	1712951030986039000	26	1712951030986039000	my_sdk
myname2	211	1712951030987780000			my_sdk

In this form, I have assumed that dataPoints[] is the only node of interest under resourceMetrics[].scopeMetrics[].metrics.gauge

| spath path=resourceMetrics{}
| fields - _* resourceMetrics{}.*
| mvexpand resourceMetrics{}
| spath input=resourceMetrics{} path=resource.attributes{}
| spath input=resourceMetrics{} path=scopeMetrics{}
| spath input=scopeMetrics{} path=metrics{}
| fields - resourceMetrics{} scopeMetrics{}
| foreach resource.attributes{} mode=multivalue
    [eval key = mvappend(key, json_extract(<<ITEM>>, "key"))]
| eval idx = mvrange(0, mvcount(key))
| eval attributes_good = json_object()
| foreach idx mode=multivalue
    [eval attribute = mvindex('resource.attributes{}', <<ITEM>>),
    attributes_good = json_set_exact(attributes_good, json_extract(attribute, "key"), json_extract(attribute, "value"))]
| fields - key attribute resource.attributes{}
| foreach metrics{} mode=multivalue
    [eval name = mvappend(name, json_extract(<<ITEM>>, "name"))]
| eval name = if(isnull(name), json_extract('metrics{}', "name"), name)
| eval idx = mvrange(0, mvcount(name))
| eval metrics_good = json_object()
| foreach idx mode=multivalue
    [eval metric = mvindex('metrics{}', <<ITEM>>),
    metrics_good = json_set_exact(metrics_good, json_extract(metric, "name"), json_extract(metric, "gauge.dataPoints"))]
    ``` the above assumes that gauge.dataPoints is the only subnode of interest ```
| fields - idx name metric metrics{}
``` the above transforms array-laden JSON into easily understandable JSON ```
| spath input=attributes_good
| spath input=metrics_good
| fields - *_good
``` the following is only needed if dataPoints[] actually contain multiple values.  This is the only code requiring prior knowledge about data fields ```
| mvexpand hw.host.energy{}.timeUnixNano
| mvexpand hw.host.power{}.timeUnixNano

(The fields - xxx commands are not essential; they just declutter view.) Hope this helps.

This is an emulation you can play with and compare with real data:

| makeresults
| eval _raw = "{
  \"resourceMetrics\": [
    {
      \"resource\": {
        \"attributes\": [
          {
            \"key\": \"host.name\",
            \"value\": {
              \"stringValue\": \"myname1\"
            }
          },
          {
            \"key\": \"telemetry.sdk.name\",
            \"value\": {
              \"stringValue\": \"my_sdk\"
            }
          }
        ]
      },
      \"scopeMetrics\": [
        {
          \"metrics\": [
            {
              \"name\": \"hw.host.energy\",
              \"gauge\": {
                \"dataPoints\": [
                  {
                    \"timeUnixNano\": \"1712951030986039000\",
                    \"asDouble\": 359
                  }
                ]
              }
            },
                    {
              \"name\": \"hw.host.power\",
              \"gauge\": {
                \"dataPoints\": [
                  {
                    \"timeUnixNano\": \"1712951030986039000\",
                    \"asDouble\": 26
                  }
                ]
              }
            }
          ]
        }
      ]
    },
    {
      \"resource\": {
        \"attributes\": [
          {
            \"key\": \"host.name\",
            \"value\": {
              \"stringValue\": \"myname2\"
            }
          },
          {
            \"key\": \"telemetry.sdk.name\",
            \"value\": {
              \"stringValue\": \"my_sdk\"
            }
          }
        ]
      },
      \"scopeMetrics\": [
        {
          \"metrics\": [
            {
              \"name\": \"hw.host.energy\",
              \"gauge\": {
                \"dataPoints\": [
                  {
                    \"timeUnixNano\": \"1712951030987780000\",
                    \"asDouble\": 211
                  }
                ]
              }
            }
          ]
        }
      ]
    }
  ]
}"
| spath
``` data emulation above ```

Final thoughts about data structure with self-evidence semantics: If my speculation about dataPoints[] being the only node of interest under resourceMetrics[].scopeMetrics[].metrics.gauge stands, good data could be further simplified to

[
 {
  "attributes": {"host.name":{"stringValue":"myname1"},"telemetry.sdk.name":{"stringValue":"my_sdk"}},
  "metrics": {"hw.host.energy":[{"timeUnixNano":"1712951030986039000","asDouble":359}],"hw.host.power":[{"timeUnixNano":"1712951030986039000","asDouble":26}]}
 },
 {
  "attributes": {"host.name":{"stringValue":"myname2"},"telemetry.sdk.name":{"stringValue":"my_sdk"}},
  "metrics": {"hw.host.energy":[{"timeUnixNano":"1712951030987780000","asDouble":211}]}
]

I do understand that listing hw.host.energy and hw.host.power as coexisting columns is different from your illustrated output and may not suite your needs. But presentation can easily be adapted. Bad data structure remains bad.

sholl · ‎04-17-2024

Thanks, I will check out your advice in a bit.

Yes, I agree that the data structure is not ideal for parsing. Unfortunately, this is output from an OpenTelemetry collector following the OpenTelemetry standard (which Spunk also embraces, though we don’t have native parsing for it yet in Splunk Enterprise), so if this takes off as the cross-vendor standard for pushing telemetry, then we are going to have to deal with ingesting in this format more and more. Or maybe it is an opportunity to suggest formatting changes to CNCF to the standard.

Parsing multiple values in fields from JSON key value format

eval

Index This | I’m short for "configuration file.” What am I?

New Articles from Academic Learning Partners, Help Expand Lantern’s Use Case Library, ...

Your Guide to SPL2 at .conf24!