Solved: Multi-value field from complex, flattened JSON arr...

bruceclarke · ‎10-29-2015

Hi all,

For reference, I've seen this Splunk Answer post, but it doesn't quite get me where I want: https://answers.splunk.com/answers/297026/extract-value-from-json-array-of-objects.html?utm_source=t...

I have logs that time how long each client's browser takes to load different resources from our server. These logs are at a page level, so each log has information about upwards of 20 different resources that were loaded for each page. The logs look something like:

{
    Context: {
      Database:  DatabaseA
      RequestContext: {
        CallTree:  0 
        Path:  /heartbeat/boomerang 
        RequestId:  XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX 
     } 
      User:  Training User 
      UserId:  1
   } 
    Data: {
      restiming[0][rt_con_end]:  2094.2379999905825 
      restiming[0][rt_con_st]:  2094.2379999905825 
      restiming[0][rt_dns_end]:  2094.2379999905825 
      restiming[0][rt_dns_st]:  2094.2379999905825 
      restiming[0][rt_dur]:  70.596999998088 
      restiming[0][rt_fet_st]:  2094.2379999905825 
      restiming[0][rt_in_type]:  script 
      restiming[0][rt_name]:  https://myWebsite:80/path/to/someJsFile.js
      restiming[0][rt_req_st]:  2158.332999999402 
      restiming[0][rt_res_end]:  2164.8349999886705 
      restiming[0][rt_res_st]:  2158.7020000006305 
      restiming[0][rt_st]:  2094.2379999905825 
      restiming[1][rt_con_end]:  2334.4559999968624 
      restiming[1][rt_con_st]:  2334.4559999968624 
      restiming[1][rt_dns_end]:  2334.4559999968624 
      restiming[1][rt_dns_st]:  2334.4559999968624 
      restiming[1][rt_dur]:  26.542999999946915 
      restiming[1][rt_fet_st]:  2334.4559999968624 
      restiming[1][rt_in_type]:  css 
      restiming[1][rt_name]:  https://myWebsite:80/Images/notebook-skin.png 
      restiming[1][rt_req_st]:  2336.8760000012117 
      restiming[1][rt_res_end]:  2360.9989999968093 
      restiming[1][rt_res_st]:  2360.43499999505 
      restiming[1][rt_st]:  2334.4559999968624 
      ...
    }
}

So, unfortunately the information for each resource is not in correct JSON format. What I want to do is zip up on all of the information into a multi-value field, then mvexpand that field to have one log per resource. After that I can aggregate some data about how long it takes to load different images, javascript files, or CSS files on average.

I was thinking something like my search below might work, but I'm not sure that it would guarantee that each resource would line up with the appropriate time:

{baseSearch}
| rex "\"restiming\[[0-9]+\]\[rt_name\]\":\"(?<resourceFile>.+?)\""
| rex "\"restiming\[[0-9]+\]\[rt_in_type\]\":\"(?<resourceType>.+?)\""
| rex "\"restiming\[[0-9]+\]\[rt_dur\]\":\"(?<resourceLoadTime>.+?)\""
| eval resourceDetails=mvzip(mvzip(resourceFile, resourceType, ":::"), resourceLoadTime, ":::")
| mvexpand resourceDetails
| {split up resourceDetails into its three parts again}
| stats avg(resourceLoadTime) by resourceType, resourceFile

My questions about this search are: (1) Am I guaranteed that the three fields I extract are going to match up for the mvzip? If not, is there a way I can order them by array index? (2) Is there a better way to do this? Perhaps using the foreach command to apply some logic to each? (3) General tips for how I can turn this flattened JSON array into the typical format. It's obviously pretty frustrating to have to deal with using the rex command so much.

Hopefully the example search gets across what I'm trying to do, but let me know if you need more information.

gcato · ‎11-03-2015

Hi bruceclark,

You could use the Splunk sedcmd to make the reformat the input data into proper json format. Using your example data I've got the Splunk to recognise it as json with the following configuration in props.conf.

[my_reformatted_json]
DATETIME_CONFIG = CURRENT
LINE_BREAKER = ([\r\n]+)\s*{
SHOULD_LINEMERGE = false
MAX_EVENTS = 256
NO_BINARY_CHECK = true
category = Custom
disabled = false
pulldown_type = true
# use sed to massage the data to make is json compliant
SEDCMD-1addleadfieldquote = s/([\w\[\]]+:\s)/"\1/g
SEDCMD-2addtrailingfieldquote = s/:\s/": /g
SEDCMD-3addkvcommas = s/(\w)(\s*)([\r\n])/\1,\3/g
SEDCMD-4addvaluequotes = s/(:\s+)([[a-zA-Z\-\/][^,]+)/: "\2"/g
SEDCMD-5striplastkvquotes = s/(,)([\r\n])(\s+\})/\2\3/g
SEDCMD-6addbracecommas = s/(\}\s*)([\r\n])(\s+\")/},\2\3/g

Not sure if this have very good preformance over a very large data set. You may need to increase MAX_EVENTS also if number of event lines is greater than 256.

OR

You could also use props and transforms to extract field value pairs from the input data. For example

props.conf
...
[my_json]
REPORT-json-extr = json-extr

transforms.conf
...
[json-extr]
REGEX = ([\w\[\]]+):\s+(?!{)([^\r\n]+)
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true

This would only extract lines with matchingfield: value pairs from the input. But maybe that's all you need anyway.

Hope this helps.

View solution in original post

gcato · ‎11-03-2015

Hi bruceclark,

You could use the Splunk sedcmd to make the reformat the input data into proper json format. Using your example data I've got the Splunk to recognise it as json with the following configuration in props.conf.

[my_reformatted_json]
DATETIME_CONFIG = CURRENT
LINE_BREAKER = ([\r\n]+)\s*{
SHOULD_LINEMERGE = false
MAX_EVENTS = 256
NO_BINARY_CHECK = true
category = Custom
disabled = false
pulldown_type = true
# use sed to massage the data to make is json compliant
SEDCMD-1addleadfieldquote = s/([\w\[\]]+:\s)/"\1/g
SEDCMD-2addtrailingfieldquote = s/:\s/": /g
SEDCMD-3addkvcommas = s/(\w)(\s*)([\r\n])/\1,\3/g
SEDCMD-4addvaluequotes = s/(:\s+)([[a-zA-Z\-\/][^,]+)/: "\2"/g
SEDCMD-5striplastkvquotes = s/(,)([\r\n])(\s+\})/\2\3/g
SEDCMD-6addbracecommas = s/(\}\s*)([\r\n])(\s+\")/},\2\3/g

Not sure if this have very good preformance over a very large data set. You may need to increase MAX_EVENTS also if number of event lines is greater than 256.

OR

You could also use props and transforms to extract field value pairs from the input data. For example

props.conf
...
[my_json]
REPORT-json-extr = json-extr

transforms.conf
...
[json-extr]
REGEX = ([\w\[\]]+):\s+(?!{)([^\r\n]+)
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true

This would only extract lines with matchingfield: value pairs from the input. But maybe that's all you need anyway.

Hope this helps.

nnmiller · ‎10-29-2015

Have you considered pre-processing the logs to either manipulate them into proper JSON or to do the evaluations? You could send these events to a separate log, operate on them outside of Splunk, then index the results.

Splunk is powerful, but in this case a purpose-built script with the ability to do consistency checks and generate consistent output might be a better choice.

bruceclarke · ‎10-29-2015

We have considered that, however because of how we're consuming the logs, it would be difficult. I'm hoping solving this in search is doable. If it is, then I think it would be an easier route to take.

nnmiller · ‎11-04-2015

There's also this Splunk answer by Martin Mueller which has some about using spath with mvexpand.

Multi-value field from complex, flattened JSON array

New Year, New Changes for Splunk Certifications

Stay Connected: Your Guide to January Tech Talks, Office Hours, and Webinars!

[Puzzles] Solve, Learn, Repeat: Reprocessing XML into Fixed-Length Events

Join the Conversation

Multi-value field from complex, flattened JSON array

New Year, New Changes for Splunk Certifications

Stay Connected: Your Guide to January Tech Talks, Office Hours, and Webinars!

[Puzzles] Solve, Learn, Repeat: Reprocessing XML into Fixed-Length Events