Hi all,
For reference, I've seen this Splunk Answer post, but it doesn't quite get me where I want: https://answers.splunk.com/answers/297026/extract-value-from-json-array-of-objects.html?utm_source=t...
I have logs that time how long each client's browser takes to load different resources from our server. These logs are at a page level, so each log has information about upwards of 20 different resources that were loaded for each page. The logs look something like:
{
Context: {
Database: DatabaseA
RequestContext: {
CallTree: 0
Path: /heartbeat/boomerang
RequestId: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
}
User: Training User
UserId: 1
}
Data: {
restiming[0][rt_con_end]: 2094.2379999905825
restiming[0][rt_con_st]: 2094.2379999905825
restiming[0][rt_dns_end]: 2094.2379999905825
restiming[0][rt_dns_st]: 2094.2379999905825
restiming[0][rt_dur]: 70.596999998088
restiming[0][rt_fet_st]: 2094.2379999905825
restiming[0][rt_in_type]: script
restiming[0][rt_name]: https://myWebsite:80/path/to/someJsFile.js
restiming[0][rt_req_st]: 2158.332999999402
restiming[0][rt_res_end]: 2164.8349999886705
restiming[0][rt_res_st]: 2158.7020000006305
restiming[0][rt_st]: 2094.2379999905825
restiming[1][rt_con_end]: 2334.4559999968624
restiming[1][rt_con_st]: 2334.4559999968624
restiming[1][rt_dns_end]: 2334.4559999968624
restiming[1][rt_dns_st]: 2334.4559999968624
restiming[1][rt_dur]: 26.542999999946915
restiming[1][rt_fet_st]: 2334.4559999968624
restiming[1][rt_in_type]: css
restiming[1][rt_name]: https://myWebsite:80/Images/notebook-skin.png
restiming[1][rt_req_st]: 2336.8760000012117
restiming[1][rt_res_end]: 2360.9989999968093
restiming[1][rt_res_st]: 2360.43499999505
restiming[1][rt_st]: 2334.4559999968624
...
}
}
So, unfortunately the information for each resource is not in correct JSON format. What I want to do is zip up on all of the information into a multi-value field, then mvexpand that field to have one log per resource. After that I can aggregate some data about how long it takes to load different images, javascript files, or CSS files on average.
I was thinking something like my search below might work, but I'm not sure that it would guarantee that each resource would line up with the appropriate time:
{baseSearch}
| rex "\"restiming\[[0-9]+\]\[rt_name\]\":\"(?<resourceFile>.+?)\""
| rex "\"restiming\[[0-9]+\]\[rt_in_type\]\":\"(?<resourceType>.+?)\""
| rex "\"restiming\[[0-9]+\]\[rt_dur\]\":\"(?<resourceLoadTime>.+?)\""
| eval resourceDetails=mvzip(mvzip(resourceFile, resourceType, ":::"), resourceLoadTime, ":::")
| mvexpand resourceDetails
| {split up resourceDetails into its three parts again}
| stats avg(resourceLoadTime) by resourceType, resourceFile
My questions about this search are: (1) Am I guaranteed that the three fields I extract are going to match up for the mvzip? If not, is there a way I can order them by array index? (2) Is there a better way to do this? Perhaps using the foreach
command to apply some logic to each? (3) General tips for how I can turn this flattened JSON array into the typical format. It's obviously pretty frustrating to have to deal with using the rex
command so much.
Hopefully the example search gets across what I'm trying to do, but let me know if you need more information.
Hi bruceclark,
You could use the Splunk sedcmd to make the reformat the input data into proper json format. Using your example data I've got the Splunk to recognise it as json with the following configuration in props.conf
.
[my_reformatted_json]
DATETIME_CONFIG = CURRENT
LINE_BREAKER = ([\r\n]+)\s*{
SHOULD_LINEMERGE = false
MAX_EVENTS = 256
NO_BINARY_CHECK = true
category = Custom
disabled = false
pulldown_type = true
# use sed to massage the data to make is json compliant
SEDCMD-1addleadfieldquote = s/([\w\[\]]+:\s)/"\1/g
SEDCMD-2addtrailingfieldquote = s/:\s/": /g
SEDCMD-3addkvcommas = s/(\w)(\s*)([\r\n])/\1,\3/g
SEDCMD-4addvaluequotes = s/(:\s+)([[a-zA-Z\-\/][^,]+)/: "\2"/g
SEDCMD-5striplastkvquotes = s/(,)([\r\n])(\s+\})/\2\3/g
SEDCMD-6addbracecommas = s/(\}\s*)([\r\n])(\s+\")/},\2\3/g
Not sure if this have very good preformance over a very large data set. You may need to increase MAX_EVENTS also if number of event lines is greater than 256.
OR
You could also use props and transforms to extract field value pairs from the input data. For example
props.conf
...
[my_json]
REPORT-json-extr = json-extr
transforms.conf
...
[json-extr]
REGEX = ([\w\[\]]+):\s+(?!{)([^\r\n]+)
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true
This would only extract lines with matchingfield: value
pairs from the input. But maybe that's all you need anyway.
Hope this helps.
Hi bruceclark,
You could use the Splunk sedcmd to make the reformat the input data into proper json format. Using your example data I've got the Splunk to recognise it as json with the following configuration in props.conf
.
[my_reformatted_json]
DATETIME_CONFIG = CURRENT
LINE_BREAKER = ([\r\n]+)\s*{
SHOULD_LINEMERGE = false
MAX_EVENTS = 256
NO_BINARY_CHECK = true
category = Custom
disabled = false
pulldown_type = true
# use sed to massage the data to make is json compliant
SEDCMD-1addleadfieldquote = s/([\w\[\]]+:\s)/"\1/g
SEDCMD-2addtrailingfieldquote = s/:\s/": /g
SEDCMD-3addkvcommas = s/(\w)(\s*)([\r\n])/\1,\3/g
SEDCMD-4addvaluequotes = s/(:\s+)([[a-zA-Z\-\/][^,]+)/: "\2"/g
SEDCMD-5striplastkvquotes = s/(,)([\r\n])(\s+\})/\2\3/g
SEDCMD-6addbracecommas = s/(\}\s*)([\r\n])(\s+\")/},\2\3/g
Not sure if this have very good preformance over a very large data set. You may need to increase MAX_EVENTS also if number of event lines is greater than 256.
OR
You could also use props and transforms to extract field value pairs from the input data. For example
props.conf
...
[my_json]
REPORT-json-extr = json-extr
transforms.conf
...
[json-extr]
REGEX = ([\w\[\]]+):\s+(?!{)([^\r\n]+)
FORMAT = $1::$2
MV_ADD = true
REPEAT_MATCH = true
This would only extract lines with matchingfield: value
pairs from the input. But maybe that's all you need anyway.
Hope this helps.
Have you considered pre-processing the logs to either manipulate them into proper JSON or to do the evaluations? You could send these events to a separate log, operate on them outside of Splunk, then index the results.
Splunk is powerful, but in this case a purpose-built script with the ability to do consistency checks and generate consistent output might be a better choice.
We have considered that, however because of how we're consuming the logs, it would be difficult. I'm hoping solving this in search is doable. If it is, then I think it would be an easier route to take.
There's also this Splunk answer by Martin Mueller which has some about using spath with mvexpand.