Hi all,
Currently on 6.5.2, but hopefully upgrading to 7.x in the next few months.
I have some data that is basically a
<timestamp> <component> <json payload>
I'm wondering if there is a way that I can replace the _raw
with just the <json payload>
at search time. I know I can do it with EVAL/replace in props, but I'm hoping to do it before that. The end goal is to have the entire event be json
by the time auto kv
runs, so that Splunk will parse out all of the json fields.
I have no problem extracting the json payload, but I'm struggling to find the process to actually replace _raw
with EXTRACT
or REPORT
(both run before Auto KV). If i use EXTRACT
with a named capture group of _raw
, then splunk will rename it to just raw
...so that doesn't help.
With REPORT
, the DEST_KEY
isn't available at search time, so I can't use that. I've tried using a named capture group there too with CLEAN_KEYS = false
, but nothing gets extracted in that case. I tried capturing the json and then using FORMAT = _raw::$1
... but that didn't seem to work either.
Am I missing something obvious? Is this possible?
Worst case, I'll try to extract that json fields with REPORT
, but unless the json is pretty flat, the extracted fields aren't the easiest to work with (parent/child relationships get lost). Or maybe I'll just do it at index time and extract the component
then too.
Maybe someday Splunk will be able to automatically parse json/xml/etc when that structured data is just in an event (not the entire event)?
EDIT
Adding a sample event.
Mar 26 13:44:57 myserver java {"userName":null,"enterpriseName":null,"type":"UPDATE","entityType":"letters","entityVersion":null,"assoicatedEvent":true,"eventReceivedTime":1522071897164,"entities":[{"children":null,"parentType":"letters","entityScope":"GLOBAL","lastUpdatedBy":"GUID","lastUpdatedDate":1522071896000,"creationDate":1518127649000,"alreadyMarkedForUnavailable":false,"unavailableTimestamp":0,"letters":[],"name":"server","location":null,"description":null,"productVersion":"version","status":"UP","lastStateChange":1521206768829,"disks":[{"children":null,"parentType":"letters","entityScope":"GLOBAL","name":"name:","unit":"GB","used":0.10546875,"available":0.90234375,"ID":"GUID","parentID":"GUID","externalID":null,"size":1.0087890625}],"owner":"GUID","ID":"GUID","parentID":"GUID","externalID":null,"address":IP","managementIP":"IP","BGPPeers":null,"messages":null,"peakMemoryUsage":74.54,"averageMemoryUsage":74.54,"currentMemoryUsage":74.54,"peakCPUUsage":37.79,"averageCPUUsage":25.45,"currentCPUUsage":25.57}],"diffMap":{"EntityKey ( id = GUID entityClassType = letters )":{"added":null,"deleted":null,"modified":{"currentCPUUsage":{"oldValue":"25.26","newValue":"25.57"},"peakCPUUsage":{"oldValue":"37.78","newValue":"37.79"}}}},"ignoreDiffInMediationEvents":false,"updateMechanism":"DEFAULT","requestID":null,"sourceEnterpriseID":null}
I think I have the same issue. Basically the event is not recognized by splunk as valid json becuase of the string before your json object: "Mar 26 13:44:57 myserver java". You may also need to remove \" if these are what is in your _raw event text.
The solution I ended up with was to have the application team modify the log format so the string of fields that starts before the json object was included within the json object itself so as to deliver valid json to splunk.
Another option is use index time feature: ingest_eval:
ingest_eval - Modify _raw at index time to remove or re-arrange _raw
see: https://docs.splunk.com/Documentation/Splunk/7.2.6/Data/IngestEval
props.conf
[mysourcetype]
TRANSFORMS = evalatindextime
transforms.conf (remove string)
[evalatindextime]
ingest_eval = _raw=replace(_raw,"^(.*?)({.*)$))","\2")
You could also use REPLACE() here to, instead of removing the text, capture parts of _raw and then compile each part in a way that produces valid JSON.
[evalatindextime]
ingest_eval = _raw=replace(_raw,"^([a-zA-Z]{3}\s\d{2}:\d{2}:\d{2})\s(.?)\s(.?)({)(.*)$","\4\"timestamp\":\"\1\",\"server\":\"\2\":\"type\":\"\3\",\5")
The issue I had was the nested json object had "\ around values & fields and the object itself had quotes around it. I replace() the \" with " and removed the quotes around the nested object.
replace(_raw,"\\\"","\"")
Here's what my bad events looked like:
{
"field":"value",
"field2":"value2",
"message":"12:50:43.222 Username Company address {\"nestedfield1\":\"nestedvalue1\",\"nestedfield3\",:\"nestedvalue3\" }"
}
Here's what my good event looks like after using REPLACE()
{
"field":"value",
"field2":"value2",
"message":{"timestamp":"12:50:43.222","username":"foo","company":"mycompany","address":"an address here","nestedfield1":"nestedvalue1","nestedfield2":"nestedvalue2","nestedfield3","nestedvalue3"}
}
Splunk parses out the top level fields/values fine (field1,field2,message). But the json object in message was not being parsed.
After replacing "\ with " and removing the " around the the nested json object all worked as expected.
Try using a calcluated field
like this:
[YourSourcetypeHere]
EVAL-_raw = replace(_raw, "^[^\{]+", "")
this also happens after splunk's auto kv functionality which sort of defeats the overall goal having Splunk parse the json for me like it does for kv_mode=json on all full json events.
https://docs.splunk.com/Documentation/Splunk/7.2.6/Knowledge/Searchtimeoperationssequence
You can do this at index-time with SEDCMD
like this:
[YourSourcetypeHere]
SEDCMD-json_only = s/^[^\{]+//
Deploy to your Indexers, restart Splunk there, send in fresh events, profit!
Thanks Woodcock. I know it can be done at index time. More than anything, I'm trying to confirm here that it is not possible to do at search time before Splunk does its kv processing.
Why can't you just add this:
| rex mode=sed "s/^[^\{]+//"
| kv
Like this:
| makeresults
| eval _raw="Mar 26 13:44:57 myserver java {\"userName\":null,\"enterpriseName\":null,\"type\":\"UPDATE\",\"entityType\":\"letters\",\"entityVersion\":null,\"assoicatedEvent\":true,\"eventReceivedTime\":1522071897164,\"entities\":[{\"children\":null,\"parentType\":\"letters\",\"entityScope\":\"GLOBAL\",\"lastUpdatedBy\":\"GUID\",\"lastUpdatedDate\":1522071896000,\"creationDate\":1518127649000,\"alreadyMarkedForUnavailable\":false,\"unavailableTimestamp\":0,\"letters\":[],\"name\":\"server\",\"location\":null,\"description\":null,\"productVersion\":\"version\",\"status\":\"UP\",\"lastStateChange\":1521206768829,\"disks\":[{\"children\":null,\"parentType\":\"letters\",\"entityScope\":\"GLOBAL\",\"name\":\"name:\",\"unit\":\"GB\",\"used\":0.10546875,\"available\":0.90234375,\"ID\":\"GUID\",\"parentID\":\"GUID\",\"externalID\":null,\"size\":1.0087890625}],\"owner\":\"GUID\",\"ID\":\"GUID\",\"parentID\":\"GUID\",\"externalID\":null,\"address\":IP\",\"managementIP\":\"IP\",\"BGPPeers\":null,\"messages\":null,\"peakMemoryUsage\":74.54,\"averageMemoryUsage\":74.54,\"currentMemoryUsage\":74.54,\"peakCPUUsage\":37.79,\"averageCPUUsage\":25.45,\"currentCPUUsage\":25.57}],\"diffMap\":{\"EntityKey ( id = GUID entityClassType = letters )\":{\"added\":null,\"deleted\":null,\"modified\":{\"currentCPUUsage\":{\"oldValue\":\"25.26\",\"newValue\":\"25.57\"},\"peakCPUUsage\":{\"oldValue\":\"37.78\",\"newValue\":\"37.79\"}}}},\"ignoreDiffInMediationEvents\":false,\"updateMechanism\":\"DEFAULT\",\"requestID\":null,\"sourceEnterpriseID\":null}"
| rex mode=sed "s/^[^\{]+//"
| kv
The goal is to be able to use the resulting fields in the base search, e.g. index="blah" parent.child.subchild="whatever" ...i don't want to have to pull back all of the results and then parse the json and then filter events. Even if i specify a keyword in the base search to limit the initial results, it's not the desired functionality.
were you able to achieve this?
I haven't looked at trying since around the time i posted this, but no I haven't
@maciep can you add some sample events (mock up/anonymize data if needed).. For JSON data start pattern and end pattern should be sufficient without the complete payload info.
ok, updated the question with a sample scrubbed event. forgot the host name is in there before the json too.