We have to model a regex in order to extract in Splunk (at index time) some fileds from our event. These fields will be used in search using the tstats
command. The regex will be used in a configuration file in Splunk settings transformation.conf
.
The main aspect of the fields we want extract at index time is that they have the same json key but a different father json-key.
Is it possible modelling this extraction using regex?
This is an example of Splunk event having the structure described before (json by the way):
{
"info":{
"eventSource":"",
"sourceType":"I/O",
"status":{
"code":"",
"msg":"",
"msgError":""
},
"transactionId":null,
"traceId":null,
"timestampStart":"2019-05-16T21:30:55.174Z",
"timestampEnd":"2019-05-16T21:30:55.174Z",
"companyIDCode":"",
"channelIDCode":"",
"branchCode":"",
"searchFields":{
"key_3":"value",
"key_2":"value",
"key_1":"value"
},
"annotation":{},
"caller":{
"id":"",
"version":"",
"acronym":""
},
"called":{
"id":"",
"version":"",
"acronym":""
},
"storage":{
"id":"",
"start":"",
"end":""
}
}
},
"headers":[],
"payLoad":{
"input":{
"encoding":"1024",
"ccsid":"1024",
"data":"dati_in"
},
"output":{
"encoding":"1024",
"ccsid":"1024",
"data":"dati_out"
}
}
}
The attended result is something like that:
We tried something like that
[calledid]
REGEX =(?<=called).*"id":"(?P<calledid>.*?)(?=")
FORMAT = calledid::"$1"
WRITE_META =true
but it dowsn't work cause it matches until the last id he finds. Such as:
":{"id":"","version":"","acronym":""},"storage":{"id":"
Thanks in advance.
We solve it using these regex:
[called_id]
REGEX = (?:info":{)(?:[\s\S]*?)(?:"called":{)(?:[\s\S]*?)(?:"id":)(?:(?:(?:")([^"]*)(?:"))|(null))(?:(?:[\}])|,)(?:(?:[^}]*)(?:}))
FORMAT = chiamato::"$1"
WRITE_META = true
[caller_id]
REGEX = (?:info":{)(?:[\s\S]*?)(?:"caller":{)(?:[\s\S]*?)(?:"id":)(?:(?:(?:")([^"]*)(?:"))|(null))(?:(?:[\}])|,)(?:(?:[^}]*)(?:}))
FORMAT = chiamante::"$1"
WRITE_META = true
in transforms.conf
TRANSFORMS-callee_id = callee_id
TRANSFORMS-caller_id = caller_id
and enabling the fields in field.conf
Hope will help other people.
We solve it using these regex:
[called_id]
REGEX = (?:info":{)(?:[\s\S]*?)(?:"called":{)(?:[\s\S]*?)(?:"id":)(?:(?:(?:")([^"]*)(?:"))|(null))(?:(?:[\}])|,)(?:(?:[^}]*)(?:}))
FORMAT = chiamato::"$1"
WRITE_META = true
[caller_id]
REGEX = (?:info":{)(?:[\s\S]*?)(?:"caller":{)(?:[\s\S]*?)(?:"id":)(?:(?:(?:")([^"]*)(?:"))|(null))(?:(?:[\}])|,)(?:(?:[^}]*)(?:}))
FORMAT = chiamante::"$1"
WRITE_META = true
in transforms.conf
TRANSFORMS-callee_id = callee_id
TRANSFORMS-caller_id = caller_id
and enabling the fields in field.conf
Hope will help other people.
*Update: Based on the data you provided in another comment I tweaked the regex.
I would avoid lookaheads and lookbehinds if possible, especially with a big payload. It's too easy to have an poorly performing or broken regex.
You also don't need to use the FORMAT command in transforms.conf if your regex is formatted to include the field names with the extractions.
You can extract the caller and called ID and version fields with two stanzas, one for caller and one for called.
[callerid]
REGEX = caller\"\s*:\s*\{\s*\"id\":\s*\"(?<callerid>[^\"]+)?\"\,\s*\"version\":\s*\"(?<callerversion>[^\"]+)?\"
[calledid]
REGEX = called\":\s*\{\s*\"id\":\s*\"(?<calledid>[^\"]+)?\"\,\s*\"version\":\s*\"(?<calledversion>[^\"]+)?\"
This was the inline search I used to test it:
| makeresults | eval test="{\"info\": {\"eventSource\": \"RPA\", \"sourceType\": \"I\", \"status\": {\"code\": \"0000\", \"msg\": \"Inizio Schedulazione\", \"msgError\": \"\"}, \"transactionId\": \"66083\", \"traceId\": \"124021\", \"timestampStart\": \"2019-10-16T11:34:00.000Z\", \"timestampEnd\": \"null\", \"companyIDCode\": \"01\", \"channelIDCode\": \"\", \"branchCode\": \"\", \"searchFields\": [{\"VDI\": \"WPVRTM2004\"}, {\"PROCESSO\": \"Assegni\"}], \"annotation\": [{\"TIPO\": \"SCHEDULAZIONE\"}, {\"RISORSE POOL\": \"SI\"}], \"caller\": {\"id\": \"VWFM\", \"version\": \"1\", \"acronym\": \"WRPA0\"}, \"called\": {\"id\": \"Assegni\", \"version\": \"1\", \"acronym\": \"WRPA0\"}}, \"payLoad\": {\"output\": {\"encoding\": \"\", \"ccsid\": \"\", \"data\": \"\"}, \"input\": {\"encoding\": \"\", \"ccsid\": \"\", \"data\": \"\"}}}"
| rex field=test "called\":\s*\{\s*\"id\":\s*\"(?<calledid>[^\"]+)?\"\,\s*\"version\":\s*\"(?<calledversion>[^\"]+)?\""
| rex field=test "caller\"\s*:\s*\{\s*\"id\":\s*\"(?<callerid>[^\"]+)?\"\,\s*\"version\":\s*\"(?<callerversion>[^\"]+)?\""
@wenthold this is how I've configured based on your tweaks:
in transformation.conf
[caller]
REGEX = caller\"\s*:\s*\{\s*\"id\":\s*\"(?<callerid>[^\"]+)?\"\,\s*\"version\":\s*\"(?<callerversion>[^\"]+)?\"
[called]
REGEX = called\":\s*\{\s*\"id\":\s*\"(?<calledid>[^\"]+)?\"\,\s*\"version\":\s*\"(?<calledversion>[^\"]+)?\"
in fields.conf
[caller]
INDEXED = true
[called]
INDEXED = true
in props.conf
[evento_srctyp]
TRANSFORMS-caller = caller
TRANSFORMS-called = called
then I tried to perform a search like
| tstats values where index=nbp_index_application by callerid
but I had no results. Also if I search in the extracted fields I don't found the fields callerid
, callerversion
, calledid
, calledversion
If you run btool from the command line do you see the configuration items listed?
/opt/splunk/bin/splunk cmd btool props list evento_srctyp --debug | less
Replace /opt/splunk with whatever your $SPLUNK_HOME path is. I don't know if "evento_srctyp" is the actual sourcetype or if you masked it, but "splunk cmd btool props list {SOURCETYPE} --debug" will dump out the configuration.
I wouldn't worry about indexing the field until you have the extractions working. When you can run the search:sourcetype=evento_srctyp | table _time host source callerid callerversion calledid calledversion
and get results, then review link text and then work on the indexed extraction.
From the deployment server cli do you mean?
Use btool on a system where the configuration has been deployed, not on the deployment server.
I would probably test the field extraction part on a search head first, then when the extraction syntax has been deployed I would remove it from the testing search head and deploy it to the indexers or heavy forwarders and work on the indexed extraction.
@wenthold We fixed it adding WRITE_META = true in the transformation.conf; anyway running a search it extract only two fields (keeping the event I gave to you):
It seems to forget the callerversion
and the calledversion
If two of the fields are working then my first guess would be that it's an issue with whitespacing in the regular expression.
Try running a search that will give you the raw json results and add the following:
| rex field=_raw "caller\"\s*:\s*\{\s*\"id\":\s*\"(?<test_callerid>[^\"]+)?\"\,\s*\"version\"\s*:\s*\"(?<test_callerversion>[^\"]+)?\""
See if you get the test_* fields extracted from the json. If the only thing at this point that needs to be tweaked is the regular expression, you might want to try tweaking it on regex101: https://regex101.com/r/4lGvKg/1