Solved: Re: INDEXED_EXTRACTIONS=json with transform

kamermans · ‎07-15-2014

I have JSON data prefixed by syslog that I would like to index using INDEXED_EXTRACTIONS=json. Here's an example of the data:

May 13 10:26:42 ip-10-11-12-13 myapp-17: {"headers":{"Accept":"*\/*","Accept-Language":"en-gb,en;q=0.5","User-Agent":"Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko\/20100101 Firefox\/29.0"},"date":1399976802,"node":"ip-10-11-12-13","source":"myapp-17","client_ip":"17.18.19.20"}

I need to strip off the stuff at the beginning of the, which was added by syslog, so everything before the first "{" char, then process the event as JSON:
{ "client_ip": "17.18.19.20", "date": 1399976802, "headers": { "Accept": "*/*", "Accept-Language": "en-gb,en;q=0.5", "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0", }, "node": "ip-10-11-12-13", "source": "myapp-17", }

I have tried the following methods:

Remove leader by pretending it's a line breaker

LINE_BREAKER=((:?^|\n).+?){ SHOULD_LINEMERGE=false

Removing the leader with SEDCMD:

SEDCMD-StripHeader=s/^[^{]+//

Removing the leader via a transform on _raw:

;transforms.conf

[StripSyslog] REGEX = ^[^{]+(.*)$ FORMAT = $1 DEST_KEY = _raw

;props.conf

TRANSFORMS-StripSyslog = StripSyslog

All of these methods work with KV_MODE=json, but none of them work with INDEXED_EXTRACTIONS=json.

What I don't like about KV_MODE=json is that my events lose their hierarchical nature, so the keys in the headers.* collection are mixed in with the other keys. For example, with INDEXED_EXTRACTIONS=json I can do "headers.User-Agent"="Mozilla/*". More importantly, I can group these headers.* keys to determine their relative frequency, which is not possible with KV_MODE=json since the keys are flattened.

In the splunkd.log file I see this error:
07-15-2014 12:33:16.384 -0400 ERROR JsonLineBreaker - JSON StreamID: 0 having confkey=source::/tmp/myfile.gz|host::17-18-19-20|JsonSyslog|3 had parsing error: Unexpected character while looking for value: 'M'

This tells me that the JsonLineBreaker is probably trying to parse the line before applying any of the aforementioned transformations (the "M" is from "May 13 10:26:42...").

Is there any way to apply a transformation before the JsonLineBreaker kicks in, or perhaps to extend that class in order to strip the leader out?

I am looking for a definitive answer here as the obvious workarounds (scripted input, change my data format, "sed -i" the file before input) are not great long-term.

This is probably relevant to these other questions as well:

Masa · ‎09-25-2014

Unfortunately, there is no solution at Splunk for your case.

INDEXED_EXTRACTIOIN happens at reading file and parsing event time before transforms.conf is applied.

View solution in original post

Masa · ‎09-25-2014

Unfortunately, there is no solution at Splunk for your case.

INDEXED_EXTRACTIOIN happens at reading file and parsing event time before transforms.conf is applied.

rturk · ‎09-23-2014

Hi kamermans - Did you have any luck with this? I am having a similar issue.

INDEXED_EXTRACTIONS=json with transform

Aligning Observability Costs with Business Value: Practical Strategies

Mastering Data Pipelines: Unlocking Value with Splunk

Splunk Up Your Game: Why It's Time to Embrace Python 3.9+ and OpenSSL 3.0

Are you a member of the Splunk Community?

INDEXED_EXTRACTIONS=json with transform

Aligning Observability Costs with Business Value: Practical Strategies

Mastering Data Pipelines: Unlocking Value with Splunk

Splunk Up Your Game: Why It's Time to Embrace Python 3.9+ and OpenSSL 3.0