Getting Data In

INDEXED_EXTRACTIONS=json with transform

kamermans
Path Finder

I have JSON data prefixed by syslog that I would like to index using INDEXED_EXTRACTIONS=json. Here's an example of the data:

May 13 10:26:42 ip-10-11-12-13 myapp-17: {"headers":{"Accept":"*\/*","Accept-Language":"en-gb,en;q=0.5","User-Agent":"Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko\/20100101 Firefox\/29.0"},"date":1399976802,"node":"ip-10-11-12-13","source":"myapp-17","client_ip":"17.18.19.20"}

I need to strip off the stuff at the beginning of the, which was added by syslog, so everything before the first "{" char, then process the event as JSON:

{
"client_ip": "17.18.19.20",
"date": 1399976802,
"headers": {
"Accept": "*/*",
"Accept-Language": "en-gb,en;q=0.5",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0",
},
"node": "ip-10-11-12-13",
"source": "myapp-17",
}

I have tried the following methods:

  1. Remove leader by pretending it's a line breaker

LINE_BREAKER=((:?^|\n).+?){
SHOULD_LINEMERGE=false

  1. Removing the leader with SEDCMD:

SEDCMD-StripHeader=s/^[^{]+//

  1. Removing the leader via a transform on _raw:

;transforms.conf

[StripSyslog]
REGEX = ^[^{]+(.*)$
FORMAT = $1
DEST_KEY = _raw

;props.conf

TRANSFORMS-StripSyslog = StripSyslog

All of these methods work with KV_MODE=json, but none of them work with INDEXED_EXTRACTIONS=json.

What I don't like about KV_MODE=json is that my events lose their hierarchical nature, so the keys in the headers.* collection are mixed in with the other keys. For example, with INDEXED_EXTRACTIONS=json I can do "headers.User-Agent"="Mozilla/*". More importantly, I can group these headers.* keys to determine their relative frequency, which is not possible with KV_MODE=json since the keys are flattened.

In the splunkd.log file I see this error:
07-15-2014 12:33:16.384 -0400 ERROR JsonLineBreaker - JSON StreamID: 0 having confkey=source::/tmp/myfile.gz|host::17-18-19-20|JsonSyslog|3 had parsing error: Unexpected character while looking for value: 'M'

This tells me that the JsonLineBreaker is probably trying to parse the line before applying any of the aforementioned transformations (the "M" is from "May 13 10:26:42...").

Is there any way to apply a transformation before the JsonLineBreaker kicks in, or perhaps to extend that class in order to strip the leader out?

I am looking for a definitive answer here as the obvious workarounds (scripted input, change my data format, "sed -i" the file before input) are not great long-term.

This is probably relevant to these other questions as well:

0 Karma
1 Solution

Masa
Splunk Employee
Splunk Employee

Unfortunately, there is no solution at Splunk for your case.

INDEXED_EXTRACTIOIN happens at reading file and parsing event time before transforms.conf is applied.

View solution in original post

Masa
Splunk Employee
Splunk Employee

Unfortunately, there is no solution at Splunk for your case.

INDEXED_EXTRACTIOIN happens at reading file and parsing event time before transforms.conf is applied.

rturk
Builder

Hi kamermans - Did you have any luck with this? I am having a similar issue.

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...