Getting Data In

INDEXED_EXTRACTIONS=json with transform

kamermans
Path Finder

I have JSON data prefixed by syslog that I would like to index using INDEXED_EXTRACTIONS=json. Here's an example of the data:

May 13 10:26:42 ip-10-11-12-13 myapp-17: {"headers":{"Accept":"*\/*","Accept-Language":"en-gb,en;q=0.5","User-Agent":"Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko\/20100101 Firefox\/29.0"},"date":1399976802,"node":"ip-10-11-12-13","source":"myapp-17","client_ip":"17.18.19.20"}

I need to strip off the stuff at the beginning of the, which was added by syslog, so everything before the first "{" char, then process the event as JSON:

{
"client_ip": "17.18.19.20",
"date": 1399976802,
"headers": {
"Accept": "*/*",
"Accept-Language": "en-gb,en;q=0.5",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0",
},
"node": "ip-10-11-12-13",
"source": "myapp-17",
}

I have tried the following methods:

  1. Remove leader by pretending it's a line breaker

LINE_BREAKER=((:?^|\n).+?){
SHOULD_LINEMERGE=false

  1. Removing the leader with SEDCMD:

SEDCMD-StripHeader=s/^[^{]+//

  1. Removing the leader via a transform on _raw:

;transforms.conf

[StripSyslog]
REGEX = ^[^{]+(.*)$
FORMAT = $1
DEST_KEY = _raw

;props.conf

TRANSFORMS-StripSyslog = StripSyslog

All of these methods work with KV_MODE=json, but none of them work with INDEXED_EXTRACTIONS=json.

What I don't like about KV_MODE=json is that my events lose their hierarchical nature, so the keys in the headers.* collection are mixed in with the other keys. For example, with INDEXED_EXTRACTIONS=json I can do "headers.User-Agent"="Mozilla/*". More importantly, I can group these headers.* keys to determine their relative frequency, which is not possible with KV_MODE=json since the keys are flattened.

In the splunkd.log file I see this error:
07-15-2014 12:33:16.384 -0400 ERROR JsonLineBreaker - JSON StreamID: 0 having confkey=source::/tmp/myfile.gz|host::17-18-19-20|JsonSyslog|3 had parsing error: Unexpected character while looking for value: 'M'

This tells me that the JsonLineBreaker is probably trying to parse the line before applying any of the aforementioned transformations (the "M" is from "May 13 10:26:42...").

Is there any way to apply a transformation before the JsonLineBreaker kicks in, or perhaps to extend that class in order to strip the leader out?

I am looking for a definitive answer here as the obvious workarounds (scripted input, change my data format, "sed -i" the file before input) are not great long-term.

This is probably relevant to these other questions as well:

0 Karma
1 Solution

Masa
Splunk Employee
Splunk Employee

Unfortunately, there is no solution at Splunk for your case.

INDEXED_EXTRACTIOIN happens at reading file and parsing event time before transforms.conf is applied.

View solution in original post

Masa
Splunk Employee
Splunk Employee

Unfortunately, there is no solution at Splunk for your case.

INDEXED_EXTRACTIOIN happens at reading file and parsing event time before transforms.conf is applied.

rturk
Builder

Hi kamermans - Did you have any luck with this? I am having a similar issue.

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...