Getting Data In

INDEXED_EXTRACTIONS=json with transform

kamermans
Path Finder

I have JSON data prefixed by syslog that I would like to index using INDEXED_EXTRACTIONS=json. Here's an example of the data:

May 13 10:26:42 ip-10-11-12-13 myapp-17: {"headers":{"Accept":"*\/*","Accept-Language":"en-gb,en;q=0.5","User-Agent":"Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko\/20100101 Firefox\/29.0"},"date":1399976802,"node":"ip-10-11-12-13","source":"myapp-17","client_ip":"17.18.19.20"}

I need to strip off the stuff at the beginning of the, which was added by syslog, so everything before the first "{" char, then process the event as JSON:

{
"client_ip": "17.18.19.20",
"date": 1399976802,
"headers": {
"Accept": "*/*",
"Accept-Language": "en-gb,en;q=0.5",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0",
},
"node": "ip-10-11-12-13",
"source": "myapp-17",
}

I have tried the following methods:

  1. Remove leader by pretending it's a line breaker

LINE_BREAKER=((:?^|\n).+?){
SHOULD_LINEMERGE=false

  1. Removing the leader with SEDCMD:

SEDCMD-StripHeader=s/^[^{]+//

  1. Removing the leader via a transform on _raw:

;transforms.conf

[StripSyslog]
REGEX = ^[^{]+(.*)$
FORMAT = $1
DEST_KEY = _raw

;props.conf

TRANSFORMS-StripSyslog = StripSyslog

All of these methods work with KV_MODE=json, but none of them work with INDEXED_EXTRACTIONS=json.

What I don't like about KV_MODE=json is that my events lose their hierarchical nature, so the keys in the headers.* collection are mixed in with the other keys. For example, with INDEXED_EXTRACTIONS=json I can do "headers.User-Agent"="Mozilla/*". More importantly, I can group these headers.* keys to determine their relative frequency, which is not possible with KV_MODE=json since the keys are flattened.

In the splunkd.log file I see this error:
07-15-2014 12:33:16.384 -0400 ERROR JsonLineBreaker - JSON StreamID: 0 having confkey=source::/tmp/myfile.gz|host::17-18-19-20|JsonSyslog|3 had parsing error: Unexpected character while looking for value: 'M'

This tells me that the JsonLineBreaker is probably trying to parse the line before applying any of the aforementioned transformations (the "M" is from "May 13 10:26:42...").

Is there any way to apply a transformation before the JsonLineBreaker kicks in, or perhaps to extend that class in order to strip the leader out?

I am looking for a definitive answer here as the obvious workarounds (scripted input, change my data format, "sed -i" the file before input) are not great long-term.

This is probably relevant to these other questions as well:

0 Karma
1 Solution

Masa
Splunk Employee
Splunk Employee

Unfortunately, there is no solution at Splunk for your case.

INDEXED_EXTRACTIOIN happens at reading file and parsing event time before transforms.conf is applied.

View solution in original post

Masa
Splunk Employee
Splunk Employee

Unfortunately, there is no solution at Splunk for your case.

INDEXED_EXTRACTIOIN happens at reading file and parsing event time before transforms.conf is applied.

rturk
Builder

Hi kamermans - Did you have any luck with this? I am having a similar issue.

0 Karma
Get Updates on the Splunk Community!

How to Monitor Google Kubernetes Engine (GKE)

We’ve looked at how to integrate Kubernetes environments with Splunk Observability Cloud, but what about ...

Index This | How can you make 45 using only 4?

October 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with this ...

Splunk Education Goes to Washington | Splunk GovSummit 2024

If you’re in the Washington, D.C. area, this is your opportunity to take your career and Splunk skills to the ...