Getting Data In

INDEXED_EXTRACTIONS=json with transform

kamermans
Path Finder

I have JSON data prefixed by syslog that I would like to index using INDEXED_EXTRACTIONS=json. Here's an example of the data:

May 13 10:26:42 ip-10-11-12-13 myapp-17: {"headers":{"Accept":"*\/*","Accept-Language":"en-gb,en;q=0.5","User-Agent":"Mozilla\/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko\/20100101 Firefox\/29.0"},"date":1399976802,"node":"ip-10-11-12-13","source":"myapp-17","client_ip":"17.18.19.20"}

I need to strip off the stuff at the beginning of the, which was added by syslog, so everything before the first "{" char, then process the event as JSON:

{
"client_ip": "17.18.19.20",
"date": 1399976802,
"headers": {
"Accept": "*/*",
"Accept-Language": "en-gb,en;q=0.5",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:29.0) Gecko/20100101 Firefox/29.0",
},
"node": "ip-10-11-12-13",
"source": "myapp-17",
}

I have tried the following methods:

  1. Remove leader by pretending it's a line breaker

LINE_BREAKER=((:?^|\n).+?){
SHOULD_LINEMERGE=false

  1. Removing the leader with SEDCMD:

SEDCMD-StripHeader=s/^[^{]+//

  1. Removing the leader via a transform on _raw:

;transforms.conf

[StripSyslog]
REGEX = ^[^{]+(.*)$
FORMAT = $1
DEST_KEY = _raw

;props.conf

TRANSFORMS-StripSyslog = StripSyslog

All of these methods work with KV_MODE=json, but none of them work with INDEXED_EXTRACTIONS=json.

What I don't like about KV_MODE=json is that my events lose their hierarchical nature, so the keys in the headers.* collection are mixed in with the other keys. For example, with INDEXED_EXTRACTIONS=json I can do "headers.User-Agent"="Mozilla/*". More importantly, I can group these headers.* keys to determine their relative frequency, which is not possible with KV_MODE=json since the keys are flattened.

In the splunkd.log file I see this error:
07-15-2014 12:33:16.384 -0400 ERROR JsonLineBreaker - JSON StreamID: 0 having confkey=source::/tmp/myfile.gz|host::17-18-19-20|JsonSyslog|3 had parsing error: Unexpected character while looking for value: 'M'

This tells me that the JsonLineBreaker is probably trying to parse the line before applying any of the aforementioned transformations (the "M" is from "May 13 10:26:42...").

Is there any way to apply a transformation before the JsonLineBreaker kicks in, or perhaps to extend that class in order to strip the leader out?

I am looking for a definitive answer here as the obvious workarounds (scripted input, change my data format, "sed -i" the file before input) are not great long-term.

This is probably relevant to these other questions as well:

0 Karma
1 Solution

Masa
Splunk Employee
Splunk Employee

Unfortunately, there is no solution at Splunk for your case.

INDEXED_EXTRACTIOIN happens at reading file and parsing event time before transforms.conf is applied.

View solution in original post

Masa
Splunk Employee
Splunk Employee

Unfortunately, there is no solution at Splunk for your case.

INDEXED_EXTRACTIOIN happens at reading file and parsing event time before transforms.conf is applied.

rturk
Builder

Hi kamermans - Did you have any luck with this? I am having a similar issue.

0 Karma
Got questions? Get answers!

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Meet up IRL or virtually!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Get Updates on the Splunk Community!

Build the Future of Agentic AI: Join the Splunk Agentic Ops Hackathon

AI is changing how teams investigate incidents, detect threats, automate workflows, and build intelligent ...

[Puzzles] Solve, Learn, Repeat: Character substitutions with Regular Expressions

This challenge was first posted on Slack #puzzles channelFor BORE at .conf23, we had a puzzle question which ...

Splunk Community Badges!

  Hey everyone! Ready to earn some serious bragging rights in the community? Along with our existing badges ...