Hi all,
Splunk offers the possibility to customize the way we want data to be segmented in the index files with a regex, like for this timestamp :
segmenters.conf :
[seg_rule]
FILTER=^\d\d\d\d-\d\d-\d\d\s*\d\d:\d\d:\d\d(.*)$
This manipulation avoids timestamp (located at the beginning of the log) from being segmented, and the rest (.*) is captured. So we spare memory space, but we lose the capability to search for it without the _time field.
My issue is the following : I want to do the same for every dates values in my data, and not only timestamps. But the Splunk documentation of segmenters.conf says that:
"segmentation will only take place on
the first group of the matching
regex."
So that we can't filter stuff that is located AT THE MIDDLE of the log, because for that, we need at least 2 matching groups. I tried it, and effectively, it only segments the part before the date matching and filters the rest.
Any idea please?
I ran into same limitation myself.
The "single capture group" setting is set in stone.
You've got 2 options (that I know of):
- if possible, use syslog-ng to rewrite your data before it is ingested by splunk (rearrange your event so that all the "junk" data you don't want segmented is at the beginning of your event)
- use index-time TRANSFORMS-foo to rewrite your _raw so that your "junk" data is discarded or placed at the beginning of your event
I haven't tried the second option, but according to (https://wiki.splunk.com/Community:HowIndexingWorks), index-time segmentation should be happening in annotator processor, which comes after regexreplacement processor , so it should work.
I ran into same limitation myself.
The "single capture group" setting is set in stone.
You've got 2 options (that I know of):
- if possible, use syslog-ng to rewrite your data before it is ingested by splunk (rearrange your event so that all the "junk" data you don't want segmented is at the beginning of your event)
- use index-time TRANSFORMS-foo to rewrite your _raw so that your "junk" data is discarded or placed at the beginning of your event
I haven't tried the second option, but according to (https://wiki.splunk.com/Community:HowIndexingWorks), index-time segmentation should be happening in annotator processor, which comes after regexreplacement processor , so it should work.
Thanks for your ideas @pmalcakdoj, that's really relevant I think.
Concerning the second point, I have to say that it's really smart but i don't see how to rewrite _raw by switching the positions in the log.
Indeed i want to keep those data in the log, so i just want to put them at the beginning at index time, and then use my segmenters.conf modifications to avoid segmentation. But how to edit _raw : "xxxxx junkdata zzzz" to get _raw="junkdata xxxxx zzzzz" with props and transforms.conf?
you would need to capture all segments with capture groups and then reorder them in the FORMAT field with "$2 $0 $1 ..." backreferences
Thank you teacher, i think you rock indeed
You rock. You are crazy but you rock.
haha, glad I could help
I've never heard of a use case where memory space so tight that you would use this approach. As a potential alternative, have you considered using regular expressions in your props.conf for all of it?
You need @pmalcakdoj to chime in. He is the only other guy that I know of crazy enough to actually modify segementers.conf
.