Getting Data In

How do you filter out dates from being segmented in segmenters.conf?

New Member

Hi all,

Splunk offers the possibility to customize the way we want data to be segmented in the index files with a regex, like for this timestamp :

segmenters.conf :

[seg_rule]
FILTER=^\d\d\d\d-\d\d-\d\d\s*\d\d:\d\d:\d\d(.*)$

This manipulation avoids timestamp (located at the beginning of the log) from being segmented, and the rest (.*) is captured. So we spare memory space, but we lose the capability to search for it without the _time field.

My issue is the following : I want to do the same for every dates values in my data, and not only timestamps. But the Splunk documentation of segmenters.conf says that:

"segmentation will only take place on
the first group of the matching
regex."
So that we can't filter stuff that is located AT THE MIDDLE of the log, because for that, we need at least 2 matching groups. I tried it, and effectively, it only segments the part before the date matching and filters the rest.

Any idea please?

0 Karma
1 Solution

Path Finder

I ran into same limitation myself.
The "single capture group" setting is set in stone.

You've got 2 options (that I know of):
- if possible, use syslog-ng to rewrite your data before it is ingested by splunk (rearrange your event so that all the "junk" data you don't want segmented is at the beginning of your event)
- use index-time TRANSFORMS-foo to rewrite your _raw so that your "junk" data is discarded or placed at the beginning of your event

I haven't tried the second option, but according to (https://wiki.splunk.com/Community:HowIndexingWorks), index-time segmentation should be happening in annotator processor, which comes after regexreplacement processor , so it should work.

View solution in original post

Path Finder

I ran into same limitation myself.
The "single capture group" setting is set in stone.

You've got 2 options (that I know of):
- if possible, use syslog-ng to rewrite your data before it is ingested by splunk (rearrange your event so that all the "junk" data you don't want segmented is at the beginning of your event)
- use index-time TRANSFORMS-foo to rewrite your _raw so that your "junk" data is discarded or placed at the beginning of your event

I haven't tried the second option, but according to (https://wiki.splunk.com/Community:HowIndexingWorks), index-time segmentation should be happening in annotator processor, which comes after regexreplacement processor , so it should work.

View solution in original post

New Member

Thanks for your ideas @pmalcakdoj, that's really relevant I think.
Concerning the second point, I have to say that it's really smart but i don't see how to rewrite _raw by switching the positions in the log.
Indeed i want to keep those data in the log, so i just want to put them at the beginning at index time, and then use my segmenters.conf modifications to avoid segmentation. But how to edit _raw : "xxxxx junkdata zzzz" to get _raw="junkdata xxxxx zzzzz" with props and transforms.conf?

0 Karma

Path Finder

you would need to capture all segments with capture groups and then reorder them in the FORMAT field with "$2 $0 $1 ..." backreferences

0 Karma

New Member

Thank you teacher, i think you rock indeed

0 Karma

Esteemed Legend

You rock. You are crazy but you rock.

0 Karma

Path Finder

haha, glad I could help

0 Karma

Builder

I've never heard of a use case where memory space so tight that you would use this approach. As a potential alternative, have you considered using regular expressions in your props.conf for all of it?

###

If this reply helps you, an upvote would be appreciated.
0 Karma

Esteemed Legend

You need @pmalcakdoj to chime in. He is the only other guy that I know of crazy enough to actually modify segementers.conf.

0 Karma