I've never worked with splunk regex before so I'm probably just missing something.
I've been up and down the https://docs.splunk.com/Documentation/Splunk/latest/Data/Advancedsourcetypeoverrides and https://docs.splunk.com/Documentation/SCS/latest/SearchReference/RexCommandOverview pages.
All i'm trying to do is set up some regex for a props/transforms that finds any instance of "ssh" and changes it's sourcetype to "authentication"
My search:
index=accounting sourcetype=linux_admin | rex field=_raw "(?<ssh>\bssh\b)"
Scoped down to the last 60 minutes, I'm getting 2,700 results and none of them have anything to do with ssh.
When I run
"index=accounting sourcetype=linux_admin ssh" - which gives the results I'm actually looking for...
I only get 28 results and they're all pertaining to ssh.
What am I missing?
Thanks for the input!!
For verifying regexes https://regex101.com is usually sufficient 🙂
Anyway, if you're not using the capture group names for field extraction, don't use capture groups. It makes the regexes easier to read and saves a bit of performance because Splunk doesn't have to retain the capture group contents. It's a tiny bit of a difference but it's there.
So since you're trying to "recast" the data to a static sourcetype, it's enough to use
REGEX = \bssh\b
to match your events.
And you're misunderstanding the relation between fields and capture groups.
If you do
| rex "(?<ssh>\bssh\b)"
Splunk will create a field named "ssh" because that's what the capture group is named. But it will be matching against the whole raw message because if you don't specify the field for matching it's the default option. You can extract data from a specific field using the field= parameter. Like
| rex field=message "(?<ssh>\bssh\b)"
This would create a field named "ssh" only if an already existing at this point of your search pipeline (either by default extractions defined for your data or manually extracted or created) field named "message" contained a word "ssh".
But anyway, this has nothing to do with transforms.
With transforms, it's the SOURCE_FIELD option which decides which field the REGEX will be matched against. One big caveat though (beginners often fall into this trap) - during ingest time processing (and that's what you're trying to do) Splunk has no idea about all search-time extracted fields. You can only use indexed fields here in index-time transforms (and they must have been already extracted if they are custom fields).
And again, index-time transforms have nothing to do with searching. (And datamodels are something yet completely different so let's not mix it all ;-))
Your config seems pretty OK at first glance but
1. Naming your sourcetype just "authentication" isn't a very good practice. It's usually better to name your sourcetypes in a more unique way. Usually it's some form of convention using vendor name, maybe product and the "kind" of data. Like "apache:error" or "cisco:ios" and so on.
2. You restarted the HF after pushing this config, didn't you?
3. Is the linux_audit sourcetype the original sourcetype of your data or isn't it also a rewritten sourcetype? (I don't remember that one to be honest). Because Splunk decides just once - at the beginning of the ingestion pipeline - what props and transforms options are relevant for the event. And even if you overwrite the event's metadata "in flight" to recast it to another sourcetype, host or source, it will still get processed till the end of the indexing phase according to the original sourcetype/host/source.
4. Oh, and you applied this config in the right place of your infrastructure? On the first "heavy" component in your events' path?
Seems like this is much more involved than I initially thought.
Before you delve into crevices, maybe check something more obvious: rex or regex autoextract itself does not filter results. You sill need a filter to do that.
index=accounting sourcetype=linux_admin | rex field=_raw "(?<ssh>\bssh\b)"
Have you tried adding a filter after rex, like this?
index=accounting sourcetype=linux_admin
| rex field=_raw "(?<ssh>\bssh\b)"
| where isnotnull(ssh)
This tells Splunk to return only those events in which the regex has a match.
If you use autoextraction as your props.conf shows, to apply filter, you need something like
index=accounting sourcetype=linux_admin ssh=*
But here is another obvious mismatch.
props.conf
[linux_audit]
TRANSFORMS-changesourcetype = change_sourcetype_authentication
This stanza applies to sourcetype linux_audit, NOT linux_admin as suggested in your original search. Is this a typo when you set up the autoextraction?
That was a great catch! But that was just a typo on my part. All of this I happening on an air gapped system, so I'm having to hand jam all this over.
Nice catch about the linux_audit vs. linux_admin. But while I recognize linux_audit, I don't recall ever seeing linux_admin, so that might actually be the typo.
Rick,
Thanks for the reply! Seems like this is much more involved than I initially thought.
It's not that I am tryin to use the regex as a means of doing searches. I was only running the search to see if the regex I had was actually hitting the data I'm looking for, so rex is out because I'm really not trying to extract anything. Thanks for that clarification. I ran the search with regex instead of rex and it did come back with what I'm looking for.
Like I mentioned, I'm just trying to create a props/transforms set to catch data that matches a certain regex and change it's sourcetype to authentication in attempt to CIM the data. Something like:
props.conf
[linux_audit]
TRANSFORMS-changesourcetype = change_sourcetype_authentication
transforms.conf
[change_sourcetype_authentication]
REGEX=(?<ssh>\bssh\b)
FORMAT = sourcetype::authentication
DEST_KEY=MetaData:Sourcetype
Nothing was coming back when I pushed that to my HF's, so I was trying to search the regex to see if it was even hitting anything. If I understand correctly, the <ssh> field needs to already exist for this to work?
With that in mind, to your 4th point, does that mean this approach would not be an ideal one? All my indexes are customer based so organizing datamodels by indexes isn't an option.
Do I just have a typo somewhere I'm missing or am I just going down the wrong lane?
BTW, why would you want to override the sourcetype for a relatively well known and well implemented and supported linux_audit sourcetype?
I think I was going too far down that particular rabbit hole. I was planning to combine audit.log and linux secure into one sourcetype but finally realized there's no good reason for doing that when I can just call on both types.
Sourcetype is the "kind" of messages you get. It's not about what is contained within those events but how it's represented.
If you want to have a nice and easy way of searching for similar "meaning" events you can use tags or eventtypes. And might want to dig into datamodels.
Yeah, I've been looking into data models and figuring out how to set my eventtypes to set up CIM, that's kinda how I fell down this particular rabbit hole.
For verifying regexes https://regex101.com is usually sufficient 🙂
Anyway, if you're not using the capture group names for field extraction, don't use capture groups. It makes the regexes easier to read and saves a bit of performance because Splunk doesn't have to retain the capture group contents. It's a tiny bit of a difference but it's there.
So since you're trying to "recast" the data to a static sourcetype, it's enough to use
REGEX = \bssh\b
to match your events.
And you're misunderstanding the relation between fields and capture groups.
If you do
| rex "(?<ssh>\bssh\b)"
Splunk will create a field named "ssh" because that's what the capture group is named. But it will be matching against the whole raw message because if you don't specify the field for matching it's the default option. You can extract data from a specific field using the field= parameter. Like
| rex field=message "(?<ssh>\bssh\b)"
This would create a field named "ssh" only if an already existing at this point of your search pipeline (either by default extractions defined for your data or manually extracted or created) field named "message" contained a word "ssh".
But anyway, this has nothing to do with transforms.
With transforms, it's the SOURCE_FIELD option which decides which field the REGEX will be matched against. One big caveat though (beginners often fall into this trap) - during ingest time processing (and that's what you're trying to do) Splunk has no idea about all search-time extracted fields. You can only use indexed fields here in index-time transforms (and they must have been already extracted if they are custom fields).
And again, index-time transforms have nothing to do with searching. (And datamodels are something yet completely different so let's not mix it all ;-))
Your config seems pretty OK at first glance but
1. Naming your sourcetype just "authentication" isn't a very good practice. It's usually better to name your sourcetypes in a more unique way. Usually it's some form of convention using vendor name, maybe product and the "kind" of data. Like "apache:error" or "cisco:ios" and so on.
2. You restarted the HF after pushing this config, didn't you?
3. Is the linux_audit sourcetype the original sourcetype of your data or isn't it also a rewritten sourcetype? (I don't remember that one to be honest). Because Splunk decides just once - at the beginning of the ingestion pipeline - what props and transforms options are relevant for the event. And even if you overwrite the event's metadata "in flight" to recast it to another sourcetype, host or source, it will still get processed till the end of the indexing phase according to the original sourcetype/host/source.
4. Oh, and you applied this config in the right place of your infrastructure? On the first "heavy" component in your events' path?
Thank you for all the input here. I was really getting caught up in the capture group without realizing that wasn't what I was even trying to figure out.
1. The SCS docs are not for your normal Splunk Enterprise or Splunk Cloud searching. Yes, they often pop up in google search results.
2. The rex command is for extracting fields from your data. So your
search | rex
SPL means that Splunk will search for the events matching the search terms of the search command (if you don't specify any command, Splunk implicitly uses the search one) and then from all those events it will try to extract the fields using the regex you provide. In your case it was a regex capturing a group named ssh so if your data matched the regex a field named ssh wojld be created. But if the event doesn't match the regex then the field simply isn't extracted. Nothing else happens.
3. If you want to filter your data by a regex, you have to use the regex command, not rex. But be aware that regex command doesn't capture anything. It just matches the event or an already extracted field against a regex and based on that filters the event stream.
4. While regex-based filtering can sometimes have its uses, it's very inefficient as a "base" method of searching.
In your case, the search for the word "ssh" returned just 28 results. And the search without that word returned 2700 events. If you specify a direct search term, Splunk can check its indexes and only consider for further processing the events which contain the word you searched for. But if you did
search | regex "\bssh\b"
Splunk would have to first fetch all 2700 events from the index and then try to match every single one of them to see if it fits the regex. That is very inefficient way of searching. You'd still get the same 26 results but the processing overhead on this search would be humongous compared to simply searching for the word "ssh".