Splunk Search

Index time field extraction: regexp issue

Super_Knulps
Explorer

Hello,

Since I often search a specific expression in a large set of events, I would like to index it.

Every single instance that I am running has the following format:
instance-name.generic-name.subdomaine.domain.com

In this expression, only domain.com is static and will never change.
I would like to extract generic-name for all of my events.

props.conf

[generic-name]
TRANSFORMS-generic-name = generic-name

transforms.conf

[generic-name]

REGEX = (?<instancename>[^\.]+)\.(?<gname>[^\.]+)\.(?<subdomain>[^\.]+)\.(?<domain>[^\.]+)\.

fields.conf

[gname]
INDEXED = True

I am wondering if the fact that I am not receiving anything in the Splunk dashboard is coming from my configuration file or my regular expression ?
Thank you in advance for your help

Update: I have tried all the following regexp and there is still no result. I don't receive any data in my sourcetype.

0 Karma
1 Solution

rsennett_splunk
Splunk Employee
Splunk Employee

I've decided to add a totally separate answer here, since if I'm right... your regex is fine (it was just the markup bug we're dealing with now that confused everyone) but your transforms syntax is off.:
Create an indexed field:

[extracted-gname]
REGEX =  whatevercomesbeforeit [^\.]+\.(?<gname>[^\.]+)\.[^\.]+\..+
FORMAT = gname::$1

[extracting-from-host]
SOURCE_KEY = MetaData:Host
REGEX =   [^\.]+\.(?<gname>[^\.]+)\.[^\.]+\..+
FORMAT = gname::$1







[indexed-gname]
REGEX =  whatevercomesbeforeit [^\.]+\.(?<gname>[^\.]+)\.[^\.]+\..+
FORMAT = gname::$1
WRITE_META = true
With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!

View solution in original post

rsennett_splunk
Splunk Employee
Splunk Employee

I've decided to add a totally separate answer here, since if I'm right... your regex is fine (it was just the markup bug we're dealing with now that confused everyone) but your transforms syntax is off.:
Create an indexed field:

[extracted-gname]
REGEX =  whatevercomesbeforeit [^\.]+\.(?<gname>[^\.]+)\.[^\.]+\..+
FORMAT = gname::$1

[extracting-from-host]
SOURCE_KEY = MetaData:Host
REGEX =   [^\.]+\.(?<gname>[^\.]+)\.[^\.]+\..+
FORMAT = gname::$1







[indexed-gname]
REGEX =  whatevercomesbeforeit [^\.]+\.(?<gname>[^\.]+)\.[^\.]+\..+
FORMAT = gname::$1
WRITE_META = true
With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!

Super_Knulps
Explorer

Okay I am going to try this.
This is tranforms.conf right ?

On props.conf I have:

[generic-name]
TRANSFORMS-generic-name = extracting-from-host
TRANSFORMS-generic-name = extracted-gname
TRANSFORMS-generic-name = indexed-gname

And fields.conf is still:

 [gname]
 INDEXED = True

Oh, yes, the field is not on _raw, it is on host.
For example, I have those events:

5/6/15 
3:40:17.000 PM  
Script-name=SMTP-RELAY | Status = OK | Proc=Postfix is running | SMTP=Connection to SMTP port succeed
host = instancename3.generic-name.subdomain source = /opt/splunk/bin/scripts/smtp-relay.pl sourcetype = generic-name
5/6/15 
3:40:06.000 PM  
Script-name=SMTP-CHAIN| Status=KO | Description=Email not received from X instance
host = instancename2.generic-name.subdomain  source = /opt/splunk/bin/scripts/smtp-chain.pl sourcetype = generic-name

Oh while pasting those events I have noticed that the host looks like instancename.generic-name.subdomain, there is not the domain.com anymore (if we are extracting from this field). So the regexp is a bit shorter:

[^\.]+\.(?<gname>[^\.]+)

And yes, the general idea would be that I have, like host and sourcetype, a field called gname on those events.

0 Karma

rsennett_splunk
Splunk Employee
Splunk Employee

Yes, if you want to extract the new field, gname as an indexed extraction you use
TRANSFORMS in props conf (rather than REPORT, which would be search_time), and you follow my example for [indexed-gname]
note the last bit... WRITE_META=true
here is the definition from transforms.conf in doc
WRITE_META = [true|false]
* NOTE: This attribute is only valid for index-time field extractions.
* Automatically writes REGEX to metadata.
* Required for all index-time field extractions except for those where DEST_KEY = _meta (see
the description of the DEST_KEY attribute, below)
* Use instead of DEST_KEY = _meta.
* Defaults to false.

That will do it. You've no need for messing with fields.conf

That said... it is not recommended to REPLACE the host or source metadata fields.
By all means... make new index-time field called gname.
But if you mess with the metadata fields that Splunk uses to tell you where things came from at the time of index, you'll be super sorry later if anything in there changes... like the instance name etc... and all you have is host=generic-name and no evidentiary history of where it actually got the data. I say this from watching it destroy a customer's ability to look at anything granularly... because all they had was the "name" which was also the source, and host and the name of the app... a giant well-meaning mess.

With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!
0 Karma

Super_Knulps
Explorer

Okay I am starting to understand all of that, thank you !
Now I have data ! But it is matching what is in _raw and not in the fields such as 'host.
My regexp looks good in regxp101: [^.]+.(?<gname>[^.]+).[^.]+
Event that is matched:
5/6/15
6:46:34.000 PM
Host=instancename.generic-name.subdomain.domain.com | Status=OK | Message=Connection to SMTP Port (25) succeeded
host = instancename.generic-name.subdomain source = /opt/splunk/etc/apps/myapp/bin/smtp.pl sourcetype =smtpchecker

I believe that it is matched due to Host=instancename.generic-name.subdomain.domain.com.

When I am running your advice:
sourcetype=smtpchecker |rex field=Host "[^.]+.(?<gname>[^.]+).[^.]+"|head 10|table Host
The result is:
Host
instancename.generic-name.subdomain.domain.com
But I have the same result with [^.]+.(?<gname>[^.]+)

I am going to play around with the regexp and try to match it. It might also comes from that it is not parsing the host field, even tough SOURCE_KEY = MetaData:Host should be good.

0 Karma

rsennett_splunk
Splunk Employee
Splunk Employee

to test you want to do this... (you grabbed the original field and not the new one)
sourcetype=smtpchecker | rex field=Host " "[^.]+.(?<gname>[^.]+).[^.]+"|head 10|table Host gname

I traditionally use both and I should have specified that earlier:
a column representing Host and a column representing gname and you'll be be able to see that you grabbed the stuff you want, right next to what you grabbed it from.

then, because I'm paranoid that way... I would always keep increasing the HEAD 10 to HEAD 100 etc... to make sure that the value of Host doesn't alter... such as losing the domain. But you can totally just grab the first part just fine to avoid that. if it's anchored in the specific field, you don't have to give all that extra info.

With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!

Super_Knulps
Explorer

Thank you, the regexp is fine. It is matching what I want when I run sourcetype=smtpchecker | rex field=Host " "[^.]+.(?<gname>[^.]+).[^.]+"|head 10|table Host gname.

But in practice, I can not do sourcetype=generic-name gname=*. The only results that I get is for this kind of event:

Host=instancename.toextract.subdomain.domain.com | Status=OK | Message=Connection to SMTP Port (25) succeeded|
host = instancename.toextract.subdomain source = /opt/splunk/etc/apps/myapp/bin/smtp-check.pl sourcetype = generic-name

But not for this event which is in the same sourcetype:

Script=SMTP-RELAY CHECK | Status = OK | Postfix=Postfix is running | SMTP=Connection to SMTP port succeed
host = instancename2.toextract.subdomain source = /opt/splunk/bin/scripts/smtprelay.pl sourcetype = generic-name

I think that the reason is because the _raw data matchs in a first place but not in the second.

An other example:
sourcetype=othersourcetype gname=* | stats count by gname

00 | Load_5_min=0 29
03 | Load_5_min=0 2
049041748046875e-05|Minimum=8 1
05 | Load_5_min=0 2
059906005859375e-06|
07 | Load_5_min=0

Those are not the gname I am looking for.
They are coming from this kind of event:

Script=load-check | Status=Ok | Load_1_min=0.02 | Load_5_min=0.02 | Load_15_min=0.00
host = instancename.nametoextract.subodmain source = /opt/splunkforwarder/bin/scripts/load-check.shy sourcetype =othersourcetype
I think that MetaData:Host is not working well, I began to despair actually...

0 Karma

rsennett_splunk
Splunk Employee
Splunk Employee

Have you named two fields in the same sourcetype the same thing? I'm confused...

With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!
0 Karma

Super_Knulps
Explorer

Oh no my bad, was confused. The field is different than the sourcetype. I was confused myself. I have edited<.

0 Karma

rsennett_splunk
Splunk Employee
Splunk Employee

and now we know why your original regex wasn't working...
Always a good idea to test inline with the rex command

...|rex field=whatever "yourregex"|head 10|table yourfield

With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!

woodcock
Esteemed Legend

I would also like to backtrack from my comment on the "indexed_value" problem. If you are using index-time extractions ("TRANSFORMS-" or "EXTRACT-") then it cannot be the problem.

0 Karma

Super_Knulps
Explorer

Is there a difference between transforms and extract ? I have been using TRANSFORMS so far, like the doc.

0 Karma

woodcock
Esteemed Legend

No, EXTRACT- is inline and TRANSFORMS- is split (and gives you more nuance optoins such as MV_ADD, etc.)

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

I know. I'm questioning whether indexed extractions are the right tool for the job.

Set this in props.conf:

[your_sourcetype]
...
EXTRACT-gname = ^[^.]+\.(?&lt;gname&gt;[^.]+) in host`

See if that works, and see if that selects the correct events (scanCount vs resultCount).

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

If that's good in terms of scanCount vs resultCount and you want to get rid of the ugly host=*.some-gname.* you can do this field extraction:

&lt;some regex&gt; in host

That'll extract your gname from the host field to let you search using gname=some-gname backed by the host field.

0 Karma

Super_Knulps
Explorer

You are talking about search time extraction while I am asking for index time.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

If you're trying to search on a part of the host you could do this:

index=foo sourcetype=generic-name host=*.some-gname.*

Should be pretty quick in terms of identifying the right events because host already is indexed. Loading the events is a different matter of course, so look at scanCount vs eventCount to check if your search is well-targeted or not.

0 Karma

rsennett_splunk
Splunk Employee
Splunk Employee

You are missing some parts in your regex:

YOURS: (capturing group not capturing anything, just naming the field):

[^\.]+\.(?<gname>)[^\.]\.domain\.com

MINE: (capturing group now contains the generic segment):

[^\.]+\.(?<gname>[^\.]+)\.[^\.]+\..+

in case it's not clear... here is the segment zoomed in - note the closing paren, and without the + you get the directive once... not one or more:

yours: (?<gname>)[^\.]

mine:  (?<gname>[^\.]+)
With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!
0 Karma

Super_Knulps
Explorer

Thank you for your answer!
I am still not receiving any result from your search.
Actually I have also tried it on regexpr.com and you are matching everything with your regexp.
Maybe I am missing something but it does not seem to work.

0 Karma

rsennett_splunk
Splunk Employee
Splunk Employee

try regex101.com that will show you what you are capturing and what you are not. It also will walk you through the regex. you can see it working click here:

https://regex101.com/r/zH0tS1/1

With Splunk... the answer is always "YES!". It just might require more regex than you're prepared for!
0 Karma

woodcock
Esteemed Legend

It is your REGEX; try this one:
(?<instancename>[^/.]+)/.(?<gname>[^/.]+)/.(?<subdomain>[^/.]+)/.(?<domain>[^/.]+)

0 Karma
Get Updates on the Splunk Community!

Automatic Discovery Part 1: What is Automatic Discovery in Splunk Observability Cloud ...

If you’ve ever deployed a new database cluster, spun up a caching layer, or added a load balancer, you know it ...

Real-Time Fraud Detection: How Splunk Dashboards Protect Financial Institutions

Financial fraud isn't slowing down. If anything, it's getting more sophisticated. Account takeovers, credit ...

Splunk + ThousandEyes: Correlate frontend, app, and network data to troubleshoot ...

 Are you tired of troubleshooting delays caused by siloed frontend, application, and network data? We've got a ...