Splunk Search

What is the most efficient approach to create a new field from the last portion of the source field's value?

gesman
Communicator

I need to create 'site' field from 'source' field by grabbing last fragment of source, such as:
/var/logs/dir/subdomain1.domainA.com -> subdomain1.domainA.com
/var/logs/dir/domainB.com -> domainB.com

Every search query filters on 'site' extensively, so my idea was to either use index-time extractions or source-time extraction via props/transforms.

Considering that data is coming via universal forwarder to indexer - which approach is the most efficient?

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

If you're going to search on site=foo then both are going to be terrible.

The calculated field (props.conf EVAL-site) and extracted fragment (props.conf EXTRACT-site ... in source) are both going to be terribly slow to filter on because Splunk cannot build an efficient source selector based on them. Effectively, both these searches should do the same:

site=foo
source=*foo

However, to the machine those two are not identical. The first asks for a field called site, which technically could come from anywhere. The second asks for specific sources ending in foo, so Splunk can look up matching sources and then load only those.

To put numbers to the theory, I've defined both eval'd and extracted fields on my PC's splunkd sourcetype and ran these three searches:

index=_internal sourcetype=splunkd site_eval="license_usage.log"
index=_internal sourcetype=splunkd site_extract="license_usage.log"
index=_internal sourcetype=splunkd source="*license_usage.log"

The first two take five seconds, scanning 100k events and returning 2 events. The third takes 0.3 seconds, scanning 2 events and returning 2 events. In terms of events scanned, that's a 50000x speedup.

My take-away from this is as follows: For searching, use source=*yoursite instead of site=yoursite. For reporting, create the site field using calculated fields or search-time extractions (doesn't matter much) to get ... | stats count by site.

You could define an index-time field site for searching, but there's no speed advantage over source=*site to outweigh the disadvantages.
If you absolutely need the prettier search for site=yoursite without an indexed field you could fiddle with fields.conf to teach Splunk how to use the site field, something along the lines of the source:: example in http://docs.splunk.com/Documentation/Splunk/6.2.1/Admin/fieldsconf - here be dragons though, less than careful settings in fields.conf can muck up a lot of things.

martin_mueller
SplunkTrust
SplunkTrust

Use source="*\\site.com" then.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Another thought - if you do really have thousands or even millions of unique sources, it'd be worth a thought to split source and site up at index time. Let source be the common path, and site the site-specific part. That way you don't duplicate high cardinality at index time by adding a site field on top of the full source field but rather move the cardinality elsewhere. When that starts to make sense depends on your data.

As for site=subdomain.site.com vs site=site.com, a filter on source=*/site.com should fix that.

0 Karma

gesman
Communicator

I thought about it - this won't work for Windows environment with backslashes.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Yes. On search-time extracted fields, field=*suffix is typically bad because it means you have to load everything, extract, and then filter. On index-time extracted fields such as source, field=*value is typically not bad because you only need to look at all the extracted values, select the matches, and then load matching raw data.

To judge how bad your filters are, compare the scanCount in the job inspector with the returned results.

0 Karma

gesman
Communicator

Thanks Martin. That's a great idea about source=*site.com to prefilter on sources.
I still need to add site=site.com because there could also be site=subdomain.site.com.
I assume that even if I have 10,000 different sources - source=*site.com will still be faster than loading everything and then post-filter of 'site' field?

0 Karma

gesman
Communicator

I ended up putting this into /splunk/etc/apps/MY_APP/local/props.conf:

[access_combined]
EVAL-site = replace(source, "^.*?/([^/]+)$", "\1")

aljohnson_splun
Splunk Employee
Splunk Employee

Is there a reason why you couldn't just use rex?

source=* | rex field=source ".*/(?<end_of_source>.*)"
0 Karma

gesman
Communicator

I have tons of queries and don't want to inject the same thing into every single query, knowing that it is needed for every each of them.

0 Karma

DavidHourani
Super Champion

Macro maybe ?

0 Karma

lguinn2
Legend

[Updated]
You can do search-time extraction of a field from another field. BUT - you can also do a calculated field! Calculated fields are also search time artifacts, and are preferred over index-time extractions. I strongly advise you to avoid index-time field extractions if you possibly can. They are not more efficient, they are less flexible and they consume more disk space.

Test this eval command. if it works, use it to create a calculated field on the indexer (or search head if you have one):

source=*.com 
| eval site = replace(source,".*/(.*?)$", "\1")

gesman
Communicator

I am not going to do index-time extractions, but:

You can't do search-time extraction of a field from another field
props.conf DOC says that I can though, like this in my case:

[access_combined]
EXTRACT-site = [/\\](?<site>[^/\\]+])$ in source

I thought I'd be able to use it like above especially making it sourcetype-specific. Shouldn't it work?
Your eval certainly will work (not sure why double slashes though), could you elaborate please on the difference between EXTRACT-site and EVAL-site?

0 Karma

lguinn2
Legend

Ha! You are right and I had forgotten that you could do this (EXTRACT-site = [/](?[^/]+])$ in source). I used// because I can't type. 🙂
I fixed my answer.

0 Karma

gesman
Communicator

Thank you.
I tried EXTRACT-* but it didn't work for some reason. EVAL-* approach did, so i went with it.
Not to forget that we cannot do field aliasing with EVAL--ed fields because aliasing done before EVAL--ing.

0 Karma

somesoni2
Revered Legend

Having search-time field extractions are preferred to indexed time field extractions. More details on below links:

http://answers.splunk.com/answers/2535/search-time-vs-index-time-field-extraction.html

http://docs.splunk.com/Documentation/Splunk/6.2.1/Indexer/Indextimeversussearchtime

0 Karma
Get Updates on the Splunk Community!

More Control Over Your Monitoring Costs with Archived Metrics!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...

New in Observability Cloud - Explicit Bucket Histograms

Splunk introduces native support for histograms as a metric data type within Observability Cloud with Explicit ...

Updated Team Landing Page in Splunk Observability

We’re making some changes to the team landing page in Splunk Observability, based on your feedback. The ...