Re: What is the most efficient approach to create ...

gesman · ‎01-21-2015

I need to create 'site' field from 'source' field by grabbing last fragment of source, such as:
/var/logs/dir/subdomain1.domainA.com -> subdomain1.domainA.com
/var/logs/dir/domainB.com -> domainB.com

Every search query filters on 'site' extensively, so my idea was to either use index-time extractions or source-time extraction via props/transforms.

Considering that data is coming via universal forwarder to indexer - which approach is the most efficient?

martin_mueller · ‎01-21-2015

If you're going to search on site=foo then both are going to be terrible.

The calculated field (props.conf EVAL-site) and extracted fragment (props.conf EXTRACT-site ... in source) are both going to be terribly slow to filter on because Splunk cannot build an efficient source selector based on them. Effectively, both these searches should do the same:

site=foo
source=*foo

However, to the machine those two are not identical. The first asks for a field called site, which technically could come from anywhere. The second asks for specific sources ending in foo, so Splunk can look up matching sources and then load only those.

To put numbers to the theory, I've defined both eval'd and extracted fields on my PC's splunkd sourcetype and ran these three searches:

index=_internal sourcetype=splunkd site_eval="license_usage.log"
index=_internal sourcetype=splunkd site_extract="license_usage.log"
index=_internal sourcetype=splunkd source="*license_usage.log"

The first two take five seconds, scanning 100k events and returning 2 events. The third takes 0.3 seconds, scanning 2 events and returning 2 events. In terms of events scanned, that's a 50000x speedup.

My take-away from this is as follows: For searching, use source=*yoursite instead of site=yoursite. For reporting, create the site field using calculated fields or search-time extractions (doesn't matter much) to get ... | stats count by site.

You could define an index-time field site for searching, but there's no speed advantage over source=*site to outweigh the disadvantages.
If you absolutely need the prettier search for site=yoursite without an indexed field you could fiddle with fields.conf to teach Splunk how to use the site field, something along the lines of the source:: example in http://docs.splunk.com/Documentation/Splunk/6.2.1/Admin/fieldsconf - here be dragons though, less than careful settings in fields.conf can muck up a lot of things.

martin_mueller · ‎01-23-2015

Use source="*\\site.com" then.

martin_mueller · ‎01-23-2015

Another thought - if you do really have thousands or even millions of unique sources, it'd be worth a thought to split source and site up at index time. Let source be the common path, and site the site-specific part. That way you don't duplicate high cardinality at index time by adding a site field on top of the full source field but rather move the cardinality elsewhere. When that starts to make sense depends on your data.

As for site=subdomain.site.com vs site=site.com, a filter on source=*/site.com should fix that.

gesman · ‎01-23-2015

I thought about it - this won't work for Windows environment with backslashes.

martin_mueller · ‎01-23-2015

Yes. On search-time extracted fields, field=*suffix is typically bad because it means you have to load everything, extract, and then filter. On index-time extracted fields such as source, field=*value is typically not bad because you only need to look at all the extracted values, select the matches, and then load matching raw data.

To judge how bad your filters are, compare the scanCount in the job inspector with the returned results.

gesman · ‎01-22-2015

Thanks Martin. That's a great idea about source=*site.com to prefilter on sources.
I still need to add site=site.com because there could also be site=subdomain.site.com.
I assume that even if I have 10,000 different sources - source=*site.com will still be faster than loading everything and then post-filter of 'site' field?

gesman · ‎01-21-2015

I ended up putting this into /splunk/etc/apps/MY_APP/local/props.conf:

[access_combined]
EVAL-site = replace(source, "^.*?/([^/]+)$", "\1")

aljohnson_splun · ‎01-21-2015

Is there a reason why you couldn't just use rex?

source=* | rex field=source ".*/(?<end_of_source>.*)"

gesman · ‎01-21-2015

I have tons of queries and don't want to inject the same thing into every single query, knowing that it is needed for every each of them.

DavidHourani · ‎01-23-2015

Macro maybe ?

lguinn2 · ‎01-21-2015

[Updated]
You can do search-time extraction of a field from another field. BUT - you can also do a calculated field! Calculated fields are also search time artifacts, and are preferred over index-time extractions. I strongly advise you to avoid index-time field extractions if you possibly can. They are not more efficient, they are less flexible and they consume more disk space.

Test this eval command. if it works, use it to create a calculated field on the indexer (or search head if you have one):

source=*.com 
| eval site = replace(source,".*/(.*?)$", "\1")

gesman · ‎01-21-2015

I am not going to do index-time extractions, but:

You can't do search-time extraction of a field from another field
props.conf DOC says that I can though, like this in my case:

[access_combined]
EXTRACT-site = [/\\](?<site>[^/\\]+])$ in source

I thought I'd be able to use it like above especially making it sourcetype-specific. Shouldn't it work?
Your eval certainly will work (not sure why double slashes though), could you elaborate please on the difference between EXTRACT-site and EVAL-site?

lguinn2 · ‎01-22-2015

Ha! You are right and I had forgotten that you could do this (EXTRACT-site = [/](?[^/]+])$ in source). I used// because I can't type. 🙂
I fixed my answer.

gesman · ‎01-22-2015

Thank you.
I tried EXTRACT-* but it didn't work for some reason. EVAL-* approach did, so i went with it.
Not to forget that we cannot do field aliasing with EVAL--ed fields because aliasing done before EVAL--ing.

somesoni2 · ‎01-21-2015

Having search-time field extractions are preferred to indexed time field extractions. More details on below links:

http://answers.splunk.com/answers/2535/search-time-vs-index-time-field-extraction.html

http://docs.splunk.com/Documentation/Splunk/6.2.1/Indexer/Indextimeversussearchtime

What is the most efficient approach to create a new field from the last portion of the source field's value?

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms