I need to create 'site' field from 'source' field by grabbing last fragment of source, such as:
/var/logs/dir/subdomain1.domainA.com -> subdomain1.domainA.com
/var/logs/dir/domainB.com -> domainB.com
Every search query filters on 'site' extensively, so my idea was to either use index-time extractions or source-time extraction via props/transforms.
Considering that data is coming via universal forwarder to indexer - which approach is the most efficient?
If you're going to search on site=foo
then both are going to be terrible.
The calculated field (props.conf EVAL-site
) and extracted fragment (props.conf EXTRACT-site ... in source
) are both going to be terribly slow to filter on because Splunk cannot build an efficient source selector based on them. Effectively, both these searches should do the same:
site=foo
source=*foo
However, to the machine those two are not identical. The first asks for a field called site
, which technically could come from anywhere. The second asks for specific sources ending in foo, so Splunk can look up matching sources and then load only those.
To put numbers to the theory, I've defined both eval'd and extracted fields on my PC's splunkd
sourcetype and ran these three searches:
index=_internal sourcetype=splunkd site_eval="license_usage.log"
index=_internal sourcetype=splunkd site_extract="license_usage.log"
index=_internal sourcetype=splunkd source="*license_usage.log"
The first two take five seconds, scanning 100k events and returning 2 events. The third takes 0.3 seconds, scanning 2 events and returning 2 events. In terms of events scanned, that's a 50000x speedup.
My take-away from this is as follows: For searching, use source=*yoursite
instead of site=yoursite
. For reporting, create the site
field using calculated fields or search-time extractions (doesn't matter much) to get ... | stats count by site
.
You could define an index-time field site
for searching, but there's no speed advantage over source=*site
to outweigh the disadvantages.
If you absolutely need the prettier search for site=yoursite
without an indexed field you could fiddle with fields.conf to teach Splunk how to use the site
field, something along the lines of the source::
example in http://docs.splunk.com/Documentation/Splunk/6.2.1/Admin/fieldsconf - here be dragons though, less than careful settings in fields.conf can muck up a lot of things.
Use source="*\\site.com"
then.
Another thought - if you do really have thousands or even millions of unique sources, it'd be worth a thought to split source and site up at index time. Let source be the common path, and site the site-specific part. That way you don't duplicate high cardinality at index time by adding a site field on top of the full source field but rather move the cardinality elsewhere. When that starts to make sense depends on your data.
As for site=subdomain.site.com
vs site=site.com
, a filter on source=*/site.com
should fix that.
I thought about it - this won't work for Windows environment with backslashes.
Yes. On search-time extracted fields, field=*suffix
is typically bad because it means you have to load everything, extract, and then filter. On index-time extracted fields such as source, field=*value
is typically not bad because you only need to look at all the extracted values, select the matches, and then load matching raw data.
To judge how bad your filters are, compare the scanCount in the job inspector with the returned results.
Thanks Martin. That's a great idea about source=*site.com
to prefilter on sources.
I still need to add site=site.com
because there could also be site=subdomain.site.com
.
I assume that even if I have 10,000 different sources - source=*site.com
will still be faster than loading everything and then post-filter of 'site' field?
I ended up putting this into /splunk/etc/apps/MY_APP/local/props.conf:
[access_combined]
EVAL-site = replace(source, "^.*?/([^/]+)$", "\1")
Is there a reason why you couldn't just use rex?
source=* | rex field=source ".*/(?<end_of_source>.*)"
I have tons of queries and don't want to inject the same thing into every single query, knowing that it is needed for every each of them.
Macro maybe ?
[Updated]
You can do search-time extraction of a field from another field. BUT - you can also do a calculated field! Calculated fields are also search time artifacts, and are preferred over index-time extractions. I strongly advise you to avoid index-time field extractions if you possibly can. They are not more efficient, they are less flexible and they consume more disk space.
Test this eval command. if it works, use it to create a calculated field on the indexer (or search head if you have one):
source=*.com
| eval site = replace(source,".*/(.*?)$", "\1")
I am not going to do index-time extractions, but:
You can't do search-time extraction of a field from another field
props.conf DOC says that I can though, like this in my case:
[access_combined]
EXTRACT-site = [/\\](?<site>[^/\\]+])$ in source
I thought I'd be able to use it like above especially making it sourcetype-specific. Shouldn't it work?
Your eval certainly will work (not sure why double slashes though), could you elaborate please on the difference between EXTRACT-site and EVAL-site?
Ha! You are right and I had forgotten that you could do this (EXTRACT-site = [/](?[^/]+])$ in source
). I used//
because I can't type. 🙂
I fixed my answer.
Thank you.
I tried EXTRACT-* but it didn't work for some reason. EVAL-* approach did, so i went with it.
Not to forget that we cannot do field aliasing with EVAL--ed fields because aliasing done before EVAL--ing.
Having search-time field extractions are preferred to indexed time field extractions. More details on below links:
http://answers.splunk.com/answers/2535/search-time-vs-index-time-field-extraction.html
http://docs.splunk.com/Documentation/Splunk/6.2.1/Indexer/Indextimeversussearchtime