I know this has been asked many times, and answered in splunkbase and in the documentation -- yet here I am, not sure if an index-time extraction would be right for our situation.
A little background
We have quite a few different web applications whose logs have an identical format, but the specific application cannot be determined by the actual event data -- except by the source. For example, the logs for app foo in the production environment are in /path/to/apps/foo/prod/log/* , and the logs for app bar in the beta environment are in /path/to/apps/bar/beta/log/* .
When searching Splunk, it is almost always the case that we will want to restrict the results to a specific application and environment. Over the course of a day, there can be multiple millions of events in this index, and it can be the case that the app that is being searched for will only be 100 or so of these events. Up to this point, we've been including something like source="\*/app/env/\*" in each search, which works, and is quite fast, but is a bit cumbersome. Sometimes, we're searching over a few apps, and would like to be able to have the app name and environment pulled out into a field -- in this case we've used rex to our advantage, but again, it's a bit cumbersome (albeit fast) to have to add a rex to every search that does this.
What I tried
After absorbing the index-time vs search-time docs, and after reading quite a few questions regarding this subject, I'd came up with a search-time extraction:
(in transforms.conf on the indexer)
[inhouse_app]
REGEX = /path/to/apps/(?<app>[^/]*)/(?<env>[^/]*)/.*
SOURCE_KEY = source
Perfect! It works like a charm! So, we changed a bunch of our searches to use app=appname env=environment instead of source="\*/appname/environment/\*" , and very quickly found out that our search performance had degraded to the point where it was almost unusable in quite a few instances. For example:
Search: index=myapps source="*/app1/prod/*"
This search has completed and has returned 14 results by scanning 14 events in 0.193 seconds.
Search: index=myapps app=app1 env=prod
This search has completed and has returned 14 results by scanning 198,502 events in 91.228 seconds.
So, maybe that wasn't the way to go 😞 I also tried using tags, creating a separate tag for each source...but that's even more cumbersome. It works, but we have an ever-changing set of apps, and continuously messing around with the tags isn't something I necessarily want to do.
So...
While I continue to try to talk myself out of using an index-time extraction, it keeps seeming to me like the way to go. Thoughts?
(If there's any crucial information that I left out, feel free to ask for more -- I'd be more than happy to help you help me 😄 )
The Solution
Since the general consensus was that it would be acceptable to extract these fields at index-time, that's just what I did. I'm creating completely new fields (not overwriting any of the default ones, like sourcetype), and it is working like a charm. For posterity, here was how I accomplished this:
fields.conf
[app]
INDEXED = true
INDEXED_VALUE = false
[env]
INDEXED = true
INDEXED_VALUE = false
transforms.conf
[app_env]
SOURCE_KEY = MetaData:Source
REGEX = /path/to/apps/([^/]+)/([^/]+)/
WRITE_META = true
FORMAT = app::"$1" env::"$2"
props.conf
[my_sourcetype]
# ... other sourcetype related stuff
TRANSFORMS-appenv = app_env
... View more