I've read the docs in the splunk manual on parse-time indexed fields. http://docs.splunk.com/Documentation/Splunk/6.1.3/Data/Configureindex-timefieldextraction
But I still have a question. We're going to be search 15 months worth of authentication data to see if users have logged in within the previous 15 months. We'll have to do this search for 700,000 different user IDs. So the speed of the individual search is very important.
We've already decided to create a summary index that extracts the auth information from the main LDAP and Active Directory logs and creates a new, reduced data set. However, I'm still concerned that search 15 months worth of data will take a LONG time when repeated 700,000 times. For example, if each search requires an average of 0.5 seconds, our search will take 4 days.
I'm wondering if creating an index-time field for the user id would speed things up dramatically? This is what we'd do with a database table, but I'm not sure if "indexed" means the same thing in Splunk.
Basically each search would need to go back and look for the first successful auth event for each user ID, and could stop there. Unfortunately, we expect a significant number of these to fail, and thus to have to repeatedly search the entire data set.
Does this sound like a good use case for creating an index-time field?
Thanks,
Andrew
Great question! You are right - in some ways, "indexed" means something different in Splunk than in a traditional RDBMS.
I assert that this is not a good case for creating an index-time field, as all keywords in events are already indexed. So all the user names (and the auth info) is already indexed. This is true for all keywords, even if they are not extracted in fields.
Splunk also uses other techniques, like bloom filters, to quickly skip over portions of the index that do not contain the data.
Finally, rather than running 700,000 searches - why not run one search that lists the first successful auth event for each user that appears in the data and then compare that to the list of 700,000 users? Here is a potential search:
sourcetype=whatever "successful auth keywords" | stats earliest(_time) as firstAuth by user
I suggest that you give us more info about the actual data and the searches - I am sure the community can come up with more ideas for you.
BTW, have you consider getting your data into the Common Information Model (or CIM)? From there you could enabling acceleration on the Authentication data model and search using tstats
or piviot
. The results would be much faster, but you'd have to keep the data online for the full 15 months.
Alternately, you could setup your own custom accelerated datamodel on your summary index and search that instead.
Either options should give you really good performance compared with raw search.
Great question! You are right - in some ways, "indexed" means something different in Splunk than in a traditional RDBMS.
I assert that this is not a good case for creating an index-time field, as all keywords in events are already indexed. So all the user names (and the auth info) is already indexed. This is true for all keywords, even if they are not extracted in fields.
Splunk also uses other techniques, like bloom filters, to quickly skip over portions of the index that do not contain the data.
Finally, rather than running 700,000 searches - why not run one search that lists the first successful auth event for each user that appears in the data and then compare that to the list of 700,000 users? Here is a potential search:
sourcetype=whatever "successful auth keywords" | stats earliest(_time) as firstAuth by user
I suggest that you give us more info about the actual data and the searches - I am sure the community can come up with more ideas for you.
Thanks for your help!
OOhh - just remembered the RIGHT answer for the question "why do the default index-time fields exist" - because sometimes they AREN'T IN THE DATA!! For example, what if you had a log file with events that looked like this
3 Sep 2014 23:20 ERROR 317 transaction 4712
If you index just the data - how the heck do you know which server that happened on? Indexing source, sourcetype and host provides a base of information about the origin of the data. You have to index these fields separately because they aren't actually IN the event. That's why it is "metadata" - data about the data.
I believe that there are multiple reasons for the default index-time fields. First, there may be a historical component.
Second, the default fields are guaranteed to exist for all data, regardless of its origin. I believe that the default fields are also stored differently than user-created index-time field extractions. And of course the index-time field extractions are stored differently than the keywords, but I believe they are stored in the same tsidx files.
I don't know enough about the details of the tsidx files to really say more.
"why not run one search that lists the first successful auth event for each user that appears in the data and then compare that to the list of 700,000 users" -- That's a great idea! Thanks.
Thanks for the reply. A question. If "all keywords in events are already indexed", why do the default, index-time fields exist at all? What's the point of them? How are they different than keyword indexes, or are they?
Not necessarily.
So, as I understand it, you have a summary index that just contains 15 months worth of logins, which is a good place to start. Where are the 700k names coming from?