We have a use case where we receive data from 2 different sources. Please note some key characteristics:
1. Our data volumes are less (say 5-10 GB per day)
2. Our searches will be very frequent as they are dashboard based (with auto refresh) and lots of users would be using it constantly
3. Our data is mostly not formatted where in one case we have to use regular expressions to parse and in another case we have to extract from XML
Since our requirement is to have faster searches so that users can see results quickly (i.e. less data but more searches), we were thinking we could extract the fields at index-time and persist them rather than doing search-time extraction. We have read online that in some cases, index-time extraction might be faster even though it increases the size (capacity wise not a problem for us) and increases the search-time (as it has to do more searching).
Could you please let me know if the increase in size and searching speed due to indexing is going to offset the search time speed that we are expecting to derive from this?
Have you looked into using loadjob? You could schedule a search to populate a base result set and use | loadjob within the dashboard panel searches to produce extremely fast loading dashboards.
Unfortunately with a lot of data / searching problems, the answer depends on the data you're searching across and the types of searches you need to perform.
Splunk's documentation notes that index time extractions could be beneficial in some specific use cases: http://docs.splunk.com/Documentation/Splunk/6.3.1/Data/Configureindex-timefieldextraction
If your searches and data do not fit those use cases, you may not realize speed up of searches. And of course the risk is if your data format changes (think: upgrades) and extractions stop working properly, then you'll have to re-index data in order to fix things.
As you mention a lot of users using the same dashboard with auto refresh, would a regularly run scheduled report embedded in the dashboard fit your use case? In this manner, the search is run once for every time period (potential limitation: cron scheduler) and whomever loads the dashboard would simply load the results of the most recently run scheduled search instead of kicking off their own searches? One ref: http://docs.splunk.com/Documentation/Splunk/6.3.1/Viz/AddPanels
Report acceleration, data model acceleration, and summary indexing all have similar limitations in frequency of execution / concurrency of data, but they have potential benefits as well depending on your use cases.
Standard recommendations of limited time windows, explicit indexes, using default fields and as many terms as possible before the first pipe to limit data needing to be read from disk, followed by as many streaming commands and limiting fields as much as possible before the first streaming command to limit the amount of data that needs to go over the network from the indexers to the search head of course all still apply.
Based on your requirements I would use data model acceleration (if pivoting is required) or report acceleration. Read the following page carefully and choose your best option based on your requirements.
If you do index time field extraction your are tying yourself unnecessarily to your current data format and schema and also requiring extra performance in your indexers.
Thanks for the response.
1. But Even if use datamodel, are we not tying the format to it ? meaning if additional attributes are added the datamodel structure would also change
2. We do some enrichment of the data based on the fields we extract and if all this has to be done at run-time, it would slow down completely (especially if we are showing 50k-100k records in a drill down dashboard and parsing/extracting at run-time)
What would be the better option then?