I am wanting to summarise data so that it can be reported on by our management using a search form. This will tell us how often a particular service is being used, and what "options" are used with it. The basic workflow is:
I think I may need to use a summary index or report acceleration. I am pretty new to both of these concepts though, so was wanting any advice/input/suggestions anyone may have. Basically the report must show data further back than the index's retention and also the search must not take as long to run as if you were searching the entire raw index.
My concern with report acceleration is that the data is stord in the hot/warm/cold buckets and it would seem subject to the 90-day retention of the index. However if I were to create a summary index to store the summarised data in, I could set its own retention period (a year or more.)
However, report acceleration sounds handy in that is handles backflow, and I dont have to worry about setting specific time ranges used in a saved search that would populate a summary index based on how often it runs (setting it to run every day at midnight, and to go back -1d to avoid missing some events, or counting some more than once.)
Im also wondering how to handle the ability to report on data limiting it to specific time periods. I would assume that I would also need to have the date returned by the saved search for this purpose...
The option that seems best (in my opinion) right now is to create a new index with retention of 1 year. Then create a saved search to pull events and only return the date and "option" fields and store them in this index. Then build a search on the search form to use this index with the condensed data in it.
Any tips/suggestions/etc would be appreciated.
Ideally since the events are quite cleanly seperated, you should flow that into a data aggregation system which can process it in bulk, quite like Hadoop. If you are looking at that data store to be updated in real time you can look into streaming solutions like Spark Streaming.
Since store in hadoop would be much bigger you would not have trouble updated it with even a yr worth of data. You can futher enable your users to drill down in a OLAPish fashion using Shark framework on top of Spark. Users will be able to filter down data in a custom fashion as well as connect conventional BI tools like Tableau to further analyze it. Since data is cached in memory in Shark the performance is quite great ~ 5 sec on 100's GB of data.
Big Data consultant
You seem to have done a fair amount of reading. It seems to me that your limiting factor is the retention policy for you're data. I don't see a way to maintain a years worth of data with an search acceleration - a summary index makes the most sense. However, you're right about the problems with summary indexes. You will need to do the things you mentioned to avoid duplicate and missing data.
The summary index should be populated with fine grained statistics, or what I've found helpful is to store a subset of the event fields in the summary index. If you store a subset of the actual data instead of statistics on the data then you have more freedom to change the statistics later.