This is more of a question about the "right" way of doing things versus what is possible.
I want to know if there is anything I am forgetting or not considering that will make the following solution problematic. I have never seen this documented or discussed in any Splunk documentation, apps, or forums, so I wanted to make sure there is a reason for its absence that I did not know about.
The scenario I have is the need to handle a large set of sensor data (> 15 fields) from thousands of endpoints (i.e., GB of data per day). The sensor data is periodically sampled, and I typically always look at averages, minimums, maximums, and weighted averages in 5-minute intervals.
This seems like a good place to use summary indexing instead of data models/pivot, so that is the path I went down.
The issue I have is there is a lot of disk space wasted due to how summary aggregation fields (psrv*) from sistats are written to a summary index in the format "Field=Value". In some cases, I actually see errors because the _raw field is too big (if I compute avg, min, and max on all sensor fields).
The solution I devised to get around this (and to be more efficient) is writing the summary data from sistats out in
| delimited raw events that look like the following (the numbers represent sistats output for my sensor data).
I then defined a new source type for my summary index that specifies the appropriate field names for the
| delimited summary statistics fields (prsvd_*, etc.).
This seems to work fine in terms of retrieving and processing the summary index data, and it saves around 25% of disc space.
So, is this OK to do for a large-scale deployment? Is there other things I need to consider? Is there a better solution that is more maintainable?
If you are getting errors about the size of the events, you could change your TRUNCATE setting. Since its in your summary index you'd need to put the value on your search head where you are running the search from (probably for the "stash" sourcetype):
That might solve your truncation issue, but the bigger issue is that the format does waste space. If you have good extractions using the format you have, you're fine. The advantage of using the sistats command and the field names that it generates is that it masks the summarization process a bit more for people who aren't familiar with it, by preserving the field names.
You might also want to look at using tscollect and tstats instead; tscollect works similar to summary indexing but writes the data to indexed fields (in tsidx files).
In addition, there's a Splunk Answers that answers a somewhat similar question:
And points to this document:
Worth a read if you haven't checked it out already.
Thank you for the detailed response. I will look into your suggestions today. Will try to come back with findings/questions ASAP.
Based on my understanding, using tscollect/tstats is not really meant to support backfilling ,e.g., you should do your initial tscollect query with the intention it will capture all of the relevant data you need at that time. Is that correct?
Regardless, I think I have stumbled onto another solution thanks to the documentation for tstats.
At first I disregarded using Pivot/Datamodels because of the limitations on the PERIOD option (e.g., 1 minute, 1 hour, 1 day, etc.). Now that I know tstats can access datamodels directly, is it possible to define a datamodel for my sensor data, accelerate that datamodel, and then use tstats to pull out the relevant data from the summarized datamodel at the time span I want, e.g., 5 minutes instead of 1 minute if I used PIVOT command? The command I have in mind would be like the following:
| tstats summariesonly=t prestats=t avg(Sensor_Field_1) as Sensor_Field_1 FROM datamodel=mydm BY _time Endpoint_Name span=5m | timechart span=5m avg(Sensor_Field_1) by Endpoint_Name
The other benefit with this solution is the datamodel will sit on the indexers in the distributed environment and I won't have to maintain nearly as much in terms of savedsearches, search head access, etc. The Pivot/Datamodel framework would handle all of that.
tscollect would work similar to summary indexing , where you'd expect to have to backfill manually (or at least with a script). Using accelerated reports or accelerated data models will take care of the backfill for you.
If you are concerned about storage, make sure to keep an eye on the data model's storage utilization. You can choose the time period you can accelerate over, so if you only report on activity over say, the last month, only accelerate that time period. That will reduce the overall storage footprint.
You've already done the work of building the summary data, and now you've got the data model. The good news is that you can enable and disable acceleration of the data model. So you might try testing it out and see which one performs better for you. All things being equal, the data model sounds like a better solution, since any backfilling will be taken care of for you.
It looks like the datamodel is going to be the way to go for now. Trying to manage the summary data is going to be a pain at this scale. Thanks for all of your help.