Hi - I'm trying to summary index a query that gives me a range of distinctive errors happened over the last 30 days, with the following SI query:
sourcetype="errorEvents" earliest=-30d@d latest=@d| DEDUP errorCode | SISTATS count as ErrNum max(_time) as _time
I'm searching a large amount of data that will have duplicates from time to time, so DEDUP is required to get the distinctive data count. My problem is when I run the query using STATS over the past 30 days, I'll get something like 10,000 counts, but when I run it from the summary-indexed results (i.e. index="summary" index_type="30DaysErrorEvents" | stats count as ErrNum MAX(_time) as _time), it returns something like 80,000 counts. I have scheduled the summary index to run every day, so every day's result will be based on the previous 30 days (with errorCode dedup'ed).
Can someone give me some details and difference between the SISTATS and STATS command and why I'm getting different results? By the way, the MAX(_time) doesn't seem to work on SISTATS - it still uses the actual date when the count start (30 days ago) as the summary index date, so the summary-indexed data is always 30 days off.
Thanks in advance!
You are seeing higher counts when referencing the summary index because insufficient information is persisted to deduplicate the common errorCodes stored across different days. Specifically, all you're storing per day is the count of distinct errorCode values and the maximum time. Taking the sum of all of these isn't going to give you what you want.
You should store in your summary something like:
sourcetype="errorEvents" | sistats dc(errorCode) max(_time)
You can then search the summary:
index=summary source=30DaysErrorEvents | stats dc(errorCode) as ErrNum max(_time) as _time
We only need the distinctive error count over the past 30 days, storing the distinctive error codes generated every day and DC it over the past 30 days will make the final query takes longer and storage-wise may not be the best.
If that's the case, you should not be using sistats, since it is intended for aggregating (non-overlapping) distinct summaries. If all you want to do is store a daily number, use stats. But be aware that you will not be able to get the counts e.g., for a week or a month's worth of data, which sistats would allow.
I would imagine that your problem is that your summarization is wrong, because you are overlapping every time your run it. You should not run your summary over the past 30 days every day. You should run it daily for the past day, without overlap. I am also not sure why you are recording max(_time), and more importantly why you are overwriting _time with it.
Thanks for your reply. To clarify:
The reason I wanted to search over 30 days instead of summarizing day-by-day is we actually want DISTINCTIVE results over 30 days, not on a day-by-day basis (for example, if error code 1234 shows up on both day 1, 5, and 9, it will be reported as happened once, NOT 4 times over the past 30-day).
Because we are running report over the past 30 days, we want the end result to associate with the "end date" of the data set rather than the start date. Overwriting time with MAX(time) seems to get the result we wanted when using STATS, but not SISTATS.
That's not how sistats works, and it seems like you don't need it. If you were to use sistats, you should run
sistats dc(errorCode) or
sistats count by errorCode, and you should run it daily, without overlapping. you can then report using
stats dc(errorCode) or
stats count by errorCode | stats count (depending which sistats you summarized with). You should not need to include or change _time.
Incidentally I gave a presentation at the Splunk users conference about how to use the si- commands, and hopefully the audio and slides will be posted some time soon.
Thanks, I'll just switch to STATS instead. By the way, efficiency-wise (storage, search, speed...), are there any disadvantages indexing results from STATS?