I am building up summary indexing for my reports, and while everything is working fine, I have some questions:
1°) How do the sistats median/dc(field) works? I can't find the algorithm used anywhere, and it's clear that it doesn't store the whole distinct values of the field. I can't find the precise documentation on the way those measure are computed (eg:per day) and agregated (eg: per month). (I have checked the doc: use summary indexing, but there is only a rough description of the algorithm used).
2°) How do the overlap command works? I understand that it takes "redundant/ missing" events in an index, but what does it mean exactly (I have read the doc: configure summary indexing). The trouble I have is how does splunk know if there are missing events or not (how can it tells that events haven't been indexed?)
eg: I have a search that runs every 5 minutes, and use sistats to sum up everything in a summary index. Is there a chance that I run into overlapping/ missing events? (except if splunkd goes down AND/OR search takes more time than the scheduled time range (5mn here))
Does anyone has info on this? I am currently seeing a weird behavior using the sistats dc(). When I use it and try to compare it with the dc() I have that does not use summary, I have discrepancies. So I investigate it and when I try to do values(field), some values are clearly missing from the summary index, and I really don't know how it's possible (I have run the fillsummaryindex.py script so this shouldn't come from a lack of summarizing.)
When doing sistats using the dc() and median() functions you have to be careful if you have lots of distinct values. The reason is sistats will start storing these unique values in the expanded stats-data that allows the broader-time statistics to be calculated after the fact.
For example, let's do a simple search using sistats and use these functions on _time, which will have lots of distinct values:
index=_internal earliest=-15m | sistats dec(_time), median(_time) by host
If you take a look at the results, you'll see the field psrsvd_vm__time which stores the meta-info for dc(), and psrsvd_rd__time which stores the meta-info for median(). Both of these contains lots of values.
Thus, if you'll end up having lots of distinct values you're going to end up with a very large summary index from the sistats meta-data. Due to this you might want to look into a different way of counting things if you can. But, I've seen summary indexes start growing very large where you're summarizing millions of events with a high percentage of unique values being stored with the sistats.
In regards to overlap, that command isn't suggested for summary indexing. There are a few tricks you can use keep from getting overlap when storing summary data. If you don't really have transactions in your data (e.g. event A.1 occurs then event A.2 occurs 1 minute later), then the suggested method of configuration is to build in an offset in your earliest/latest times on the scheduled search page. Also, using the @ character to lock times to the top of the minute/hour/etc. For example, let's say I have my summary search scheduled to run at the top and bottom of the hour via cron scheduling. Then, I might make my earliest time -35m@m and latest time -5m@m. This will give data coming from the forwarder 5 minutes to get across the wire and indexed.
Another more complicated scenario is where you actually have transactions. In this case you typically only want to summarize transactions where you have a start and end event logged. We will need to make an assumption that our transaction should complete within a certain time - say 5 minutes. Let's create an example where we are summarizing the average duration of transactions by company. We'll assume a sliding 5 minute window (which corresponds to the 300 in the search below), and we'll assume that a transaction will complete within 5 minutes. We'll schedule this search to run every 5 minutes, and have an earliest time of -15m@m and latest time of -5m@m so we can build not only 5m for data to get into the index but also 5 minutes for our window and 5m for the transaction to complete. Then, our summary search will be thus:
index=foo | stats earliest(_time) as start_time, latest(_time) as end_time by my_unique_identifier, company_name | eval duration=end_time-start_time | addinfo | where start_time < (info_min_time+300) | sistats avg(duration) by company_name
The important lines to create the sliding window are the addinfo and where commands - essentially we're only including events that start within the first 5m (300 seconds) of the lower time bound of our search (info_min_time).
Thanks very much for the insight. I am exactly in the configuration where I have a lot of unique values. So My summary index will grow quite fast. I have inspected the fields "psrsvdrdmyfield" and "psrsvdvmmyfield" but it doesn't look like it stores all the distinct values (I have something like 40 000 distinct values, but I can see only around 2000 in this field). That's why I asked about the logic/algorithm behind.
Considering the summary indexing and the scheduled time for transaction, I am exactly in the case you describe (with transaction), and I have come up with the same result you have given here (schedule a search to run every Xm, 5m before current time, using addinfo and a marker for start and end of transaction). I have also improved a bit the search by specifying infomintime in a macro so I can change the "offset" (300 here) from a macro with parameters.