Hi - I have a need of running a query to count unique values from a large set of data (>1 million) OVER A 30-Day PERIOD, for example:
sourcetype="largeDataset" earliest=-30d@d latest=@d| dedup uniqueValueColumn| stats count as valueCount
This large dataset changes every hour and we have scheduled and summary indexed (using SISTATS) this search on a daily basis. Right now this query takes about 2 hours to run, but as the dataset gets larger, we imagine that the execution time of this query will gets longer and longer. We can't just index the unique value count from a day-to-day basis, since everyday there will be repeat values from the previous day, and want a unique count over 30-days.
Instead of running the same query over everyday and summary indexing the result count, I was thinking about doing a multiple-level summary index and gets the result from the summary indexes. Does anyone have a good indexing strategy in this scenario?
This is rather an open-ended questions, any inputs are welcome.
You can certainly use sistats dc(uniqueValueColumn)
, but I'd probably make the search more explicit (and condensed) by this daily summary:
sourcetype="largeDataset" earliest=-1d@d latest=@d | stats count by uniqueValueColumn
Then the search to report on this would be:
index=summary source=my_summary | stats count by uniqueValueColumn | stats count as valueCount
You can certainly use sistats dc(uniqueValueColumn)
, but I'd probably make the search more explicit (and condensed) by this daily summary:
sourcetype="largeDataset" earliest=-1d@d latest=@d | stats count by uniqueValueColumn
Then the search to report on this would be:
index=summary source=my_summary | stats count by uniqueValueColumn | stats count as valueCount
We currently scheduled the search to run every night, but it just takes a very long time to finish the search and put a lot of workload on the splunk server. If there are no other way to solve this, I guess that will be our only option if we decided to stick with splunk on this report.
If you have a good partition on these records, you can delete them that way. For example "index=summary source=my_search a* | delete", and repeat for "b*", "c*", etc. As far as retrieving more than 500k records for the same second, we could either consider running the summary more often, or tweaking the timestamps so that these >500k events are spread over several seconds and not just one. I am concerned, however, that if we have this many records for one day, the summary will be suboptimal. We may want to just consider scheduling a nightly search to report over the previous 30 days.
Stephen, using your query does save the unique values each day. I have the search run and indexed once everyday. However, since we have over 500,000 unique daily values, I run into the "IndexScopedSearch" error when searching the index; the search failed because of there are more than 500000 records at one exact milliseconds. Now I need to figure out a way to delete those records (the data was not put on a separate index, thus can't just do a clean event command) and think of another way to index the data.
Note that we're not storing a daily count summary. We're storing, in the most condensed fashion possible, the list of values for each day. The second search has the net effect of performing that dedup and the stats count at the end counts the deduplicated values.
Thanks Stephen - as menitioned in the original question, I can't do a daily count summary like this because the value has to be unique across 30 days, not unique per day. For example, value "abc" can appear on both day 1, 2, and 3, if we use dedup over the past 30 days, 'abc' will be only counted once. However, if we dedup on a daily basis, 'abc' will be counted in the summary index for day 1, day 2 and day 3, and when we add up the total unique value count over the past 30 days, 'abc' will be counted 3 times.