Knowledge Management

Summary Indexing Strategy on large data set

clincg
Path Finder

Hi - I have a need of running a query to count unique values from a large set of data (>1 million) OVER A 30-Day PERIOD, for example:

sourcetype="largeDataset" earliest=-30d@d latest=@d| dedup uniqueValueColumn| stats count as valueCount

This large dataset changes every hour and we have scheduled and summary indexed (using SISTATS) this search on a daily basis. Right now this query takes about 2 hours to run, but as the dataset gets larger, we imagine that the execution time of this query will gets longer and longer. We can't just index the unique value count from a day-to-day basis, since everyday there will be repeat values from the previous day, and want a unique count over 30-days.

Instead of running the same query over everyday and summary indexing the result count, I was thinking about doing a multiple-level summary index and gets the result from the summary indexes. Does anyone have a good indexing strategy in this scenario?

This is rather an open-ended questions, any inputs are welcome.

Tags (1)
1 Solution

Stephen_Sorkin
Splunk Employee
Splunk Employee

You can certainly use sistats dc(uniqueValueColumn), but I'd probably make the search more explicit (and condensed) by this daily summary:

sourcetype="largeDataset" earliest=-1d@d latest=@d | stats count by uniqueValueColumn

Then the search to report on this would be:

index=summary source=my_summary | stats count by uniqueValueColumn | stats count as valueCount

View solution in original post

Stephen_Sorkin
Splunk Employee
Splunk Employee

You can certainly use sistats dc(uniqueValueColumn), but I'd probably make the search more explicit (and condensed) by this daily summary:

sourcetype="largeDataset" earliest=-1d@d latest=@d | stats count by uniqueValueColumn

Then the search to report on this would be:

index=summary source=my_summary | stats count by uniqueValueColumn | stats count as valueCount

clincg
Path Finder

We currently scheduled the search to run every night, but it just takes a very long time to finish the search and put a lot of workload on the splunk server. If there are no other way to solve this, I guess that will be our only option if we decided to stick with splunk on this report.

0 Karma

Stephen_Sorkin
Splunk Employee
Splunk Employee

If you have a good partition on these records, you can delete them that way. For example "index=summary source=my_search a* | delete", and repeat for "b*", "c*", etc. As far as retrieving more than 500k records for the same second, we could either consider running the summary more often, or tweaking the timestamps so that these >500k events are spread over several seconds and not just one. I am concerned, however, that if we have this many records for one day, the summary will be suboptimal. We may want to just consider scheduling a nightly search to report over the previous 30 days.

0 Karma

clincg
Path Finder

Stephen, using your query does save the unique values each day. I have the search run and indexed once everyday. However, since we have over 500,000 unique daily values, I run into the "IndexScopedSearch" error when searching the index; the search failed because of there are more than 500000 records at one exact milliseconds. Now I need to figure out a way to delete those records (the data was not put on a separate index, thus can't just do a clean event command) and think of another way to index the data.

0 Karma

Stephen_Sorkin
Splunk Employee
Splunk Employee

Note that we're not storing a daily count summary. We're storing, in the most condensed fashion possible, the list of values for each day. The second search has the net effect of performing that dedup and the stats count at the end counts the deduplicated values.

0 Karma

clincg
Path Finder

Thanks Stephen - as menitioned in the original question, I can't do a daily count summary like this because the value has to be unique across 30 days, not unique per day. For example, value "abc" can appear on both day 1, 2, and 3, if we use dedup over the past 30 days, 'abc' will be only counted once. However, if we dedup on a daily basis, 'abc' will be counted in the summary index for day 1, day 2 and day 3, and when we add up the total unique value count over the past 30 days, 'abc' will be counted 3 times.

0 Karma
Get Updates on the Splunk Community!

Announcing Scheduled Export GA for Dashboard Studio

We're excited to announce the general availability of Scheduled Export for Dashboard Studio. Starting in ...

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...