Create these accelerated reports, then rollup by "| stats dc(cookie)" over the time range in the time picker
search = sourcetype=iis | bucket span=1m _time | stats count by _time cookie
search = sourcetype=iis | bucket span=1h _time | stats count by _time cookie
search = sourcetype=iis | bucket span=1d _time | stats count by _time cookie
search = sourcetype=iis | bucket span=1w _time | stats count by _time cookie
using accelerated data model directly
dc(cookie) on 1 sec of data - instant
dc(cookie) on 1 hr of data - 3 min
dc(cookie) on 1.5 hr of data - i ran out of patience
dc(cookie) all time - didnt even attempt
computing hour,cookie,count then dedup
dc(cookie) on all time - i ran out of patience
computing minute,cookie,count then dedup consecutive=true
dc(cookie) on all time - 7 minutes
all time = 14 days
What kind of data structures are these?
accelerated data model
How can I do a quick estimate?
I'm guessing your cookies are extremely high-cardinality, ie a large number of distinct values. Computing a
dc() over that is a very high-load task for a data model. It has to keep (probably) millions of different values around, and for each new value it has to check if it has seen that value before or not. That's a nightmare to compute accurately.
Yeah, this should scale out to multiple indexers and probably (didn't check myself yet) also to multiple search pipelines on one indexer (6.3 feature).
Before adding tons of hardware you should spend some time (and maybe money) on figuring out if your current data model / search is the best way to answer your core question - determining the best approach, one step before figuring out if you can make the chosen approach faster.
It is running single core cpu bound on splunk 6.3. I am using the auto (750 MB) bucket size. I think inter-bucket parallelization is possible. Not the 10 GB auto_high_volume bucket size. Simply adding the dc(cookie) hourly would count a 24 hour long session 24 times in the worst case. How can I run a parallel sort merge uniq in splunk?
This is your starting point for parallelizing things on a single box: http://docs.splunk.com/Documentation/Splunk/6.3.0/Capacity/Parallelization
Your described hourly distinct count would be something like
... | timechart span=1h dc(cookie) or a row split by time in pivot/datamodel terms.
The cookies have a 1 year life time. If I set the query time range to 1 year it should saturate parallel sort merge uniq on any cluster that I can afford. Even C++ may not be fast enough.
I have seen some splunk apps that appends new distinct keys to a csv file every 5 to 10 min. How can I write that?
Yeah, doing one year of cookies accurately in one go is nigh-on impossible, even nlog(n) becomes less than fun.
Here's a traditional approach of keeping state in CSV files: http://blogs.splunk.com/2011/01/11/maintaining-state-of-the-union/
Depending on your actual use case, you may be better off with precomputing chunks of distinct counts and storing them in a summary. Summing that up will give you higher numbers than reality, but if you're only looking for trends then that'd be fine. If you need more accurate numbers you could compare the chunked numbers with real numbers computed for a longer but still manageable time range, and apply that correction to your chunked data from then on.
It'll work, it just depends on what your actual requirements are. All we're doing here is throwing technical solutions around, it's impossible to tell what works best for your use case without knowing your use case.
Yes, in principle smaller will always be faster.
However, in this case you won't see large gains - the issue is cardinality, not data model size.
This should imho be solved through dropping the accuracy requirement and computing chunks, e.g. build a distinct count each hour and store that in a summary, letting your reports read the summaries only. You can attempt to improve accuracy by estimating how many duplicate cookies you can expect.
I know, that's why you'd need to calculate a rough conversion factor - and chunked data would be most useful for trends, not for precise absolute numbers. In return it'd be much much cheaper to compute.
Instead of chunking, you could sample - e.g. skip the last few chars of your cookie and calculate a conversion factor from this lower number to the real number. With every hex char lost you reduce cardinality by *16.
In shell script I can probably do parallel sort merge in O(n log n) time w/ linear speedup by adding CPU cores. Can I parallelize this query in Splunk? The hash of cookies should work as the key for the parallel mapreduce operation.