Re: Why does our data model "distinct count(cookie...

hylam · ‎10-26-2015

17 GB IIS log files, 2.5 GB 100% accelerated data model. 16 cores 8 GB RAM with 2 GB RAM free. The pivot was single-core cpu-bound. Disk activity is minimal. Any idea? Thx.

hylam · ‎11-12-2015

Create these accelerated reports, then rollup by "| stats dc(cookie)" over the time range in the time picker

hylam · ‎10-29-2015

using accelerated data model directly
dc(cookie) on 1 sec of data - instant
dc(cookie) on 1 hr of data - 3 min
dc(cookie) on 1.5 hr of data - i ran out of patience
dc(cookie) all time - didnt even attempt
computing hour,cookie,count then dedup
dc(cookie) on all time - i ran out of patience
computing minute,cookie,count then dedup consecutive=true
dc(cookie) on all time - 7 minutes

all time = 14 days

What kind of data structures are these?
accelerated data model
tsidx

How can I do a quick estimate?
https://www.google.com/search?q=distinct+value+estimation

hylam · ‎10-30-2015

dc(cookie) from indexed data also took 7 min.

martin_mueller · ‎10-27-2015

I'm guessing your cookies are extremely high-cardinality, ie a large number of distinct values. Computing a dc() over that is a very high-load task for a data model. It has to keep (probably) millions of different values around, and for each new value it has to check if it has seen that value before or not. That's a nightmare to compute accurately.

hylam · ‎10-27-2015

Does this scale out as you said 3 days ago?
https://answers.splunk.com/answers/320179/data-model-split-row-1m-limit.html#comment-320182

martin_mueller · ‎10-27-2015

Yeah, this should scale out to multiple indexers and probably (didn't check myself yet) also to multiple search pipelines on one indexer (6.3 feature).

Before adding tons of hardware you should spend some time (and maybe money) on figuring out if your current data model / search is the best way to answer your core question - determining the best approach, one step before figuring out if you can make the chosen approach faster.

hylam · ‎10-27-2015

It is running single core cpu bound on splunk 6.3. I am using the auto (750 MB) bucket size. I think inter-bucket parallelization is possible. Not the 10 GB auto_high_volume bucket size. Simply adding the dc(cookie) hourly would count a 24 hour long session 24 times in the worst case. How can I run a parallel sort merge uniq in splunk?

martin_mueller · ‎10-27-2015

This is your starting point for parallelizing things on a single box: http://docs.splunk.com/Documentation/Splunk/6.3.0/Capacity/Parallelization

Your described hourly distinct count would be something like ... | timechart span=1h dc(cookie) or a row split by time in pivot/datamodel terms.

hylam · ‎10-27-2015

The cookies have a 1 year life time. If I set the query time range to 1 year it should saturate parallel sort merge uniq on any cluster that I can afford. Even C++ may not be fast enough.

I have seen some splunk apps that appends new distinct keys to a csv file every 5 to 10 min. How can I write that?

martin_mueller · ‎10-28-2015

Yeah, doing one year of cookies accurately in one go is nigh-on impossible, even nlog(n) becomes less than fun.

Here's a traditional approach of keeping state in CSV files: http://blogs.splunk.com/2011/01/11/maintaining-state-of-the-union/

Depending on your actual use case, you may be better off with precomputing chunks of distinct counts and storing them in a summary. Summing that up will give you higher numbers than reality, but if you're only looking for trends then that'd be fine. If you need more accurate numbers you could compare the chunked numbers with real numbers computed for a longer but still manageable time range, and apply that correction to your chunked data from then on.

hylam · ‎10-31-2015

How about keeping a slowly changing dimension csv?

firstSeenTime,lastSeenTime,cookie

http://blogs.splunk.com/2011/01/11/maintaining-state-of-the-union/

martin_mueller · ‎11-01-2015

It'll work, it just depends on what your actual requirements are. All we're doing here is throwing technical solutions around, it's impossible to tell what works best for your use case without knowing your use case.

hylam · ‎11-01-2015

The slowly changing dimension should not work under a lot of cases.

hylam · ‎10-29-2015

Will it be any faster if I use minimal number of fields in a data model?

martin_mueller · ‎10-30-2015

Yes, in principle smaller will always be faster.

However, in this case you won't see large gains - the issue is cardinality, not data model size.
This should imho be solved through dropping the accuracy requirement and computing chunks, e.g. build a distinct count each hour and store that in a summary, letting your reports read the summaries only. You can attempt to improve accuracy by estimating how many duplicate cookies you can expect.

hylam · ‎10-30-2015

"visitor-hours" and "visitors" are different. 24 visitor-hours ! = 24 visitors

hylam · ‎10-30-2015

will a summary index holding these work? i will then rollup
minute sum, hour sum, day sum, week sum, month sum, year sum

martin_mueller · ‎10-30-2015

I know, that's why you'd need to calculate a rough conversion factor - and chunked data would be most useful for trends, not for precise absolute numbers. In return it'd be much much cheaper to compute.

martin_mueller · ‎10-30-2015

Instead of chunking, you could sample - e.g. skip the last few chars of your cookie and calculate a conversion factor from this lower number to the real number. With every hex char lost you reduce cardinality by *16.

hylam · ‎10-27-2015

In shell script I can probably do parallel sort merge in O(n log n) time w/ linear speedup by adding CPU cores. Can I parallelize this query in Splunk? The hash of cookies should work as the key for the parallel mapreduce operation.

Why does our data model "distinct count(cookie) AS UniqueVisitor" take very long?

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Automated Threat Analysis: Available in ES Premier

What’s New in Splunk AI: Volume 02

Best Practices: Splunk auto adjust pipeline queue

Join the Conversation