Monitoring Splunk

Why does our data model "distinct count(cookie) AS UniqueVisitor" take very long?

hylam
Contributor

17 GB IIS log files, 2.5 GB 100% accelerated data model. 16 cores 8 GB RAM with 2 GB RAM free. The pivot was single-core cpu-bound. Disk activity is minimal. Any idea? Thx.

0 Karma

hylam
Contributor

Create these accelerated reports, then rollup by "| stats dc(cookie)" over the time range in the time picker

[cookieMinute]
search = sourcetype=iis | bucket span=1m _time | stats count by _time cookie
[cookieHour]
search = sourcetype=iis | bucket span=1h _time | stats count by _time cookie
[cookieDay]
search = sourcetype=iis | bucket span=1d _time | stats count by _time cookie
[cookieWeek]
search = sourcetype=iis | bucket span=1w _time | stats count by _time cookie

0 Karma

hylam
Contributor
  • using accelerated data model directly
    dc(cookie) on 1 sec of data - instant
    dc(cookie) on 1 hr of data - 3 min
    dc(cookie) on 1.5 hr of data - i ran out of patience
    dc(cookie) all time - didnt even attempt

  • computing hour,cookie,count then dedup
    dc(cookie) on all time - i ran out of patience

  • computing minute,cookie,count then dedup consecutive=true
    dc(cookie) on all time - 7 minutes

all time = 14 days

What kind of data structures are these?
accelerated data model
tsidx

How can I do a quick estimate?
https://www.google.com/search?q=distinct+value+estimation

0 Karma

hylam
Contributor

dc(cookie) from indexed data also took 7 min.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

I'm guessing your cookies are extremely high-cardinality, ie a large number of distinct values. Computing a dc() over that is a very high-load task for a data model. It has to keep (probably) millions of different values around, and for each new value it has to check if it has seen that value before or not. That's a nightmare to compute accurately.

hylam
Contributor
0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Yeah, this should scale out to multiple indexers and probably (didn't check myself yet) also to multiple search pipelines on one indexer (6.3 feature).

Before adding tons of hardware you should spend some time (and maybe money) on figuring out if your current data model / search is the best way to answer your core question - determining the best approach, one step before figuring out if you can make the chosen approach faster.

0 Karma

hylam
Contributor

It is running single core cpu bound on splunk 6.3. I am using the auto (750 MB) bucket size. I think inter-bucket parallelization is possible. Not the 10 GB auto_high_volume bucket size. Simply adding the dc(cookie) hourly would count a 24 hour long session 24 times in the worst case. How can I run a parallel sort merge uniq in splunk?

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

This is your starting point for parallelizing things on a single box: http://docs.splunk.com/Documentation/Splunk/6.3.0/Capacity/Parallelization

Your described hourly distinct count would be something like ... | timechart span=1h dc(cookie) or a row split by time in pivot/datamodel terms.

hylam
Contributor

The cookies have a 1 year life time. If I set the query time range to 1 year it should saturate parallel sort merge uniq on any cluster that I can afford. Even C++ may not be fast enough.

I have seen some splunk apps that appends new distinct keys to a csv file every 5 to 10 min. How can I write that?

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Yeah, doing one year of cookies accurately in one go is nigh-on impossible, even nlog(n) becomes less than fun.

Here's a traditional approach of keeping state in CSV files: http://blogs.splunk.com/2011/01/11/maintaining-state-of-the-union/

Depending on your actual use case, you may be better off with precomputing chunks of distinct counts and storing them in a summary. Summing that up will give you higher numbers than reality, but if you're only looking for trends then that'd be fine. If you need more accurate numbers you could compare the chunked numbers with real numbers computed for a longer but still manageable time range, and apply that correction to your chunked data from then on.

0 Karma

hylam
Contributor

How about keeping a slowly changing dimension csv?

firstSeenTime,lastSeenTime,cookie

http://blogs.splunk.com/2011/01/11/maintaining-state-of-the-union/

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

It'll work, it just depends on what your actual requirements are. All we're doing here is throwing technical solutions around, it's impossible to tell what works best for your use case without knowing your use case.

0 Karma

hylam
Contributor

The slowly changing dimension should not work under a lot of cases.

0 Karma

hylam
Contributor

Will it be any faster if I use minimal number of fields in a data model?

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Yes, in principle smaller will always be faster.

However, in this case you won't see large gains - the issue is cardinality, not data model size.
This should imho be solved through dropping the accuracy requirement and computing chunks, e.g. build a distinct count each hour and store that in a summary, letting your reports read the summaries only. You can attempt to improve accuracy by estimating how many duplicate cookies you can expect.

0 Karma

hylam
Contributor

"visitor-hours" and "visitors" are different. 24 visitor-hours ! = 24 visitors

0 Karma

hylam
Contributor

will a summary index holding these work? i will then rollup
minute sum, hour sum, day sum, week sum, month sum, year sum

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

I know, that's why you'd need to calculate a rough conversion factor - and chunked data would be most useful for trends, not for precise absolute numbers. In return it'd be much much cheaper to compute.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Instead of chunking, you could sample - e.g. skip the last few chars of your cookie and calculate a conversion factor from this lower number to the real number. With every hex char lost you reduce cardinality by *16.

0 Karma

hylam
Contributor

In shell script I can probably do parallel sort merge in O(n log n) time w/ linear speedup by adding CPU cores. Can I parallelize this query in Splunk? The hash of cookies should work as the key for the parallel mapreduce operation.

0 Karma