Splunk Search

Accelerating searches over data with millions of logs

Path Finder

Hi,

Each day I receive millions of events of type "video_view". These logs are taken for hundreds of thousands of videos.

index="main"  action="video" | stats count as views by video_name |  sort -views | head 1000

index="main"  action="video" | stats dc(user) as views by video_name |  sort -views | head 1000

As you can see, we need to table both views and distinct user views with the video names. What is the best way to approach a very fast loading version of these?
We have previously gone with accelerated searches, but due to the massive amount of data (logs and different names of videos), this still takes minutes to load at times.

0 Karma

Super Champion

You should use summary indexing.

Here is a good document:

http://docs.splunk.com/Documentation/Splunk/6.0/Knowledge/Usesummaryindexing

Here is a video explaination:

http://www.splunk.com/view/SP-CAAACZW

0 Karma

SplunkTrust
SplunkTrust

I guess summary indexing will solve both your problem, with or without including time. Suppose your create a sumary index search with "index=main action=video | stats count by user,videoname " which will run hourly storing into main1h index, then using "index=main1h source='yoursummary index search name' | stats sum(count) as views by videoname | sort -views |head 1000" will serve your first requirement and "index=main1h source='yoursummary index search name' | stats dc(user) as views by video_name| sort -views |head 1000" will serve second.

0 Karma

Super Champion

I would create the summary index with table stats.

index="main" action="video" | table _time, action, user, video_name

It will be far faster to search the subset of data in the summary index then it would be searching the raw data.

0 Karma

Path Finder

How do you mean to include _time? I don't see how this would solve the problem at the moment

0 Karma

Super Champion

Include _time
I had the same problem with failed logon events with over a hundred million windows security events. It changed a searches that were basically impossible to ones that complete in seconds.

0 Karma

Path Finder

Summary indexing is certainly one solution to the first query.

However, if I were to use this for the latter then it would count a distinct user view once per time the summary index runs. So, if the index was run hourly then it would be counted up to 24 times in a report which is meant to show distinct video views over the last 24 hours. This should never be more than once, or it's not distinct.

Any idea to get around this? I've used summary indexes in the past but the same issue always arises when distinct user counts get involved

0 Karma