Hi,
Each day I receive millions of events of type "video_view". These logs are taken for hundreds of thousands of videos.
index="main" action="video" | stats count as views by video_name | sort -views | head 1000
index="main" action="video" | stats dc(user) as views by video_name | sort -views | head 1000
As you can see, we need to table both views and distinct user views with the video names. What is the best way to approach a very fast loading version of these?
We have previously gone with accelerated searches, but due to the massive amount of data (logs and different names of videos), this still takes minutes to load at times.
You should use summary indexing.
Here is a good document:
http://docs.splunk.com/Documentation/Splunk/6.0/Knowledge/Usesummaryindexing
Here is a video explaination:
I guess summary indexing will solve both your problem, with or without including _time. Suppose your create a sumary index search with "index=main action=video | stats count by user,video_name " which will run hourly storing into main_1h index, then using "index=main_1h source='yoursummary index search name' | stats sum(count) as views by video_name | sort -views |head 1000" will serve your first requirement and "index=main_1h source='yoursummary index search name' | stats dc(user) as views by video_name| sort -views |head 1000" will serve second.
I would create the summary index with table stats.
index="main" action="video" | table _time, action, user, video_name
It will be far faster to search the subset of data in the summary index then it would be searching the raw data.
How do you mean to include _time? I don't see how this would solve the problem at the moment
Include _time
I had the same problem with failed logon events with over a hundred million windows security events. It changed a searches that were basically impossible to ones that complete in seconds.
Summary indexing is certainly one solution to the first query.
However, if I were to use this for the latter then it would count a distinct user view once per time the summary index runs. So, if the index was run hourly then it would be counted up to 24 times in a report which is meant to show distinct video views over the last 24 hours. This should never be more than once, or it's not distinct.
Any idea to get around this? I've used summary indexes in the past but the same issue always arises when distinct user counts get involved