Re: Accelerating searches over data with millions ...

anthonycopus · ‎12-13-2013

Hi,

Each day I receive millions of events of type "video_view". These logs are taken for hundreds of thousands of videos.

index="main"  action="video" | stats count as views by video_name |  sort -views | head 1000

index="main"  action="video" | stats dc(user) as views by video_name |  sort -views | head 1000

As you can see, we need to table both views and distinct user views with the video names. What is the best way to approach a very fast loading version of these?
We have previously gone with accelerated searches, but due to the massive amount of data (logs and different names of videos), this still takes minutes to load at times.

lukejadamec · ‎12-13-2013

You should use summary indexing.

Here is a good document:

http://docs.splunk.com/Documentation/Splunk/6.0/Knowledge/Usesummaryindexing

Here is a video explaination:

http://www.splunk.com/view/SP-CAAACZW

somesoni2 · ‎12-16-2013

I guess summary indexing will solve both your problem, with or without including _time. Suppose your create a sumary index search with "index=main action=video | stats count by user,video_name " which will run hourly storing into main_1h index, then using "index=main_1h source='yoursummary index search name' | stats sum(count) as views by video_name | sort -views |head 1000" will serve your first requirement and "index=main_1h source='yoursummary index search name' | stats dc(user) as views by video_name| sort -views |head 1000" will serve second.

lukejadamec · ‎12-16-2013

I would create the summary index with table stats.

index="main" action="video" | table _time, action, user, video_name

It will be far faster to search the subset of data in the summary index then it would be searching the raw data.

anthonycopus · ‎12-16-2013

How do you mean to include _time? I don't see how this would solve the problem at the moment

lukejadamec · ‎12-13-2013

Include _time
I had the same problem with failed logon events with over a hundred million windows security events. It changed a searches that were basically impossible to ones that complete in seconds.

anthonycopus · ‎12-13-2013

Summary indexing is certainly one solution to the first query.

However, if I were to use this for the latter then it would count a distinct user view once per time the summary index runs. So, if the index was run hourly then it would be counted up to 24 times in a report which is meant to show distinct video views over the last 24 hours. This should never be more than once, or it's not distinct.

Any idea to get around this? I've used summary indexes in the past but the same issue always arises when distinct user counts get involved

Accelerating searches over data with millions of logs

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes