I'm having a tough time getting a particular scheduled saved search to not generate duplicates in my summary index. Looking for some advice.
The premise: I have a lot of Apache web logs (hits) in Splunk that I want to summarize into web "sessions", which will be generated from a combination of website hostname, ip address, and user agent. Not accurate but close enough for this purpose. My saved search looks at the weblogs since beginning of the day and creates a unique "session_id" along with other important data like landing page, user agent, etc. I want this search to run every 5 minutes so that the summary index is up-to-date for a dashboard.
My planned saved search to populate the summary index was:
earliest=@d index=apache_logs status=200 contenttype=text/html
| stats earliest(_time) AS _time earliest(referer) AS referer earliest(uri) AS landing_page BY host_header clientip useragent
| eval session_id=md5(""._time.host_header.clientip.useragent)
| search NOT [ search earliest=@d index=website_summary search_name="Website Sessions" | table session_id ]
I've used subsearches to exclude duplicates in the past, and it normally works great. The problem is, we have many sessions per day, so the subsearch generates way more than the 10,000 result limit for a subsearch. Even if I created 24 subsearches (one for each hour), I would still go past 10,000 results for each. I thought about switching to using append or join which has a 50,000 limit, but that is still too low.
I cannot change the timeframe because then the search will generate duplicate sessions because it doesn't know the true start time of the session (start time typically resets every day, which is why the search start time is @d).
Any ideas to do the deduplication without using a subsearch? Or maybe I'm looking at this wrong and need to try something completely different?
... View more