Solved: How to make sure a saved search does not create du...

rnotley · ‎03-27-2019

I'm having a tough time getting a particular scheduled saved search to not generate duplicates in my summary index. Looking for some advice.

The premise: I have a lot of Apache web logs (hits) in Splunk that I want to summarize into web "sessions", which will be generated from a combination of website hostname, ip address, and user agent. Not accurate but close enough for this purpose. My saved search looks at the weblogs since beginning of the day and creates a unique "session_id" along with other important data like landing page, user agent, etc. I want this search to run every 5 minutes so that the summary index is up-to-date for a dashboard.

My planned saved search to populate the summary index was:

earliest=@d index=apache_logs status=200 contenttype=text/html
| stats earliest(_time) AS _time earliest(referer) AS referer earliest(uri) AS landing_page BY host_header clientip useragent
| eval session_id=md5(""._time.host_header.clientip.useragent)
| search NOT [ search earliest=@d index=website_summary search_name="Website Sessions" | table session_id ]

I've used subsearches to exclude duplicates in the past, and it normally works great. The problem is, we have many sessions per day, so the subsearch generates way more than the 10,000 result limit for a subsearch. Even if I created 24 subsearches (one for each hour), I would still go past 10,000 results for each. I thought about switching to using append or join which has a 50,000 limit, but that is still too low.

I cannot change the timeframe because then the search will generate duplicate sessions because it doesn't know the true start time of the session (start time typically resets every day, which is why the search start time is @d).

Any ideas to do the deduplication without using a subsearch? Or maybe I'm looking at this wrong and need to try something completely different?

rnotley · ‎03-29-2019

Well, I figured it out. The saved search I ended up with is:

| multisearch
  [ search earliest=@d index=website_summary search_name="Website Sessions" ]
  [ search earliest=@d index=apache_logs status=200 contenttype=text/html
    | eval date=strftime(_time, "%Y-%m-%d")
    | eval session_id=md5("".date.host_header.clientip.useragent ) ]
| stats earliest(_time) AS _time
        earliest(referer) AS referer
        earliest(uri) AS uri
        first(search_name) AS search_name
        BY session_id host_header clientip useragent 
| where isnull(search_name)

Here was my logic leading up to this result:

I knew I couldn't use a subsearch, so I decided to try multisearch. This way, I won't have any problem with limits, but I knew that since both search results would be mixed together, I needed a way to filter out the duplicate session_ids.
Because multisearches only accept streaming commands, I couldn't use stats as part of the search, but I needed that because the session_id was derived from earliest(_time). So I had to figure out a way to generate a session_id before stats. Instead of using _time, I created a date field with strftime(_time, "%Y-%m-%d") and used that instead. Perfect!
Now I just needed to figure out a way to remove the duplicates. By running stats on both result sets, I could add a field that I knew was included in the summary index but not the new result set. search_name seems to fit the bill, so I added first(search_name) to the stats, and then used a where isnull(search_name) to include just the new results.

View solution in original post

rnotley · ‎03-29-2019

Well, I figured it out. The saved search I ended up with is:

| multisearch
  [ search earliest=@d index=website_summary search_name="Website Sessions" ]
  [ search earliest=@d index=apache_logs status=200 contenttype=text/html
    | eval date=strftime(_time, "%Y-%m-%d")
    | eval session_id=md5("".date.host_header.clientip.useragent ) ]
| stats earliest(_time) AS _time
        earliest(referer) AS referer
        earliest(uri) AS uri
        first(search_name) AS search_name
        BY session_id host_header clientip useragent 
| where isnull(search_name)

Here was my logic leading up to this result:

I knew I couldn't use a subsearch, so I decided to try multisearch. This way, I won't have any problem with limits, but I knew that since both search results would be mixed together, I needed a way to filter out the duplicate session_ids.
Because multisearches only accept streaming commands, I couldn't use stats as part of the search, but I needed that because the session_id was derived from earliest(_time). So I had to figure out a way to generate a session_id before stats. Instead of using _time, I created a date field with strftime(_time, "%Y-%m-%d") and used that instead. Perfect!
Now I just needed to figure out a way to remove the duplicates. By running stats on both result sets, I could add a field that I knew was included in the summary index but not the new result set. search_name seems to fit the bill, so I added first(search_name) to the stats, and then used a where isnull(search_name) to include just the new results.

How to make sure a saved search does not create duplicates in a summary index?

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases

Are you a member of the Splunk Community?

How to make sure a saved search does not create duplicates in a summary index?

Building Reliable Asset and Identity Frameworks in Splunk ES

Cloud Monitoring Console - Unlocking Greater Visibility in SVC Usage Reporting

Automatic Discovery Part 3: Practical Use Cases