Solved: Why is search not working properly on duplicate in...

user9025 · ‎10-13-2022

I have a splunk query, in which my intention is to get all ipAddress for which "EVENT A" occurred in last 22 hours starting from 4 hours before, but "EVENT B" is not there in last 24 hours for same IpAddress.

It is known that "Event A" will have one occurrence for Ip address,(if any), but "Event B" will have ,multiple occurrences.

Following is the query:

index=prod-* sourcetype="kube:service" "Event A"  earliest=-24h latest=-4h  |table IpAddress | search NOT [search index=prod-* sourcetype="kube:service" AND ("Event B")  earliest=-24h latest=-0h |table IpAddress ]

Why the first query is not working fine?

This does not work fine and return the results, even if, there is an Ip address for "Event A" and multiple events for same Ip address "Event B".

But if I add, dedup IpAddress to inner search not query, then it works fine.

Updated query:

index=prod-* sourcetype="kube:service" "Event A"  earliest=-24h latest=-4h  |table IpAddress | search NOT [search index=prod-* sourcetype="kube:service" AND ("Event B")  earliest=-24h latest=-0h |dedup IpAddress|table IpAddress ]

jdunlea · ‎10-13-2022

If you have a lot of events with "EVENT B" in your data, then you might be hitting the event limit for the subsearch (10k events). Therefore the subsearch will return only the first 10k events, which might only have a small number of IP addresses (if many events have the same IP address).

Using dedup will make the result count much smaller and probably have less than 50k IP addresses, so the subsearch can return all of the IP addresses to the first search and then do the filtering.

Side note: You might be able to do this using a single search (no subsearch) by doing something like the following (please note: you will need to create the event_flag field yourself using your own regex/match)

index=prod-* sourcetype="kube:service" ("Event A"  earliest=-24h latest=-4h) OR ("Event B" earliest=-24h latest=-0h)  | eval event_flag=if(match(_raw,"Event A"),"Event_A","Event_B")
| stats values(event_flag) as event_flag dc(event_flag) as event_count by IPAddress
| search event_count=1 event_flag="Event_A"

View solution in original post

jdunlea · ‎10-13-2022

If you have a lot of events with "EVENT B" in your data, then you might be hitting the event limit for the subsearch (10k events). Therefore the subsearch will return only the first 10k events, which might only have a small number of IP addresses (if many events have the same IP address).

Using dedup will make the result count much smaller and probably have less than 50k IP addresses, so the subsearch can return all of the IP addresses to the first search and then do the filtering.

Side note: You might be able to do this using a single search (no subsearch) by doing something like the following (please note: you will need to create the event_flag field yourself using your own regex/match)

index=prod-* sourcetype="kube:service" ("Event A"  earliest=-24h latest=-4h) OR ("Event B" earliest=-24h latest=-0h)  | eval event_flag=if(match(_raw,"Event A"),"Event_A","Event_B")
| stats values(event_flag) as event_flag dc(event_flag) as event_count by IPAddress
| search event_count=1 event_flag="Event_A"

johnhuang · ‎10-13-2022

Subsearch have limitations including 10k results and 60 sec runtime. The dedup reduce the number of results to less than 10K.

Subsearch is also inefficient compared to other methods -- you should write a primary search that includes both event types and use stats, etc to filter. If you need help with this, you should provide the actual search terms/fields for Event A and B.

Why is search not working properly on duplicate inner records?

subsearch

table

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes