(This is the first of a series of 2 blogs).
Splunk Enterprise Security is a fantastic tool that offers robust features to give you great insights into your protected infrastructure, helping you strike a nice balance between visibility and reducing alert fatigue.
In my role at Splunk, I often design and test non-production environments for Splunk Enterprise Security. Lately, I've been diving into various correlation searches (ES7) and detections (ES8). During this journey, I've found myself wondering, “Why isn't this finding being generated?” While troubleshooting, I came across a few scenarios that I think are worth sharing. Now, my situation is pretty straightforward since I have control over the data my instance receives. It’s easy for me to pinpoint when findings should have popped up. However, this might not be the case in a real production environment. So, having a solid and regular plan for testing detections in your environment is super important. And don’t forget, Splunk has a wealth of resources available to help you tackle these challenges!
Just a heads-up, I’ll be sticking with Enterprise Security 8 terminology for the rest of this post. A comprehensive glossary can be found here.
However, here is a little table with the closest equivalent in Enterprise Security 7 terminology to keep things clear.
Reason #1: The Search is not valid (any longer?).
Occam's principle suggests that the simplest explanation is often the most accurate. While this may seem straightforward, there have been instances where I anticipated a discovery, only to find that my initial search was misguided.
Let’s illustrate this with an example. In this scenario, I have established the following rule:
It seems it has succeeded 25 times before. However, in the last 24 hours, it seems there are no items in the queue, even though I'm certain of some matching events happening. So, what’s the deal with these events not showing up?
To get to the bottom of this, let’s take a closer look at the detection search. Just click on the detection name to move forward.
If you are familiar with Splunk Enterprise Security, you know this particular search is rather simple. Definitely, you would like to have more restrictive queries to avoid performance overhead. Anyway, its simplicity will help us illustrate the idea.
Now, let’s copy that detection and run it directly against our data. To do so, we can, of course, go to the search & reporting app, but there is no need as we have a Search tab on the Enterprise Security top menu.
No results in the past 24 hours. Not even in verbose mode. Interesting. So I went ahead and removed one of the search constraints, in this case, the tag, and ran the search again.
We have Results. Hence, the problem must be with the tags. To prove it I removed the table command and executed my search again. Then, I checked the available tags. As expected, the privileged tag was not there.
The privileged tag on some users was removed, affecting the detection. The solution will be to either edit the search and remove the tag constraint, or restore the tag on the data.
So, I went with editing the detection. Then, after a while I got new hits on this detection.
In this case the actual root cause was a miscommunication between the Splunk Administrator and the Splunk Enterprise Security Detection Engineer. Something that would never happen in your organization, right?
Now, if you don’t feel as comfortable with SPL, or even if you are an expert but will love some assistance, remember you can leverage the guided mode.
As a note aside, this seems to be a perfect opportunity to introduce detection versioning.
Reason #2: Wrong time range.
This is a subcategory of the previous reason. One that deserves to be mentioned on its own.
The time range of the search defined on a detection is shown in the time range section. Clearly.
Let’s analyze the example below:
In this case, the search will run every 60 minutes, capturing all events that occurred in the lapse between 72 minutes before the minute of detection time to 12 minutes before execution time. As a result, if the detection is scheduled to run at 10:20 PM but the event happened at 10:18, it won’t show until the next iteration, at 11:20 PM.
Now, usually, you won’t be messing with timing and will use the latest time as close to real-time as possible. Honestly, I have seen more instances of events duplicated because of the time adjustments. If the cron schedule is set to every 10 minutes and includes events from the past 60 minutes, you may end up with duplicated results. Indeed, over-correcting this second scenario was the cause of me creating some uncovered time windows. By the way, pay close attention to snapping.
I will come back with some more reasons I have found in a upcoming post.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.