Splunk Search

Need help Optimizing Search in HUNK

EricLloyd79
Builder

We are currently using MapRFS and with our restrictions on directory structure, we are having a hard time getting optimized searches with Hunk.

Basically, the search will find all the events and then just keep searching through all files.

Our restriction require us to a have a folder called current that our current hour logs go into and then at the top of the hour, it is rolled and we move the rolled file into the subdirectories based on date/time.

Our current directory structure looks like:
/mapr/mapr.oly.cequintecid.com/user/mapr/data/(sourcetype)/(host)/current/(year)/(month)/(day)/(hour)

The current hour goes into a log file in /mapr/mapr.oly.cequintecid.com/user/mapr/data/(sourcetype)/(host)/current
and then is moved at the top of the hour to the corresponding
...(year)/(month)/(day)/(hour)
folder

We had search optimization before when we were putting the current hour log file directly into the further down hour subdirectory but we cannot do this anymore due to internal restrictions.

Suggestions are welcome.

Here is our indexes.conf for the virtual index we are using:
alt text

0 Karma
1 Solution

rdagan_splunk
Splunk Employee
Splunk Employee

The issue is that your HDFS directory contains 4 subdirectories between /data/(sourcetype)/(host)/current/
However, the regex contains only 3 combinations of ' ignore all '

The few workarounds are:
regex = .*?/current/(\d+)/(\d+)/(\d+)/(\d+)/.*
or
regex = .*?/(\d+)/(\d+)/(\d+)/(\d+)/.*
or
regex = /user/mapr/.*?/.*?/.*?/.*?/(\d+)/(\d+)/(\d+)/(\d+)/.*

View solution in original post

0 Karma

EricLloyd79
Builder

The solution was to change it to the answer by rdagan: "Since the vix.input.1.path start searching from the /current/ and below ... you should change the time regex = .?/(\d+)/(\d+)/(\d+)/(\d+)/. "

0 Karma

rdagan_splunk
Splunk Employee
Splunk Employee

The issue is that your HDFS directory contains 4 subdirectories between /data/(sourcetype)/(host)/current/
However, the regex contains only 3 combinations of ' ignore all '

The few workarounds are:
regex = .*?/current/(\d+)/(\d+)/(\d+)/(\d+)/.*
or
regex = .*?/(\d+)/(\d+)/(\d+)/(\d+)/.*
or
regex = /user/mapr/.*?/.*?/.*?/.*?/(\d+)/(\d+)/(\d+)/(\d+)/.*

0 Karma

EricLloyd79
Builder

I did find when I used the configuration below, it ran faster.
How does this look to you in terms of speed performance for finding last 15 mins of data? Expected?

This search has completed and has returned 16 results by scanning 119,993,301 events in 198.176 seconds

[mapr-curr]
vix.input.1.et.format = yyyyMMddHH
vix.input.1.et.regex = /user/mapr/data/.*?/.*?/current/(\d+)/(\d+)/(\d+)/(\d+)/.*
vix.input.1.lt.format = yyyyMMddHH
vix.input.1.lt.offset = 3600
vix.input.1.lt.regex = /user/mapr/data/.*?/.*?/current/(\d+)/(\d+)/(\d+)/(\d+)/.*
vix.input.1.path = /user/mapr/data/${sourcetype}/${host}/current/...
vix.provider = maproly
0 Karma

rdagan_splunk
Splunk Employee
Splunk Employee

Since the vix.input.1.path start searching from the /current/ and below ... you should change the time regex = .*?/(\d+)/(\d+)/(\d+)/(\d+)/.*

0 Karma

EricLloyd79
Builder

Thanks, I noticed that earlier and changed it to the solutions you have and it still displays the same behavior. It will find the events requested and then continue searching through all of the rest of the million files.

Im pretty baffled as I have created an almost identical subdirectory structure that I use a new virtual index on and it runs with optimization.

Im updating the screenshot in the original question to what I have now.

0 Karma

burwell
SplunkTrust
SplunkTrust

Hi. So when you say

Basically, the search will find all the events and then just keep searching through all files.

I assume you are watching with debug or something to see that Splunk keeps looking at the files to see if they match the regex?

That is the behavior I have observed too. How would it know not to keep looking?

0 Karma

EricLloyd79
Builder

So it doesnt always do this.
Hunk can avoid this by the regex you create in specifying a timeline either in Splunk Web or in the Indexes.conf. If you look at the vix.input.1.et.regex it is (et stands for earliest time, lt for latest time)

/user/mapr/.?/.?/.?/(\d+)/(\d+)/(\d+)/(\d+)/.

and I have the format at yyyyMMddHH

So it will find that first (\d+) and identify it as the year
Second one is the month
Third is the Day
Fourth is the hour
So now Hunk can search through ONLY the directories for the time you specified. Say you only wanted to search for the last 60 mins
It will know what the HOUR of your logs are from the regex and ONLY search in the appropriate folders based on whatever hours fall into the last 60 mins.

I've seen this work before BEFORE I had a current folder in my path and now it doesnt.

0 Karma

ledion
Path Finder

Hunk will list all the files, then eliminate as many is it can by just looking at the path e.g. filter via fields extracted from path, or time range. If it cannot eliminate a file, then it will process it's contents and apply the search to the contents. Just to be clear, there are two parts to search (a) list files try to eliminate and (b) read files and process their content

EricLloyd79 - are you seeing files that shouldn't be processed be processed? If so, please provide an example

0 Karma

EricLloyd79
Builder

I am not sure how to tell which files it is searching through. All I can see is that it has found x number of files of y number of files and even though I see all the results for the time I specified in the search results, it continues to search through more files up in the range of millions. I tried turning on Debug mode for Splunk and running this and viewing the splunkd.log file to see if I can view where it is searching once all the files are found but I could not see in the logs where that information would be.

0 Karma

ledion
Path Finder

Is x(in x of y files) in the general vicinity of the files it needs to search? ie the ones that fall within earliest/latest of the search + current?

0 Karma

EricLloyd79
Builder

Yes the events it finds are correct, that isnt the issue. The issue is that after it finds them, it continues searching through the rest of the files.

Its even more bizarre to me because I have a separate directory with a new virtual index that is nearly identical to the first one (only difference I can find is the names) and when it searches, it will find the events in the time frame asked for and then wraps up the search easy.

I realize this is very difficult to troubleshoot this sort of thing without seeing it. Any suggestions are welcome though as I will continue trying to figure out what makes one virtual index run faster than the other.

0 Karma

ledion
Path Finder

Is the et/lt regex correct and markdown is messing it up to just .? ?

0 Karma

EricLloyd79
Builder

Thanks for pointing that out. it was a fluke of the nature of HTML in this interface. I changed it and uploaded an image of what my et/lt regex looks like with the astrick included

0 Karma

ledion
Path Finder

Below you say it doesn't "always do it" - with the current configs that you have in place it should always search the input files for which it cannot figure out the time range they belong to - ie the most recent hour of data will be always searched. Is that what you see?

0 Karma
Get Updates on the Splunk Community!

Harnessing Splunk’s Federated Search for Amazon S3

Managing your data effectively often means balancing performance, costs, and compliance. Splunk’s Federated ...

Infographic provides the TL;DR for the 2024 Splunk Career Impact Report

We’ve been buzzing with excitement about the recent validation of Splunk Education! The 2024 Splunk Career ...

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...