We are currently using MapRFS and with our restrictions on directory structure, we are having a hard time getting optimized searches with Hunk.
Basically, the search will find all the events and then just keep searching through all files.
Our restriction require us to a have a folder called current that our current hour logs go into and then at the top of the hour, it is rolled and we move the rolled file into the subdirectories based on date/time.
Our current directory structure looks like:
/mapr/mapr.oly.cequintecid.com/user/mapr/data/(sourcetype)/(host)/current/(year)/(month)/(day)/(hour)
The current hour goes into a log file in /mapr/mapr.oly.cequintecid.com/user/mapr/data/(sourcetype)/(host)/current
and then is moved at the top of the hour to the corresponding
...(year)/(month)/(day)/(hour)
folder
We had search optimization before when we were putting the current hour log file directly into the further down hour subdirectory but we cannot do this anymore due to internal restrictions.
Suggestions are welcome.
Here is our indexes.conf for the virtual index we are using:
The issue is that your HDFS directory contains 4 subdirectories between /data/(sourcetype)/(host)/current/
However, the regex contains only 3 combinations of ' ignore all '
The few workarounds are:
regex = .*?/current/(\d+)/(\d+)/(\d+)/(\d+)/.*
or
regex = .*?/(\d+)/(\d+)/(\d+)/(\d+)/.*
or
regex = /user/mapr/.*?/.*?/.*?/.*?/(\d+)/(\d+)/(\d+)/(\d+)/.*
The solution was to change it to the answer by rdagan: "Since the vix.input.1.path start searching from the /current/ and below ... you should change the time regex = .?/(\d+)/(\d+)/(\d+)/(\d+)/. "
The issue is that your HDFS directory contains 4 subdirectories between /data/(sourcetype)/(host)/current/
However, the regex contains only 3 combinations of ' ignore all '
The few workarounds are:
regex = .*?/current/(\d+)/(\d+)/(\d+)/(\d+)/.*
or
regex = .*?/(\d+)/(\d+)/(\d+)/(\d+)/.*
or
regex = /user/mapr/.*?/.*?/.*?/.*?/(\d+)/(\d+)/(\d+)/(\d+)/.*
I did find when I used the configuration below, it ran faster.
How does this look to you in terms of speed performance for finding last 15 mins of data? Expected?
This search has completed and has returned 16 results by scanning 119,993,301 events in 198.176 seconds
[mapr-curr]
vix.input.1.et.format = yyyyMMddHH
vix.input.1.et.regex = /user/mapr/data/.*?/.*?/current/(\d+)/(\d+)/(\d+)/(\d+)/.*
vix.input.1.lt.format = yyyyMMddHH
vix.input.1.lt.offset = 3600
vix.input.1.lt.regex = /user/mapr/data/.*?/.*?/current/(\d+)/(\d+)/(\d+)/(\d+)/.*
vix.input.1.path = /user/mapr/data/${sourcetype}/${host}/current/...
vix.provider = maproly
Since the vix.input.1.path start searching from the /current/ and below ... you should change the time regex = .*?/(\d+)/(\d+)/(\d+)/(\d+)/.*
Thanks, I noticed that earlier and changed it to the solutions you have and it still displays the same behavior. It will find the events requested and then continue searching through all of the rest of the million files.
Im pretty baffled as I have created an almost identical subdirectory structure that I use a new virtual index on and it runs with optimization.
Im updating the screenshot in the original question to what I have now.
Hi. So when you say
Basically, the search will find all the events and then just keep searching through all files.
I assume you are watching with debug or something to see that Splunk keeps looking at the files to see if they match the regex?
That is the behavior I have observed too. How would it know not to keep looking?
So it doesnt always do this.
Hunk can avoid this by the regex you create in specifying a timeline either in Splunk Web or in the Indexes.conf. If you look at the vix.input.1.et.regex it is (et stands for earliest time, lt for latest time)
/user/mapr/.?/.?/.?/(\d+)/(\d+)/(\d+)/(\d+)/.
and I have the format at yyyyMMddHH
So it will find that first (\d+) and identify it as the year
Second one is the month
Third is the Day
Fourth is the hour
So now Hunk can search through ONLY the directories for the time you specified. Say you only wanted to search for the last 60 mins
It will know what the HOUR of your logs are from the regex and ONLY search in the appropriate folders based on whatever hours fall into the last 60 mins.
I've seen this work before BEFORE I had a current folder in my path and now it doesnt.
Hunk will list all the files, then eliminate as many is it can by just looking at the path e.g. filter via fields extracted from path, or time range. If it cannot eliminate a file, then it will process it's contents and apply the search to the contents. Just to be clear, there are two parts to search (a) list files try to eliminate and (b) read files and process their content
EricLloyd79 - are you seeing files that shouldn't be processed be processed? If so, please provide an example
I am not sure how to tell which files it is searching through. All I can see is that it has found x number of files of y number of files and even though I see all the results for the time I specified in the search results, it continues to search through more files up in the range of millions. I tried turning on Debug mode for Splunk and running this and viewing the splunkd.log file to see if I can view where it is searching once all the files are found but I could not see in the logs where that information would be.
Is x
(in x
of y
files) in the general vicinity of the files it needs to search? ie the ones that fall within earliest/latest of the search + current?
Yes the events it finds are correct, that isnt the issue. The issue is that after it finds them, it continues searching through the rest of the files.
Its even more bizarre to me because I have a separate directory with a new virtual index that is nearly identical to the first one (only difference I can find is the names) and when it searches, it will find the events in the time frame asked for and then wraps up the search easy.
I realize this is very difficult to troubleshoot this sort of thing without seeing it. Any suggestions are welcome though as I will continue trying to figure out what makes one virtual index run faster than the other.
Is the et/lt regex correct and markdown is messing it up to just .?
?
Thanks for pointing that out. it was a fluke of the nature of HTML in this interface. I changed it and uploaded an image of what my et/lt regex looks like with the astrick included
Below you say it doesn't "always do it" - with the current configs that you have in place it should always search the input files for which it cannot figure out the time range they belong to - ie the most recent hour of data will be always searched. Is that what you see?