Currently I am writing events into a file on a continuous basis.
The rate is about 5000 events every 5 min, which is about 60,000 events every 1 hour. And it continues.
I am running a search query as below which is taking close to 7 min. Is there any way I can reduce this search time to few seconds. Can summary indexing help in this case?
EventCode="0" OR EventCode="4624" earliest=-30m | transaction ComputerName startswith=(EventCode="4624") endswith=(EventCode="0") maxevents=2 | eval duration=round(duration,1) | search duration<30 AND duration>2| stats stdev(duration) as StdDevDuration avg(duration) as AvgDuration count(ComputerName) as count_value | table AvgDuration, StdDevDuration, count_value
sideview's recommendations for tweaking the base search are also very good. Doing the transaction part with with stats is possible but its very gnarly (I've done it but it was confusing and time consuming, and involves the use of multi-value fields and the use of first() and last(), etc). See more on that below.
The first thing I would check after optimizing the base search as sideview mentioned is ensuring that my regular expressions (field extractions) aren't performing poorly for non-buildin fields such as EventCode and ComputerName. If those are extracted by Splunk implicitly (if they are in the _raw events as "EventCode=xxx ComputerName=yyy", then skip my next paragraph and proceed to transaction info, otherwise continue on.
Try running the base search as recommended by dwaddel, again. After the search completes, click on the Job Inspector icon (it looks like an "i") near where you have the option to save, pause, etc. In the job inspector, you should see a summary of how long each portion of the search took. If you see command.search.kv taking a lot of the time used for the search (say, greater than 30s out of 3 mins), it means that the regular expression that is being used to extract the EventCode field is fairly slow. Try downloading "Regex Coach" from the internet, take a sample event from splunk, and the field extraction for EventCode. You'll need to strip out the extra syntax contained in the parenthesis for naming the group (should look like "?<EventCode>". Then single-step through the regex to see if it is doing a lot of backing up and moving forward. Read about REGEX optimizations and apply anything you can to make it do less moving around. Paste the new REGEX back into your field extraction (adding the group name back in), and test that it works and shows up in 100% of your results (if the old one did). You should also do this for the ComputerName field, since it will be required for the timechart/stats portion of the search.
As for the transaction part, that is going to be much tougher because you have two distinct field values that are important for calculating the duration, and they must occur in a set order (otherwise I'd say take a look at this for more info about that). Transaction processing time goes up exponentially based on the number of events it has to deal with at once. This is in also line with Lamar's recommendation of minimizing the events that are found in the base search as much as possible.
Also note that the transaction command is single threaded (actually a lot of splunk searching is), so you are limited to the speed of one CPU. You may be able to get better results by segmenting your base search into smaller partitions based on a list of ComputerName's, and running more transaction commands in parallel so that the searches are distributed across CPUs (if you have a multi-core system -- I haven't tried this, but it should work).
If there are many cases (say, greater than 10%) where you receive EventCode 4624 but not an accompanying 0 (or vice versa), then an option would be to filter off events for hosts that don't have both of those EventCodes, prior to the transaction command (since their transaction duration will always 0 and can easily be identified with a stats command). Use a subsearch before the first pipe, like this:
EventCode=0 OR EventCode=4624 [search EventCode=0 OR EventCode=4624 | stats dc(EventCode) AS CodeCount by ComputerName | search CodeCount>1 | table ComputerName ] | ... your transaction command etc
Of course, if the field extraction time for EventCode and ComputerName is slow, then this will add more time up front, but if they are fast, and many events are eliminated, transaction could speed up a great deal.
One thing you should check, is run just this search
EventCode="0" OR EventCode="4624" earliest=-30m
Then click the little "i" icon, and scroll down to find both scanCount to eventCount. Compare the two.
To get the events for this search, splunk will technically use the index to get the events for this search:
0 OR 4624
and then once the events are off disk, it will extract the EventCode fields and compare them for those values. Obviously though that search can match many events that will get thrown away.
If the word EventCode is literally in the data, I would add
EventCode as a searchterm.
In any event, if scanCount is a lot greater than eventCount, you want to explore what you can do to that initial search clause to make scanCount be much closer to eventCount, and then you'll get a corresponding increase in search speed.
as for reducing the time spent in transaction, the drastic solution is to do away with transaction entirely and use only eval, streamstats and stats. The search language gets a little scary looking, but here's an example:
EventCode="0" OR EventCode="4624" | streamstats count | eval transaction_id=if(EventCode=="4624",ComputerName+"."+count,null) | eval inOrOut=if(EventCode=="4624","in",if(EventCode=="0","out",null)) | streamstats last(transaction_id) as transaction_id by ComputerName | streamstats current=f last(inOrOut) as inOrOut2 by ComputerName | eval inOrOut2=if(inOrOut=="in","in",inOrOut2) | search inOrOut2="in" | stats min(_time) as earliestTime max(_time) as latestTime by transaction_id | eval duration=latestTime-earliestTime | where duration>2 AND duration<30
The transaction is a relatively expensive operator in the Splunk pipeline. I would spend more time trying to refine your base search if you can. Meaning: if you have a specific sourcetype (it looks like you're searching across WinEvent::Security logs) refine your search with that sourcetype as opposed to search across all of your data in your index.
So, your search might look something like this:
sourcetype="WinEvent::Security" EventCode="0" OR EventCode="4624" earliest=-30m | transaction ComputerName startswith=(EventCode="4624") endswith=(EventCode="0") maxevents=2 | eval duration=round(duration,1) | search duration<30 AND duration>2| stats stdev(duration) as StdDevDuration avg(duration) as AvgDuration count(ComputerName) as count_value | table AvgDuration, StdDevDuration, count_value
It is very difficult to do summary indexing with
transaction, because you can't be sure that "both" events that compose the transaction will be present in the time range for your summary-populating search. So, while I wouldn't rule it right out, I'd consider it far more difficult.
In terms of accelerating the search you have, look at what you have to the left of the first
| character - the base search. How long does just this part take to run?
EventCode="0" OR EventCode="4624" earliest=-30m
Can you give Splunk any additional hints to help filter down the set of events? The string
"0" is likely very common. Is there a
source you can add to the search to help filter down further?
This part(EventCode="0" OR EventCode="4624" earliest=-30m) takes close to 3 min to run. I tried adding source/sourcetype but it doesn't improve the performance by much. These 2 Events come one below the other, so may be transaction command can be changed if that is the culprit.