I would like to get advice on how to make the search below more efficient (and probably simpler). The sourcetype contains massive amounts of log and the search takes about 24hours or more to finish. Also, let's consider that summary indexing is not an option at this point of time.
Below is what the search is:
sourcetype=vpn <my search>|bucket _time span=1d |rex field=_raw "(?i)\(LOCAL\\\(?P<Username>[^)]+)\)" | rex field=_raw "(?i)\((?P<UserAcct>[^\d]+)\)"| eval user=coalesce(Username,UserAcct)|fields + Username UserAcct user|convert ctime(_time) as timestamp | stats dc(timestamp) AS "Days_Logged_In" by user|where Days_Logged_In>11|sort -Days_Logged_in
All answers have different good points and trying out each of them would probably help anyone who has a similar issue decide what's best. In my case, I decided to stick with the original search and make it a scheduled search since the need for the information is not urgent. Running it as a report in the background made more sense.
All answers have different good points and trying out each of them would probably help anyone who has a similar issue decide what's best. In my case, I decided to stick with the original search and make it a scheduled search since the need for the information is not urgent. Running it as a report in the background made more sense.
The search is basically okay, though I would probably slightly shorten it as:
sourcetype=vpn <my search>
| rex "(?i)\(LOCAL\\\(?P<Username>[^)]+)\)"
| rex "(?i)\((?P<UserAcct>[^\d]+)\)"
| eval user=coalesce(Username,UserAcct)
| bucket _time span=1d
| stats dc(_time) as days_logged_in by user
| where days_logged_in > 11
| sort - days_logged_in
If this is a 100% dense search, (i.e., the events returned is pretty much all the events in the index for the time period in question) then I would expect this to run in 20 minutes 8 hours or less on a single server with the recommended CPU and disk. Probably less. However, if the events are more spare or rare within the index, then it would probably run proportionately slower. If they're 10%, then it would take about 10 times as long (i.e., 5 to 10 hours) and if they're 1 in 20, about 20 times as long, down till about 1 in 500 or 1 in 1000 when it will stabilize to a constant level. (This is known as a sparse search at this point, see http://docs.splunk.com/Documentation/Splunk/6.1.2/Installation/HowsearchtypesaffectSplunkperformance for explanation.)
If you want better performance, you can:
Edit: Sorry, I thought it was a total of 50 million, not 50 million per day. So that should be 30 times what I suggested, or 7 to 10 hours, not 15-20 minutes.
To clarify, the 24 hours I originally mentioned, I ran the search during peak hours of our infra. I ran the job on the background on off peak hours and finished the job after 6 hours.
Thanks gkanapathy.
So I ran a background job on this earlier and was still running as I posted the question. It finished at around 6 hours which is close to the estimate. So I am expecting any optimization on the search itself will still come around the same duration.
Sorry, I thought it was a total of 50 million, not 50 million per day. So that should be 30 times what I suggested, or 7 to 10 hours, not 15-20 minutes.
How is loading 1.5 billion events taking 20 minutes on a single reasonably-specced server? My maths suggest 1.25Meps for that, which sounds like a bit much.
Note that the field extractions are not a significant contributor to the performance. You could optimize them away and it's not likely to make a difference.
Some things you could do:
Create a search that looks over the last day and schedule it to run daily. The search will be basically what you already have, but will record the date and the last login time. For example:
sourcetype=vpn earliest=-24h | <all your field extractions> | dedup user | table _time, user | outputlookup vpn.csv
Now, modify that so that it appends the data (as opposed to overwritting every run) and keeps the last 30 days:
sourcetype=vpn earliest=-24h | <all your field extractions> | dedup user | table _time, user | append [|inputlookup vpn.csv] | where _time>(now()-(30*24*60*60)) | outputlookup vpn.csv
Now, you can on-demand run a stats count command against that lookup table:
|inputlookup vpn.csv | stats count by user
The above searchs are still a little messy, but I think they illustrate the method better. Here is the scheduled search I would actually run every 2 hours:
sourcetype=vpn earliest=-2h | <all your field extractions> | eval day=round(_time/86400) | dedup user, day | table day, user | append [|inputlookup vpn.csv] | where day>round((now()-(2592000))) | outputlookup vpn.csv
Hi gkanapathy,
Thanks for that note, will keep that in mind.
You're more or less re-creating summarization into a big CSV lookup with this method. It's a little more efficient that summarization, but less flexible.
rtadams,
Very interesting!!! I think I will give this a test run.
There's a nice writeup of using lookups in Splunk Blogs: http://blogs.splunk.com/2011/01/11/maintaining-state-of-the-union/
Easiest answer, make summary indexing an option 🙂
To properly look at this, check the search job inspector for what takes the most time. You can run the search over an hour or a day of data for this if you don't have a month-long job still available.
Once you know what takes the most time you know where best to start optimizing.
Additionally, what's in <my search>
? If your number of scanned events is high vs your number of matched events then those filters may be inefficient.
If all optimizing falls short and summary indexing remains not an option you can still index the user fields.
Martin, this is a great point. It'll be on my list to further narrow down events. Thanks.
Any wildcards in those minimal strings? Do all your events matched by those strings contain user names?
Without knowing what's taking up the most time it's just guesswork. Additionally, without knowing your data it's hard to find the best optimization - loading as little data as needed. For example, you might now load all the events that contain a user name, but maybe it'd be enough to only load sign-in events.
Martin,
Thank you for the insight and I like the idea of troubleshooting the bottleneck. To answer your question, '
I've been considering the summary indexing option as it seems to be my best bet so far, but I would be interested if someone can still tweak this more.