Splunk Search

How to improve search efficiency for monthly vpn users login data averaging 50 million events per day?

Communicator

I would like to get advice on how to make the search below more efficient (and probably simpler). The sourcetype contains massive amounts of log and the search takes about 24hours or more to finish. Also, let's consider that summary indexing is not an option at this point of time.

Below is what the search is:

  1. The search will simply look for vpn users logging in for 11 or more days in a month.
  2. There are two regex field extractions because the data sources have different formats.
  3. The sourcetype has an average of 50 million events per day.

sourcetype=vpn <my search>|bucket _time span=1d |rex field=_raw "(?i)\(LOCAL\\\(?P<Username>[^)]+)\)" | rex field=_raw "(?i)\((?P<UserAcct>[^\d]+)\)"| eval user=coalesce(Username,UserAcct)|fields + Username UserAcct user|convert ctime(_time) as timestamp | stats dc(timestamp) AS "Days_Logged_In" by user|where Days_Logged_In>11|sort -Days_Logged_in

Tags (3)
1 Solution

Communicator

All answers have different good points and trying out each of them would probably help anyone who has a similar issue decide what's best. In my case, I decided to stick with the original search and make it a scheduled search since the need for the information is not urgent. Running it as a report in the background made more sense.

View solution in original post

0 Karma

Communicator

All answers have different good points and trying out each of them would probably help anyone who has a similar issue decide what's best. In my case, I decided to stick with the original search and make it a scheduled search since the need for the information is not urgent. Running it as a report in the background made more sense.

View solution in original post

0 Karma

Splunk Employee
Splunk Employee

The search is basically okay, though I would probably slightly shorten it as:

sourcetype=vpn <my search>
| rex "(?i)\(LOCAL\\\(?P<Username>[^)]+)\)"
| rex "(?i)\((?P<UserAcct>[^\d]+)\)"
| eval user=coalesce(Username,UserAcct)
| bucket _time span=1d
| stats dc(_time) as days_logged_in by user
| where days_logged_in > 11
| sort - days_logged_in

If this is a 100% dense search, (i.e., the events returned is pretty much all the events in the index for the time period in question) then I would expect this to run in 20 minutes 8 hours or less on a single server with the recommended CPU and disk. Probably less. However, if the events are more spare or rare within the index, then it would probably run proportionately slower. If they're 10%, then it would take about 10 times as long (i.e., 5 to 10 hours) and if they're 1 in 20, about 20 times as long, down till about 1 in 500 or 1 in 1000 when it will stabilize to a constant level. (This is known as a sparse search at this point, see http://docs.splunk.com/Documentation/Splunk/6.1.2/Installation/HowsearchtypesaffectSplunkperformance for explanation.)

If you want better performance, you can:

  • Try to isolate the selected events to their own index, or at least with less "other" data so that the search is more dense
  • Use summarization, or report acceleration, or create an accelerated data model
  • Check to be sure that the hardware is giving you the real CPU and disk performance required
  • Distribute the index over multiple indexer nodes
  • A combination of the above

Edit: Sorry, I thought it was a total of 50 million, not 50 million per day. So that should be 30 times what I suggested, or 7 to 10 hours, not 15-20 minutes.

Communicator

To clarify, the 24 hours I originally mentioned, I ran the search during peak hours of our infra. I ran the job on the background on off peak hours and finished the job after 6 hours.

0 Karma

Communicator

Thanks gkanapathy.

So I ran a background job on this earlier and was still running as I posted the question. It finished at around 6 hours which is close to the estimate. So I am expecting any optimization on the search itself will still come around the same duration.

0 Karma

Splunk Employee
Splunk Employee

Sorry, I thought it was a total of 50 million, not 50 million per day. So that should be 30 times what I suggested, or 7 to 10 hours, not 15-20 minutes.

SplunkTrust
SplunkTrust

How is loading 1.5 billion events taking 20 minutes on a single reasonably-specced server? My maths suggest 1.25Meps for that, which sounds like a bit much.

0 Karma

Splunk Employee
Splunk Employee

Note that the field extractions are not a significant contributor to the performance. You could optimize them away and it's not likely to make a difference.

0 Karma

Contributor

Some things you could do:

  • Summary indexing (which you dismissed, but hey, it might be your best bet
  • Setting up the extractions as index time extractions to offload some of that workload
  • Use a lookup table (see below)

Create a search that looks over the last day and schedule it to run daily. The search will be basically what you already have, but will record the date and the last login time. For example:

sourcetype=vpn earliest=-24h | <all your field extractions> | dedup user | table _time, user | outputlookup vpn.csv

Now, modify that so that it appends the data (as opposed to overwritting every run) and keeps the last 30 days:

sourcetype=vpn earliest=-24h | <all your field extractions> | dedup user | table _time, user  | append [|inputlookup vpn.csv] | where _time>(now()-(30*24*60*60)) | outputlookup vpn.csv

Now, you can on-demand run a stats count command against that lookup table:

|inputlookup vpn.csv | stats count by user

The above searchs are still a little messy, but I think they illustrate the method better. Here is the scheduled search I would actually run every 2 hours:

sourcetype=vpn earliest=-2h | <all your field extractions> | eval day=round(_time/86400) | dedup user, day | table day, user | append [|inputlookup vpn.csv] | where day>round((now()-(2592000))) | outputlookup vpn.csv

Communicator

Hi gkanapathy,
Thanks for that note, will keep that in mind.

0 Karma

Splunk Employee
Splunk Employee

You're more or less re-creating summarization into a big CSV lookup with this method. It's a little more efficient that summarization, but less flexible.

0 Karma

Communicator

rtadams,
Very interesting!!! I think I will give this a test run.

0 Karma

SplunkTrust
SplunkTrust

There's a nice writeup of using lookups in Splunk Blogs: http://blogs.splunk.com/2011/01/11/maintaining-state-of-the-union/

SplunkTrust
SplunkTrust

Easiest answer, make summary indexing an option 🙂

To properly look at this, check the search job inspector for what takes the most time. You can run the search over an hour or a day of data for this if you don't have a month-long job still available.
Once you know what takes the most time you know where best to start optimizing.

Additionally, what's in <my search>? If your number of scanned events is high vs your number of matched events then those filters may be inefficient.

If all optimizing falls short and summary indexing remains not an option you can still index the user fields.

Communicator

Martin, this is a great point. It'll be on my list to further narrow down events. Thanks.

0 Karma

SplunkTrust
SplunkTrust

Any wildcards in those minimal strings? Do all your events matched by those strings contain user names?

Without knowing what's taking up the most time it's just guesswork. Additionally, without knowing your data it's hard to find the best optimization - loading as little data as needed. For example, you might now load all the events that contain a user name, but maybe it'd be enough to only load sign-in events.

0 Karma

Communicator

Martin,
Thank you for the insight and I like the idea of troubleshooting the bottleneck. To answer your question, '' contains very minimal strings, in fact just a 5-letter word.

I've been considering the summary indexing option as it seems to be my best bet so far, but I would be interested if someone can still tweak this more.

0 Karma