On iis logs, suppose I have 60000 transactions per 24 hours. How can I get a random sample of say 5000 events? I need to get a random sample for each day for suppose last 50 days. I want to build control charts based on response time (time_taken) from the iis logs.
Here is a random sample macro I use:
From macros.conf
:
[Random_Sample(1)]
args = RandomSamplePercentEventsToKeep
definition = eval RandomSampleSeed = random()\
| sort 0 -RandomSampleSeed\
| eventstats count AS RandomSmpleTotalEventCount\
| eval RandomSampleNumberToKeep = ceil($RandomSamplePercentEventsToKeep$ * RandomSmpleTotalEventCount / 100)\
| streamstats count AS RandomSampleSerialNumber\
| where RandomSampleSerialNumber<=RandomSampleNumberToKeep\
| fields = RandomSample*
iseval = 0
The latest Splunk Cloud version has recently gotten an event sampling feature, so it'd be reasonable to assume that's coming to Splunk Enterprise some day as well.
http://docs.splunk.com/Documentation/Splunk/6.3.1511/Search/Retrieveasamplesetofevents
Until then, you could fake a sampling rate of 1:60 by only looking at a specific date_second
, or a sampling rate of 1:30 by looking at two seconds, and so on. If your data is sufficiently well-spread, this not-random sampling should work well enough.
For both sampling approaches, make sure you don't mess up your transactions if they comprise of multiple events per transaction.
Alternatively, just run over all your data without sampling.
For 1:60 sampling, add date_second = 42
to your search. Any other second will do.
To check if this gives you a reasonable sampling, you could run statistics by second to see if there are any outliers, e.g. lots of events generated at the second zero from cronjobs.
You really should first consider running over your entire data set though. 60000 events over 24 hours really isn't that much if you have reference-spec hardware or better.
can you give me example of how to fake the sampling ?