Splunk Search

How to pull random Sampling across a period of time?

Kishi_B
New Member

Any ideas on how to pull a random sample for the logging application that spans the full month and does not specify sources or source types? We’re trying to make this generic enough that it can be applied to any system that starts logging to scan samples of whatever raw data they’ve logged. The query that has been used historically is only pulling the first 25 of the last time items were logged:

index=co_lob co_id=app1 co_env=prod | head 25 | stats latest(_time) as latestinput, latest(source) as source, latest(_raw) as latestraw, count by host, index, co_id, sourcetype, co_env | convert timeformat="%Y-%m-%d %H:%M:%S" ctime(latestinput) AS latestinput | eval application="app1" | table application, count, host, index, latestinput, latestraw, source, sourcetype, co_id, co_env

 

I found the information on random() and tried:

index=co_lob co_id=app1 co_env=prod | eval rand=random() % 50 | head 50

 

and was going to go from there to extract into the right table format for the scanning, but even just running for the week to date it times out. Trying to get a random 50 or 100 from across an entire month. Using the Event Sampling doesn’t work because even if I go 1 : 100,000,000, for some of these applications that are logging millions of transactions an hour, it’s causing performance issues and is too much for review. 

 

Thank you in advance for any guidance 🙂

Labels (2)
Tags (1)
0 Karma

Kishi_B
New Member

Thank you, will give that a shot

0 Karma

bowesmana
SplunkTrust
SplunkTrust

Firstly as a general solution to looking over longer time periods and large datasets, summary indexing can be a solution, but to address the specific performance question...

You could generate a random value of a known field inside a subsearch, for example if you have the date_* fields auto generated in your search 

index=co_lob co_id=app1 co_env=prod [
  | makeresults 
  | eval date_second=random() % 60 
  | fields date_second
]
...

 then that will pick a random value of second and ONLY look for records with that value.  You could go further and choose also a random minute or even a full date range of a random few seconds in the month

index=co_lob co_id=app1 co_env=prod [
  | makeresults 
  | eval date_second=random() % 60, date_minute=random() % 60
  | fields date_second date_minute
]

if you combine that with sampling that may help.

You could look at known values of other fields in your data and then do randomisation of that in the subsearch.

Another very good way to improve the performance is to use the TERM() directive, which can significantly increase the search performance if the TERM you search for actually exists in your data, e.g. if your _raw data actually contains co_id=app1 and co_env=prod, with no major breaker characters, then you should use

ndex=co_lob TERM(co_id=app1) TERM(co_env=prod)

 

0 Karma
Career Survey
First 500 qualified respondents will receive a $20 gift card! Tell us about your professional Splunk journey.

Can’t make it to .conf25? Join us online!

Get Updates on the Splunk Community!

Can’t Make It to Boston? Stream .conf25 and Learn with Haya Husain

Boston may be buzzing this September with Splunk University and .conf25, but you don’t have to pack a bag to ...

Splunk Lantern’s Guide to The Most Popular .conf25 Sessions

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...

Unlock What’s Next: The Splunk Cloud Platform at .conf25

In just a few days, Boston will be buzzing as the Splunk team and thousands of community members come together ...