Any ideas on how to pull a random sample for the logging application that spans the full month and does not specify sources or source types? We’re trying to make this generic enough that it can be applied to any system that starts logging to scan samples of whatever raw data they’ve logged. The query that has been used historically is only pulling the first 25 of the last time items were logged:
index=co_lob co_id=app1 co_env=prod | head 25 | stats latest(_time) as latestinput, latest(source) as source, latest(_raw) as latestraw, count by host, index, co_id, sourcetype, co_env | convert timeformat="%Y-%m-%d %H:%M:%S" ctime(latestinput) AS latestinput | eval application="app1" | table application, count, host, index, latestinput, latestraw, source, sourcetype, co_id, co_env
I found the information on random() and tried:
index=co_lob co_id=app1 co_env=prod | eval rand=random() % 50 | head 50
and was going to go from there to extract into the right table format for the scanning, but even just running for the week to date it times out. Trying to get a random 50 or 100 from across an entire month. Using the Event Sampling doesn’t work because even if I go 1 : 100,000,000, for some of these applications that are logging millions of transactions an hour, it’s causing performance issues and is too much for review.
Thank you in advance for any guidance 🙂
Thank you, will give that a shot
Firstly as a general solution to looking over longer time periods and large datasets, summary indexing can be a solution, but to address the specific performance question...
You could generate a random value of a known field inside a subsearch, for example if you have the date_* fields auto generated in your search
index=co_lob co_id=app1 co_env=prod [
| makeresults
| eval date_second=random() % 60
| fields date_second
]
...
then that will pick a random value of second and ONLY look for records with that value. You could go further and choose also a random minute or even a full date range of a random few seconds in the month
index=co_lob co_id=app1 co_env=prod [
| makeresults
| eval date_second=random() % 60, date_minute=random() % 60
| fields date_second date_minute
]
if you combine that with sampling that may help.
You could look at known values of other fields in your data and then do randomisation of that in the subsearch.
Another very good way to improve the performance is to use the TERM() directive, which can significantly increase the search performance if the TERM you search for actually exists in your data, e.g. if your _raw data actually contains co_id=app1 and co_env=prod, with no major breaker characters, then you should use
ndex=co_lob TERM(co_id=app1) TERM(co_env=prod)