Monitoring Splunk

How to Know What Buckets Are Searched (hot/warm/cold)

David
Splunk Employee
Splunk Employee

I'm trying to analyze the effectiveness of our bucketing strategy, to ensure that we're making optimal use of hot/warm bucket SSDs. I would like to be able to look across all of our searches to see what hits hot/warm buckets and what hits cold buckets, ideally including how many events are retrieved and how much time was spent, but at least how many cold buckets were needed.

I don't see this being logged by default (with the except of results_preview.csv, which only covers interactive searches and I don't seem to have many of them). I've tried starting with --debug, and I've tried selectively incrementing a bunch nice-sounding log channels to debug. I understand that I can make an educated guess by looking at the search times and comparing with bucket timestamps (via |dbinspect), but there are a lot of downsides to that approach (time delta between search and analysis, some buckets may not contain the term for needle/haystack searches, no detail on what was retrieved, etc.).

Any pointers for how I can pull this type of information?

Edit: ...Bumping on the off chance that anyone missed this earlier...

AzJimbo
Path Finder

I was playing with koshyk's query and came up with this one.

Make sure you use fast mode and event sampling 1:100,000 or else it might take forever.

index=whatever
|rename _bkt AS bucketId, _cd AS cd
|stats earliest(_time) as first_event latest(_time) as last_event
by bucketId
|join type=left bucketId
[|dbinspect index=whatever |fields bucketId, state]
|stats first(first_event) AS first_event, last(last_event) AS last_event dc(bucketId) AS buckets by state
|eval last_event=strftime(last_event,"%m/%d/%Y %H:%M")
|eval first_event=strftime(first_event,"%m/%d/%Y %H:%M")
|table first_event, last_event, state buckets |sort - last_event

And if you want to see 'em all:

index=*
|rename _bkt AS bucketId, _cd AS cd
|stats earliest(_time) as first_event latest(_time) as last_event 
by bucketId, index
|join type=left bucketId
[|dbinspect index=* |fields bucketId, state index]
|stats first(first_event) AS first_event, last(last_event) AS last_event dc(bucketId) AS buckets  by state index
|eval last_event=strftime(last_event,"%m/%d/%Y %H:%M")
|eval first_event=strftime(first_event,"%m/%d/%Y %H:%M")
|table index first_event, last_event, state buckets |sort - last_event

koshyk
Super Champion

Found almost a way to identify the event to bucketId and where it came from

index=main| head 100 | rename _bkt as bucketId| rename _cd as cd|   stats count by bucketId,cd,_raw| join type=left bucketId [|dbinspect index=main ] | table _raw,bucketId,state,splunk_server

The logic is
- find the events and bucketId, _cd
- join this data with dbinspect

0 Karma

koshyk
Super Champion

upvoted. Hope Splunk will give a method to map buckets to events

0 Karma

Lowell
Super Champion

Hey David!

Here's a proof-of-concept to determine which buckets are accessed for each search. The approach is rather low-level, Linux specific, and has several rough edges. (Aka, don't try this on production). Disclaimers aside, I think it may give you the kind of answers you are looking for, or at least provide a jumping-off point.

In Splunk 6 (and possibly earlier), there's a settings called search_process_mode which allows "debugging" all search commands. Essentially, this allows you to run an arbitrary command instead of the normal splunkd search ... process. So basically we can drop in a debugging script to intercept the new search request and then run the standard worker process. This proof of concept uses the strace command to tracks file activity, which includes every open() call to each bucket. The file activity gets dumped to a strace.log file thrown into the dispatch folder.

Here's the basic setup:

inputs.conf:

[monitor://$SPLUNK_HOME/var/run/splunk/dispatch]
index = _internal
whitelist = strace\.log$
sourcetype = splunk_search_strace

limits.conf:

[search]
search_process_mode = debug search-perf.sh

$SPLUNK_HOME/bin/scripts/search-perf.sh

#!/bin/bash
shift
search_id=`echo "$*" | sed -re 's/.* --id=([^ ]+).*/\1/'`
echo "`date` [${search_id}] START SEARCH   ARGS: $*" >> $SPLUNK_HOME/var/log/splunk/search-perf.log

strace -ttt -f -e trace=file $SPLUNK_HOME/bin/splunkd $* 2> $SPLUNK_HOME/var/run/splunk/dispatch/${search_id}/strace.log
echo "`date` [${search_id}] END SEARCH" >> $SPLUNK_HOME/var/log/splunk/search-perf.log

Here's an example search that shows a list of indexes and buckets accessed per search:

index=_internal sourcetype=splunk_search_strace open (merged_lexicon OR .tsidx OR Hosts.data OR Sources.data OR Sourcetypes.data)
| rex field=source "/dispatch/(?<sid>[^/]+)/"
| rex "open\(\"(?<path>[^\"]+)\","
| rex field=path "/(?<idx>[^/]+)/(?<phase>[^/]+)/(?<bkt>[^/]+)/[^/]+$"
| transaction sid
| table sid, path, duration, idx, phase, bkt

Right now you can use this to see what buckets are being accessed, but I think you could expand the example search to see which buckets are scanned (in other words, add some logic looking for open calls to 'rawdata/journal.gz' rather than just open calls to tsidx files; I'm guessing you could also look at the effectiveness of bloom filters using this technique too). I'm sure there's a way to get statistics on IO usage if you (or someone you know) has voodoo strace skills to determine times take per read and amount of data accessed.


Other random notes

I found out that you can just run strace without redirecting stderr to a log file and it shows up in the splunkd.log. (Search for "ProcessDispatchedSearch - PROCESS_SEARCH") I thought about using this approach, but then you can't tell which search the message came from. Additional, using the dispatch folder has the advantage of using Splunk's default job aging behavior to cleanup work for us.

Also, I saw in the docs a new setting in Splunk 6.1 that looks promising, but I couldn't locate where these new metrics go. But I could have been looking in the wrong location.

limits.conf:

[search_metrics]
debug_metrics = true

This may be worth a second look.

martin_mueller
SplunkTrust
SplunkTrust

The smaller issue may be solvable if you add a subsearch using dbinspect to your search and use its output to add "debug rows" to your results... that's assuming your use of the search allows extra info without breaking. That should give you the then-current bucket info any time the search is run.

0 Karma

David
Splunk Employee
Splunk Employee

Agreed -- that scenario could take me 40% there.. but it also has the issue of requiring that the search be run very often to get an accurate result. (i.e., buckets will have rolled between when Search X ran and when I run |dbinspect.) That's in addition to the bigger issue you pointed out, of having no intelligence to say whether bloom filters or metadata excluded all the cold buckets because they're looking for a new host/feature/etc.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

I don't have a turnkey solution, but here's a tidbit to make your educated guesses a bit easier:

| dbinspect index=foo

Run that over some timerange, and it'll tell you the buckets from that index for the given timerange as well as their state of heat. You get their size and number of events, but not the number of events scanned or matched for any particular search.

Get Updates on the Splunk Community!

Video | Welcome Back to Smartness, Pedro

Remember Splunk Community member, Pedro Borges? If you tuned into Episode 2 of our Smartness interview series, ...

Detector Best Practices: Static Thresholds

Introduction In observability monitoring, static thresholds are used to monitor fixed, known values within ...

Expert Tips from Splunk Education, Observability in Action, Plus More New Articles on ...

Splunk Lantern is a Splunk customer success center that provides advice from Splunk experts on valuable data ...