I need to create an alert when all the below queues are at 100% for respective indexer. For this I am using "DMC Alert - Saturated Event-Processing Queues" inbuilt alert but need to tweak it a little bit to alert when all the 4 queues " aggQueue.*" "indexQueue.0*" "parsingQueue.*" and "typingQueue.0" are at 100% for that host.
| rest splunk_server_group=dmc_group_indexer /services/server/introspection/queues | search title=tcpin_queue* OR title=parsingQueue* OR title=aggQueue* OR title=typingQueue* OR title=indexQueue* | eval fifteen_min_fill_perc = round(value_cntr3_size_bytes_lookback / max_size_bytes * 100,2) | fields title fifteen_min_fill_perc splunk_server | where fifteen_min_fill_perc > 99 | rename splunk_server as Instance, title AS "Queue name", fifteen_min_fill_perc AS "Average queue fill percentage (last 15min)"
Queue name Average queue fill percentage (last 15min) Instance
i use this search:
index=_internal source=*metrics.log sourcetype=splunkd group=queue | eval name=case(name=="aggqueue","2 - Aggregation Queue", name=="indexqueue", "4 - Indexing Queue", name=="parsingqueue", "1 - Parsing Queue", name=="typingqueue", "3 - Typing Queue", name=="splunktcpin", "0 - TCP In Queue", name=="tcpin_cooked_pqueue", "0 - TCP In Queue") | eval max=if(isnotnull(max_size_kb),max_size_kb,max_size) | eval curr=if(isnotnull(current_size_kb),current_size_kb,current_size) | eval fill_perc=round((curr/max)*100,2) | bin _time span=1m | stats Median(fill_perc) AS "fill_percentage" max(max) AS max max(curr) AS curr by host, _time, name | where (fill_percentage>70 AND name!="4 - Indexing Queue") OR (fill_percentage>70 AND name="4 - Indexing Queue") | sort -_time
Removing tcpin_queue* and counting the number of distinct base queue names by Splunk instance should allow you to alert when all 4 queues across any number of pipelines have breached your threshold:
| rest splunk_server_group=dmc_group_indexer /services/server/introspection/queues | search ```title=tcpin_queue* OR``` title=parsingQueue* OR title=aggQueue* OR title=typingQueue* OR title=indexQueue* | eval fifteen_min_fill_perc = round(value_cntr3_size_bytes_lookback / max_size_bytes * 100,2) | fields title fifteen_min_fill_perc splunk_server | where fifteen_min_fill_perc > 99 | rex field=title "(?<basename>[^.]+)" | eventstats dc(basename) as distinct_count by splunk_server | where distinct_count==4 | fields - basename distinct_count | rename splunk_server as Instance, title AS "Queue name", fifteen_min_fill_perc AS "Average queue fill percentage (last 15min)"
I've added the rex, eventstats, where, and fields commands on lines 6-9 to your original search.
In my own environments, I also keep an eye on blocked queues:
| tstats latest(PREFIX(max_size_kb=)) as max_size_kb latest(PREFIX(largest_size=)) as largest_size where index=_internal source=*metrics.log* TERM(group=queue) TERM(blocked=true) by host PREFIX(name=)
@tscroggins Thank you for looking into my query. I tried the search query you posted and the results are same as my search query. What I am looking for a consolidated report for example, in the output I pasted in my original post, instance "Y" has all the four queues full (parsingQueue* OR title=aggQueue* OR title=typingQueue* OR title=indexQueue) so my output should only be this instance name. I will set up and alert for this host for further action. Any suggestions pls ?
In the table in your original post, only instance X would pass the new where clause. If you want to reduce the results to just an instance name, you can add stats, dedup, etc. to your search:
| stats count by splunk_server
| fields - count
These would replace the rename command.
Query seems to be working but partially. When I run the query I get results for splunk_server whose one of the parsing queue pipeline is not greater than the threshold I set (which is >80). As per my requirement this server xyz should not showup as its parsing_queue.0 is not greater than thershold. (It should only report if all its 3 pipelines 4 Queues are greater than 80).
title fifteen_min_fill_perc splunk_server
Appreciate if you could also help me understand more on why dc is used here and how does it work?