Splunk Search

Need help with query performance improvement

username13
Explorer

Hi guys,

I need to evaluate a disruption.  It can last multiple hours, so I need to use data which is at least 4h old. This query needs to show all disruptions that are longer than 15 minutes with it's starting timestamp and it's last occurring timestamp. To group all logged events, I need a transaction which also contains the field CompleteDescription. If this field contains specific values which can be seen in the query, it is a disruption.

The query I've build works so far but is to slow to collect data from multiple hours. Does anyone have an idea how to improve the query for more performance?

Thank you!

 

 

index=log sourcetype=servlog
| transaction ThreadId host maxspan=180s startswith=(LogMessage=start) endswith=(LogMessage=end) 
| stats earliest(_time) as "first", latest(_time) as "last", count by Type, CompleteDescription
| eventstats sum(count) as count_full by Type, CompleteDescription 
| eventstats sum(count_full) as total by Type
| eval percentage = round((count_full/total)*100,0)
| eval time_diff = round((last - first)/60, 0)
| eval CompleteDescription=upper(CompleteDescription) 
| search Type!=SSL (CompleteDescription = "MISSING RESPONSE" OR CompleteDescription = "TIMEOUT" OR CompleteDescription = "TECHNICAL ERROR" OR CompleteDescription = "INTERNAL SYSTEM ERROR" OR CompleteDescription = "NO REACHABILITY") total >= 10 percentage >= 50 time_diff >= 30
| convert ctime(first) ctime(last) 
| table Type, CompleteDescription, count_type, count, percentage 
| sort - percentage, total

 

 

 

Labels (3)
0 Karma

johnhuang
Motivator

Why did you specified maxspan=180s for the transaction if your goal is to look for events longer than 15 minutes?

 

| transaction ThreadId host maxspan=180s startswith=(LogMessage=start) endswith=(LogMessage=end) 

 


You dropped the field ThreadId and Host from the subsequent grouping using stats. Was this intentional? 

 

| stats earliest(_time) as "first", latest(_time) as "last", count by Type, CompleteDescription

 

 

Expanding on the @gcusello's solution, we can make a simple adjustment to give you a similar output as your original query.

 

| stats range(_time) AS duration_secs earliest(_time) as "first" latest(_time) as "last" count by values(CompleteDescription) AS CompleteDescription values(Type) AS Type BY ThreadId host

 


You can move one of your filter up to improve efficiency

 

| where duration_secs>=1800
index=log sourcetype=servlog
| stats range(_time) AS duration_secs earliest(_time) as "first" latest(_time) as "last" count by values(CompleteDescription) AS CompleteDescription values(Type) AS Type BY ThreadId host
| where duration_secs>=1800
| eventstats sum(count) as count_full by Type, CompleteDescription 
| eventstats sum(count_full) as total by Type
| eval percentage = round((count_full/total)*100,0)
| eval time_diff = round((last - first)/60, 0)
| eval CompleteDescription=upper(CompleteDescription) 
| search Type!=SSL (CompleteDescription = "MISSING RESPONSE" OR CompleteDescription = "TIMEOUT" OR CompleteDescription = "TECHNICAL ERROR" OR CompleteDescription = "INTERNAL SYSTEM ERROR" OR CompleteDescription = "NO REACHABILITY") total >= 10 percentage >= 50 time_diff >= 30
| convert ctime(first) ctime(last) 
| table Type, CompleteDescription, count_type, count, percentage 
| sort - percentage, total

 


If you're still having issues, could you provide sample events?

0 Karma

gcusello
SplunkTrust
SplunkTrust

Hi @username13,

the transaction command is a very slow command.

if you couple "ThreadId" "host" is unique (if in other words startswith and endswith are additional and not relevant), you could put them in the stats command having a faster search

index=log sourcetype=servlog
| stats earliest(_time) as "first" latest(_time) as "last" count by ThreadId host Type CompleteDescription
| eventstats sum(count) as count_full by Type, CompleteDescription 
| eventstats sum(count_full) as total by Type
| eval percentage = round((count_full/total)*100,0)
| eval time_diff = round((last - first)/60, 0)
| eval CompleteDescription=upper(CompleteDescription) 
| search Type!=SSL (CompleteDescription = "MISSING RESPONSE" OR CompleteDescription = "TIMEOUT" OR CompleteDescription = "TECHNICAL ERROR" OR CompleteDescription = "INTERNAL SYSTEM ERROR" OR CompleteDescription = "NO REACHABILITY") total >= 10 percentage >= 50 time_diff >= 30
| convert ctime(first) ctime(last) 
| table Type, CompleteDescription, count_type, count, percentage 
| sort - percentage, total

Ciao.

Giuseppe

 

0 Karma

username13
Explorer

Hi Giuseppe,

thank you very much for your help. Unfortunately my data is incomplete without the transaction command. So I guess I still have to keep it. But I'll keep it in mind! 🙂

0 Karma
Get Updates on the Splunk Community!

New in Observability - Improvements to Custom Metrics SLOs, Log Observer Connect & ...

The latest enhancements to the Splunk observability portfolio deliver improved SLO management accuracy, better ...

Improve Data Pipelines Using Splunk Data Management

  Register Now   This Tech Talk will explore the pipeline management offerings Edge Processor and Ingest ...

3-2-1 Go! How Fast Can You Debug Microservices with Observability Cloud?

Register Join this Tech Talk to learn how unique features like Service Centric Views, Tag Spotlight, and ...