Hi guys,
I need to evaluate a disruption. It can last multiple hours, so I need to use data which is at least 4h old. This query needs to show all disruptions that are longer than 15 minutes with it's starting timestamp and it's last occurring timestamp. To group all logged events, I need a transaction which also contains the field CompleteDescription. If this field contains specific values which can be seen in the query, it is a disruption.
The query I've build works so far but is to slow to collect data from multiple hours. Does anyone have an idea how to improve the query for more performance?
Thank you!
index=log sourcetype=servlog
| transaction ThreadId host maxspan=180s startswith=(LogMessage=start) endswith=(LogMessage=end)
| stats earliest(_time) as "first", latest(_time) as "last", count by Type, CompleteDescription
| eventstats sum(count) as count_full by Type, CompleteDescription
| eventstats sum(count_full) as total by Type
| eval percentage = round((count_full/total)*100,0)
| eval time_diff = round((last - first)/60, 0)
| eval CompleteDescription=upper(CompleteDescription)
| search Type!=SSL (CompleteDescription = "MISSING RESPONSE" OR CompleteDescription = "TIMEOUT" OR CompleteDescription = "TECHNICAL ERROR" OR CompleteDescription = "INTERNAL SYSTEM ERROR" OR CompleteDescription = "NO REACHABILITY") total >= 10 percentage >= 50 time_diff >= 30
| convert ctime(first) ctime(last)
| table Type, CompleteDescription, count_type, count, percentage
| sort - percentage, total
Why did you specified maxspan=180s for the transaction if your goal is to look for events longer than 15 minutes?
| transaction ThreadId host maxspan=180s startswith=(LogMessage=start) endswith=(LogMessage=end)
You dropped the field ThreadId and Host from the subsequent grouping using stats. Was this intentional?
| stats earliest(_time) as "first", latest(_time) as "last", count by Type, CompleteDescription
Expanding on the @gcusello's solution, we can make a simple adjustment to give you a similar output as your original query.
| stats range(_time) AS duration_secs earliest(_time) as "first" latest(_time) as "last" count by values(CompleteDescription) AS CompleteDescription values(Type) AS Type BY ThreadId host
You can move one of your filter up to improve efficiency
| where duration_secs>=1800
index=log sourcetype=servlog
| stats range(_time) AS duration_secs earliest(_time) as "first" latest(_time) as "last" count by values(CompleteDescription) AS CompleteDescription values(Type) AS Type BY ThreadId host
| where duration_secs>=1800
| eventstats sum(count) as count_full by Type, CompleteDescription
| eventstats sum(count_full) as total by Type
| eval percentage = round((count_full/total)*100,0)
| eval time_diff = round((last - first)/60, 0)
| eval CompleteDescription=upper(CompleteDescription)
| search Type!=SSL (CompleteDescription = "MISSING RESPONSE" OR CompleteDescription = "TIMEOUT" OR CompleteDescription = "TECHNICAL ERROR" OR CompleteDescription = "INTERNAL SYSTEM ERROR" OR CompleteDescription = "NO REACHABILITY") total >= 10 percentage >= 50 time_diff >= 30
| convert ctime(first) ctime(last)
| table Type, CompleteDescription, count_type, count, percentage
| sort - percentage, total
If you're still having issues, could you provide sample events?
Hi @username13,
the transaction command is a very slow command.
if you couple "ThreadId" "host" is unique (if in other words startswith and endswith are additional and not relevant), you could put them in the stats command having a faster search
index=log sourcetype=servlog
| stats earliest(_time) as "first" latest(_time) as "last" count by ThreadId host Type CompleteDescription
| eventstats sum(count) as count_full by Type, CompleteDescription
| eventstats sum(count_full) as total by Type
| eval percentage = round((count_full/total)*100,0)
| eval time_diff = round((last - first)/60, 0)
| eval CompleteDescription=upper(CompleteDescription)
| search Type!=SSL (CompleteDescription = "MISSING RESPONSE" OR CompleteDescription = "TIMEOUT" OR CompleteDescription = "TECHNICAL ERROR" OR CompleteDescription = "INTERNAL SYSTEM ERROR" OR CompleteDescription = "NO REACHABILITY") total >= 10 percentage >= 50 time_diff >= 30
| convert ctime(first) ctime(last)
| table Type, CompleteDescription, count_type, count, percentage
| sort - percentage, total
Ciao.
Giuseppe
Hi Giuseppe,
thank you very much for your help. Unfortunately my data is incomplete without the transaction command. So I guess I still have to keep it. But I'll keep it in mind! 🙂