Splunk Search

Transaction with multiple down events but only one up event results in incorrect duration calculation

NJL
Explorer

I've created a dashboard showing downtime for BGP adjacencies and WAN circuits.
It works (almost) perfectly, but recently on one circuit we've seen a down event on a circuit and it's taken a long time to come back up.
While troubleshooting the circuit, the engineers have tried several things to bring up the interface, but it has failed several times.
This troubleshooting has caused "down" events identical to the "original" down event.
I want the dashboard to be able to show the duration between the first "down" event and the recent "up" event, but what the dashboard is showing is very misleading.

This is the search used:

 index=net <host-ip> AND "%LINK-3-UPDOWN: Interface"
 | rex "%LINK-3-UPDOWN: Interface (?P<Interface_Name>(.+)),"
 | transaction host startswith="Down" endswith="Up" keepevicted=true
 | eval duration=if(duration==0,now()-_time,duration)
 | convert rmunit(duration) as numSecs
 | eval string_dur = tostring(round(numSecs), "duration") 
 | eval formatted_dur = replace(string_dur,"(?:(\d+)\+)?0?(\d+):0?(\d+):0?(\d+)","\1d \2h \3m \4s")
 | eval stringSecs = replace(formatted_dur, "^d (0h (0m )?)?","") 
 | eval InterfaceUpTime = _time + duration
 | table host, Interface_Name, _time, , InterfaceUpTime, stringSecs 
 | rename _time AS "Interface Down", InterfaceUpTime AS "Interface Up", host AS "Reporting Host", Interface_Name AS "Interface", stringSecs AS "Total Downtime"
 | convert timeformat="%a-%d-%b-%Y %H:%M:%S" ctime("Interface Down") ctime("Interface Up")

And this is the filtered log output from the device feeding into Splunk:

 008464: Jun  3 2018 12:50:53.356 dst: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/0, changed state to up
 008466: Jun  3 2018 12:50:53.356 dst: %LINK-3-UPDOWN: SIP0/0: Interface GigabitEthernet0/0/0, changed state to up
 008493: Jun  8 2018 10:55:34.676 dst: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/0, changed state to down
 008495: Jun  8 2018 10:55:34.675 dst: %LINK-3-UPDOWN: SIP0/0: Interface GigabitEthernet0/0/0, changed state to down
 008527: Jun 12 2018 13:15:29.641 dst: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/0, changed state to down
 008529: Jun 12 2018 13:15:29.644 dst: %LINK-3-UPDOWN: SIP0/0: Interface GigabitEthernet0/0/0, changed state to down
 008548: Jun 14 2018 17:35:31.894 dst: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/0, changed state to down
 008549: Jun 14 2018 17:35:31.897 dst: %LINK-3-UPDOWN: SIP0/0: Interface GigabitEthernet0/0/0, changed state to down
 008558: Jun 14 2018 17:43:38.631 dst: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/0, changed state to down
 008559: Jun 14 2018 17:43:38.634 dst: %LINK-3-UPDOWN: SIP0/0: Interface GigabitEthernet0/0/0, changed state to down
 008573: Jun 15 2018 07:22:05.050 dst: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/0, changed state to up
 008575: Jun 15 2018 07:22:05.050 dst: %LINK-3-UPDOWN: SIP0/0: Interface GigabitEthernet0/0/0, changed state to up
 008576: Jun 15 2018 07:22:44.183 dst: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/0, changed state to down
 008579: Jun 15 2018 07:22:44.183 dst: %LINK-3-UPDOWN: SIP0/0: Interface GigabitEthernet0/0/0, changed state to down
 008580: Jun 15 2018 07:23:16.648 dst: %LINK-3-UPDOWN: Interface GigabitEthernet0/0/0, changed state to up
 008582: Jun 15 2018 07:23:16.648 dst: %LINK-3-UPDOWN: SIP0/0: Interface GigabitEthernet0/0/0, changed state to up

What I'd like to see in the dashboard was the duration between the "Down" event Jun 8 10:55 and until the "Up" event Jun 15 07:23.

Instead I'm seeing this:

 Reporting Host    Interface        Interface Down              Interface Up                Total Downtime
 X.X.X.X    GigabitEthernet0/0/0    Fri-15-Jun-2018 07:22:44    Fri-15-Jun-2018 07:23:17    33s
 X.X.X.X    GigabitEthernet0/0/0    Thu-14-Jun-2018 17:43:39    Fri-15-Jun-2018 07:22:05    13h 38m 26s
 X.X.X.X    GigabitEthernet0/0/0    Thu-14-Jun-2018 17:35:32    Fri-15-Jun-2018 09:06:24    15h 30m 52s
 X.X.X.X    GigabitEthernet0/0/0    Tue-12-Jun-2018 13:15:29    Fri-15-Jun-2018 09:06:24    2d 19h 50m 55s
 X.X.X.X    GigabitEthernet0/0/0    Sun-03-Jun-2018 12:06:57    Sun-03-Jun-2018 12:50:54    43m 57s
 X.X.X.X    GigabitEthernet0/0/0    Sun-03-Jun-2018 11:45:27    Sun-03-Jun-2018 12:04:57    19m 30s
 X.X.X.X    GigabitEthernet0/0/0    Sun-03-Jun-2018 09:03:27    Sun-03-Jun-2018 10:45:27    1h 42m 0s
 X.X.X.X    GigabitEthernet0/0/0    Sat-02-Jun-2018 23:59:05    Sun-03-Jun-2018 00:34:49    35m 44s
 X.X.X.X    GigabitEthernet0/0/0    Wed-23-May-2018 00:37:20    Wed-23-May-2018 00:40:27    3m 7s
 X.X.X.X    GigabitEthernet0/0/0    Fri-08-Jun-2018 10:55:35    Fri-15-Jun-2018 09:06:24    6d 22h 10m 49s

Any help appreciated 🙂

0 Karma
1 Solution

DalJeanis
Legend

Transaction is almost never the right answer. This should get the data you are asking for, and then you can reformat it as needed.

index=net <host-ip> AND "%LINK-3-UPDOWN: Interface"
| rex "%LINK-3-UPDOWN: Interface (?P<Interface_Name>(.+)),"

| rename COMMENT as "Find the new status and kill all events without it"
| rename COMMENT as "WARNING - Make sure that that this line properly extracts the status"
| rex "changed state to (?<myStatus>(up|down)"
| where isnotnull(myStatus)

| rename COMMENT as "sort data into order, copy the prior status, kill records that just maintain the prior status"
| sort 0 host _time
| streamstats current=f last(myStatus) as prevStatus by host
| where isnull(prevStatus) OR myStatus <> prevStatus

| rename COMMENT as "reverse the prior order, copy the next later time backwards"
| reverse 
| streamstats current=f last(_time) as nextTime last(myStatus) as nextStatus by host

| rename COMMENT as "calculate duration, fill in for any still down hosts, and kill the up records"
| eval duration= if(isnotnull(nextTime), nextTime - _time, now() - _time) 
| where myStatus == "down"

| rename COMMENT as "sort into desired order"
| sort 0 _time  

You can identify down interfaces by isnull(nextStatus).

View solution in original post

DalJeanis
Legend

Transaction is almost never the right answer. This should get the data you are asking for, and then you can reformat it as needed.

index=net <host-ip> AND "%LINK-3-UPDOWN: Interface"
| rex "%LINK-3-UPDOWN: Interface (?P<Interface_Name>(.+)),"

| rename COMMENT as "Find the new status and kill all events without it"
| rename COMMENT as "WARNING - Make sure that that this line properly extracts the status"
| rex "changed state to (?<myStatus>(up|down)"
| where isnotnull(myStatus)

| rename COMMENT as "sort data into order, copy the prior status, kill records that just maintain the prior status"
| sort 0 host _time
| streamstats current=f last(myStatus) as prevStatus by host
| where isnull(prevStatus) OR myStatus <> prevStatus

| rename COMMENT as "reverse the prior order, copy the next later time backwards"
| reverse 
| streamstats current=f last(_time) as nextTime last(myStatus) as nextStatus by host

| rename COMMENT as "calculate duration, fill in for any still down hosts, and kill the up records"
| eval duration= if(isnotnull(nextTime), nextTime - _time, now() - _time) 
| where myStatus == "down"

| rename COMMENT as "sort into desired order"
| sort 0 _time  

You can identify down interfaces by isnull(nextStatus).

NJL
Explorer

Thanks for the elaborate answer!

It seems to work, so thank you very much for your help!

NJL
Explorer

Anybody? 😄

0 Karma

Sukisen1981
Champion

hi,
have you considered applying something like : eventstats max(Interface Up) as uptime, min(Interface Down) as down | eva duration=uptime-downtime?
The issue here is splunk can not know what is the actual time when human engineers are working to fix the issue, so for example 07:22:44 is a downtime event. Now, you know that the human engineers are working in this time , but such information is not available to splunk. The above pseudo code i posted is merely doing max(uptime)- min(downtime) . How do you 'tell' splunk or any other system to ignore the down time at 07:22:44 on an automated basis? Since, you know that the downtime at 07:22:44 should not be considered, the above code will work but how are you going to inform splunk which is the actual downtime to consider as the earliest?

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...

Introducing the 2024 SplunkTrust!

Hello, Splunk Community! We are beyond thrilled to announce our newest group of SplunkTrust members!  The ...