I am experiencing periodic duplicate notable events in my search head cluster. I have a feeling this has something to do with how a SHC handles notable event syncing between search heads. Has anyone else run into this?
@Joetron -- From the description of the issue being seen it sounds like you might be experiencing an issue with the "Notable Event Suppression Keys" not getting updated properly.
To provide some insight on what might be happening here, let me provide some information on how the "Notable Event Suppression" process works. For each "Notable Event Suppression" setup on a Search Head there is an "expire time" associated with the "Notable Event Suppression Key". When this "Expire Time" is reached the Search Head that the "Notable Event Suppression" is configured on sends a "Heartbeat" to the Search Head Captain. In this "Heartbeat" sent from the Search Head it contains the "Notable Event Suppression Key & List" which is to be distributed to the Members of the Search Head Cluster.
Now the Captain does not Preemptively Push this Updated "Notable Event Suppression Key & List" it waits for the Search Head Cluster members to Check-in during their Regular "Heartbeat Check-in" to the SH Captain. So if the heartbeat_period inside of the [shclustering] has been adjusted away from the "Default" Setting of 5 seconds this can cause the following situation to arise inside of the Search Head Cluster:
1) The Notable Event Suppression Key expires
2) A New Notable Suppression Key & List are generated on the Search Head
3) The Search Head then sends a HeartBeat to the Search Head Cluster Captain and this HeartBeat contains the New Notable Event Suppression Key & List.
4) The Search Head Captain does not Preemptively send this New Notable Event Suppression Key & List to the Search Head Cluster Members
5) On the Next "Heartbeat Check-in" completed by the Search Head Cluster members the Captain then Pushes the New Notable Event Suppression Key & List.
So if the heartbeat_period inside of the [shclustering] configuration Stanza inside of server.conf has been adjusted away from the Default of 5 seconds, you may be experiencing an issue where the Notable Event Suppression Key Expires just after the Search Head Cluster Members have "checked-in" to the Captain and as such the Captain waits for them to check-in again before sending the New Notable Event Suppression Key & List.
Scenario: Notable Event Suppression Key expired on SH-1 at 11:02:15 am and the Search Head sends its "HeartBeat" with the Updated Notable Event Suppression Key which reaches the Captain at 11:02:17. Now SH-2 had checked in with the SH Captain at 11:02:15 and has a heartbeat_period of 30 seconds. This means that the SH-2 will not check-in to the Captain for another 27 seconds and as such this is when a Duplicate Notable Event could be generated from the Search Head Cluster Members that have not received the Updated Notable Event Suppression Key & List.
You will want to double check the heartbeat_period inside of shclustering stanza inside of server.conf configuration file and if this is set higher than the Default of 5 seconds you will want to lower this setting to prevent this Scenario from arising and causing Duplicate Notable Events due to Suppression Keys not being distributed to the Search Head Cluster members.
Hopefully this information helps shed some light on the "Notable Event Suppression" process.
@Joetron -- From the description of the issue being seen it sounds like you might be experiencing an issue with the "Notable Event Suppression Keys" not getting updated properly.
To provide some insight on what might be happening here, let me provide some information on how the "Notable Event Suppression" process works. For each "Notable Event Suppression" setup on a Search Head there is an "expire time" associated with the "Notable Event Suppression Key". When this "Expire Time" is reached the Search Head that the "Notable Event Suppression" is configured on sends a "Heartbeat" to the Search Head Captain. In this "Heartbeat" sent from the Search Head it contains the "Notable Event Suppression Key & List" which is to be distributed to the Members of the Search Head Cluster.
Now the Captain does not Preemptively Push this Updated "Notable Event Suppression Key & List" it waits for the Search Head Cluster members to Check-in during their Regular "Heartbeat Check-in" to the SH Captain. So if the heartbeat_period inside of the [shclustering] has been adjusted away from the "Default" Setting of 5 seconds this can cause the following situation to arise inside of the Search Head Cluster:
1) The Notable Event Suppression Key expires
2) A New Notable Suppression Key & List are generated on the Search Head
3) The Search Head then sends a HeartBeat to the Search Head Cluster Captain and this HeartBeat contains the New Notable Event Suppression Key & List.
4) The Search Head Captain does not Preemptively send this New Notable Event Suppression Key & List to the Search Head Cluster Members
5) On the Next "Heartbeat Check-in" completed by the Search Head Cluster members the Captain then Pushes the New Notable Event Suppression Key & List.
So if the heartbeat_period inside of the [shclustering] configuration Stanza inside of server.conf has been adjusted away from the Default of 5 seconds, you may be experiencing an issue where the Notable Event Suppression Key Expires just after the Search Head Cluster Members have "checked-in" to the Captain and as such the Captain waits for them to check-in again before sending the New Notable Event Suppression Key & List.
Scenario: Notable Event Suppression Key expired on SH-1 at 11:02:15 am and the Search Head sends its "HeartBeat" with the Updated Notable Event Suppression Key which reaches the Captain at 11:02:17. Now SH-2 had checked in with the SH Captain at 11:02:15 and has a heartbeat_period of 30 seconds. This means that the SH-2 will not check-in to the Captain for another 27 seconds and as such this is when a Duplicate Notable Event could be generated from the Search Head Cluster Members that have not received the Updated Notable Event Suppression Key & List.
You will want to double check the heartbeat_period inside of shclustering stanza inside of server.conf configuration file and if this is set higher than the Default of 5 seconds you will want to lower this setting to prevent this Scenario from arising and causing Duplicate Notable Events due to Suppression Keys not being distributed to the Search Head Cluster members.
Hopefully this information helps shed some light on the "Notable Event Suppression" process.