All Apps and Add-ons
Highlighted

Inconsistent Results on Alert Manager Recent Incidents

Explorer

On the incident posture screen the Informational --> Critical boxes update and show the proper number and status of events.

The Recent Incidents do not show all triggered alerts.

App installed in distributed environment using non-standard index search head is stand alone, the index is sent to clustered indexers, summary indexes are store on the search head.

When an event triggers I am seeing the following logs:

2015-04-09 11:50:06,187 DEBUG Create event will be: time=2015-04-09T11:50:06.187492 severity=INFO origin="alerthandler" eventid="f5e728746d9ec28b42db2b41ba85109e" user="splunk-system-user" action="create" alert="XXXX Alert Name" incidentid="624b2d98-14df-43b9-9765-fac36e8662e0" jobid="schedulermesearch_RMD5853430e3bafc3e3fat1428605400164" resultid="0" owner="unassigned" status="new" urgency="high" ttl="86400" alerttime="1428605401"

action = create
eventtype = failed_login eventtype = nix-all-logs eventtype = nix_errors error
host = Host.name
index = _internal
source = /opt/splunk/var/log/splunk/alert_manager.log
sourcetype = key_indicators_controller-2

When checking the Recent Incident screen or by searching | all_alerts I do not see the alert listed.

The counters for the informational --> Critical count up but there isn't an incident to respond to. It appears almost random, originally I thought it was being truncated to I increased the Truncate in props.conf to allow for larger than 10000, but that hasn't fixed the issue.

Any ideas on what could cause this?

0 Karma
Highlighted

Re: Inconsistent Results on Alert Manager Recent Incidents

Explorer

Hi

Can you check the eventtype "alertbase" in the TA-alertmanager, to which index it is set to?
It should be configured like this:

[alert_base]
search = index=<your_custom_index_name_here>

I think we forgot to mention in the documentation that you'll need to adjust the index name in the eventtype if you're using a custom index.

The datamodel and all the rest is based on this eventtype.

Thanks,
Simon

Highlighted

Re: Inconsistent Results on Alert Manager Recent Incidents

Explorer

I have this set on both the search head and distributed to the indexers via the cluster master. Before that setting nothing showed up on the screen.

The weird part about this is while I see the trigger action for alert_handler.py, but the alert only makes it to the Recent Incident "most" of the time. Below is an alert that only shows up 1/3 of the time (XXXX inserted to hide some details)

Alert Name
XX Failed Login Alert

index=XXXXX (EventCode=529 OR EventCode=530 OR EventCode=531 OR EventCode=532 OR EventCode=533 OR EventCode=534 OR EventCode=535 OR EventCode=536 OR EventCode=537 OR EventCode=539 OR EventCode=4625) AND Message=Fail NOT (Message=XXXXXXX) NOT (host=XXXXX01OR host=XXXXX02) | eval AccountName=mvfilter(AccountName!="-") | stats count by host, AccountName, SourceNetworkAddress, LogonType | search count>3

Scheduled cron -10m --> now @ */10 * * * *

Also running Splunk 6.2.1

0 Karma
Highlighted

Re: Inconsistent Results on Alert Manager Recent Incidents

Explorer

That's kind of weird. Did you configure autottlresolve or autopreviousresolve?
Did you try to set "All" as filter option for "Status"?
Could we maybe set up a screen share session in order to debug your issue?
Sorry for all the inconvenience.

0 Karma
Highlighted

Re: Inconsistent Results on Alert Manager Recent Incidents

Explorer

I rechecked the Incident settings and both are unchecked.

I tried setting all and still don't see the alerts. They also don't show up when searching just | all_alerts

Screen sharing isn't an option in my enviroment, I can attach log files etc or try just about anything. (I don't see an attach option for comments so I could send them to you.)

Is there a specific log or functionity that I should be looking for, is there a limit to how often an alert can trigger?

0 Karma
Highlighted

Re: Inconsistent Results on Alert Manager Recent Incidents

Contributor

There shouldn't be any limitation how many alerts can be created. Although I can imagine locking issues when firing alerts simultaneously.

Can you please provide me the alert_manager.log file, which contains the main log information when firing alerts. You could paste it here for example: https://gist.github.com/

Further you can disable the comments on line 23 and 24 in the alert_handler.py and maybe set a different path for these two files. Later check if something gets written to those two files.

Thanks,
Simon

0 Karma
Highlighted

Re: Inconsistent Results on Alert Manager Recent Incidents

Explorer

https://gist.github.com/04c0f199194e3945dc75.git

In the log 4/10/15 between 11:47:04 and 11:54:04 nothing showed in Recent Incidents. That is one example of where an event did not show.

I will try disabling the comments and report back.

0 Karma
Highlighted

Re: Inconsistent Results on Alert Manager Recent Incidents

Explorer

Thanks for providing the logfile. It's really weird since there are a few log entries showing that the alert handler finished correctly:

2015-04-10 11:47:04,733 INFO Alert handler finished. duration=0.718s
2015-04-10 11:48:04,671 INFO Alert handler finished. duration=0.706s
2015-04-10 11:49:04,869 INFO Alert handler finished. duration=0.77s
2015-04-10 11:50:06,267 INFO Alert handler finished. duration=0.717s
2015-04-10 11:51:30,768 INFO Alert handler finished. duration=0.709s
2015-04-10 11:52:07,434 INFO Alert handler finished. duration=0.756s
2015-04-10 11:53:04,671 INFO Alert handler finished. duration=0.732s

If there would have been issues somewhere, the alert handler wouldn't finish.

A few more things to double-check:

  • Check the "incidents" collection to see if the data has been written for those incidents, e.g.: Note: No column should be empty!

| inputlookup incidents | where incident_id="3f3f5812-480c-465d-8f89-c2b94e658eec"

  • Check if the metadata has been written to the index: > index=alerts-to-inf incidentid="b35b7f4f-691d-4114-8efc-7a39820a9a11" You should get one event with sourcetype=alertmetadata with correctly formatted json data and one event with key/value data and sourcetype=incident_change belonging to the incident id.
0 Karma
Highlighted

Re: Inconsistent Results on Alert Manager Recent Incidents

Explorer

The one listed above showed up on the recent incident screen. I ran the commands above BOTH on that incident and "8b82d86a-b742-4f61-8f95-9c312015d2f4" which does not show up in the recent incidnets screen. All fields were present and the JSON appeared to be formated correctly (nothing malformed as plain text).

The Recent Incidents search string references all_alerts but when I search that the indient I listed above shows up no where.

I am commenting out line 23 and 24 now, to change where the go can I just simply replace the /tmp/stdxxx with another location or is there a variable elsewhere (I didn't see one).

0 Karma
Highlighted

Re: Inconsistent Results on Alert Manager Recent Incidents

Explorer

We're getting closer. The good message is: The alert handler is working fine. I think the issue is somewhere around the datamodel or the macro.

Once again, please give a try with these queries and let me know what they return:
Tstats search:

| tstats values(allalerts.alert) as alert, values(allalerts.app) as app, values(allalerts.eventsearch) as eventsearch, values(allalerts.search) as search, values(allalerts.impact) as impact, values(allalerts.earliest) as earliest, values(allalerts.latest) as latest, count from datamodel="alertmanager" where nodename="allalerts" by allalerts.jobid, allalerts.incidentid, allalerts.resultid, _time | search allalerts.incident_id="8b82d86a-b742-4f61-8f95-9c312015d2f4"

Pivot search:

| pivot alertmanager allalerts count(allalerts) AS "count" FILTER incidentid is "8b82d86a-b742-4f61-8f95-9c312015d2f4"

Eventtype search:

eventtype="alertmetadata" incidentid="8b82d86a-b742-4f61-8f95-9c312015d2f4" | table app, earliest, eventSearch, impact, incidentid, jobid, latest, name, owner, result_id, ttl, urgency

0 Karma