On the incident posture screen the Informational --> Critical boxes update and show the proper number and status of events.
The Recent Incidents do not show all triggered alerts.
App installed in distributed environment using non-standard index search head is stand alone, the index is sent to clustered indexers, summary indexes are store on the search head.
When an event triggers I am seeing the following logs:
2015-04-09 11:50:06,187 DEBUG Create event will be: time=2015-04-09T11:50:06.187492 severity=INFO origin="alerthandler" eventid="f5e728746d9ec28b42db2b41ba85109e" user="splunk-system-user" action="create" alert="XXXX Alert Name" incidentid="624b2d98-14df-43b9-9765-fac36e8662e0" jobid="schedulermesearch_RMD5853430e3bafc3e3fat1428605400164" resultid="0" owner="unassigned" status="new" urgency="high" ttl="86400" alerttime="1428605401"
action = create eventtype = failed_login eventtype = nix-all-logs eventtype = nix_errors error host = Host.name index = _internal source = /opt/splunk/var/log/splunk/alert_manager.log sourcetype = key_indicators_controller-2
When checking the Recent Incident screen or by searching |
all_alerts I do not see the alert listed.
The counters for the informational --> Critical count up but there isn't an incident to respond to. It appears almost random, originally I thought it was being truncated to I increased the Truncate in props.conf to allow for larger than 10000, but that hasn't fixed the issue.
Any ideas on what could cause this?
Can you check the eventtype "alertbase" in the TA-alertmanager, to which index it is set to?
It should be configured like this:
[alert_base] search = index=<your_custom_index_name_here>
I think we forgot to mention in the documentation that you'll need to adjust the index name in the eventtype if you're using a custom index.
The datamodel and all the rest is based on this eventtype.
I have this set on both the search head and distributed to the indexers via the cluster master. Before that setting nothing showed up on the screen.
The weird part about this is while I see the trigger action for alert_handler.py, but the alert only makes it to the Recent Incident "most" of the time. Below is an alert that only shows up 1/3 of the time (XXXX inserted to hide some details)
XX Failed Login Alert
index=XXXXX (EventCode=529 OR EventCode=530 OR EventCode=531 OR EventCode=532 OR EventCode=533 OR EventCode=534 OR EventCode=535 OR EventCode=536 OR EventCode=537 OR EventCode=539 OR EventCode=4625) AND Message=Fail NOT (Message=XXXXXXX) NOT (host=XXXXX01OR host=XXXXX02) | eval AccountName=mvfilter(AccountName!="-") | stats count by host, AccountName, SourceNetworkAddress, LogonType | search count>3
Scheduled cron -10m --> now @ */10 * * * *
Also running Splunk 6.2.1
That's kind of weird. Did you configure autottlresolve or autopreviousresolve?
Did you try to set "All" as filter option for "Status"?
Could we maybe set up a screen share session in order to debug your issue?
Sorry for all the inconvenience.
I rechecked the Incident settings and both are unchecked.
I tried setting all and still don't see the alerts. They also don't show up when searching just | all_alerts
Screen sharing isn't an option in my enviroment, I can attach log files etc or try just about anything. (I don't see an attach option for comments so I could send them to you.)
Is there a specific log or functionity that I should be looking for, is there a limit to how often an alert can trigger?
There shouldn't be any limitation how many alerts can be created. Although I can imagine locking issues when firing alerts simultaneously.
Can you please provide me the alert_manager.log file, which contains the main log information when firing alerts. You could paste it here for example: https://gist.github.com/
Further you can disable the comments on line 23 and 24 in the alert_handler.py and maybe set a different path for these two files. Later check if something gets written to those two files.
In the log 4/10/15 between 11:47:04 and 11:54:04 nothing showed in Recent Incidents. That is one example of where an event did not show.
I will try disabling the comments and report back.
Thanks for providing the logfile. It's really weird since there are a few log entries showing that the alert handler finished correctly:
2015-04-10 11:47:04,733 INFO Alert handler finished. duration=0.718s
2015-04-10 11:48:04,671 INFO Alert handler finished. duration=0.706s
2015-04-10 11:49:04,869 INFO Alert handler finished. duration=0.77s
2015-04-10 11:50:06,267 INFO Alert handler finished. duration=0.717s
2015-04-10 11:51:30,768 INFO Alert handler finished. duration=0.709s
2015-04-10 11:52:07,434 INFO Alert handler finished. duration=0.756s
2015-04-10 11:53:04,671 INFO Alert handler finished. duration=0.732s
If there would have been issues somewhere, the alert handler wouldn't finish.
A few more things to double-check:
| inputlookup incidents | where incident_id="3f3f5812-480c-465d-8f89-c2b94e658eec"
The one listed above showed up on the recent incident screen. I ran the commands above BOTH on that incident and "8b82d86a-b742-4f61-8f95-9c312015d2f4" which does not show up in the recent incidnets screen. All fields were present and the JSON appeared to be formated correctly (nothing malformed as plain text).
The Recent Incidents search string references all_alerts but when I search that the indient I listed above shows up no where.
I am commenting out line 23 and 24 now, to change where the go can I just simply replace the /tmp/stdxxx with another location or is there a variable elsewhere (I didn't see one).
We're getting closer. The good message is: The alert handler is working fine. I think the issue is somewhere around the datamodel or the macro.
Once again, please give a try with these queries and let me know what they return:
| tstats values(allalerts.alert) as alert, values(allalerts.app) as app, values(allalerts.eventsearch) as eventsearch, values(allalerts.search) as search, values(allalerts.impact) as impact, values(allalerts.earliest) as earliest, values(allalerts.latest) as latest, count from datamodel="alertmanager" where nodename="allalerts" by allalerts.jobid, allalerts.incidentid, allalerts.resultid, _time | search allalerts.incident_id="8b82d86a-b742-4f61-8f95-9c312015d2f4"
| pivot alertmanager allalerts count(allalerts) AS "count" FILTER incidentid is "8b82d86a-b742-4f61-8f95-9c312015d2f4"
eventtype="alertmetadata" incidentid="8b82d86a-b742-4f61-8f95-9c312015d2f4" | table app, earliest, eventSearch, impact, incidentid, jobid, latest, name, owner, result_id, ttl, urgency