On the incident posture screen the Informational --> Critical boxes update and show the proper number and status of events.
The Recent Incidents do not show all triggered alerts.
App installed in distributed environment using non-standard index search head is stand alone, the index is sent to clustered indexers, summary indexes are store on the search head.
When an event triggers I am seeing the following logs:
2015-04-09 11:50:06,187 DEBUG Create event will be: time=2015-04-09T11:50:06.187492 severity=INFO origin="alert_handler" event_id="f5e728746d9ec28b42db2b41ba85109e" user="splunk-system-user" action="create" alert="XXXX Alert Name" incident_id="624b2d98-14df-43b9-9765-fac36e8662e0" job_id="scheduler_mesearch_RMD5853430e3bafc3e3f_at_1428605400_164" result_id="0" owner="unassigned" status="new" urgency="high" ttl="86400" alert_time="1428605401"
action = create
eventtype = failed_login eventtype = nix-all-logs eventtype = nix_errors error
host = Host.name
index = _internal
source = /opt/splunk/var/log/splunk/alert_manager.log
sourcetype = key_indicators_controller-2
When checking the Recent Incident screen or by searching | all_alerts
I do not see the alert listed.
The counters for the informational --> Critical count up but there isn't an incident to respond to. It appears almost random, originally I thought it was being truncated to I increased the Truncate in props.conf to allow for larger than 10000, but that hasn't fixed the issue.
Any ideas on what could cause this?
Issue resolved:
Issue neded up being field extractions not working past 10k characters.
Added the following to limits.conf on search head and peers
[kv]
maxchars = 20240
Problem resolved with this change
Issue resolved:
Issue neded up being field extractions not working past 10k characters.
Added the following to limits.conf on search head and peers
[kv]
maxchars = 20240
Problem resolved with this change
Hi
Can you check the eventtype "alert_base" in the TA-alert_manager, to which index it is set to?
It should be configured like this:
[alert_base]
search = index=<your_custom_index_name_here>
I think we forgot to mention in the documentation that you'll need to adjust the index name in the eventtype if you're using a custom index.
The datamodel and all the rest is based on this eventtype.
Thanks,
Simon
I have this set on both the search head and distributed to the indexers via the cluster master. Before that setting nothing showed up on the screen.
The weird part about this is while I see the trigger action for alert_handler.py, but the alert only makes it to the Recent Incident "most" of the time. Below is an alert that only shows up 1/3 of the time (XXXX inserted to hide some details)
Alert Name
XX Failed Login Alert
index=XXXXX (EventCode=529 OR EventCode=530 OR EventCode=531 OR EventCode=532 OR EventCode=533 OR EventCode=534 OR EventCode=535 OR EventCode=536 OR EventCode=537 OR EventCode=539 OR EventCode=4625) AND Message=Fail NOT (Message=XXXXXXX) NOT (host=XXXXX01OR host=XXXXX02) | eval Account_Name=mvfilter(Account_Name!="-") | stats count by host, Account_Name, Source_Network_Address, Logon_Type | search count>3
Scheduled cron -10m --> now @ */10 * * * *
Also running Splunk 6.2.1
That's kind of weird. Did you configure auto_ttl_resolve or auto_previous_resolve?
Did you try to set "All" as filter option for "Status"?
Could we maybe set up a screen share session in order to debug your issue?
Sorry for all the inconvenience.
I rechecked the Incident settings and both are unchecked.
I tried setting all and still don't see the alerts. They also don't show up when searching just | all_alerts
Screen sharing isn't an option in my enviroment, I can attach log files etc or try just about anything. (I don't see an attach option for comments so I could send them to you.)
Is there a specific log or functionity that I should be looking for, is there a limit to how often an alert can trigger?
There shouldn't be any limitation how many alerts can be created. Although I can imagine locking issues when firing alerts simultaneously.
Can you please provide me the alert_manager.log file, which contains the main log information when firing alerts. You could paste it here for example: https://gist.github.com/
Further you can disable the comments on line 23 and 24 in the alert_handler.py and maybe set a different path for these two files. Later check if something gets written to those two files.
Thanks,
Simon
https://gist.github.com/04c0f199194e3945dc75.git
In the log 4/10/15 between 11:47:04 and 11:54:04 nothing showed in Recent Incidents. That is one example of where an event did not show.
I will try disabling the comments and report back.
Thanks for providing the logfile. It's really weird since there are a few log entries showing that the alert handler finished correctly:
2015-04-10 11:47:04,733 INFO Alert handler finished. duration=0.718s
2015-04-10 11:48:04,671 INFO Alert handler finished. duration=0.706s
2015-04-10 11:49:04,869 INFO Alert handler finished. duration=0.77s
2015-04-10 11:50:06,267 INFO Alert handler finished. duration=0.717s
2015-04-10 11:51:30,768 INFO Alert handler finished. duration=0.709s
2015-04-10 11:52:07,434 INFO Alert handler finished. duration=0.756s
2015-04-10 11:53:04,671 INFO Alert handler finished. duration=0.732s
If there would have been issues somewhere, the alert handler wouldn't finish.
A few more things to double-check:
| inputlookup incidents | where incident_id="3f3f5812-480c-465d-8f89-c2b94e658eec"
The one listed above showed up on the recent incident screen. I ran the commands above BOTH on that incident and "8b82d86a-b742-4f61-8f95-9c312015d2f4" which does not show up in the recent incidnets screen. All fields were present and the JSON appeared to be formated correctly (nothing malformed as plain text).
The Recent Incidents search string references all_alerts but when I search that the indient I listed above shows up no where.
I am commenting out line 23 and 24 now, to change where the go can I just simply replace the /tmp/stdxxx with another location or is there a variable elsewhere (I didn't see one).
We're getting closer. The good message is: The alert handler is working fine. I think the issue is somewhere around the datamodel or the macro.
Once again, please give a try with these queries and let me know what they return:
Tstats search:
| tstats values(all_alerts.alert) as alert, values(all_alerts.app) as app, values(all_alerts.event_search) as event_search, values(all_alerts.search) as search, values(all_alerts.impact) as impact, values(all_alerts.earliest) as earliest, values(all_alerts.latest) as latest, count from datamodel="alert_manager" where nodename="all_alerts" by all_alerts.job_id, all_alerts.incident_id, all_alerts.result_id, _time | search all_alerts.incident_id="8b82d86a-b742-4f61-8f95-9c312015d2f4"
Pivot search:
| pivot alert_manager all_alerts count(all_alerts) AS "count" FILTER incident_id is "8b82d86a-b742-4f61-8f95-9c312015d2f4"
Eventtype search:
eventtype="alert_metadata" incident_id="8b82d86a-b742-4f61-8f95-9c312015d2f4" | table app, earliest, eventSearch, impact, incident_id, job_id, latest, name, owner, result_id, ttl, urgency
1st search: Error in TsidxStats': Could not find datamodel: alert_manager
2nd search : Error in 'DataModelEvaluator': Data model'alert_manager' was not found
3rd search looks good however job_id was blank
EDIT:
Found my error in searching, datamodel was restricted from Search and Reporting, Shared Globally and works now.
1st Search: Returns No Results Found
2nd search: returns count 1
3rd remains the same, however I neglected to say result_id is also blank
Did you enable datamodel acceleration? pivot and tstats are accessing the same datamodel... That's really really weird...
Let's try again with these two queries:
| tstats summariesonly=false allow_old_summaries=true count from datamodel="alert_manager" where nodename="all_alerts" all_alerts.incident_id="1000000-000-000-000-000000000000"
and
| tstats count from datamodel="alert_manager" where nodename="all_alerts" all_alerts.incident_id="1000000-000-000-000-000000000000"
Further I cannot imagine why job_id and result_id should be blank, the logfile says those two fields have been parsed correctly for incident_id 8b82d86a-b742-4f61-8f95-9c312015d2f4 (see https://gist.github.com/sgman/04c0f199194e3945dc75#file-alert_manager_log-L4938)
Can you please paste the plain text event form the output of
eventtype="alert_metadata" incident_id="8b82d86a-b742-4f61-8f95-9c312015d2f4"
to gist? Although I don't think it's the main reason why the tstats doesn't return the events. I reproduced an incident with empty result_id and job_id, which wasn't an issue so far.
The two tstats queries returned No Results found.
The results of the 3rd search are at git https://gist.github.com/sgman/14985c08cedfdbccc523
Many thanks for providing all the information but I'm really sorry, I think I'm not able to solve this issue.
My impression is, that somehow the Datamodel didn't include the events generated by the alert handler. I have no idea why since I'm not able to reproduce it at all, even with events which you provided which don't appear in the datamodel. Since the events have been written to the index correctly and are retrievable with the event type (which is the "base" for the datamodel), there must be something else.
Would you agree to open a splunk support case and try to explain the issue, that the datamodel doesn't cover/show some events?
Let me know if I shall support you.
Thanks for your patience and understanding,
Simon
Splunk ticket open, but while I'm waiting I dug into the data models. I posted a few exampled of the incident_id portion of the datamodel. I noticed the data doesn't always get written in the same order and when incident_id is near the end it doesn't get extracted. If you have a few could you take a look and see if you see anything I didn't.