Solved: Inconsistent Results on Alert Manager Recent Incid...

sgallman · ‎04-09-2015

On the incident posture screen the Informational --> Critical boxes update and show the proper number and status of events.

The Recent Incidents do not show all triggered alerts.

App installed in distributed environment using non-standard index search head is stand alone, the index is sent to clustered indexers, summary indexes are store on the search head.

When an event triggers I am seeing the following logs:

2015-04-09 11:50:06,187 DEBUG Create event will be: time=2015-04-09T11:50:06.187492 severity=INFO origin="alert_handler" event_id="f5e728746d9ec28b42db2b41ba85109e" user="splunk-system-user" action="create" alert="XXXX Alert Name" incident_id="624b2d98-14df-43b9-9765-fac36e8662e0" job_id="scheduler_mesearch_RMD5853430e3bafc3e3f_at_1428605400_164" result_id="0" owner="unassigned" status="new" urgency="high" ttl="86400" alert_time="1428605401"

action = create
eventtype = failed_login eventtype = nix-all-logs eventtype = nix_errors error
host = Host.name
index = _internal
source = /opt/splunk/var/log/splunk/alert_manager.log
sourcetype = key_indicators_controller-2

When checking the Recent Incident screen or by searching | all_alerts I do not see the alert listed.

The counters for the informational --> Critical count up but there isn't an incident to respond to. It appears almost random, originally I thought it was being truncated to I increased the Truncate in props.conf to allow for larger than 10000, but that hasn't fixed the issue.

Any ideas on what could cause this?

sgallman · ‎04-22-2015

Issue resolved:

Issue neded up being field extractions not working past 10k characters.

Added the following to limits.conf on search head and peers

[kv]
maxchars = 20240

Problem resolved with this change

View solution in original post

sgallman · ‎04-22-2015

Issue resolved:

Issue neded up being field extractions not working past 10k characters.

Added the following to limits.conf on search head and peers

[kv]
maxchars = 20240

Problem resolved with this change

alert_manager · ‎04-10-2015

Hi

Can you check the eventtype "alert_base" in the TA-alert_manager, to which index it is set to?
It should be configured like this:

[alert_base]
search = index=<your_custom_index_name_here>

I think we forgot to mention in the documentation that you'll need to adjust the index name in the eventtype if you're using a custom index.

The datamodel and all the rest is based on this eventtype.

Thanks,
Simon

sgallman · ‎04-10-2015

I have this set on both the search head and distributed to the indexers via the cluster master. Before that setting nothing showed up on the screen.

The weird part about this is while I see the trigger action for alert_handler.py, but the alert only makes it to the Recent Incident "most" of the time. Below is an alert that only shows up 1/3 of the time (XXXX inserted to hide some details)

Alert Name
XX Failed Login Alert

index=XXXXX (EventCode=529 OR EventCode=530 OR EventCode=531 OR EventCode=532 OR EventCode=533 OR EventCode=534 OR EventCode=535 OR EventCode=536 OR EventCode=537 OR EventCode=539 OR EventCode=4625) AND Message=Fail NOT (Message=XXXXXXX) NOT (host=XXXXX01OR host=XXXXX02) | eval Account_Name=mvfilter(Account_Name!="-") | stats count by host, Account_Name, Source_Network_Address, Logon_Type | search count>3

Scheduled cron -10m --> now @ */10 * * * *

Also running Splunk 6.2.1

alert_manager · ‎04-13-2015

That's kind of weird. Did you configure auto_ttl_resolve or auto_previous_resolve?
Did you try to set "All" as filter option for "Status"?
Could we maybe set up a screen share session in order to debug your issue?
Sorry for all the inconvenience.

sgallman · ‎04-13-2015

I rechecked the Incident settings and both are unchecked.

I tried setting all and still don't see the alerts. They also don't show up when searching just | all_alerts

Screen sharing isn't an option in my enviroment, I can attach log files etc or try just about anything. (I don't see an attach option for comments so I could send them to you.)

Is there a specific log or functionity that I should be looking for, is there a limit to how often an alert can trigger?

Simon · ‎04-13-2015

There shouldn't be any limitation how many alerts can be created. Although I can imagine locking issues when firing alerts simultaneously.

Can you please provide me the alert_manager.log file, which contains the main log information when firing alerts. You could paste it here for example: https://gist.github.com/

Further you can disable the comments on line 23 and 24 in the alert_handler.py and maybe set a different path for these two files. Later check if something gets written to those two files.

Thanks,
Simon

sgallman · ‎04-13-2015

https://gist.github.com/04c0f199194e3945dc75.git

In the log 4/10/15 between 11:47:04 and 11:54:04 nothing showed in Recent Incidents. That is one example of where an event did not show.

I will try disabling the comments and report back.

alert_manager · ‎04-13-2015

Thanks for providing the logfile. It's really weird since there are a few log entries showing that the alert handler finished correctly:

2015-04-10 11:47:04,733 INFO Alert handler finished. duration=0.718s
2015-04-10 11:48:04,671 INFO Alert handler finished. duration=0.706s
2015-04-10 11:49:04,869 INFO Alert handler finished. duration=0.77s
2015-04-10 11:50:06,267 INFO Alert handler finished. duration=0.717s
2015-04-10 11:51:30,768 INFO Alert handler finished. duration=0.709s
2015-04-10 11:52:07,434 INFO Alert handler finished. duration=0.756s
2015-04-10 11:53:04,671 INFO Alert handler finished. duration=0.732s

If there would have been issues somewhere, the alert handler wouldn't finish.

A few more things to double-check:

Check the "incidents" collection to see if the data has been written for those incidents, e.g.: Note: No column should be empty!

| inputlookup incidents | where incident_id="3f3f5812-480c-465d-8f89-c2b94e658eec"

Check if the metadata has been written to the index: > index=alerts-to-inf incident_id="b35b7f4f-691d-4114-8efc-7a39820a9a11" You should get one event with sourcetype=alert_metadata with correctly formatted json data and one event with key/value data and sourcetype=incident_change belonging to the incident id.

sgallman · ‎04-13-2015

The one listed above showed up on the recent incident screen. I ran the commands above BOTH on that incident and "8b82d86a-b742-4f61-8f95-9c312015d2f4" which does not show up in the recent incidnets screen. All fields were present and the JSON appeared to be formated correctly (nothing malformed as plain text).

The Recent Incidents search string references all_alerts but when I search that the indient I listed above shows up no where.

I am commenting out line 23 and 24 now, to change where the go can I just simply replace the /tmp/stdxxx with another location or is there a variable elsewhere (I didn't see one).

alert_manager · ‎04-13-2015

We're getting closer. The good message is: The alert handler is working fine. I think the issue is somewhere around the datamodel or the macro.

Once again, please give a try with these queries and let me know what they return:
Tstats search:

| tstats values(all_alerts.alert) as alert, values(all_alerts.app) as app, values(all_alerts.event_search) as event_search, values(all_alerts.search) as search, values(all_alerts.impact) as impact, values(all_alerts.earliest) as earliest, values(all_alerts.latest) as latest, count from datamodel="alert_manager" where nodename="all_alerts" by all_alerts.job_id, all_alerts.incident_id, all_alerts.result_id, _time | search all_alerts.incident_id="8b82d86a-b742-4f61-8f95-9c312015d2f4"

Pivot search:

| pivot alert_manager all_alerts count(all_alerts) AS "count" FILTER incident_id is "8b82d86a-b742-4f61-8f95-9c312015d2f4"

Eventtype search:

eventtype="alert_metadata" incident_id="8b82d86a-b742-4f61-8f95-9c312015d2f4" | table app, earliest, eventSearch, impact, incident_id, job_id, latest, name, owner, result_id, ttl, urgency

sgallman · ‎04-13-2015

1st search: Error in TsidxStats': Could not find datamodel: alert_manager

2nd search : Error in 'DataModelEvaluator': Data model'alert_manager' was not found

3rd search looks good however job_id was blank

EDIT:
Found my error in searching, datamodel was restricted from Search and Reporting, Shared Globally and works now.

1st Search: Returns No Results Found

2nd search: returns count 1

3rd remains the same, however I neglected to say result_id is also blank

alert_manager · ‎04-13-2015

Did you enable datamodel acceleration? pivot and tstats are accessing the same datamodel... That's really really weird...

Let's try again with these two queries:

| tstats summariesonly=false allow_old_summaries=true count from datamodel="alert_manager" where nodename="all_alerts" all_alerts.incident_id="1000000-000-000-000-000000000000"

and

| tstats count from datamodel="alert_manager" where nodename="all_alerts" all_alerts.incident_id="1000000-000-000-000-000000000000"

Further I cannot imagine why job_id and result_id should be blank, the logfile says those two fields have been parsed correctly for incident_id 8b82d86a-b742-4f61-8f95-9c312015d2f4 (see https://gist.github.com/sgman/04c0f199194e3945dc75#file-alert_manager_log-L4938)
Can you please paste the plain text event form the output of

eventtype="alert_metadata" incident_id="8b82d86a-b742-4f61-8f95-9c312015d2f4"
to gist? Although I don't think it's the main reason why the tstats doesn't return the events. I reproduced an incident with empty result_id and job_id, which wasn't an issue so far.

sgallman · ‎04-14-2015

The two tstats queries returned No Results found.

The results of the 3rd search are at git https://gist.github.com/sgman/14985c08cedfdbccc523

alert_manager · ‎04-15-2015

Many thanks for providing all the information but I'm really sorry, I think I'm not able to solve this issue.

My impression is, that somehow the Datamodel didn't include the events generated by the alert handler. I have no idea why since I'm not able to reproduce it at all, even with events which you provided which don't appear in the datamodel. Since the events have been written to the index correctly and are retrievable with the event type (which is the "base" for the datamodel), there must be something else.

Would you agree to open a splunk support case and try to explain the issue, that the datamodel doesn't cover/show some events?

Let me know if I shall support you.

Thanks for your patience and understanding,
Simon

sgallman · ‎04-21-2015

Splunk ticket open, but while I'm waiting I dug into the data models. I posted a few exampled of the incident_id portion of the datamodel. I noticed the data doesn't always get written in the same order and when incident_id is near the end it doesn't get extracted. If you have a few could you take a look and see if you see anything I didn't.

https://gist.github.com/sgman/8a21f7a245c04c040bac

Inconsistent Results on Alert Manager Recent Incidents

Join the Splunk Community Slack to learn, troubleshoot, and make connections with fellow Splunk practitioners in real time!

Join Splunk User Groups to connect and learn in-person by region or remotely by topic or industry.

Index This | What travels the world but is also stuck in place?

Discover New Use Cases: Unlock Greater Value from Your Existing Splunk Data

Continue Your Journey: Join Session 2 of the Data Management and Federation Bootcamp ...

Join the Conversation