Alerting

Best practices for large number of alerts

alexiri
Communicator

We're using Splunk to monitor the logs of IBM's Tivoli Storage Manager and we'd like to replace our current home-grown alerting system. We'd like to create alerts based on the TSM error code, and the idea is to have one alert per error code so that they can be managed and thresholded independently (ie. we don't want several cases of a "benign" or understood error code eclipsing the others).

The problem is that there are many error codes that we'd like to alert, at current count about 300. Also, we'd like to be able to alert every other error code in case we miss something, but for this we should only get one alert.

Now, we could have alerts with searches like these:

  • Alert 1: "search tsmcode=ANR0102E"
  • Alert 2: "search tsmcode=ANR3423E"
  • ...
  • Alert 3XX (the generic one): "search eventtype=error NOT tsmcode=ANR0102E NOT tsmcode=ANR3423E NOT ..."

but this seems kind of hard to manage, not to mention messy. Is there a better way to do this?

1 Solution

mw
Splunk Employee
Splunk Employee

I think that using a lookup might be the best way. Your lookup file could look something like:

tsmcode,alert,severity
ANR0102E,1,low
ANR3423E,1,high
...

With automatic lookups your search would become more like:

# catch anything else?
eventtype=error NOT alert=*

or similar. And, of course, with the addition of severity to the mix, you could treat messages more appropriately, and likely from just a few searches.

View solution in original post

0 Karma

Hema_Nithya
Explorer

Hi ,
We are planning to monitor our TSM servers with Splunk , so please help us what are the data need to feed in the splunk to get the complete report .

0 Karma

mw
Splunk Employee
Splunk Employee

I think that using a lookup might be the best way. Your lookup file could look something like:

tsmcode,alert,severity
ANR0102E,1,low
ANR3423E,1,high
...

With automatic lookups your search would become more like:

# catch anything else?
eventtype=error NOT alert=*

or similar. And, of course, with the addition of severity to the mix, you could treat messages more appropriately, and likely from just a few searches.

View solution in original post

0 Karma

mw
Splunk Employee
Splunk Employee

Why do you need to create 300 alerts? I would imagine that the same lookup would be used to limit yourself to just a few alert searches. In other words, at least from my experience, you wouldn't treat 300 error codes in 300 different ways; you would treat them in groups as "critical" severity, etc, etc. With a lookup, the severity would be added, and so you would only need one or a few searches, IMHO.

alexiri
Communicator

Hi Mike,

Yes, something like this may be the easiest way to deal with the generic alert. I guess I could probably also generate the CSV file programatically if I can get Splunk to give me a list of configured alerts. (Is this possible?)

Can you think of any solution to the first issue, ie. having to create 300 alerts in Splunk?

Cheers,

Alex

0 Karma