Alerting

Best practices for large number of alerts

alexiri
Communicator

We're using Splunk to monitor the logs of IBM's Tivoli Storage Manager and we'd like to replace our current home-grown alerting system. We'd like to create alerts based on the TSM error code, and the idea is to have one alert per error code so that they can be managed and thresholded independently (ie. we don't want several cases of a "benign" or understood error code eclipsing the others).

The problem is that there are many error codes that we'd like to alert, at current count about 300. Also, we'd like to be able to alert every other error code in case we miss something, but for this we should only get one alert.

Now, we could have alerts with searches like these:

  • Alert 1: "search tsmcode=ANR0102E"
  • Alert 2: "search tsmcode=ANR3423E"
  • ...
  • Alert 3XX (the generic one): "search eventtype=error NOT tsmcode=ANR0102E NOT tsmcode=ANR3423E NOT ..."

but this seems kind of hard to manage, not to mention messy. Is there a better way to do this?

1 Solution

mw
Splunk Employee
Splunk Employee

I think that using a lookup might be the best way. Your lookup file could look something like:

tsmcode,alert,severity
ANR0102E,1,low
ANR3423E,1,high
...

With automatic lookups your search would become more like:

# catch anything else?
eventtype=error NOT alert=*

or similar. And, of course, with the addition of severity to the mix, you could treat messages more appropriately, and likely from just a few searches.

View solution in original post

0 Karma

Hema_Nithya
Explorer

Hi ,
We are planning to monitor our TSM servers with Splunk , so please help us what are the data need to feed in the splunk to get the complete report .

0 Karma

mw
Splunk Employee
Splunk Employee

I think that using a lookup might be the best way. Your lookup file could look something like:

tsmcode,alert,severity
ANR0102E,1,low
ANR3423E,1,high
...

With automatic lookups your search would become more like:

# catch anything else?
eventtype=error NOT alert=*

or similar. And, of course, with the addition of severity to the mix, you could treat messages more appropriately, and likely from just a few searches.

0 Karma

mw
Splunk Employee
Splunk Employee

Why do you need to create 300 alerts? I would imagine that the same lookup would be used to limit yourself to just a few alert searches. In other words, at least from my experience, you wouldn't treat 300 error codes in 300 different ways; you would treat them in groups as "critical" severity, etc, etc. With a lookup, the severity would be added, and so you would only need one or a few searches, IMHO.

alexiri
Communicator

Hi Mike,

Yes, something like this may be the easiest way to deal with the generic alert. I guess I could probably also generate the CSV file programatically if I can get Splunk to give me a list of configured alerts. (Is this possible?)

Can you think of any solution to the first issue, ie. having to create 300 alerts in Splunk?

Cheers,

Alex

0 Karma
Get Updates on the Splunk Community!

Unlock New Opportunities with Splunk Education: Explore Our Latest Courses!

At Splunk Education, we’re dedicated to providing top-tier learning experiences that cater to every skill ...

Technical Workshop Series: Splunk Data Management and SPL2 | Register here!

Hey, Splunk Community! Ready to take your data management skills to the next level? Join us for a 3-part ...

Spotting Financial Fraud in the Haystack: A Guide to Behavioral Analytics with Splunk

In today's digital financial ecosystem, security teams face an unprecedented challenge. The sheer volume of ...