We already have Splunk installed and are using it pretty heavily. The use cases consist mainly of various log files or databases, and we have provided multiple teams with dashboards for the apps/systems.
Now my management would like us to evaluate Splunk as an enterprise monitoring tool. In other words, as a replacement for something like SCOM, ITM, Up.Time, etc. For us, that would mean monitoring/alerting for about 10k servers (Windows, Linux, Unix) for memory, disk, processors, services, processes, sql, exchange, web sites etc.
Is anyone trying to use Splunk in that capacity and on that scale? I know the data is on the servers and there are apps out there to collect most of it. But I'm a little leery of bringing all of that data back to the infrastructure and searching on it there. As opposed to most other monitoring tools, that have an agent that receives some sort of policy and only sends alerts/status back to the infrastructure.
And then there's the management of it all. Different servers with different alerting needs, thresholds, polling intervals and recurrence. Not to mention the requirement to be able to take action against a server depending on an issue, e.g. restarting service or cleaning up disk space.
As is usually the case, I'm sure it can be done. But I feel like we would be shoehorning Splunk into that type of solution, and it may be more trouble than it's worth.
If anyone out there has any relevant experience and could share some advice/guidance, that would be great.
I am using Splunk to do some of our application monitoring but not at that scale. I primarily alert when certain conditions are met in searching the log files (i.e. ORA errors, certain Java exceptions, etc...). I have been interested in using Splunk to gather OS metrics via the *nix & Windows apps, but don't have the license capacity to pull this off at the moment. Besides that we already have 4-5 other tools collecting OS metrics already. It would be nice to have that data along side app logs in order to try and find correlations (i.e. CPU goes to 100% while executing a particular java method).
Thanks for the feedback, Jeremiah. We currently use splunk in a similar way, alerting on errors in log files as requested by various app/support teams for example. And I agree that having the OS metric data would be nice for event correlation.