Hi Splunkers!
I’d like to pick your brain to see if you know of 3-5 key windows event log events to monitor that would indicate a machine that has crashed or is having trouble with a particular component (application, hardware, driver, etc). I’m working on a set of alerts in Splunk for my program to assist with maintaining their uptime SLAs.
I’m looking to search in Splunk for a simple text string, event id, error code, or pattern that would indicate that a system has gone down or is degraded (i.e. something is failing).
I’ve done some research. Here’s what I’ve got so far:
System Log, Event ID: 41, Source: Microsoft-Windows-Kernel-Power
Description: The system has rebooted without cleanly shutting down first.
The kernel power event ID 41 error occurs when the computer is shut down, or it restarts unexpectedly
An unexpected reboot error appears in the log when the system fails to shut down and restart gracefully. A likely cause of this error is that the operating system stopped responding and crashed, or the server lost power.
System Log, Event ID: 6008, Source: EventLog
Description: “The previous system shutdown at on was unexpected.” This event id will let you know that the system started after it was not shut down properly.
System Log, Event ID: 18, Source: Microsoft-Windows-WHEA-Logger
Description: “A fatal hardware error has occurred.” This error indicates that there is a hardware problem
Application Log, Level: Error, Source: Application Error
Description: Tracking applications that have crashed or faulted on the system
System Log, Event ID: 7000, Source: Service Control Manager
Description: “The service failed to start due to the following error: ”. This error is logged when a service fails to start normally.
Any thoughts?
... View more