I’d like to pick your brain to see if you know of 3-5 key windows event log events to monitor that would indicate a machine that has crashed or is having trouble with a particular component (application, hardware, driver, etc). I’m working on a set of alerts in Splunk for my program to assist with maintaining their uptime SLAs.
I’m looking to search in Splunk for a simple text string, event id, error code, or pattern that would indicate that a system has gone down or is degraded (i.e. something is failing).
I’ve done some research. Here’s what I’ve got so far:
System Log, Event ID: 41, Source: Microsoft-Windows-Kernel-Power
Description: The system has rebooted without cleanly shutting down first.
The kernel power event ID 41 error occurs when the computer is shut down, or it restarts unexpectedly
An unexpected reboot error appears in the log when the system fails to shut down and restart gracefully. A likely cause of this error is that the operating system stopped responding and crashed, or the server lost power.
System Log, Event ID: 6008, Source: EventLog
Description: “The previous system shutdown at on was unexpected.” This event id will let you know that the system started after it was not shut down properly.
System Log, Event ID: 18, Source: Microsoft-Windows-WHEA-Logger
Description: “A fatal hardware error has occurred.” This error indicates that there is a hardware problem
First i would like to say that i think you are on the right track and your research is valid and that you found good events.
windows logs are verbose and there is plenty to look for and see. having said that, I have seen many windows admins and ops guys, looking at different events for same or similar use cases.
before i continue, i will highly recommend to consult your windows admins / SMEs and ask them what do they see more often? what is important for them to be alerted at?
from many many sources online on this subject, these 2 links are pretty good. I choose those since they also add some security aspect to the mix: http://www.redblue.team/2015/09/spotting-adversary-with-windows-event.html http://www.redblue.team/2015/09/spotting-adversary-with-windows-event_21.html
from an operational perspective, i have seen that many times the WinHostMon (windows host monitoring) can be very useful as a source on top (or by itself) of system logs