Solved: How do I monitor JBOSS/Tomcat/Apache/etc and raise...

nickhills · ‎12-28-2017

I have seen this (or similar) questions many times on answers, and I thought I would create a post on my preferred way to do this.
As with most things Splunk there are any number of ways to tackle this, and I would welcome any feedback or different approaches.

Let’s start with defining what we want to monitor:

Scenario 1:
I want to know if the [servicename] process stops

Scenario 2:
I want to know if the [servicename] stops working.

These two scenarios are similar, but subtly different.
In Scenario 1, The process running the service has stopped, either because it died (horribly), because someone stopped it or because it never started.
These are all very valid things that you would want to be alerted to, however, if you have worked with java/php long enough you will know that very often jboss/tomcat/apache can still be ‘running’, but doing nothing of any use.

It is for this reason I offer consideration for Scenario 2 - Do you really care if the [servicename] is running if you know that it is still working?

All of the above service are relatively chatty, even when Idle, so for this reason you would expect to see log lines being written by these services every few minutes.
If the logs stop being written to then either the service is stuck but still running (scenario 2) or its stopped (scenario 1) (or maybe other reasons such as disk space etc which you probably also care about)

So how to monitor for these scenarios?

If my comment helps, please give it a thumbs up!

nickhills · ‎12-28-2017

Scenario 1
To monitor a running process there are a few options of varying complexity and suitability.

1.) Install the relevant TA and enable process monitoring (splunk_TA_windows, splunk_TA_nix) – Both of these TAs provide a mechanism to monitor all running processes and report them to Splunk.
2.) Write a script which you deploy to the host which determines if the service is running and reports it as an input. – This is often suggested on answers but unless your use case is very specific, I feel there is little value in reinventing the wheel. (Circumstances where rolling your own might be useful are monitoring AS400/MF systems, checking that a service is returning a specific response on a socket, or verifying a chain of dependencies are all running and communicating – there are many others I am sure, but if your starting off from “is x process running” writing a custom script is just making it harder than it needs to be).

Whichever way to choose to monitor your service, generally speaking, Splunk reports on the occurrence of events (rather than the absence) so we need to look for events that are missing.
A search like the following will get you there:

Index=os sourcetype=ps splunkd|stats latest(host) latest(_time) by host |eval lastSeen='latest(_time)'|fields host lastSeen
  |eval status=if(lastSeen<(now() - 300), "late","recent")
  |table host status

You will want to adjust these times to suit, but you should run the search over a broad enough timescale for you to capture a time when it was working, spot that its stopped, receive the alert and resolve it.
It my environment 1 hour is sufficient, but you might choose 24/48 etc depending on your needs.

Scenario 2
This is the subtle difference - the logic is the same, except you are monitoring the output of the process (the fact that its writing logs) rather than the fact that a process is consuming resources. This also has the happy coincidence of being much faster because we can use tsdix meta data rather than event data

| tstats earliest(sourcetype) as sourcetypes earliest(_time) as etime where sourcetype=messages by host|join host [| tstats latest(sourcetype) as sourcetypes latest(_time) as ltime where index=* by host ]
|eval status=if(ltime<(now() - 300), "late", "recent")|table host status

In this case we look for a specific sourcetype (I use tomcat:catalina) and we find the earliest event in your search window.
Within that list look again for the latest event from the same host
Calculate the time between the last event and now() – if its more than 300 seconds, mark it as late.

Whichever approach you take, you now have a table which reports the hostname and if the data you care about has been recently seen or is ‘late’

Add |stats count(status) by host and you can add a Pie Chart Vis

If you wanted a single panel indicator to show you how many hosts are experiencing issues you can use:

<row>
         <panel>
           <single>
             <search>
               <query> | tstats earliest(sourcetype) as sourcetypes earliest(_time) as etime where sourcetype=messages by host|join host [| tstats latest(sourcetype) as sourcetypes latest(_time) as ltime where index=* by host ]
|eval status=if(ltime<(now() - 300), "late", "recent")|table host status|
search status=late  |stats count
               <earliest>@d</earliest>
               <latest>now</latest>
               <sampleRatio>1</sampleRatio>
             </search>
             <option name="colorBy">value</option>
             <option name="colorMode">block</option>
             <option name="drilldown">none</option>
             <option name="numberPrecision">0</option>
             <option name="rangeColors">["0x65a637","0xd93f3c"]</option>
             <option name="rangeValues">[0]</option>
             <option name="showSparkline">1</option>
             <option name="showTrendIndicator">1</option>
             <option name="trendColorInterpretation">standard</option>
             <option name="trendDisplayMode">absolute</option>
             <option name="underLabel">Hosts Missing service x</option>
             <option name="unitPosition">after</option>
             <option name="useColors">1</option>
             <option name="useThousandSeparators">1</option>
           </single>
         </panel>
       </row>

Hopefully that covers just a few of the options available, there are plenty of others. Please submit other answers if you have alternatives

If my comment helps, please give it a thumbs up!

View solution in original post

nickhills · ‎01-02-2018

If you are looking for an App which will give you excellent insight into your data sources (and don't need to build monitoring into your own app) Take a look at the quite excellent Meta Woot : https://splunkbase.splunk.com/app/2949/ by Discovered Intelligence.

If my comment helps, please give it a thumbs up!

nickhills · ‎12-28-2017

Scenario 1
To monitor a running process there are a few options of varying complexity and suitability.

1.) Install the relevant TA and enable process monitoring (splunk_TA_windows, splunk_TA_nix) – Both of these TAs provide a mechanism to monitor all running processes and report them to Splunk.
2.) Write a script which you deploy to the host which determines if the service is running and reports it as an input. – This is often suggested on answers but unless your use case is very specific, I feel there is little value in reinventing the wheel. (Circumstances where rolling your own might be useful are monitoring AS400/MF systems, checking that a service is returning a specific response on a socket, or verifying a chain of dependencies are all running and communicating – there are many others I am sure, but if your starting off from “is x process running” writing a custom script is just making it harder than it needs to be).

Whichever way to choose to monitor your service, generally speaking, Splunk reports on the occurrence of events (rather than the absence) so we need to look for events that are missing.
A search like the following will get you there:

Index=os sourcetype=ps splunkd|stats latest(host) latest(_time) by host |eval lastSeen='latest(_time)'|fields host lastSeen
  |eval status=if(lastSeen<(now() - 300), "late","recent")
  |table host status

You will want to adjust these times to suit, but you should run the search over a broad enough timescale for you to capture a time when it was working, spot that its stopped, receive the alert and resolve it.
It my environment 1 hour is sufficient, but you might choose 24/48 etc depending on your needs.

Scenario 2
This is the subtle difference - the logic is the same, except you are monitoring the output of the process (the fact that its writing logs) rather than the fact that a process is consuming resources. This also has the happy coincidence of being much faster because we can use tsdix meta data rather than event data

| tstats earliest(sourcetype) as sourcetypes earliest(_time) as etime where sourcetype=messages by host|join host [| tstats latest(sourcetype) as sourcetypes latest(_time) as ltime where index=* by host ]
|eval status=if(ltime<(now() - 300), "late", "recent")|table host status

In this case we look for a specific sourcetype (I use tomcat:catalina) and we find the earliest event in your search window.
Within that list look again for the latest event from the same host
Calculate the time between the last event and now() – if its more than 300 seconds, mark it as late.

Whichever approach you take, you now have a table which reports the hostname and if the data you care about has been recently seen or is ‘late’

Add |stats count(status) by host and you can add a Pie Chart Vis

If you wanted a single panel indicator to show you how many hosts are experiencing issues you can use:

<row>
         <panel>
           <single>
             <search>
               <query> | tstats earliest(sourcetype) as sourcetypes earliest(_time) as etime where sourcetype=messages by host|join host [| tstats latest(sourcetype) as sourcetypes latest(_time) as ltime where index=* by host ]
|eval status=if(ltime<(now() - 300), "late", "recent")|table host status|
search status=late  |stats count
               <earliest>@d</earliest>
               <latest>now</latest>
               <sampleRatio>1</sampleRatio>
             </search>
             <option name="colorBy">value</option>
             <option name="colorMode">block</option>
             <option name="drilldown">none</option>
             <option name="numberPrecision">0</option>
             <option name="rangeColors">["0x65a637","0xd93f3c"]</option>
             <option name="rangeValues">[0]</option>
             <option name="showSparkline">1</option>
             <option name="showTrendIndicator">1</option>
             <option name="trendColorInterpretation">standard</option>
             <option name="trendDisplayMode">absolute</option>
             <option name="underLabel">Hosts Missing service x</option>
             <option name="unitPosition">after</option>
             <option name="useColors">1</option>
             <option name="useThousandSeparators">1</option>
           </single>
         </panel>
       </row>

Hopefully that covers just a few of the options available, there are plenty of others. Please submit other answers if you have alternatives

If my comment helps, please give it a thumbs up!

How do I monitor JBOSS/Tomcat/Apache/etc and raise an alert if it 'Goes Down'

Detecting Brute Force Account Takeover Fraud with Splunk

Buttercup Games: Further Dashboarding Techniques (Part 9)

Buttercup Games: Further Dashboarding Techniques (Part 8)

Are you a member of the Splunk Community?

How do I monitor JBOSS/Tomcat/Apache/etc and raise an alert if it 'Goes Down'

Detecting Brute Force Account Takeover Fraud with Splunk

Buttercup Games: Further Dashboarding Techniques (Part 9)

Buttercup Games: Further Dashboarding Techniques (Part 8)