Intermittently some notables have been missing over time where ITSI runs in a SHC env, ITSI 4.2.1 + Splunk 7.2.8 in SHC + Multisite Indexer Cluster.
There are times when correlation searches do NOT create Notable Events. This happens throughout the day but at random times. Most of the time Notable Events are created but there are times when business critical alerts are missed.
Here's what I found:
i) The search below shows that notables are not created intermittently and happening when it were assigned to one search head.
index=_internal source=/scheduler.log status=success result_count > 0 alert_action="" savedsearch_name=A savedsearch_name=B OR savedsearch_name=C | stats count by host
This is found to be happening 7days ago from the SH.
ii) There are also errors in splunkd.log
ERROR ModularInputs - Unable to initialize modular input "itsi_entity_exchange_consumer"
Timechart suggested that this error started to happen from a specific time, i.e: Oct 26 05:49 and persisting since then. Also the splunkd.log shows it stopped and started back.
Checking logs in /var/log/messages, kernel complaint about out of memory and OOM killer killed kvstore and splunkd at Oct 26 05:46. Then splunk might have been restarted at 05:49.
iii) Ever since then it fails to load python libraries.
10-29-2019 00:12:49.313 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - Traceback (most recent call last):
10-29-2019 00:12:49.313 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - File "/users/splunk/etc/apps/SA-ITOA/bin/itsi_event_generator.py", line 8, in
10-29-2019 00:12:49.313 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - from ITOA.setup_logging import getLogger
10-29-2019 00:12:49.313 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - File "/users/splunk/etc/apps/SA-ITOA/lib/ITOA/setup_logging.py", line 9, in
10-29-2019 00:12:49.313 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - from splunk.appserver.mrsparkle.lib import i18n
SNIP
10-29-2019 00:12:49.313 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - _install_highlighting()
10-29-2019 00:12:49.314 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - File "/users/splunk/lib/python2.7/site-packages/mako/exceptions.py", line 252,
in _install_highlighting
10-29-2019 00:12:49.314 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - _install_fallback()
10-29-2019 00:12:49.314 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - File "/users/splunk/lib/python2.7/site-packages/mako/exceptions.py", line 243,
in _install_fallback
10-29-2019 00:12:49.314 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - from mako.filters import html_escape
10-29-2019 00:12:49.314 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - ImportError: cannot import name html_escape
iv) The error with ImportError suggests it fails to load python lib and pyc file corruption is suspected due to the OOM situation.
On the SH, we followed the steps below so that pyc file should be auto-generated by the interpreter when splunk starts and imports modules.
$ splunk stop
$ cd $SPLUNK_HOME/lib/python2.7/site-packages
$ find . -name "*.pyc" -exec rm -f {} \;
$ splunk start
After this step the issue never came back again.
Here's what I found:
i) The search below shows that notables are not created intermittently and happening when it were assigned to one search head.
index=_internal source=/scheduler.log status=success result_count > 0 alert_action="" savedsearch_name=A savedsearch_name=B OR savedsearch_name=C | stats count by host
This is found to be happening 7days ago from the SH.
ii) There are also errors in splunkd.log
ERROR ModularInputs - Unable to initialize modular input "itsi_entity_exchange_consumer"
Timechart suggested that this error started to happen from a specific time, i.e: Oct 26 05:49 and persisting since then. Also the splunkd.log shows it stopped and started back.
Checking logs in /var/log/messages, kernel complaint about out of memory and OOM killer killed kvstore and splunkd at Oct 26 05:46. Then splunk might have been restarted at 05:49.
iii) Ever since then it fails to load python libraries.
10-29-2019 00:12:49.313 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - Traceback (most recent call last):
10-29-2019 00:12:49.313 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - File "/users/splunk/etc/apps/SA-ITOA/bin/itsi_event_generator.py", line 8, in
10-29-2019 00:12:49.313 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - from ITOA.setup_logging import getLogger
10-29-2019 00:12:49.313 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - File "/users/splunk/etc/apps/SA-ITOA/lib/ITOA/setup_logging.py", line 9, in
10-29-2019 00:12:49.313 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - from splunk.appserver.mrsparkle.lib import i18n
SNIP
10-29-2019 00:12:49.313 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - _install_highlighting()
10-29-2019 00:12:49.314 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - File "/users/splunk/lib/python2.7/site-packages/mako/exceptions.py", line 252,
in _install_highlighting
10-29-2019 00:12:49.314 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - _install_fallback()
10-29-2019 00:12:49.314 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - File "/users/splunk/lib/python2.7/site-packages/mako/exceptions.py", line 243,
in _install_fallback
10-29-2019 00:12:49.314 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - from mako.filters import html_escape
10-29-2019 00:12:49.314 -0700 ERROR sendmodalert - action=itsi_event_generator STDERR - ImportError: cannot import name html_escape
iv) The error with ImportError suggests it fails to load python lib and pyc file corruption is suspected due to the OOM situation.
On the SH, we followed the steps below so that pyc file should be auto-generated by the interpreter when splunk starts and imports modules.
$ splunk stop
$ cd $SPLUNK_HOME/lib/python2.7/site-packages
$ find . -name "*.pyc" -exec rm -f {} \;
$ splunk start
After this step the issue never came back again.