Solved: Splunk Cloud- HUGE uptick in _internal errors invo...

kelstahl8705 · ‎10-03-2023

Hey Splunk Community 🙂

Ok Ive got a tale of woe, intrigue, revenge, index=_*, and python 3.7

My tale begins a few weeks ago when myself and the other Splunk admin where just like "Ok, I know searches can be slow but like EVERYTHING is just draggin" We opened a support ticket, talked about it with AOD, let our Splunk team know, got told we might be under provisioned for SVCs and indexers no wait over provisioned, no wait do better searches, no wait again skynet is like "why is you instance doing that?". We also got a Splunk engineer assigned to our case and were told our instance is fine.

Le sigh, when I tell you I rabbled rabbled rabbled racka facka Mr. Krabs .... I was definitely salty.
So I took it upon myself to dive deeper then I have ever EEEEEVER dived before... index=_* error OR failed OR severe OR ( sourcetype=access_* ( 404 OR 500 OR 503 ) )

I know I know it was a rough one BUT down the rabbit hole I went. I did this search back as far my instance would go.

October 2022
and counted from there. I was trying to find any sort of 'spike' or anomaly something to explain that our instance is not fine.
October 2022 -2
November 2022- 0
December 2022- 0
January- 25
February- 0
March- 29
April- 15
May-44
June- 1843
July-40,081
August- 569,004
September-119,696,269
October - dont ask, ok fine, so far in October there are 21,604,091

The climb is real and now I had to find what was doing it? From August and back it was a lot of connection/time out errors from the UF on some endpoints so nothing super weird just a lot of them.

SEPTEMBER, specifically 9/2/23 11:49:25.331 AM This girl blew up!
The 1st event_message was...
09-02-2023 16:49:25.331 +0000 ERROR PersistentScript [3873892 PersistentScriptIo] - From {/opt/splunk/bin/python3.7 /opt/splunk/etc/apps/TA-Zscaler_CIM/bin/TA_Zscaler_CIM_rh_settings.py persistent}: WARNING:root:Run function: get_password failed: Traceback (most recent call last):

The rest of the event messages that followed were these ...
see 3 attached screen shots

I did a 'last 15 min" search but like September's show this hits the millions. Also, I see it's not just one app, its several of our apps that we use API to get logs into Splunk with, but not all the apps we use shows on the list (weird), and it's not just limited to 3rd party apps, the Splunk cloud admin app is on there among others (see attached VSC doc) I also checked that any of these apps may be out of date and they are all on their current version.

I did see one post on community (https://community.splunk.com/t5/All-Apps-and-Add-ons/ERROR-PersistentScript-23354-PersistentScriptIo...) but there was no reply.
I also 1st posted on the Slack channel to see if anyone else was or had experienced this happening.
https://splunk-usergroups.slack.com/archives/C23PUUYAF/p1696351395640639
and last but not least I did open another support ticket so hopefully I can give an update if I get so good deets!

Appreciate you 🙂
-Kelly

tscroggins · ‎10-08-2023

Hi Kelly,

The following error is normal when no proxy is enabled or no proxy credentials are saved in TA-Zscaler_CIM:

PersistentScript - From {/opt/splunk/bin/python3.7 /opt/splunk/etc/apps/TA-Zscaler_CIM/bin/TA_Zscaler_CIM_rh_settings.py persistent}: solnlib.credentials.CredentialNotExistException: Failed to get password of realm=__REST_CREDENTIAL__#TA-Zscaler_CIM#configs/conf-ta_zscaler_cim_settings, user=proxy.

The error is likely normal in TA-sailpoint_identitynow-auditevent-add-on and TA-trendmicrocloudappsecurity for the same reason.

The read timeout error in TA-trendmicrocloudappsecurity is caused by the Trend Micro /v1/siem/security_events endpoint not returning an HTTP response within 5 minutes, the default read timeout inherited by TA-trendmicrocloudappsecurity when it calls the Splunk Add-on Builder helper.send_http_request() method with timeout=None. The timeout value is not configurable, but TA-trendmicrocloudappsecurity/bin/input_module_tmcas_detection_logs.py could be modified to use a longer timeout value:

response = helper.send_http_request(
url,
"GET",
parameters=params,
payload=None,
headers=headers,
cookies=None,
verify=True,
cert=None,
timeout=(None, 60),
use_proxy=use_proxy,
)

However, this change should be made by Trend Micro, preferably by making the connect and read timeout values fully configurable.

Explosions in splunkd.log events can often be caused by failures in modular or scripted inputs, where a script logs a message before a process fails, Splunk immediately restarts the process, and the cycle repeats ad infinitum. Your screenshots don't necessarily point to that, but you may get closer to a cause with:

index=_internal source=*splunkd.log* host=*splunkdcloud*
| cluster showcount=t
| sort 10 - cluster_count
| table cluster_count _raw

If you don't see anything with a cluster_count of the expected magnitude, remove host=*splunkdcloud* from the search. Change the sort limit from 10 to 0 to show all results.

View solution in original post

kelstahl8705 · ‎04-12-2024

holy events tscroggins!
that search you provided blew my mind and my instance. ☠️
I did a 24 search and i have like 10,000 stat results.

It is like so over whelming reading all of these I don't even know where to begin.
You and your search real MVP though, I did have to take out the host=*splunkdcloud* from the search because I did get zero but after I did that BOOM all the results.

tscroggins · ‎04-13-2024

Thwarted by high cardinality! You can adjust the similarity threshold of the cluster command with the t option:

| cluster showcount=t t=0.5

or change how the cluster command determines similarity with the match option:

| cluster showcount=t match=termset ``` unordered terms ```

If you find a frequently occurring event unrelated to your original question and want a bit of help, you'll get the best answer by starting a new question. Everyone here loves solving problems!

tscroggins · ‎10-08-2023