Solved: Splunkd Crash Every Ten Minutes

gearmstrong · ‎12-03-2020

Good day,

Just installed Splunk Enterprise 8.1.0.1 onto New (all-in-one) (moved drive from old system to new) Windows DataCenter Server 2019 Core over top of 8.0.0. Initially got some Splunkd Crashes related to incompatible Python2 code in Apps that prevented splund from starting. Disabled offending apps from command-line and... all good. New System is 56GB memory, 16 vCPUs. I have two indexes (colddb paths) set to F: and G: while Warm/Hot indexes still exist on E: drives.

Today I noticed that splunkd is still crashing but on a different thread from the one that halted splunkd.

Access violation, cannot read at address [0x000002EC99CBE6F8] Exception address: [0x00007FF65CF8EAB6] Crashing thread: BucketSummaryActorThread

The only 'direct' reference I found to this was for Transparent huge memory pages (THP) being disabled on a Linux System but nothing relevant to a Windows 2019 Core system. I'm sure there are some parallels here between memory management on Linux and Core but I don't see them.

Any assistance is appreciated.

Best regards,

Greg

jbjerke_splunk · ‎12-09-2020

Hi Greg

Sorry about the issues.

I've updated the codebase now as I managed to reproduce the error. I've just published a new release to Splunkbase (2.2.5) although it is not vetted for Splunk Cloud yet, this might take a few days.

There were some Python3 compatibility issues that were so minor they went through the basic checks. Unfortunately it meant the datamodel wouldn't build every 10 minutes and that job crashed.

Let me know if this works for you.

https://github.com/johanbjerke/SplunkAppForWebAnalytics/releases/tag/2.2.5

https://splunkbase.splunk.com/app/2699/

Kind regards

Johan

View solution in original post

gearmstrong · ‎12-03-2020

I have verified that the Secpol does not have "Lock Pages in Memory" enabled on this system.

gearmstrong · ‎12-03-2020

May have Identified the source of the crash as SplunkAppForWebAnalytics Acceleration Job that 'fires' every ten minutes crashing the "BucketSummaryActorThread" but now need to figure out how to address it.

ERROR SavedSplunker - savedsearch_id="nobody;SplunkAppForWebAnalytics;_ACCELERATE_DM_SplunkAppForWebAnalytics_Web_ACCELERATE_", message="The search job

gearmstrong · ‎12-03-2020

Found where similar issue was fixed in 8.0.2.... Can this be similar in any way to previous bug-fixes?

"https://docs.splunk.com/Documentation/Splunk/8.0.2/ReleaseNotes/Fixedissues","Search issues","2020-01-06","SPL-180268, SPL-177675","Crash in BucketSummaryActorThread for a specific summary directory, persists after removing"

gearmstrong · ‎12-03-2020

Ok... I have finally isolated the individual Job that is Crashing the thread. I would like to disable a scheduled job under Web Analytics called "Generate Goal summary - Scheduled" as it is the culprit. I ran it manually and it crashed. I checked under Goals and do not see any configured and the WA_Goas.csv lookup is empty. Under "Goals" "Setup" I do not see any Goals configured for any of our sites so I don't think it is used.

Here is the code for that job.... in case any of you can figure out if it is a bug with this App and 8.1.0.1 or if it is just bad code!

| inputlookup WA_goals
| map maxsearches=10000 search="search tag=web eventtype=pageview site=\"$site$\"
| eval goal_id=\"$goal_id$\"
| eval goal_start=\"$start$\"
| eval goal_end=\"$end$\"
| rex field=goal_start mode=sed \"s/\*/%/g\"
| rex field=goal_end mode=sed \"s/\*/%/g\"
| lookup WA_sessions user AS user OUTPUT http_session,http_session_start,http_session_end,http_session_pageviews,http_session_duration,http_referer,http_referer_domain AS http_session_referrer_domain,http_referer_hostname AS http_session_referrer_hostname,http_session_channel
| where isnotnull(http_session)
| lookup user_agents http_user_agent
| eval ua_mobile=if(eventtype==\"ua-mobile\",'ua_device', \"\")
| bucket _time span=10m
| stats dc(http_session) AS Sessions,dc(user) AS Users,count(eval(like(uri,goal_start))) AS Entries,count(eval(like(uri,goal_end))) AS Completed by _time, site, goal_id, http_session_channel, http_session_referrer_domain,ua_family,ua_mobile,ua_os_family,goal_id
| eval goal_start=\"$start$\"
| eval goal_end=\"$end$\"
| collect index=goal_summary
"

gearmstrong · ‎12-07-2020

Job was just a coincidence in scheduling. Crash still occurring. See that it matches the Web Data Model Acceleration Cron Schedule. Decided to try to perform a rebuild but crash still persists. Does anyone have any other suggestions.... short of not using the WebAnalytics App? This is a 'no-go' as customer uses this very extensively.

Configuration Settings

These settings can be changed by going to actions and selecting edit acceleration. Learn More

allow_old_summaries = true

allow_skew = 0

backfill_time = -

cron_schedule = 2,12,22,32,42,52 * * * *

earliest_time = -3mon

hunk.compression_codec = -

hunk.dfs_block_size = 0

hunk.file_format = -

manual_rebuilds = true

max_concurrent = 6

max_time = 3600

poll_buckets_until_maxtime = false

schedule_priority = highest

workload_pool = -

jbjerke_splunk · ‎12-09-2020

Hi Greg

Sorry about the issues.

I've updated the codebase now as I managed to reproduce the error. I've just published a new release to Splunkbase (2.2.5) although it is not vetted for Splunk Cloud yet, this might take a few days.

There were some Python3 compatibility issues that were so minor they went through the basic checks. Unfortunately it meant the datamodel wouldn't build every 10 minutes and that job crashed.

Let me know if this works for you.

https://github.com/johanbjerke/SplunkAppForWebAnalytics/releases/tag/2.2.5

https://splunkbase.splunk.com/app/2699/

Kind regards

Johan

gearmstrong · ‎12-09-2020

Thank you so much. Upgrading to 2.2.5 resolved the issue.

Splunkd Crash Every Ten Minutes

splunkd

Windows

Observe and Secure All Apps with Splunk

Splunk Decoded: Business Transactions vs Business IQ

Fastest way to demo Observability

Are you a member of the Splunk Community?