All Apps and Add-ons

After installing Splunk for Box, why is cluster master unable to connect to peer, but UI shows peer status as "Up"?

sat94541
Communicator

Since I have installed SplunkForBox my splunk instance has been unstable. We have
CM and 2 Cluster Peer and one Search Head with SF=2 and RF=2

Issue is that very often CM is unable to connect to the Peer causing searches to fail, during this time the Cluster master UI stil shows peer status as “ Up”, second peer “dctlaplog01.discovery.com “ has no issue.

0 Karma

halr9000
Motivator

Version 1.1 of the Box app has been published. The offending scheduled searches have been removed.

0 Karma

niemesrw
Path Finder

This still looks like it hasn't been fixed. 1.2 still has searches like these: index=box earliest = -30d|dedup event_id| stats count by ip_address| sort ip_address

0 Karma

rbal_splunk
Splunk Employee
Splunk Employee

During debugging we found that SH -> IDX connectivity issues coincided with the following messages in the peer's splunkd_access.log records indicating searches are streaming MASSIVE amounts of data back to the SH, for periods as long as 90 seconds:


172.25.254.24 - splunk-system-user [26/Aug/2014:10:45:09.828 -0400] \"GET /services/search/jobs/remote_dctlapsrch01_1409064304.10954/search.log HTTP/1.0" 200 41345 - - - 3ms

127.0.0.1 - splunk-system-user [26/Aug/2014:10:45:09.870 -0400] \"POST /servicesNS/nobody/SplunkAppForXenApp/saved/searches/UserName%20EXP%20-%20Disk%20Space%20Cirtical/notify?trigger.condition_state=0 HTTP/1.0" 200 1884 - - - 6ms

172.25.254.24 - splunk-system-user [26/Aug/2014:10:43:40.590 -0400] \"POST /services/streams/search?sh_sid=scheduler__nobody__SplunkForBox__RMD56a59ae778f9744fe_at_1409064000_4352 HTTP/1.0" 200 795110673 - - - 89975ms

172.25.254.24 - splunk-system-user [26/Aug/2014:10:43:40.588 -0400] \"POST /services/streams/search?sh_sid=scheduler__nobody__SplunkForBox__RMD562d32cf5c0e92d47_at_1409064000_4353 HTTP/1.0" 200 801119413 - - - 90222ms

172.25.254.24 - splunk-system-user [26/Aug/2014:10:43:40.719 -0400] \"POST /services/streams/search?sh_sid=scheduler__nobody__SplunkForBox__RMD517f0fa7737fcd363_at_1409064000_4355 HTTP/1.0" 200 789633822 - - - 90433ms

172.25.254.72 - splunk-system-user [26/Aug/2014:10:45:09.152 -0400] \"POST /services/streams/search?sh_sid=scheduler__nobody__sos__RMD59d4672721e98f163_at_1409064300_1734 HTTP/1.0" 200 2223 - - - 2293ms

172.25.254.72 - splunk-system-user [26/Aug/2014:10:45:09.153 -0400] \"POST /services/streams/search?sh_sid=SummaryDirector_1409064301.3248 HTTP/1.0" 200 170257 - - - 2372ms

172.25.254.24 - splunk-system-user [26/Aug/2014:10:45:09.151 -0400] \"POST /services/streams/search?sh_sid=1409064306.10955 HTTP/1.0" 200 7863 - - - 2648ms

172.25.254.24 - splunk-system-user [26/Aug/2014:10:45:09.151 -0400] \"POST /services/streams/search?sh_sid=1409064304.10954 HTTP/1.0" 200 7841 - - - 2657ms

172.25.254.24 - splunk-system-user [26/Aug/2014:10:43:40.269 -0400] \"POST /services/streams/search?sh_sid=scheduler_nobodySplunkForBox_login_at_1409064000_4356 HTTP/1.0" 200 816007836 - - - 91849ms

172.25.254.24 - splunk-system-user [26/Aug/2014:10:43:40.717 -0400] \"POST /services/streams/search?sh_sid=scheduler_nobodySplunkForBox_RMD50e158fc1bca7fae7_at_1409064000_4354 HTTP/1.0" 200 816100097 - - - 91426ms

172.25.254.24 - splunk-system-user [26/Aug/2014:10:45:12.251 -0400] \"GET /services/search/jobs/remote_dctlapsrch01_1409064295.10952/search.log HTTP/1.0" 200 102188 - - - 3ms

172.25.254.24 - splunk-system-user [26/Aug/2014:10:45:12.369 -0400] \"GET /services/search/jobs/remote_dctlapsrch01_1409064296.10953/search.log HTTP/1.0" 200 102189 - - - 3ms

172.25.254.24 - splunk-system-user [26/Aug/2014:10:43:40.648 -0400] \"POST /services/streams/search?sh_sid=scheduler_nobodySplunkForBox_RMD5f9de3dfc7a54d55b_at_1409064000_4357 HTTP/1.0" 200 801332138 - - - 92451ms

(...)


Each of these records accounts for ~ 780MB of data being streamed from indexer to search-head. This is indicative of poorly-written searches that use non-streaming search operators very early on in the search pipeline (reverse, transaction, head, table, dedup) and completely fail to leverage map-reduce in the process.

Example:

…2/log/audit.log.5:08-25-2014 23:06:24.353 -0400 INFO AuditLogger - Audit:[timestamp=08-25-2014 23:06:24.353, user=splunk-system-user, action=search, info=granted , search_id='scheduler_nobodySplunkForBox_RMD5f9de3dfc7a54d55b_at_1409022300_16118', search='search index=box earliest = -30d|dedup event_id| stats count by created_by.name| sort created_by.name|rename created_by.name as user_name', autojoin='1', buckets=0, ttl=600, max_count=500000, maxtime=8640000, enable_lookups='1', extra_fields='', apiStartTime='ZERO_TIME', apiEndTime='Mon Aug 25 23:05:00 2014', savedsearch_name=\"user_name"][n/a]

Note how "dedup" is the very first command that we apply to the events returned, of which there are quite a few:

……./log/audit.log.5:08-25-2014 23:08:08.616 -0400 INFO AuditLogger - Audit:[timestamp=08-25-2014 23:08:08.616, user=splunk-system-user, action=search, info=completed, search_id='scheduler_nobodySplunkForBox_RMD5f9de3dfc7a54d55b_at_1409022300_16118', total_run_time=82.40, event_count=1234680, result_count=4939, available_count=0, scan_count=1235682, drop_count=0, exec_time=1409022384, api_et=N/A, api_lt=1409022300.000000000, search_et=1406430300.000000000, search_lt=1409022385.577498000, is_realtime=0, savedsearch_name="user_name"][n/a]

2) We checked SplunkForBox app's searches

………/savedsearches.txt | tr -s " " | cut -f2- -d " " | grep -P '^search|['

[Box Top Downloaders]
[event_type]
search = index=box earliest = -30d|dedup event_id| stats count by event_type| sort event_type
[ip_address]
search = index=box earliest = -30d|dedup event_id| stats count by ip_address| sort ip_address
[item_name]
search = index=box earliest = -30d|dedup event_id| stats count by source.item_name| sort source.item_name | rename source.item_name as item_name
[item_type]
search = index=box earliest = -30d|dedup event_id| stats count by source.item_type| sort source.item_type | rename source.item_type as item_type
[login]
search = index=box earliest = -30d|dedup event_id| stats count by created_by.login| sort created_by.login|rename created_by.login as login
[user_name]
search = index=box earliest = -30d|dedup event_id| stats count by created_by.name| sort created_by.name|rename created_by.name as user_name

Note the systematic use of "dedup" as the second search operator for each of these scheduled searches, which basically guarantees that almost all matched events will be shipped back to the search-head.
The current searched are not efficient to run in the distributed search environment. The App provide has been notified to improve these searches.

sorenmaigaard
Path Finder

I don't seem to be able to find the Box app anymore.
The prior link just gives a 404.

What happened to this app?

Best
Soren

0 Karma

halr9000
Motivator

The app listing had an error which has since been corrected.

0 Karma

hrottenberg_spl
Splunk Employee
Splunk Employee

We are bringing this to Box's attention and will assist them with tuning the searches. Watch the app for future updates.

0 Karma

martin_mueller
SplunkTrust
SplunkTrust

Would replacing this

... | dedup event_id | stats count by X

with this

... | stats dc(event_id) by B

do the same thing? If so smart pre-stats-calculations might severely reduce the network load.

That's assuming the values for X are the same throughout one event_id value... but I think the dedup approach needs that assumption to hold as well.

0 Karma
Get Updates on the Splunk Community!

New in Observability Cloud - Explicit Bucket Histograms

Splunk introduces native support for histograms as a metric data type within Observability Cloud with Explicit ...

Updated Team Landing Page in Splunk Observability

We’re making some changes to the team landing page in Splunk Observability, based on your feedback. The ...

New! Splunk Observability Search Enhancements for Splunk APM Services/Traces and ...

Regardless of where you are in Splunk Observability, you can search for relevant APM targets including service ...