Deployment Architecture

Any fixes or workarounds for these post 6.5.1 upgrade issues?

twinspop
Influencer

Upgraded my clusters from 6.4.4 to 6.5.1 last night. Things appeared okay, but this morning 2 problems surfaced:

  • scheduled searches are not running on the SHC. If you open the saved search settings and click save, [EDIT: They show a schedule time, but don't actually fire.]
  • 2/10 of our clustered indexers have filled queues. A restart of splunk gets things moving again for a few minutes, then back to full queues, blocked indexing. No errors are being logged. No indication of why they're blocked. Or why they work for 5-10 minutes, then stop.

Anyone else?

EDIT: mistaken description. Fake-editing the scheduled search gives it a "scheduled time" in the future, but it doesn't fire.

EDIT 2: Scheduling problems looks to be related to a known bug that was due to be fixed in 6.5.1, but apparently wasn't. https://answers.splunk.com/answers/456812/why-are-alerts-not-working-after-upgrade-to-splunk-1.html

EDIT 3: The problem referenced in EDIT 2 above was not related, although the error message was similar. See answer below.

0 Karma
1 Solution

twinspop
Influencer

Answer for problem 1: The error in the _internal index, vector::_M_range_check, led us to this more detailed error:

12-08-2016 20:12:33.160 -0500 ERROR StatsProcessor - Error in 'stats' command: 3 duplicate rename field(s). Original renames: [c ftime ltime I_EMAIL I_CELL I_DIALCODE I_UID VIEW_ID ERROR_ID DELIVERY_METHOD CALLER OUTPUT_TYPE VIEW_ID ERROR_ID URI ORGID AOID ORG_NAME UID VALIDATED_CHANNEL_COUNT UID_RETRIEVED FN_NOT_MATCH R_FN DELIVERY_METHOD]. Duplicate renames: [DELIVERY_METHOD ERROR_ID VIEW_ID].

There was an accelerated saved search that had multiple duplicated fields. As soon as I edited the search to remove the dupes, everything cleared up. 6.4.4 did not trip up on this, but 6.5.1 did. Big thanks to Terrance Lam @ Splunk Support for finding this.

Still dealing with problem 2. No closer to resolution there.

EDIT - Answer for problem 2: The 2 indexers that were periodically blocking all indexing could not see our AD server for LDAP auth. The connection was timing out. This was always happening, but 6.5.1 appears to handle it badly. The entire splunkd process blocks for long periods of time occasionally. I entered the hostname in my /etc/hosts file pointing to localhost as a quick work around. The connection is immediately refused and Splunk handles that better.

View solution in original post

twinspop
Influencer

Answer for problem 1: The error in the _internal index, vector::_M_range_check, led us to this more detailed error:

12-08-2016 20:12:33.160 -0500 ERROR StatsProcessor - Error in 'stats' command: 3 duplicate rename field(s). Original renames: [c ftime ltime I_EMAIL I_CELL I_DIALCODE I_UID VIEW_ID ERROR_ID DELIVERY_METHOD CALLER OUTPUT_TYPE VIEW_ID ERROR_ID URI ORGID AOID ORG_NAME UID VALIDATED_CHANNEL_COUNT UID_RETRIEVED FN_NOT_MATCH R_FN DELIVERY_METHOD]. Duplicate renames: [DELIVERY_METHOD ERROR_ID VIEW_ID].

There was an accelerated saved search that had multiple duplicated fields. As soon as I edited the search to remove the dupes, everything cleared up. 6.4.4 did not trip up on this, but 6.5.1 did. Big thanks to Terrance Lam @ Splunk Support for finding this.

Still dealing with problem 2. No closer to resolution there.

EDIT - Answer for problem 2: The 2 indexers that were periodically blocking all indexing could not see our AD server for LDAP auth. The connection was timing out. This was always happening, but 6.5.1 appears to handle it badly. The entire splunkd process blocks for long periods of time occasionally. I entered the hostname in my /etc/hosts file pointing to localhost as a quick work around. The connection is immediately refused and Splunk handles that better.

dxu_splunk
Splunk Employee
Splunk Employee

for #2

1) whats your indexing - thruput like for those indexers? pre-upgrade and post-upgrade. (if ur maxed on thruput, could just be an excessive amount of data being forwarded there)
2) are those indexers constantly creating hot buckets? (rolling hot buckets and creating new hot buckets slows down indexing rates a lot)
3) any consistent ERRORS / WARNS in the logs for those 2 indexers?

0 Karma

twinspop
Influencer

1) In the 7 MB/s range. Unchanged from before. If it was load related, when the 2 dropped, I would expect the surfing members to be overwhelmed. They are not. They handle the additional load fine, up to 15 MB/s sometimes.
2) BucketMover activity is no more or less than the other indexers. (When they're active. When they block for long periods of time, BucketMover activity disappears, as it should.)
3) No. That would be nice! At least I'd have somewhere to start.

2 pieces of curious info: The 2 usually block together. Rarely does one start blocking without the other. And when they are blocked they take an abnormally long time to restart. While waiting for the restart to happen, with the "..." crawling across the screen, server load drops to 0. Vs while running but blocked, load is at ~ 1. And while running normally, load is ~ 10.

0 Karma

dxu_splunk
Splunk Employee
Splunk Employee

contact support 😉

are there a lot of buckets on those 2 indexers?

0 Karma

twinspop
Influencer

See the ldap answer above

0 Karma

bshuler_splunk
Splunk Employee
Splunk Employee

I have found that these kinds of questions involve more diagnosis than can normally be done in an Answers post.

I recommend you contact Splunk Support for assistance with this.

0 Karma

twinspop
Influencer

Looks like the scheduling issue is a carryover from 6.5.0 that was not fixed. Hoping there's a workaround somewhere: https://answers.splunk.com/answers/456812/why-are-alerts-not-working-after-upgrade-to-splunk-1.html

0 Karma

twinspop
Influencer

Not related. Same error, but different cause.

0 Karma

twinspop
Influencer

Done. Still waiting for help. 😞 effectively down in the meantime.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...