Deployment Architecture

Any fixes or workarounds for these post 6.5.1 upgrade issues?

Influencer

Upgraded my clusters from 6.4.4 to 6.5.1 last night. Things appeared okay, but this morning 2 problems surfaced:

  • scheduled searches are not running on the SHC. If you open the saved search settings and click save, [EDIT: They show a schedule time, but don't actually fire.]
  • 2/10 of our clustered indexers have filled queues. A restart of splunk gets things moving again for a few minutes, then back to full queues, blocked indexing. No errors are being logged. No indication of why they're blocked. Or why they work for 5-10 minutes, then stop.

Anyone else?

EDIT: mistaken description. Fake-editing the scheduled search gives it a "scheduled time" in the future, but it doesn't fire.

EDIT 2: Scheduling problems looks to be related to a known bug that was due to be fixed in 6.5.1, but apparently wasn't. https://answers.splunk.com/answers/456812/why-are-alerts-not-working-after-upgrade-to-splunk-1.html

EDIT 3: The problem referenced in EDIT 2 above was not related, although the error message was similar. See answer below.

0 Karma
1 Solution

Influencer

Answer for problem 1: The error in the _internal index, vector::_M_range_check, led us to this more detailed error:

12-08-2016 20:12:33.160 -0500 ERROR StatsProcessor - Error in 'stats' command: 3 duplicate rename field(s). Original renames: [c ftime ltime I_EMAIL I_CELL I_DIALCODE I_UID VIEW_ID ERROR_ID DELIVERY_METHOD CALLER OUTPUT_TYPE VIEW_ID ERROR_ID URI ORGID AOID ORG_NAME UID VALIDATED_CHANNEL_COUNT UID_RETRIEVED FN_NOT_MATCH R_FN DELIVERY_METHOD]. Duplicate renames: [DELIVERY_METHOD ERROR_ID VIEW_ID].

There was an accelerated saved search that had multiple duplicated fields. As soon as I edited the search to remove the dupes, everything cleared up. 6.4.4 did not trip up on this, but 6.5.1 did. Big thanks to Terrance Lam @ Splunk Support for finding this.

Still dealing with problem 2. No closer to resolution there.

EDIT - Answer for problem 2: The 2 indexers that were periodically blocking all indexing could not see our AD server for LDAP auth. The connection was timing out. This was always happening, but 6.5.1 appears to handle it badly. The entire splunkd process blocks for long periods of time occasionally. I entered the hostname in my /etc/hosts file pointing to localhost as a quick work around. The connection is immediately refused and Splunk handles that better.

View solution in original post

Influencer

Answer for problem 1: The error in the _internal index, vector::_M_range_check, led us to this more detailed error:

12-08-2016 20:12:33.160 -0500 ERROR StatsProcessor - Error in 'stats' command: 3 duplicate rename field(s). Original renames: [c ftime ltime I_EMAIL I_CELL I_DIALCODE I_UID VIEW_ID ERROR_ID DELIVERY_METHOD CALLER OUTPUT_TYPE VIEW_ID ERROR_ID URI ORGID AOID ORG_NAME UID VALIDATED_CHANNEL_COUNT UID_RETRIEVED FN_NOT_MATCH R_FN DELIVERY_METHOD]. Duplicate renames: [DELIVERY_METHOD ERROR_ID VIEW_ID].

There was an accelerated saved search that had multiple duplicated fields. As soon as I edited the search to remove the dupes, everything cleared up. 6.4.4 did not trip up on this, but 6.5.1 did. Big thanks to Terrance Lam @ Splunk Support for finding this.

Still dealing with problem 2. No closer to resolution there.

EDIT - Answer for problem 2: The 2 indexers that were periodically blocking all indexing could not see our AD server for LDAP auth. The connection was timing out. This was always happening, but 6.5.1 appears to handle it badly. The entire splunkd process blocks for long periods of time occasionally. I entered the hostname in my /etc/hosts file pointing to localhost as a quick work around. The connection is immediately refused and Splunk handles that better.

View solution in original post

Splunk Employee
Splunk Employee

for #2

1) whats your indexing - thruput like for those indexers? pre-upgrade and post-upgrade. (if ur maxed on thruput, could just be an excessive amount of data being forwarded there)
2) are those indexers constantly creating hot buckets? (rolling hot buckets and creating new hot buckets slows down indexing rates a lot)
3) any consistent ERRORS / WARNS in the logs for those 2 indexers?

0 Karma

Influencer

1) In the 7 MB/s range. Unchanged from before. If it was load related, when the 2 dropped, I would expect the surfing members to be overwhelmed. They are not. They handle the additional load fine, up to 15 MB/s sometimes.
2) BucketMover activity is no more or less than the other indexers. (When they're active. When they block for long periods of time, BucketMover activity disappears, as it should.)
3) No. That would be nice! At least I'd have somewhere to start.

2 pieces of curious info: The 2 usually block together. Rarely does one start blocking without the other. And when they are blocked they take an abnormally long time to restart. While waiting for the restart to happen, with the "..." crawling across the screen, server load drops to 0. Vs while running but blocked, load is at ~ 1. And while running normally, load is ~ 10.

0 Karma

Splunk Employee
Splunk Employee

contact support 😉

are there a lot of buckets on those 2 indexers?

0 Karma

Influencer

See the ldap answer above

0 Karma

Splunk Employee
Splunk Employee

I have found that these kinds of questions involve more diagnosis than can normally be done in an Answers post.

I recommend you contact Splunk Support for assistance with this.

0 Karma

Influencer

Looks like the scheduling issue is a carryover from 6.5.0 that was not fixed. Hoping there's a workaround somewhere: https://answers.splunk.com/answers/456812/why-are-alerts-not-working-after-upgrade-to-splunk-1.html

0 Karma

Influencer

Not related. Same error, but different cause.

0 Karma

Influencer

Done. Still waiting for help. 😞 effectively down in the meantime.

0 Karma