Upgraded my clusters from 6.4.4 to 6.5.1 last night. Things appeared okay, but this morning 2 problems surfaced:
Anyone else?
EDIT: mistaken description. Fake-editing the scheduled search gives it a "scheduled time" in the future, but it doesn't fire.
EDIT 2: Scheduling problems looks to be related to a known bug that was due to be fixed in 6.5.1, but apparently wasn't. https://answers.splunk.com/answers/456812/why-are-alerts-not-working-after-upgrade-to-splunk-1.html
EDIT 3: The problem referenced in EDIT 2 above was not related, although the error message was similar. See answer below.
Answer for problem 1: The error in the _internal index, vector::_M_range_check
, led us to this more detailed error:
12-08-2016 20:12:33.160 -0500 ERROR StatsProcessor - Error in 'stats' command: 3 duplicate rename field(s). Original renames: [c ftime ltime I_EMAIL I_CELL I_DIALCODE I_UID VIEW_ID ERROR_ID DELIVERY_METHOD CALLER OUTPUT_TYPE VIEW_ID ERROR_ID URI ORGID AOID ORG_NAME UID VALIDATED_CHANNEL_COUNT UID_RETRIEVED FN_NOT_MATCH R_FN DELIVERY_METHOD]. Duplicate renames: [DELIVERY_METHOD ERROR_ID VIEW_ID].
There was an accelerated saved search that had multiple duplicated fields. As soon as I edited the search to remove the dupes, everything cleared up. 6.4.4 did not trip up on this, but 6.5.1 did. Big thanks to Terrance Lam @ Splunk Support for finding this.
Still dealing with problem 2. No closer to resolution there.
EDIT - Answer for problem 2: The 2 indexers that were periodically blocking all indexing could not see our AD server for LDAP auth. The connection was timing out. This was always happening, but 6.5.1 appears to handle it badly. The entire splunkd process blocks for long periods of time occasionally. I entered the hostname in my /etc/hosts file pointing to localhost as a quick work around. The connection is immediately refused and Splunk handles that better.
Answer for problem 1: The error in the _internal index, vector::_M_range_check
, led us to this more detailed error:
12-08-2016 20:12:33.160 -0500 ERROR StatsProcessor - Error in 'stats' command: 3 duplicate rename field(s). Original renames: [c ftime ltime I_EMAIL I_CELL I_DIALCODE I_UID VIEW_ID ERROR_ID DELIVERY_METHOD CALLER OUTPUT_TYPE VIEW_ID ERROR_ID URI ORGID AOID ORG_NAME UID VALIDATED_CHANNEL_COUNT UID_RETRIEVED FN_NOT_MATCH R_FN DELIVERY_METHOD]. Duplicate renames: [DELIVERY_METHOD ERROR_ID VIEW_ID].
There was an accelerated saved search that had multiple duplicated fields. As soon as I edited the search to remove the dupes, everything cleared up. 6.4.4 did not trip up on this, but 6.5.1 did. Big thanks to Terrance Lam @ Splunk Support for finding this.
Still dealing with problem 2. No closer to resolution there.
EDIT - Answer for problem 2: The 2 indexers that were periodically blocking all indexing could not see our AD server for LDAP auth. The connection was timing out. This was always happening, but 6.5.1 appears to handle it badly. The entire splunkd process blocks for long periods of time occasionally. I entered the hostname in my /etc/hosts file pointing to localhost as a quick work around. The connection is immediately refused and Splunk handles that better.
for #2
1) whats your indexing - thruput like for those indexers? pre-upgrade and post-upgrade. (if ur maxed on thruput, could just be an excessive amount of data being forwarded there)
2) are those indexers constantly creating hot buckets? (rolling hot buckets and creating new hot buckets slows down indexing rates a lot)
3) any consistent ERRORS / WARNS in the logs for those 2 indexers?
1) In the 7 MB/s range. Unchanged from before. If it was load related, when the 2 dropped, I would expect the surfing members to be overwhelmed. They are not. They handle the additional load fine, up to 15 MB/s sometimes.
2) BucketMover activity is no more or less than the other indexers. (When they're active. When they block for long periods of time, BucketMover activity disappears, as it should.)
3) No. That would be nice! At least I'd have somewhere to start.
2 pieces of curious info: The 2 usually block together. Rarely does one start blocking without the other. And when they are blocked they take an abnormally long time to restart. While waiting for the restart to happen, with the "..." crawling across the screen, server load drops to 0. Vs while running but blocked, load is at ~ 1. And while running normally, load is ~ 10.
contact support 😉
are there a lot of buckets on those 2 indexers?
See the ldap answer above
I have found that these kinds of questions involve more diagnosis than can normally be done in an Answers post.
I recommend you contact Splunk Support for assistance with this.
Looks like the scheduling issue is a carryover from 6.5.0 that was not fixed. Hoping there's a workaround somewhere: https://answers.splunk.com/answers/456812/why-are-alerts-not-working-after-upgrade-to-splunk-1.html
Not related. Same error, but different cause.
Done. Still waiting for help. 😞 effectively down in the meantime.