Solved: Is there any way to delay bucket-fix activity perf...

fatemabwudel · ‎07-01-2016

Hi,

We have a cluster of 3 indexers with replication factor of 3 and search factor of 2.

Just curious to know if we can delay or disable the functionally of bucket-fixing performed by the master when an indexer goes down.
The problem which we are dealing with right now is we don't know if we are going to have enough disk space available for the master to stream the extra bucket copies (both searchable and non-searchable) to match up with the replication factor and search factor of the cluster when an indexer fails suddenly, hence not giving us the time to enable maintenance mode on master to stop the streaming of extra copies over the remaining peers.
When we were specking out the hardware we weren't aware of this functionality and thought that we would going to need the total amount of storage just to store one month worth searchable data. But now the problem is if we have a cluster maintaining 80-90% disk usage consisting of the storage of data that we designed it for and then suddenly an indexer dies out on us then the master will try to fix the replication and search factor as soon as it knows about peer going offline. So if we can delay that bucket replication and streaming activity of master for some time, giving us some room to fix up the problem with the failed indexer and putting it back in, will be more inline with our use-case of data should be searchable all time, assuming that no two indexers will fail simultaneously.

The same functionality is available when the indexer dies out "intentionally" which is called maintenance mode, just searching for a solution to tune or tweak some kind of settings to do that exact same thing when indexer dies "unintentionally",i.e indexer failures situations.

Any help would be appreciated.

Thanks.

maciep · ‎07-02-2016

We currently have out heartbeat timeout set at 30 minutes, because our indexers can be too busy to respond to the cm heartbeat in reasonable timeframe. We worked with support/dev heavily for our situation and we slowly bumped the heartbeat up that high. I'm not sure if setting it to 3 days is good idea or if that will have a negative effect on other cluster tasks. Maybe contact support to see if they have any input.

Here are the 2 searches we're using in our environment. We run the first one every 15 minutes and store the results in a summary index. It runs on the cm and uses rest calls to gather stats.

| rest splunk_server=local /services/cluster/master/fixup level=generation 
| eval age = now() - 'initial.timestamp' 
| stats count as gen_count max(age) as gen_max_age min(age) as gen_min_age 
| appendcols 
[
    | rest splunk_server=local /services/cluster/master/fixup level=search_factor 
    | eval age = now() - 'initial.timestamp' 
    | stats count as sf_count max(age) as sf_max_age min(age) as sf_min_age
] 
| appendcols 
[
    | rest splunk_server=local /services/cluster/master/fixup level=replication_factor 
    | eval age = now() - 'initial.timestamp' 
    | stats count as rf_count max(age) as rf_max_age min(age) as rf_min_age
] 
|appendcols 
[
    | rest splunk_server=local /services/cluster/master/peers 
    | stats count(eval(status="Up")) as peers_up count(eval(status="Down")) as peers_down count(eval(status="Pending")) as peers_pending
] 
| appendcols 
[
    | rest splunk_server=local /services/cluster/master/info 
    | table indexing_ready_flag initialized_flag maintenance_mode rolling_restart_flag service_ready_flag
] 
| eval _time = now()

The next search is what we use to alert us if there appears to be an issue, and it runs once an hour. I don't fully understand exactly what metrics dictate an issue, so I'm trying to guess at what might a problem. If we don't have 4 events over the past, that's a problem. And it's likely because all data isn't searchable, not because the first search didn't run. Or if we've had peers go pending or down (not up) and some of the fix up counts are higher than expected and fixup tasks are older than expected. But of course, if we've been in maintenance mode at some point in the last hour, don't alert because we're obviously working on something. The formatting of the results is to prepare them for the script that sends traps to our event management system.

[the sumary index search]
| stats count as event_count min(peers_up) as min_peers_up min(gen_count) as min_gen_count min(gen_max_age) as max_gen_age values(maintenance_mode) as maint_mode
| eval max_gen_age = coalesce(max_gen_age,0) 
| where (event_count < 4 OR (min_peers_up < [our indexer count] AND min_gen_count > 100 AND max_gen_age > 120)) AND NOT match(maint_mode,"1")
| eval 1 = now()
| eval 2 = "[Our The Alert group]"
| eval 3 = "[Our Cluster Master]"
| eval 4 = "Critical"
| eval 5 = "Splunk - The indexer cluster may not be healthy.  Please address if needed."
| table 1 2 3 4 5

And finally, here is the script that would put us into maintenance mode, so it could be called from a search too. We would only run it from the cm and it lives $SPLUNKHOME$/bin/scripts. We are not currently using it, but have it in our back pocket if needed. We did not write this script, but seems pretty self-explanatory

import sys
import urllib2
import urllib
import ssl

BASE_SPLUNK_URL = 'https://localhost:8089'


def main():
    #Grab the session key from standard in and URLDecode it (it's URLEncoded)
    sessionKey = sys.stdin.readline().strip()
    sessionKey = sessionKey[11:]
    sessionKey=urllib.unquote(sessionKey).decode('utf8')

    #Put the cluster master into maintenance mode
    putClusterMasterInMM(sessionKey)

    return


def putClusterMasterInMM(sessionKey):
    try:
        #prep the request
        request = urllib2.Request(BASE_SPLUNK_URL + '/services/cluster/master/control/default/maintenance/')
        request.add_header("Authorization", "Splunk {0}".format(sessionKey))
        request.add_data(urllib.urlencode({'mode': 'true'}))

        #execute the request
        server_content = urllib2.urlopen(request, context=ssl._create_unverified_context())

        return True
    except:
        return False


if __name__ == "__main__":
    main()

Hope this helps a bit and maybe gives you a few ideas if needed.

View solution in original post

maciep · ‎07-02-2016

We currently have out heartbeat timeout set at 30 minutes, because our indexers can be too busy to respond to the cm heartbeat in reasonable timeframe. We worked with support/dev heavily for our situation and we slowly bumped the heartbeat up that high. I'm not sure if setting it to 3 days is good idea or if that will have a negative effect on other cluster tasks. Maybe contact support to see if they have any input.

Here are the 2 searches we're using in our environment. We run the first one every 15 minutes and store the results in a summary index. It runs on the cm and uses rest calls to gather stats.

| rest splunk_server=local /services/cluster/master/fixup level=generation 
| eval age = now() - 'initial.timestamp' 
| stats count as gen_count max(age) as gen_max_age min(age) as gen_min_age 
| appendcols 
[
    | rest splunk_server=local /services/cluster/master/fixup level=search_factor 
    | eval age = now() - 'initial.timestamp' 
    | stats count as sf_count max(age) as sf_max_age min(age) as sf_min_age
] 
| appendcols 
[
    | rest splunk_server=local /services/cluster/master/fixup level=replication_factor 
    | eval age = now() - 'initial.timestamp' 
    | stats count as rf_count max(age) as rf_max_age min(age) as rf_min_age
] 
|appendcols 
[
    | rest splunk_server=local /services/cluster/master/peers 
    | stats count(eval(status="Up")) as peers_up count(eval(status="Down")) as peers_down count(eval(status="Pending")) as peers_pending
] 
| appendcols 
[
    | rest splunk_server=local /services/cluster/master/info 
    | table indexing_ready_flag initialized_flag maintenance_mode rolling_restart_flag service_ready_flag
] 
| eval _time = now()

The next search is what we use to alert us if there appears to be an issue, and it runs once an hour. I don't fully understand exactly what metrics dictate an issue, so I'm trying to guess at what might a problem. If we don't have 4 events over the past, that's a problem. And it's likely because all data isn't searchable, not because the first search didn't run. Or if we've had peers go pending or down (not up) and some of the fix up counts are higher than expected and fixup tasks are older than expected. But of course, if we've been in maintenance mode at some point in the last hour, don't alert because we're obviously working on something. The formatting of the results is to prepare them for the script that sends traps to our event management system.

[the sumary index search]
| stats count as event_count min(peers_up) as min_peers_up min(gen_count) as min_gen_count min(gen_max_age) as max_gen_age values(maintenance_mode) as maint_mode
| eval max_gen_age = coalesce(max_gen_age,0) 
| where (event_count < 4 OR (min_peers_up < [our indexer count] AND min_gen_count > 100 AND max_gen_age > 120)) AND NOT match(maint_mode,"1")
| eval 1 = now()
| eval 2 = "[Our The Alert group]"
| eval 3 = "[Our Cluster Master]"
| eval 4 = "Critical"
| eval 5 = "Splunk - The indexer cluster may not be healthy.  Please address if needed."
| table 1 2 3 4 5

And finally, here is the script that would put us into maintenance mode, so it could be called from a search too. We would only run it from the cm and it lives $SPLUNKHOME$/bin/scripts. We are not currently using it, but have it in our back pocket if needed. We did not write this script, but seems pretty self-explanatory

import sys
import urllib2
import urllib
import ssl

BASE_SPLUNK_URL = 'https://localhost:8089'


def main():
    #Grab the session key from standard in and URLDecode it (it's URLEncoded)
    sessionKey = sys.stdin.readline().strip()
    sessionKey = sessionKey[11:]
    sessionKey=urllib.unquote(sessionKey).decode('utf8')

    #Put the cluster master into maintenance mode
    putClusterMasterInMM(sessionKey)

    return


def putClusterMasterInMM(sessionKey):
    try:
        #prep the request
        request = urllib2.Request(BASE_SPLUNK_URL + '/services/cluster/master/control/default/maintenance/')
        request.add_header("Authorization", "Splunk {0}".format(sessionKey))
        request.add_data(urllib.urlencode({'mode': 'true'}))

        #execute the request
        server_content = urllib2.urlopen(request, context=ssl._create_unverified_context())

        return True
    except:
        return False


if __name__ == "__main__":
    main()

Hope this helps a bit and maybe gives you a few ideas if needed.

fatemabwudel · ‎07-02-2016

Thank you Maciep for providing the information, I will try these searches and script out on our cluster.

Just wanted to ask that are you running DMC and Fire-Brigade App on your cluster? these two give pretty decent information regarding the health of the overall cluster and provides inspection and insight into the health of indexes in Splunk environments. Also they have alerting capability if some abnormal situation occurs in the cluster (their document mentions it, haven't tried it).
Hence just curious to see if those alerts can be used to trigger the script (you provided) and to enter the maintenance mode....

maciep · ‎07-02-2016

Yep, we do have dmc and fire brigade in our environment.

We're on 6.3.4, so I'm not sure if any more improvements have been made in DMC in 6.4.x that might help. Admittedly, I haven't researched what it can do from an alerting perspective.

I don't look at fire brigade as often as I should, and we haven't upgraded in a while either. But it's there when we need it. I'm hoping that someday we can just rely on dmc. I think Splunk is going to keep trying to move in that direction...hopefully

maciep · ‎07-01-2016

We have a similar need for different reasons, but that functionality doesn't exist. I have submitted an enhancement request to allow us to choose how the cm should react to a down indexer - run fix up tasks like normal, automatically go into maintenance mode, delay fix-up tasks but generate an alert or event of some sort so that we can manually enable maintenance mode if needed.

In the meantime, I created a search that drops metrics from the indexer cluster via rest into an index summary. I then search that data once an hour to try to determine if all data isn't searchable, which will generate an alert for us. It's not the most robust system, but it has been pretty good at identifying when our cluster isn't healthy.

We also have another script that will put the cluster in maintenance mode automatically, so we can kick that off as an alert script from the search too. But we don't want to do that just yet.

I can share if you're interested.

fatemabwudel · ‎07-02-2016

Hi Maciep,

Thanks for the quick reply and suggesting solution, yeah it would be great if you could be able to share the script! Thanks!

Also, one more note, as I was exploring the solution for this problem, I came across this in documentation:
"When a peer node goes down for any reason besides the offline command, it stops sending the periodic heartbeat to the master. This causes the master to detect the loss and initiate remedial action. The master coordinates essentially the same actions as when the peer gets taken offline intentionally, except for the following:

 The downed peer does not continue to participate in ongoing searches.
 The master waits only for the length of the heartbeat timeout (by default, 60 seconds) before reassigning primacy and initiating bucket-fixing actions."

And there is an attribute called heartbeat_timeout in server.conf that defines the heartbeat timeout:

heartbeat_timeout =
*Only valid for mode=master
*Determines when the master considers a slave down. Once a slave is down, the master will initiate fixup steps to replicate
buckets from the dead slave to its peers.
* Defaults to 60s.

So does that mean that we can set it to a higher number, enough for us to go through and troubleshoot the down indexer, maybe 259200 secs (3 days), to account for the weekend?
But the main issue with that is the master will not be able to get the up-to-date information regarding peers in the cluster for like 3 days from the last time it got heartbeats from the peers in the cluster.

Any thoughts on other consequences that might create issues by tweaking this setting or what are the other functionalities that will get impacted by tweaking this setting?

Thanks,
Fatema.

Is there any way to delay bucket-fix activity performed by the master when an indexer goes down unintentionally?

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!

New in Observability Cloud - Explicit Bucket Histograms