Splunk Search

EMR issues, MapReduce job killed

Ledion_Bitincka
Splunk Employee
Splunk Employee

I'm running into an issue with Hunk searches that spawn a MapReduce job in my EMR cluster. The MR job seems to be killed after some time, even though no user issued a kill command to the job - so somehow it seems like Hunk is killing the job. Digging through the task logs I noticed the following interesting line

INFO com.splunk.mr.SplunkMR$SplunkBaseMapper (main): No heart beat received from Splunk, killing the MR job id=job_201310150205_0001
Tags (1)
0 Karma

hyan_splunk
Splunk Employee
Splunk Employee

vix.splunk.heartbeat.threshold - number of heartbeat delayed by search head before MR job commits suicide
vix.splunk.heartbeat.interval - how often does search head heartbeat

heartbeat is achieved by renaming a heartbeat file from search head. The default 1 second interval assumes your file system rename operation can be done in a second. If that’s not the case, e.g. s3n system, or the network connection between search head to Hadoop NameNode is slow, or simply because your Hadoop cluster is too busy to react, you should increase the interval. On the other hand, you could decrease heartbeat threshold to make Hunk kill run away MR job more promptly.

Ledion_Bitincka
Splunk Employee
Splunk Employee

This issue is caused by a mechanism that Hunk has in place to reduce the runaway jobs - ie MapReduce jobs which keep running even though the client is not interested in the results. Hunk solves this problem by having the Hunk server heartbeat in a specific location in the file system (hdfs/maprf/s3n ...) and the map tasks check the heartbeat to ensure that the Hunk search is still running. However, this relies on the filesystem rename operations to happen relatively quickly (<1s)- in some filesystems, like s3n, the renames operations take a much longer time, when this time exceeds the missed heartbeat threshold the MapReduce job commits suicide and you see the above log message. The runway job logic can be completely disabled by setting the following variable in the provider

vix.splunk.heartbeat = 0 

or you can tinker with the default heartbeat interval (in ms) and threshold (in missed heartbeats) by setting/updating

vix.splunk.heartbeat.threshold     = 60
vix.splunk.heartbeat.interval      = 1000

Finally, it might be a good idea to switch to using s3 rather than s3n, according to this article s3 offers efficient implementation of renames

Masa
Splunk Employee
Splunk Employee

Are there any guidance for tuning the following values?

vix.splunk.heartbeat.threshold = 60
vix.splunk.heartbeat.interval = 1000

0 Karma
Get Updates on the Splunk Community!

Extending Observability Content to Splunk Cloud

Watch Now!   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to leverage ...

More Control Over Your Monitoring Costs with Archived Metrics!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...

New in Observability Cloud - Explicit Bucket Histograms

Splunk introduces native support for histograms as a metric data type within Observability Cloud with Explicit ...