Re: EMR issues, MapReduce job killed

Ledion_Bitincka · ‎11-14-2013

I'm running into an issue with Hunk searches that spawn a MapReduce job in my EMR cluster. The MR job seems to be killed after some time, even though no user issued a kill command to the job - so somehow it seems like Hunk is killing the job. Digging through the task logs I noticed the following interesting line

INFO com.splunk.mr.SplunkMR$SplunkBaseMapper (main): No heart beat received from Splunk, killing the MR job id=job_201310150205_0001

hyan_splunk · ‎04-08-2015

vix.splunk.heartbeat.threshold - number of heartbeat delayed by search head before MR job commits suicide
vix.splunk.heartbeat.interval - how often does search head heartbeat

heartbeat is achieved by renaming a heartbeat file from search head. The default 1 second interval assumes your file system rename operation can be done in a second. If that’s not the case, e.g. s3n system, or the network connection between search head to Hadoop NameNode is slow, or simply because your Hadoop cluster is too busy to react, you should increase the interval. On the other hand, you could decrease heartbeat threshold to make Hunk kill run away MR job more promptly.

Ledion_Bitincka · ‎11-14-2013

This issue is caused by a mechanism that Hunk has in place to reduce the runaway jobs - ie MapReduce jobs which keep running even though the client is not interested in the results. Hunk solves this problem by having the Hunk server heartbeat in a specific location in the file system (hdfs/maprf/s3n ...) and the map tasks check the heartbeat to ensure that the Hunk search is still running. However, this relies on the filesystem rename operations to happen relatively quickly (<1s)- in some filesystems, like s3n, the renames operations take a much longer time, when this time exceeds the missed heartbeat threshold the MapReduce job commits suicide and you see the above log message. The runway job logic can be completely disabled by setting the following variable in the provider

vix.splunk.heartbeat = 0

or you can tinker with the default heartbeat interval (in ms) and threshold (in missed heartbeats) by setting/updating

vix.splunk.heartbeat.threshold     = 60
vix.splunk.heartbeat.interval      = 1000

Finally, it might be a good idea to switch to using s3 rather than s3n, according to this article s3 offers efficient implementation of renames

Masa · ‎04-08-2015

Are there any guidance for tuning the following values?

vix.splunk.heartbeat.threshold = 60
vix.splunk.heartbeat.interval = 1000

EMR issues, MapReduce job killed

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics!

New in Observability Cloud - Explicit Bucket Histograms