After solving this problem, we are running into an issue with the application master container not being able to launch - it seems to be failing during resource localization stage.
search.log shows the following stack traces
ERROR ERP.REDACTED - SplunkMR - ERROR while waiting for MapReduce job to complete, job_id=[!http://resourcemanager.redacted.com:8088/cluster/app/application_1386796141359_1780612 job_1386796141359_1780612], state=FAILED, reason=Application application_1386712341359_0000612 failed 3 times due to AM Container for appattempt_1386712341359_0000612_000003 exited with exitCode: -1000 due to: RemoteTrace:
ERROR ERP.REDACTED - java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "node1234.redacted.com/10.123.123.123"; destination host is: ""namenode.redacted.com":8020;
ERROR ERP.REDACTED - at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:738)
ERROR ERP.REDACTED - at org.apache.hadoop.ipc.Client.call(Client.java:1098)
ERROR ERP.REDACTED - at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:195)
ERROR ERP.REDACTED - at com.sun.proxy.$Proxy7.getFileInfo(Unknown Source)
ERROR ERP.REDACTED - at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
ERROR ERP.REDACTED - at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
ERROR ERP.REDACTED - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
ERROR ERP.REDACTED - at java.lang.reflect.Method.invoke(Method.java:601)
ERROR ERP.REDACTED - at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:102)
ERROR ERP.REDACTED - at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:67)
ERROR ERP.REDACTED - at com.sun.proxy.$Proxy7.getFileInfo(Unknown Source)
ERROR ERP.REDACTED - at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1305)
ERROR ERP.REDACTED - at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:734)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282)
ERROR ERP.REDACTED - at java.security.AccessController.doPrivileged(Native Method)
ERROR ERP.REDACTED - at javax.security.auth.Subject.doAs(Subject.java:415)
ERROR ERP.REDACTED - at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1284)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:281)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51)
ERROR ERP.REDACTED - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
ERROR ERP.REDACTED - at java.util.concurrent.FutureTask.run(FutureTask.java:166)
ERROR ERP.REDACTED - at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
ERROR ERP.REDACTED - at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
ERROR ERP.REDACTED - at java.util.concurrent.FutureTask.run(FutureTask.java:166)
ERROR ERP.REDACTED - at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
ERROR ERP.REDACTED - at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
ERROR ERP.REDACTED - at java.lang.Thread.run(Thread.java:722)
ERROR ERP.REDACTED - Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
ERROR ERP.REDACTED - at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:537)
ERROR ERP.REDACTED - at java.security.AccessController.doPrivileged(Native Method)
ERROR ERP.REDACTED - at javax.security.auth.Subject.doAs(Subject.java:415)
ERROR ERP.REDACTED - at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1284)
ERROR ERP.REDACTED - at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:501)
ERROR ERP.REDACTED - at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:585)
ERROR ERP.REDACTED - at org.apache.hadoop.ipc.Client$Connection.access$2100(Client.java:207)
ERROR ERP.REDACTED - at org.apache.hadoop.ipc.Client.getConnection(Client.java:1204)
ERROR ERP.REDACTED - at org.apache.hadoop.ipc.Client.call(Client.java:1074)
ERROR ERP.REDACTED - ... 28 more
ERROR ERP.REDACTED - Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
ERROR ERP.REDACTED - at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
ERROR ERP.REDACTED - at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:140)
ERROR ERP.REDACTED - at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:409)
ERROR ERP.REDACTED - at org.apache.hadoop.ipc.Client$Connection.access$1300(Client.java:207)
ERROR ERP.REDACTED - at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:578)
ERROR ERP.REDACTED - at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:575)
ERROR ERP.REDACTED - at java.security.AccessController.doPrivileged(Native Method)
ERROR ERP.REDACTED - at javax.security.auth.Subject.doAs(Subject.java:415)
ERROR ERP.REDACTED - at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1284)
ERROR ERP.REDACTED - at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:574)
ERROR ERP.REDACTED - ... 31 more
ERROR ERP.REDACTED - Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
ERROR ERP.REDACTED - at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
ERROR ERP.REDACTED - at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
ERROR ERP.REDACTED - at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
ERROR ERP.REDACTED - at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
ERROR ERP.REDACTED - at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
ERROR ERP.REDACTED - at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
ERROR ERP.REDACTED - at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193)
ERROR ERP.REDACTED - ... 40 more
ERROR ERP.REDACTED - at LocalTrace:
ERROR ERP.REDACTED - org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "node1234.redacted.com/10.123.123.123"; destination host is: ""namenode.redacted.com":8020;
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:820)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:497)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:224)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Server.call(ProtoOverHadoopRpcEngine.java:353)
ERROR ERP.REDACTED - at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1543)
ERROR ERP.REDACTED - at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1539)
ERROR ERP.REDACTED - at java.security.AccessController.doPrivileged(Native Method)
ERROR ERP.REDACTED - at javax.security.auth.Subject.doAs(Subject.java:415)
ERROR ERP.REDACTED - at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1284)
ERROR ERP.REDACTED - at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1537)
ERROR ERP.REDACTED - Caused by: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.getCause(YarnRemoteExceptionPBImpl.java:90)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.getCause(YarnRemoteExceptionPBImpl.java:28)
ERROR ERP.REDACTED - at java.lang.Throwable.printStackTrace(Throwable.java:664)
ERROR ERP.REDACTED - at java.lang.Throwable.printStackTrace(Throwable.java:720)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.exceptions.YarnRemoteException.printStackTrace(YarnRemoteException.java:48)
ERROR ERP.REDACTED - at org.apache.hadoop.util.StringUtils.stringifyException(StringUtils.java:69)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceFailedTransition.transition(ContainerImpl.java:702)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceFailedTransition.transition(ContainerImpl.java:695)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:359)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:299)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:828)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:71)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:556)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:549)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
ERROR ERP.REDACTED - at java.lang.Thread.run(Thread.java:722)
ERROR ERP.REDACTED - Caused by: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: GSS initiate failed
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.getCause(YarnRemoteExceptionPBImpl.java:90)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.getCause(YarnRemoteExceptionPBImpl.java:28)
ERROR ERP.REDACTED - at java.lang.Throwable.printEnclosedStackTrace(Throwable.java:706)
ERROR ERP.REDACTED - at java.lang.Throwable.printStackTrace(Throwable.java:666)
ERROR ERP.REDACTED - ... 16 more
ERROR ERP.REDACTED - Caused by: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.getCause(YarnRemoteExceptionPBImpl.java:90)
ERROR ERP.REDACTED - at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.getCause(YarnRemoteExceptionPBImpl.java:28)
ERROR ERP.REDACTED - at java.lang.Throwable.printEnclosedStackTrace(Throwable.java:706)
ERROR ERP.REDACTED - at java.lang.Throwable.printEnclosedStackTrace(Throwable.java:708)
ERROR ERP.REDACTED - ... 17 more
ERROR ERP.REDACTED - .Failing this attempt.. Failing the application.
This problem was pretty nasty to troubleshoot - from the stacktraces above it seems like the resource localizer is trying to use Kerberos to authenticate against the NameNode. However, it fails (as it should) because it cannot find a keytab file, because there is none. After the job submission there should be no need for the keytab files anymore because at any stage after submission the job must be using Hadoop's delegation tokens. In the above case the Hunk server was properly getting the delegation tokens from the namenode, as can be seen in the following log line:
DEBUG ERP.REDACTED - SecurityUtil - Acquired token Ident: 00 16 ... 05 8b, Kind: HDFS_DELEGATION_TOKEN, Service: [external-ip-address]:8020
However, the delegtion token was for a service provided by the external IP of the Namenode (as can be see by Service: [external-ip-address]:8020) - however the resource localizer communicates with the Namenode using an internal IP ("node1234.redacted.com/10.123.123.123"; destination host is: ""namenode.redacted.com":8020; )
Thus the root cause of the problem was a mismatch between the client's and the cluster's value of hadoop.security.token.service.use_ip . The fix was to set the following flag in the provider
[REDACTED]
....
vix.hadoop.security.token.service.use_ip = false
This problem was pretty nasty to troubleshoot - from the stacktraces above it seems like the resource localizer is trying to use Kerberos to authenticate against the NameNode. However, it fails (as it should) because it cannot find a keytab file, because there is none. After the job submission there should be no need for the keytab files anymore because at any stage after submission the job must be using Hadoop's delegation tokens. In the above case the Hunk server was properly getting the delegation tokens from the namenode, as can be seen in the following log line:
DEBUG ERP.REDACTED - SecurityUtil - Acquired token Ident: 00 16 ... 05 8b, Kind: HDFS_DELEGATION_TOKEN, Service: [external-ip-address]:8020
However, the delegtion token was for a service provided by the external IP of the Namenode (as can be see by Service: [external-ip-address]:8020) - however the resource localizer communicates with the Namenode using an internal IP ("node1234.redacted.com/10.123.123.123"; destination host is: ""namenode.redacted.com":8020; )
Thus the root cause of the problem was a mismatch between the client's and the cluster's value of hadoop.security.token.service.use_ip . The fix was to set the following flag in the provider
[REDACTED]
....
vix.hadoop.security.token.service.use_ip = false