Security

ERROR while waiting for MapReduce job to complete - AM killed

Ledion_Bitincka
Splunk Employee
Splunk Employee

After solving this problem, we are running into an issue with the application master container not being able to launch - it seems to be failing during resource localization stage.

search.log shows the following stack traces

ERROR ERP.REDACTED -  SplunkMR - ERROR while waiting for MapReduce job to complete, job_id=[!http://resourcemanager.redacted.com:8088/cluster/app/application_1386796141359_1780612 job_1386796141359_1780612], state=FAILED, reason=Application application_1386712341359_0000612 failed 3 times due to AM Container for appattempt_1386712341359_0000612_000003 exited with  exitCode: -1000 due to: RemoteTrace: 
ERROR ERP.REDACTED -  java.io.IOException: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "node1234.redacted.com/10.123.123.123"; destination host is: ""namenode.redacted.com":8020; 
ERROR ERP.REDACTED -    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:738)
ERROR ERP.REDACTED -    at org.apache.hadoop.ipc.Client.call(Client.java:1098)
ERROR ERP.REDACTED -    at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:195)
ERROR ERP.REDACTED -    at com.sun.proxy.$Proxy7.getFileInfo(Unknown Source)
ERROR ERP.REDACTED -    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
ERROR ERP.REDACTED -    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
ERROR ERP.REDACTED -    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
ERROR ERP.REDACTED -    at java.lang.reflect.Method.invoke(Method.java:601)
ERROR ERP.REDACTED -    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:102)
ERROR ERP.REDACTED -    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:67)
ERROR ERP.REDACTED -    at com.sun.proxy.$Proxy7.getFileInfo(Unknown Source)
ERROR ERP.REDACTED -    at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1305)
ERROR ERP.REDACTED -    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:734)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282)
ERROR ERP.REDACTED -    at java.security.AccessController.doPrivileged(Native Method)
ERROR ERP.REDACTED -    at javax.security.auth.Subject.doAs(Subject.java:415)
ERROR ERP.REDACTED -    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1284)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:281)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51)
ERROR ERP.REDACTED -    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
ERROR ERP.REDACTED -    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
ERROR ERP.REDACTED -    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
ERROR ERP.REDACTED -    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
ERROR ERP.REDACTED -    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
ERROR ERP.REDACTED -    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
ERROR ERP.REDACTED -    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
ERROR ERP.REDACTED -    at java.lang.Thread.run(Thread.java:722)
ERROR ERP.REDACTED -  Caused by: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
ERROR ERP.REDACTED -    at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:537)
ERROR ERP.REDACTED -    at java.security.AccessController.doPrivileged(Native Method)
ERROR ERP.REDACTED -    at javax.security.auth.Subject.doAs(Subject.java:415)
ERROR ERP.REDACTED -    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1284)
ERROR ERP.REDACTED -    at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:501)
ERROR ERP.REDACTED -    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:585)
ERROR ERP.REDACTED -    at org.apache.hadoop.ipc.Client$Connection.access$2100(Client.java:207)
ERROR ERP.REDACTED -    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1204)
ERROR ERP.REDACTED -    at org.apache.hadoop.ipc.Client.call(Client.java:1074)
ERROR ERP.REDACTED -    ... 28 more
ERROR ERP.REDACTED -  Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
ERROR ERP.REDACTED -    at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
ERROR ERP.REDACTED -    at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:140)
ERROR ERP.REDACTED -    at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:409)
ERROR ERP.REDACTED -    at org.apache.hadoop.ipc.Client$Connection.access$1300(Client.java:207)
ERROR ERP.REDACTED -    at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:578)
ERROR ERP.REDACTED -    at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:575)
ERROR ERP.REDACTED -    at java.security.AccessController.doPrivileged(Native Method)
ERROR ERP.REDACTED -    at javax.security.auth.Subject.doAs(Subject.java:415)
ERROR ERP.REDACTED -    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1284)
ERROR ERP.REDACTED -    at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:574)
ERROR ERP.REDACTED -    ... 31 more
ERROR ERP.REDACTED -  Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
ERROR ERP.REDACTED -    at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
ERROR ERP.REDACTED -    at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
ERROR ERP.REDACTED -    at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
ERROR ERP.REDACTED -    at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
ERROR ERP.REDACTED -    at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
ERROR ERP.REDACTED -    at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
ERROR ERP.REDACTED -    at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:193)
ERROR ERP.REDACTED -    ... 40 more
ERROR ERP.REDACTED -   at LocalTrace: 
ERROR ERP.REDACTED -    org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: Failed on local exception: java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]; Host Details : local host is: "node1234.redacted.com/10.123.123.123"; destination host is: ""namenode.redacted.com":8020; 
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:820)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:497)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:224)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Server.call(ProtoOverHadoopRpcEngine.java:353)
ERROR ERP.REDACTED -    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1543)
ERROR ERP.REDACTED -    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1539)
ERROR ERP.REDACTED -    at java.security.AccessController.doPrivileged(Native Method)
ERROR ERP.REDACTED -    at javax.security.auth.Subject.doAs(Subject.java:415)
ERROR ERP.REDACTED -    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1284)
ERROR ERP.REDACTED -    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1537)
ERROR ERP.REDACTED -  Caused by: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.getCause(YarnRemoteExceptionPBImpl.java:90)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.getCause(YarnRemoteExceptionPBImpl.java:28)
ERROR ERP.REDACTED -    at java.lang.Throwable.printStackTrace(Throwable.java:664)
ERROR ERP.REDACTED -    at java.lang.Throwable.printStackTrace(Throwable.java:720)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.exceptions.YarnRemoteException.printStackTrace(YarnRemoteException.java:48)
ERROR ERP.REDACTED -    at org.apache.hadoop.util.StringUtils.stringifyException(StringUtils.java:69)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceFailedTransition.transition(ContainerImpl.java:702)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl$ResourceFailedTransition.transition(ContainerImpl.java:695)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:359)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:299)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:445)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:828)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:71)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:556)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:549)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77)
ERROR ERP.REDACTED -    at java.lang.Thread.run(Thread.java:722)
ERROR ERP.REDACTED -  Caused by: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: GSS initiate failed
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.getCause(YarnRemoteExceptionPBImpl.java:90)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.getCause(YarnRemoteExceptionPBImpl.java:28)
ERROR ERP.REDACTED -    at java.lang.Throwable.printEnclosedStackTrace(Throwable.java:706)
ERROR ERP.REDACTED -    at java.lang.Throwable.printStackTrace(Throwable.java:666)
ERROR ERP.REDACTED -    ... 16 more
ERROR ERP.REDACTED -  Caused by: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.getCause(YarnRemoteExceptionPBImpl.java:90)
ERROR ERP.REDACTED -    at org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl.getCause(YarnRemoteExceptionPBImpl.java:28)
ERROR ERP.REDACTED -    at java.lang.Throwable.printEnclosedStackTrace(Throwable.java:706)
ERROR ERP.REDACTED -    at java.lang.Throwable.printEnclosedStackTrace(Throwable.java:708)
ERROR ERP.REDACTED -    ... 17 more
ERROR ERP.REDACTED -  .Failing this attempt.. Failing the application.
Tags (3)
0 Karma
1 Solution

Ledion_Bitincka
Splunk Employee
Splunk Employee

This problem was pretty nasty to troubleshoot - from the stacktraces above it seems like the resource localizer is trying to use Kerberos to authenticate against the NameNode. However, it fails (as it should) because it cannot find a keytab file, because there is none. After the job submission there should be no need for the keytab files anymore because at any stage after submission the job must be using Hadoop's delegation tokens. In the above case the Hunk server was properly getting the delegation tokens from the namenode, as can be seen in the following log line:

DEBUG ERP.REDACTED -  SecurityUtil - Acquired token Ident: 00 16 ... 05 8b, Kind: HDFS_DELEGATION_TOKEN, Service: [external-ip-address]:8020

However, the delegtion token was for a service provided by the external IP of the Namenode (as can be see by Service: [external-ip-address]:8020) - however the resource localizer communicates with the Namenode using an internal IP ("node1234.redacted.com/10.123.123.123"; destination host is: ""namenode.redacted.com":8020; )

Thus the root cause of the problem was a mismatch between the client's and the cluster's value of hadoop.security.token.service.use_ip . The fix was to set the following flag in the provider

[REDACTED]
 ....
vix.hadoop.security.token.service.use_ip = false

View solution in original post

0 Karma

Ledion_Bitincka
Splunk Employee
Splunk Employee

This problem was pretty nasty to troubleshoot - from the stacktraces above it seems like the resource localizer is trying to use Kerberos to authenticate against the NameNode. However, it fails (as it should) because it cannot find a keytab file, because there is none. After the job submission there should be no need for the keytab files anymore because at any stage after submission the job must be using Hadoop's delegation tokens. In the above case the Hunk server was properly getting the delegation tokens from the namenode, as can be seen in the following log line:

DEBUG ERP.REDACTED -  SecurityUtil - Acquired token Ident: 00 16 ... 05 8b, Kind: HDFS_DELEGATION_TOKEN, Service: [external-ip-address]:8020

However, the delegtion token was for a service provided by the external IP of the Namenode (as can be see by Service: [external-ip-address]:8020) - however the resource localizer communicates with the Namenode using an internal IP ("node1234.redacted.com/10.123.123.123"; destination host is: ""namenode.redacted.com":8020; )

Thus the root cause of the problem was a mismatch between the client's and the cluster's value of hadoop.security.token.service.use_ip . The fix was to set the following flag in the provider

[REDACTED]
 ....
vix.hadoop.security.token.service.use_ip = false
0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In December, the Splunk Threat Research Team had 1 release of new security content via the Enterprise Security ...

Why am I not seeing the finding in Splunk Enterprise Security Analyst Queue?

(This is the first of a series of 2 blogs). Splunk Enterprise Security is a fantastic tool that offers robust ...

Index This | What are the 12 Days of Splunk-mas?

December 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...