All Apps and Add-ons

Yarn Container exit code 143

suarezry
Builder

We are running a 10-datanode Hortonworks HDP v2.5 cluster on Ubuntu 14.04. Whenever I run a large yarn job he map task shows as SUCCEEDED but with a Note "Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143"

Can someone help me troubleshoot this?

alt text

yarn-yarn-nodemanager-datanode.log

2017-04-03 10:15:18,140 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(810)) - Start request for container_e10_1484675915702_18333_01_000003 by user root
2017-04-03 10:15:18,151 INFO  application.ApplicationImpl (ApplicationImpl.java:transition(304)) - Adding container_e10_1484675915702_18333_01_000003 to application application_1484675915702_18333
2017-04-03 10:15:18,153 INFO  container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from NEW to LOCALIZING
2017-04-03 10:15:18,157 INFO  yarn.YarnShuffleService (YarnShuffleService.java:initializeContainer(184)) - Initializing container container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:18,157 INFO  yarn.YarnShuffleService (YarnShuffleService.java:initializeContainer(185)) - Initializing container container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:18,358 INFO  localizer.ResourceLocalizationService (ResourceLocalizationService.java:handle(712)) - Created localizer for container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:18,406 INFO  localizer.ResourceLocalizationService (ResourceLocalizationService.java:writeCredentials(1194)) - Writing credentials to the nmPrivate file /grid/3/hadoop/yarn/local/nmPrivate/container_e10_1484675915702_18333_01_000003.tokens. Credentials list: 
2017-04-03 10:15:18,407 INFO  container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from LOCALIZING to LOCALIZED
2017-04-03 10:15:18,458 INFO  container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from LOCALIZED to RUNNING
2017-04-03 10:15:18,462 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:buildCommandExecutor(281)) - launchContainer: [bash, /grid/1/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/default_container_executor.sh]
2017-04-03 10:15:18,465 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:startLocalizer(126)) - Copying from /grid/3/hadoop/yarn/local/nmPrivate/container_e10_1484675915702_18333_01_000003.tokens to /grid/2/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003.tokens
2017-04-03 10:15:20,998 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:21,144 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 851 for container-id container_e10_1484675915702_18333_01_000003: 148.7 MB of 2 GB physical memory used; 2.1 GB of 4.2 GB virtual memory used
2017-04-03 10:15:24,293 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 851 for container-id container_e10_1484675915702_18333_01_000003: 305.4 MB of 2 GB physical memory used; 2.4 GB of 4.2 GB virtual memory used
2017-04-03 10:15:24,734 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(960)) - Stopping container with container Id: container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,734 INFO  container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from RUNNING to KILLING
2017-04-03 10:15:24,734 INFO  launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(425)) - Cleaning up container container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,743 WARN  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(237)) - Exit code from container container_e10_1484675915702_18333_01_000003 is : 143
2017-04-03 10:15:24,756 INFO  container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2017-04-03 10:15:24,757 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(480)) - Deleting absolute path : /grid/1/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,757 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(480)) - Deleting absolute path : /grid/2/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,757 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(480)) - Deleting absolute path : /grid/3/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,757 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(480)) - Deleting absolute path : /grid/0/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,757 INFO  container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2017-04-03 10:15:24,757 INFO  application.ApplicationImpl (ApplicationImpl.java:transition(347)) - Removing container_e10_1484675915702_18333_01_000003 from application application_1484675915702_18333
2017-04-03 10:15:24,757 INFO  logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:startContainerLogAggregation(512)) - Considering container container_e10_1484675915702_18333_01_000003 for log-aggregation
2017-04-03 10:15:24,758 INFO  yarn.YarnShuffleService (YarnShuffleService.java:stopContainer(190)) - Stopping container container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,758 INFO  yarn.YarnShuffleService (YarnShuffleService.java:stopContainer(191)) - Stopping container container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:26,338 INFO  nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:removeOrTrackCompletedContainersFromContext(553)) - Removed completed containers from NM context: [container_e10_1484675915702_18333_01_000003]
2017-04-03 10:15:27,294 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(390)) - Stopping resource-monitoring for container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:34,491 INFO  logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:doContainerLogAggregation(567)) - Uploading logs for container container_e10_1484675915702_18333_01_000003. Current good log dirs are /grid/1/hadoop/yarn/log,/grid/2/hadoop/yarn/log,/grid/3/hadoop/yarn/log,/grid/0/hadoop/yarn/log
2017-04-03 10:15:34,495 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/syslog
2017-04-03 10:15:34,496 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/directory.info
2017-04-03 10:15:34,496 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/stdout
2017-04-03 10:15:34,496 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/stderr
2017-04-03 10:15:34,496 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/launch_container.sh
0 Karma

rdagan_splunk
Splunk Employee
Splunk Employee

It looks as if you have some memory issues in the Hadoop nodes, so some of the jobs are being killed.

0 Karma

lguinn2
Legend

Is this a Splunk problem, or is it that you are using Splunk to detect the problem?

If it is actually a hadoop/yarn problem, then this is not the forum for that question - although it is possible that someone here might know the answer...

0 Karma

suarezry
Builder

Yes, it is a splunk problem. I am running a hadoop search using Splunk Analytics for Hadoop and I am getting this problem. I would like some suggestions on how to troubleshoot this.

0 Karma

lguinn2
Legend

Ah - that's helpful to know. Have you looked at the splunkd.log for any error messages?

0 Karma

suarezry
Builder

There is nothing sticking out in the search.log:
https://pastebin.com/rmBTBFcG

0 Karma
Get Updates on the Splunk Community!

Enterprise Security Content Update (ESCU) | New Releases

In April, the Splunk Threat Research Team had 2 releases of new security content via the Enterprise Security ...

Stay Connected: Your Guide to May Tech Talks, Office Hours, and Webinars!

What are Community Office Hours?Community Office Hours is an interactive 60-minute Zoom series where ...

It’s go time — Boston, here we come!

Are you ready to take your Splunk skills to the next level? Get set, because Splunk University is back, and ...