All Apps and Add-ons

Yarn Container exit code 143

Builder

We are running a 10-datanode Hortonworks HDP v2.5 cluster on Ubuntu 14.04. Whenever I run a large yarn job he map task shows as SUCCEEDED but with a Note "Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143"

Can someone help me troubleshoot this?

alt text

yarn-yarn-nodemanager-datanode.log

2017-04-03 10:15:18,140 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(810)) - Start request for container_e10_1484675915702_18333_01_000003 by user root
2017-04-03 10:15:18,151 INFO  application.ApplicationImpl (ApplicationImpl.java:transition(304)) - Adding container_e10_1484675915702_18333_01_000003 to application application_1484675915702_18333
2017-04-03 10:15:18,153 INFO  container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from NEW to LOCALIZING
2017-04-03 10:15:18,157 INFO  yarn.YarnShuffleService (YarnShuffleService.java:initializeContainer(184)) - Initializing container container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:18,157 INFO  yarn.YarnShuffleService (YarnShuffleService.java:initializeContainer(185)) - Initializing container container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:18,358 INFO  localizer.ResourceLocalizationService (ResourceLocalizationService.java:handle(712)) - Created localizer for container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:18,406 INFO  localizer.ResourceLocalizationService (ResourceLocalizationService.java:writeCredentials(1194)) - Writing credentials to the nmPrivate file /grid/3/hadoop/yarn/local/nmPrivate/container_e10_1484675915702_18333_01_000003.tokens. Credentials list: 
2017-04-03 10:15:18,407 INFO  container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from LOCALIZING to LOCALIZED
2017-04-03 10:15:18,458 INFO  container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from LOCALIZED to RUNNING
2017-04-03 10:15:18,462 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:buildCommandExecutor(281)) - launchContainer: [bash, /grid/1/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/default_container_executor.sh]
2017-04-03 10:15:18,465 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:startLocalizer(126)) - Copying from /grid/3/hadoop/yarn/local/nmPrivate/container_e10_1484675915702_18333_01_000003.tokens to /grid/2/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003.tokens
2017-04-03 10:15:20,998 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:21,144 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 851 for container-id container_e10_1484675915702_18333_01_000003: 148.7 MB of 2 GB physical memory used; 2.1 GB of 4.2 GB virtual memory used
2017-04-03 10:15:24,293 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 851 for container-id container_e10_1484675915702_18333_01_000003: 305.4 MB of 2 GB physical memory used; 2.4 GB of 4.2 GB virtual memory used
2017-04-03 10:15:24,734 INFO  containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(960)) - Stopping container with container Id: container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,734 INFO  container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from RUNNING to KILLING
2017-04-03 10:15:24,734 INFO  launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(425)) - Cleaning up container container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,743 WARN  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(237)) - Exit code from container container_e10_1484675915702_18333_01_000003 is : 143
2017-04-03 10:15:24,756 INFO  container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2017-04-03 10:15:24,757 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(480)) - Deleting absolute path : /grid/1/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,757 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(480)) - Deleting absolute path : /grid/2/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,757 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(480)) - Deleting absolute path : /grid/3/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,757 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(480)) - Deleting absolute path : /grid/0/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,757 INFO  container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2017-04-03 10:15:24,757 INFO  application.ApplicationImpl (ApplicationImpl.java:transition(347)) - Removing container_e10_1484675915702_18333_01_000003 from application application_1484675915702_18333
2017-04-03 10:15:24,757 INFO  logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:startContainerLogAggregation(512)) - Considering container container_e10_1484675915702_18333_01_000003 for log-aggregation
2017-04-03 10:15:24,758 INFO  yarn.YarnShuffleService (YarnShuffleService.java:stopContainer(190)) - Stopping container container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,758 INFO  yarn.YarnShuffleService (YarnShuffleService.java:stopContainer(191)) - Stopping container container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:26,338 INFO  nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:removeOrTrackCompletedContainersFromContext(553)) - Removed completed containers from NM context: [container_e10_1484675915702_18333_01_000003]
2017-04-03 10:15:27,294 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(390)) - Stopping resource-monitoring for container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:34,491 INFO  logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:doContainerLogAggregation(567)) - Uploading logs for container container_e10_1484675915702_18333_01_000003. Current good log dirs are /grid/1/hadoop/yarn/log,/grid/2/hadoop/yarn/log,/grid/3/hadoop/yarn/log,/grid/0/hadoop/yarn/log
2017-04-03 10:15:34,495 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/syslog
2017-04-03 10:15:34,496 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/directory.info
2017-04-03 10:15:34,496 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/stdout
2017-04-03 10:15:34,496 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/stderr
2017-04-03 10:15:34,496 INFO  nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/launch_container.sh
0 Karma

Splunk Employee
Splunk Employee

It looks as if you have some memory issues in the Hadoop nodes, so some of the jobs are being killed.

0 Karma

Legend

Is this a Splunk problem, or is it that you are using Splunk to detect the problem?

If it is actually a hadoop/yarn problem, then this is not the forum for that question - although it is possible that someone here might know the answer...

0 Karma

Builder

Yes, it is a splunk problem. I am running a hadoop search using Splunk Analytics for Hadoop and I am getting this problem. I would like some suggestions on how to troubleshoot this.

0 Karma

Legend

Ah - that's helpful to know. Have you looked at the splunkd.log for any error messages?

0 Karma

Builder

There is nothing sticking out in the search.log:
https://pastebin.com/rmBTBFcG

0 Karma
State of Splunk Careers

Access the Splunk Careers Report to see real data that shows how Splunk mastery increases your value and job satisfaction.

Find out what your skills are worth!