We are running a 10-datanode Hortonworks HDP v2.5 cluster on Ubuntu 14.04. Whenever I run a large yarn job he map task shows as SUCCEEDED but with a Note "Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143"
Can someone help me troubleshoot this?
yarn-yarn-nodemanager-datanode.log
2017-04-03 10:15:18,140 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(810)) - Start request for container_e10_1484675915702_18333_01_000003 by user root
2017-04-03 10:15:18,151 INFO application.ApplicationImpl (ApplicationImpl.java:transition(304)) - Adding container_e10_1484675915702_18333_01_000003 to application application_1484675915702_18333
2017-04-03 10:15:18,153 INFO container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from NEW to LOCALIZING
2017-04-03 10:15:18,157 INFO yarn.YarnShuffleService (YarnShuffleService.java:initializeContainer(184)) - Initializing container container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:18,157 INFO yarn.YarnShuffleService (YarnShuffleService.java:initializeContainer(185)) - Initializing container container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:18,358 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:handle(712)) - Created localizer for container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:18,406 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:writeCredentials(1194)) - Writing credentials to the nmPrivate file /grid/3/hadoop/yarn/local/nmPrivate/container_e10_1484675915702_18333_01_000003.tokens. Credentials list:
2017-04-03 10:15:18,407 INFO container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from LOCALIZING to LOCALIZED
2017-04-03 10:15:18,458 INFO container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from LOCALIZED to RUNNING
2017-04-03 10:15:18,462 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:buildCommandExecutor(281)) - launchContainer: [bash, /grid/1/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/default_container_executor.sh]
2017-04-03 10:15:18,465 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:startLocalizer(126)) - Copying from /grid/3/hadoop/yarn/local/nmPrivate/container_e10_1484675915702_18333_01_000003.tokens to /grid/2/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003.tokens
2017-04-03 10:15:20,998 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(375)) - Starting resource-monitoring for container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:21,144 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 851 for container-id container_e10_1484675915702_18333_01_000003: 148.7 MB of 2 GB physical memory used; 2.1 GB of 4.2 GB virtual memory used
2017-04-03 10:15:24,293 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(464)) - Memory usage of ProcessTree 851 for container-id container_e10_1484675915702_18333_01_000003: 305.4 MB of 2 GB physical memory used; 2.4 GB of 4.2 GB virtual memory used
2017-04-03 10:15:24,734 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:stopContainerInternal(960)) - Stopping container with container Id: container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,734 INFO container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from RUNNING to KILLING
2017-04-03 10:15:24,734 INFO launcher.ContainerLaunch (ContainerLaunch.java:cleanupContainer(425)) - Cleaning up container container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,743 WARN nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:launchContainer(237)) - Exit code from container container_e10_1484675915702_18333_01_000003 is : 143
2017-04-03 10:15:24,756 INFO container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2017-04-03 10:15:24,757 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(480)) - Deleting absolute path : /grid/1/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,757 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(480)) - Deleting absolute path : /grid/2/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,757 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(480)) - Deleting absolute path : /grid/3/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,757 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(480)) - Deleting absolute path : /grid/0/hadoop/yarn/local/usercache/root/appcache/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,757 INFO container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e10_1484675915702_18333_01_000003 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2017-04-03 10:15:24,757 INFO application.ApplicationImpl (ApplicationImpl.java:transition(347)) - Removing container_e10_1484675915702_18333_01_000003 from application application_1484675915702_18333
2017-04-03 10:15:24,757 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:startContainerLogAggregation(512)) - Considering container container_e10_1484675915702_18333_01_000003 for log-aggregation
2017-04-03 10:15:24,758 INFO yarn.YarnShuffleService (YarnShuffleService.java:stopContainer(190)) - Stopping container container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:24,758 INFO yarn.YarnShuffleService (YarnShuffleService.java:stopContainer(191)) - Stopping container container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:26,338 INFO nodemanager.NodeStatusUpdaterImpl (NodeStatusUpdaterImpl.java:removeOrTrackCompletedContainersFromContext(553)) - Removed completed containers from NM context: [container_e10_1484675915702_18333_01_000003]
2017-04-03 10:15:27,294 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(390)) - Stopping resource-monitoring for container_e10_1484675915702_18333_01_000003
2017-04-03 10:15:34,491 INFO logaggregation.AppLogAggregatorImpl (AppLogAggregatorImpl.java:doContainerLogAggregation(567)) - Uploading logs for container container_e10_1484675915702_18333_01_000003. Current good log dirs are /grid/1/hadoop/yarn/log,/grid/2/hadoop/yarn/log,/grid/3/hadoop/yarn/log,/grid/0/hadoop/yarn/log
2017-04-03 10:15:34,495 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/syslog
2017-04-03 10:15:34,496 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/directory.info
2017-04-03 10:15:34,496 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/stdout
2017-04-03 10:15:34,496 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/stderr
2017-04-03 10:15:34,496 INFO nodemanager.DefaultContainerExecutor (DefaultContainerExecutor.java:deleteAsUser(489)) - Deleting path : /grid/1/hadoop/yarn/log/application_1484675915702_18333/container_e10_1484675915702_18333_01_000003/launch_container.sh
It looks as if you have some memory issues in the Hadoop nodes, so some of the jobs are being killed.
Is this a Splunk problem, or is it that you are using Splunk to detect the problem?
If it is actually a hadoop/yarn problem, then this is not the forum for that question - although it is possible that someone here might know the answer...
Yes, it is a splunk problem. I am running a hadoop search using Splunk Analytics for Hadoop and I am getting this problem. I would like some suggestions on how to troubleshoot this.
Ah - that's helpful to know. Have you looked at the splunkd.log for any error messages?
There is nothing sticking out in the search.log:
https://pastebin.com/rmBTBFcG