2016年1月14日 星期四

[ 常見問題 ] Avoid MapReduce Out-of-Memory Errors

Source From Here
How-To
If your installation uses the TaskTracker (pre-YARN distributions), your lens builds can fail with out-of-memory errors. Better tuning of the TaskTracker task properties and Java virtual memory (JVM) settings can fix these errors.

When creating a MapReduce job, Hadoop does not dynamically detect system resources to determine the number of map or reduce task slots to allocate. Instead, the MapReduce job tries to use as many task slots as it is allowed with as much Java virtual memory (JVM) allowed. A Platfora configuration can pass JVM allocation for a MapReduce job but it cannot set the allowable tasks. The Hadoop configuration controls the allowable tasks.

The maximum number of simultaneous tasks that can run on a Hadoop TaskTracker node is configured by the following MapReduce configuration properties:
mapred.tasktracker.map.tasks.maximum
The maximum number of map tasks that will be run simultaneously by a task tracker. (Default is 2)

mapred.tasktracker.reduce.tasks.maximum
The maximum number of reduce tasks that will be run simultaneously by a task tracker. (Default is 2)

Note.
In MR1, the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum (mapred-default.html) properties dictated how many map and reduce slots each TaskTracker had. These properties no longer exist in YARN. Instead, YARN uses yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores (yarn-default.xml), which control the amount of memory and CPU on each node, both available to both maps and reduces. If you were using Cloudera Manager to configure these properties automatically, Cloudera Manager will take care of it in MR2 as well. If configuring these properties manually, simply set these to the amount of memory and number of cores on the machine after subtracting out resources needed for other services.

Your Hadoop administrator configures these on the Hadoop TaskTracker nodes.

The maximum number of tasks able to run on a single TaskTracker node has nothing to do with the number of tasks a job needs. For example, a job may require 10 map tasks. If the maximum map task slots is 5, then the job runs 5 tasks at a time until it completes all 10. Likewise the maximum reduce task slots could be set to 10, but a job may only need to use 2 reduce slots.

Your Hadoop administrator should make sure that the number of task slots is sized according to the amount of memory and CPU available on your Hadoop TaskTracker nodes, and the typical job workload. If the tracker nodes have swap enabled, administrators can reduce these limits to take that into account.

The total JVM size that Hadoop allocates per task slot is set by the mapred.java.child.opts property. You set this in Platfora's local mapred-site.xml file. Platfora needs at least a 1 GB JVM size for its task slots. But if you decide to use a higher JVM size to optimize lens build performance, you must make sure not to over-allocate system memory on your Hadoop TaskTracker nodes.

In addition to its operating system requirements, a TaskTracker node requires enough RAM to support the TaskTracker process, the data node virtual machine, and any other process a node may run. Think of this as the node's RAM requirements. A good rule of thumb for setting mapred.java.child.opts for Platfora is:
(Node RAM requirements – 4GB) / (mapred.tasktracker.map.tasks.maximum + mapred.tasktracker.reduce.tasks.maximum)

For example, if a TaskTracker node returns 32 GB of RAM, minus 4 GB reserve memory, then 28 GB is available for all MapReduce tasks. If the maximum map tasks allowed is 5 and the maximum reduce tasks is 3, then no more than 8 tasks can run at one time on the node. 28 GB divided by 8 is 3.5 GB per task ( -Xmx3500M ).

Supplement
Hadoop 參數設定 – mapred-site.xml
FAQ - Out of Memory Error in Hadoop
You can assign more memory by editing the conf/mapred-site.xml file and adding the property:
  1.   
  2.   mapred.child.java.opts  
  3.   -Xmx1024m  
  This will start the hadoop JVMs with more heap space.

沒有留言:

張貼留言

[Git 常見問題] error: The following untracked working tree files would be overwritten by merge

  Source From  Here 方案1: // x -----删除忽略文件已经对 git 来说不识别的文件 // d -----删除未被添加到 git 的路径中的文件 // f -----强制运行 #   git clean -d -fx 方案2: 今天在服务器上  gi...