做个大体的说明。如有问题,请联系。
Frequently Asked Questions
- Why does -update not create the parent source-directory under a pre-existing target directory?
The behaviour of -update and -overwrite is described in detail in the Usage section of this document. In short, if either option is used with a pre-existing destination directory, the contents of each source directory is copied over, rather than the source-directory itself. This behaviour is consistent with the legacy DistCp implementation as well.
如果在目标目录存在的情况下,命令行指定-update或-overwrite选项,那么只会复制原目录下的内容,而不是整个原目录。distcp distcp2 版本都是这样实现的。 - How does the new DistCp differ in semantics from the Legacy DistCp?
- Files that are skipped during copy used to also have their file-attributes (permissions, owner/group info, etc.) unchanged, when copied with Legacy DistCp. These are now updated, even if the file-copy is skipped.
- 通过默认distcp复制,如果已经复制过的那么会被skip。
- Empty root directories among the source-path inputs were not created at the target, in Legacy DistCp. These are now created.
- 传统的distcp源上的空目录不会在目标目录里创建,但distcp2会。
- Why does the new DistCp use more maps than legacy DistCp?
Legacy DistCp works by figuring out what files need to be actually copied to target before the copy-job is launched, and then launching as many maps as required for copy. So if a majority of the files need to be skipped (because they already exist, for example), fewer maps will be needed. As a consequence, the time spent in setup (i.e. before the M/R job) is higher.
默认的distcp会在执行作业之前,会分析哪些文件会被复制。如果有大量文件被skip(已经存在),那么就会需要很少的map。但会延长在执行m/r之前的时间。The new DistCp calculates only the contents of the source-paths. It doesn't try to filter out what files can be skipped. That decision is put- off till the M/R job runs. This is much faster (vis-a-vis execution-time), but the number of maps launched will be as specified in the -m option, or 20 (default) if unspecified.
distcp2 只是统计原文件的内容,并不会过滤哪些文件会skip掉,这个过程交给了m/r作业来搞。 默认的map作业的数量是20. - Why does DistCp not run faster when more maps are specified?
At present, the smallest unit of work for DistCp is a file. i.e., a file is processed by only one map. Increasing the number of maps to a value exceeding the number of files would yield no performance benefit. The number of maps lauched would equal the number of files.
最小的工作单元是文件,一个map一次只能处理一给文件。增加map的数量超过文件的数量并不会带来性能上的提高,map的数量应该与文件数量保持一致。 - Why does DistCp run out of memory?
If the number of individual files/directories being copied from the source path(s) is extremely large (e.g. 1,000,000 paths), DistCp might run out of memory while determining the list of paths for copy. This is not unique to the new DistCp implementation.
To get around this, consider changing the -Xmx JVM heap-size parameters, as follows:
bash$ export HADOOP_CLIENT_OPTS="-Xms64m -Xmx1024m"
bash$ hadoop distcp2 /source /target