问题描述
我正在研究一个集群,其中数据集以分布式方式保存在 hdfs
中.这是我所拥有的:
I am working on a cluster where a dataset is kept in hdfs
in distributed manner. Here is what I have:
[hmi@bdadev-5 ~]$ hadoop fs -ls /bdatest/clm/data/
Found 1840 items
-rw-r--r-- 3 bda supergroup 0 2015-08-11 00:32 /bdatest/clm/data/_SUCCESS
-rw-r--r-- 3 bda supergroup 34404390 2015-08-11 00:32 /bdatest/clm/data/part-00000
-rw-r--r-- 3 bda supergroup 34404062 2015-08-11 00:32 /bdatest/clm/data/part-00001
-rw-r--r-- 3 bda supergroup 34404259 2015-08-11 00:32 /bdatest/clm/data/part-00002
....
....
数据格式为:
[hmi@bdadev-5 ~]$ hadoop fs -cat /bdatest/clm/data/part-00000|head
V|485715986|1|8ca217a3d75d8236|Y|Y|Y|Y/1X||Trimode|SAMSUNG|1x/Trimode|High|Phone|N|Y|Y|Y|N|Basic|Basic|Basic|Basic|N|N|N|N|Y|N|Basic-Communicator|Y|Basic|N|Y|1X|Basic|1X|||SAM|Other|SCH-A870|SCH-A870|N|N|M2MC|
所以,我要做的是统计原始数据文件data
的总行数.我的理解是像 part-00000
、part-00001
等分布式块有重叠.因此,仅计算 part-xxxx
文件中的行数并将它们相加是行不通的.此外,原始数据集data
的大小为~70GB
.如何有效地找出总行数?
So, what I want to do is to count the total number of lines in the original data file data
. My understanding is that the distributed chunks like part-00000
, part-00001
etc have overlaps. So just counting the number of lines in part-xxxx
files and summing them won't work. Also the original dataset data
is of size ~70GB
. How can I efficiently find out the total number of lines?
推荐答案
更高效——您可以使用 spark 来计算编号.的线.以下代码片段有助于计算行数.
More efficiently -- you can use spark to count the no. of lines. The following code snippet helps to count the number of lines.
text_file = spark.textFile("hdfs://...")
count = text_file.count();
count.dump();
这将显示 no 的计数.行.
This displays the count of no. of lines.
注意:不同零件文件中的数据不会重叠
使用hdfs dfs -cat/bdatest/clm/data/part-* |wc -l
也会为您提供输出,但这会将所有数据转储到本地机器并且需要更长的时间.
Using hdfs dfs -cat /bdatest/clm/data/part-* | wc -l
will also give you the output but this will dump all the data to the local machine and takes longer time.
最好的解决方案是使用 MapReduce 或 spark.MapReduce 将需要更长的时间来开发和执行.如果安装了火花,这是最好的选择.
Best solution is to use MapReduce or spark. MapReduce will take longer time to develop and execute. If the spark is installed, this is the best choice.
这篇关于使用命令行查找 hdfs 分布式文件中的总行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!