hadoop如何读取输入文件

hadoop如何读取输入文件

本文介绍了hadoop如何读取输入文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个csv文件用hadoop mapreduce进行分析。我想知道,如果hadoop会逐行解析它?如果是的话,我想用逗号分隔字符串来获得想要分析的字段。还是有其他更好的方法来解析csv并将其提供给hadoop?该文件是10 GB,逗号分隔。我想用hadoop来使用java。在下面的map()方法中,Tex类型的参数value包含由Map / Reduce解析的每一行? - 这是我最困惑的地方。

i have a csv file to analyze with hadoop mapreduce. I am wondering if hadoop will parse it line by line? if yes, i want to use string split by comma to get the fields want to analyze. or is there other better method of parsing csv and feed it into hadoop? The file is 10 GB, comma delimited. I want to use java with hadoop. The parameter "value" of Tex type in the below map() method contains each line that is parsed in by Map/Reduce? - this is where I'm most confused about.

这是我的代码:

this is my code:

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    try {
       String[] tokens = value.toString().split(",");

       String crimeType = tokens[5].trim();
       int year = Integer.parseInt(tokens[17].trim());

       context.write(crimeType, year);

     } catch (Exception e) {...}
 }


推荐答案

是的,默认情况下,Hadoop使用文本输入阅读器,从输入文件逐行提供映射器。映射器中的键是读取行的偏移量。尽管如此,请注意CSV文件,因为单个列/字段可以包含换行符。您可能想要查找像这样的CSV输入阅读器:

Yes, by default Hadoop uses a Text Input reader that feeds the mapper line by line from the input file. The key in the mapper is the offset of the line read. Be careful with CSV files though, as single columns/fields can contain a line break. You might want to look for a CSV input reader like this one: https://github.com/mvallebr/CSVInputFormat/blob/master/src/main/java/org/apache/hadoop/mapreduce/lib/input/CSVNLineInputFormat.java

这篇关于hadoop如何读取输入文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 04:52