本文介绍了MRJob和python-Reducer的.csv文件输出?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将MRJob模块用于python 2.7.我创建了一个从MRJob继承的类,并使用继承的mapper函数正确映射了所有内容.

I'm using the MRJob module for python 2.7. I have created a class that inherits from MRJob, and have correctly mapped everything using the inherited mapper function.

问题是,我想让reducer函数输出一个.csv文件...这是reducer的代码:

Problem is, I would like to have the reducer function output a .csv file...here is the code for the reducer:

def reducer(self, geo_key, info_list):
        info_list.insert(0, ['Name,Age,Gender,Height'])
        for set in info_list:
            yield set

然后我在命令行中运行-> python -m map_csv <inputfile.txt> outputfile.csv

Then i run in the command line---> python -m map_csv <inputfile.txt> outputfile.csv

我一直收到此错误,但并不真正理解为什么:

I keep getting this error, and dont really understand why:

Counters from step 1:
  Unencodable output:
    TypeError: 785

reducer中的info_list参数只是一个列表,其中包含与标头中的类型匹配的各种值的列表(即

The info_list parameter in the reducer is simply a list containing lists of various values that match the types in the header(i.e.

[
['Bill', 28, 'Male',75],
['Emily', 16, 'Female',56],
['Jason', 21, 'Male',63]]

您知道这里出了什么问题吗?谢谢!

Any idea what the problem is here? Thanks!

推荐答案

要管理mrjob中的输入和输出格式,您需要使用协议.

To manage input and output formats in mrjob, you need to use protocols.

幸运的是,有一个实现CSV协议的现有程序包,您可以使用它- https://pypi .python.org/pypi/mr3px

Luckily, there is an existing package which implements a CSV protocol that you could use - https://pypi.python.org/pypi/mr3px

将包导入您的工作脚本中

Import the package in your job script

from mr3px.csvprotocol import CsvProtocol

在您的工作类别中指定协议

Specify the protocol in your job class

class CsvOutputJob(MRJob):
    ...
    OUTPUT_PROTOCOL = CsvProtocol  # write output as CSV

然后yield您的字段列表(或元组)

And then just yield your list (or tuple) of fields

def reducer(self, geo_key, info_list):
    for row in info_list:
        yield (None, row)

请注意,您不能可靠地在此输出中添加标题行,因为Hadoop将使用多个reducer并行生成输出.

Note that you cannot reliably add a header row to this output because Hadoop will use several reducers to generate the output in parallel.

要在EMR上使用此软件包,您需要在实例引导阶段通过向配置的bootstrap部分添加一个项目来安装它.

To use this package on EMR, you'll need to install it during the instance bootstrap phase by adding an item to the bootstrap section of your config.

runners:
  emr:
    ...
    bootstrap:
      - sudo apt-get install -y python-setuptools
      - sudo easy_install pip
      - sudo pip install mr3px

免责声明-我是mr3px软件包的维护者,该软件包是从mr3po

disclaimer - I am the maintainer of the mr3px package, which is forked from mr3po

这篇关于MRJob和python-Reducer的.csv文件输出?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

05-29 04:25
查看更多