问题描述
我正在将MRJob模块用于python 2.7.我创建了一个从MRJob继承的类,并使用继承的mapper函数正确映射了所有内容.
I'm using the MRJob module for python 2.7. I have created a class that inherits from MRJob, and have correctly mapped everything using the inherited mapper function.
问题是,我想让reducer函数输出一个.csv文件...这是reducer的代码:
Problem is, I would like to have the reducer function output a .csv file...here is the code for the reducer:
def reducer(self, geo_key, info_list):
info_list.insert(0, ['Name,Age,Gender,Height'])
for set in info_list:
yield set
然后我在命令行中运行-> python -m map_csv <inputfile.txt> outputfile.csv
Then i run in the command line---> python -m map_csv <inputfile.txt> outputfile.csv
我一直收到此错误,但并不真正理解为什么:
I keep getting this error, and dont really understand why:
Counters from step 1:
Unencodable output:
TypeError: 785
reducer中的info_list
参数只是一个列表,其中包含与标头中的类型匹配的各种值的列表(即
The info_list
parameter in the reducer is simply a list containing lists of various values that match the types in the header(i.e.
[
['Bill', 28, 'Male',75],
['Emily', 16, 'Female',56],
['Jason', 21, 'Male',63]]
您知道这里出了什么问题吗?谢谢!
Any idea what the problem is here? Thanks!
推荐答案
要管理mrjob
中的输入和输出格式,您需要使用协议.
To manage input and output formats in mrjob
, you need to use protocols.
幸运的是,有一个实现CSV协议的现有程序包,您可以使用它- https://pypi .python.org/pypi/mr3px
Luckily, there is an existing package which implements a CSV protocol that you could use - https://pypi.python.org/pypi/mr3px
将包导入您的工作脚本中
Import the package in your job script
from mr3px.csvprotocol import CsvProtocol
在您的工作类别中指定协议
Specify the protocol in your job class
class CsvOutputJob(MRJob):
...
OUTPUT_PROTOCOL = CsvProtocol # write output as CSV
然后yield
您的字段列表(或元组)
And then just yield
your list (or tuple) of fields
def reducer(self, geo_key, info_list):
for row in info_list:
yield (None, row)
请注意,您不能可靠地在此输出中添加标题行,因为Hadoop将使用多个reducer并行生成输出.
Note that you cannot reliably add a header row to this output because Hadoop will use several reducers to generate the output in parallel.
要在EMR上使用此软件包,您需要在实例引导阶段通过向配置的bootstrap
部分添加一个项目来安装它.
To use this package on EMR, you'll need to install it during the instance bootstrap phase by adding an item to the bootstrap
section of your config.
runners:
emr:
...
bootstrap:
- sudo apt-get install -y python-setuptools
- sudo easy_install pip
- sudo pip install mr3px
免责声明-我是mr3px
软件包的维护者,该软件包是从mr3po
disclaimer - I am the maintainer of the mr3px
package, which is forked from mr3po
这篇关于MRJob和python-Reducer的.csv文件输出?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!