本文介绍了在Hadoop集群中运行代码时Mapper.py和Reducer.py中面临的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

运行该代码以使Hadoop中的概率将我的数据存储在CSV文件中.

Running this code to take Probability in Hadoop cluster my data in CSV File.

当我在群集中运行此代码时,出现此错误"java.lang.RuntimeException:PipeMapRed.waitOutputThreads():子进程失败,代码为1",任何人都可以修复我的代码.

When I run this code in cluster getting this error "java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1" anyone fix my code.

#!/usr/bin/env python3
"""mapper.py"""
import sys

# Get input lines from stdin
for line in sys.stdin:
    # Remove spaces from beginning and end of the line
    line = line.strip()

    # Split it into tokens
    #tokens = line.split()

    #Get probability_mass values
    for probability_mass in line:
        print(str(probability_mass)+ '\t1')
#!/usr/bin/env python3
"""reducer.py"""
import sys
from collections import defaultdict


counts = defaultdict(int)

# Get input from stdin
for line in sys.stdin:
    #Remove spaces from beginning and end of the line
    line = line.strip()

    # skip empty lines
    if not line:
        continue

    # parse the input from mapper.py
    k,v = line.split('\t', 1)
    counts[v] += 1

total = sum(counts.values())
probability_mass = {k:v/total for k,v in counts.items()}
print(probability_mass)
marks
10
10
60
10
30
Expected output Probability of each number

{10: 0.6, 60: 0.2, 30: 0.2}

but result still show like this
{1:1} {1:1} {1:1} {1:1} {1:1} {1:1}

推荐答案

真正的错误应该在YARN用户界面中提供 ,但是将概率作为关键字并不能使您总和一次使用这些值,因为它们最终都将使用不同的reducer

The real error should be available in the YARN UI, but putting the probability as the key won't allow you to sum all the values at once because they all would end up in different reducers

如果您没有用于分组值的键,则可以使用此键,它将所有数据集中到一个化简器中

If you have no key to group the values by, then you can use this, which should funnel all data to one reducer

print('%s\t%s' % (None, probability_mass))

这是您想要的输出的有效示例,我仅使用输入文件(而不是在Hadoop中)对其进行了测试

Here is a working example for the output that you wanted, which I tested with just an input file, not in Hadoop

import sys
from collections import defaultdict

counts = defaultdict(int)

# Get input from stdin
for line in sys.stdin:
    #Remove spaces from beginning and end of the line
    line = line.strip()

    # skip empty lines
    if not line:
        continue

    # parse the input from mapper.py
    k,v = line.split('\t', 1)
    counts[v] += 1

total = float(sum(counts.values()))
probability_mass = {k:v/total for k,v in counts.items()}
print(probability_mass)

输出

{'10': 0.6, '60': 0.2, '30': 0.2}

您可以使用cat file.txt | python mapper.py | sort -u | python reducer.py

Plus,mrjob或pyspark是可以提供更多有用功能的高级语言

Plus, mrjob or pyspark are higher level languages that would provide more useful features

这篇关于在Hadoop集群中运行代码时Mapper.py和Reducer.py中面临的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-24 03:56