Spark总计中的Python列

Spark总计中的Python列

所以我有一个数据集,我要做的是从数据集中取出一列,而不是将其映射到键值对。问题是我不能总结我的价值:

position = 1
myData = dataSplit.map(lambda arr: (arr[position]))
print myData.take(10)
myData2 = myData.map(lambda line: line.split(',')).map(lambda fields: (“Column", fields[0])).groupByKey().map(lambda (Column, values): (Column, sum(float(values))))
print myData2.take(10)


打印出以下内容:

[u'18964', u'18951', u'18950', u'18949', u'18960', u'18958', u'18956', u'19056', u'18948', u'18969’]
TypeError: float() argument must be a string or a number


所以当我将其更改为:

myData2 = myData.map(lambda line: line.split(',')).map(lambda fields: (“Column", fields[0])).groupByKey().map(lambda (Column, values): (values))


我看到以下内容:

[<pyspark.resultiterable.ResultIterable object at 0x7fab6c43f1d0>]


如果我只是做:

myData2 = myData.map(lambda line: line.split(',')).map(lambda fields: (“Column", fields[0]))


我得到这个:

[('Column', u'18964'), ('Column', u'18951'), ('Column', u'18950'), ('Column', u'18949'), ('Column', u'18960'), ('Column', u'18958'), ('Column', u'18956'), ('Column', u'19056'), ('Column', u'18948'), ('Column', u'18969’)]


有什么建议么?

最佳答案

解决了:

myData2 = myData.map(lambda line: line.split(',')).map(lambda fields: ("Column", float(fields[0]))).groupByKey().map(lambda (Column, values): (Column, sum(values)))

关于python - Spark总计中的Python列,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/28769716/

10-12 13:54