问题描述
我有一个包含数值的CSV文件。
I have a csv file that contains numeric values.
val row = withoutHeader.map{
line => {
val arr = line.split(',')
for (h <- 0 until arr.length){
if(arr(h).trim == ""){
val abc = avgrdd.filter {case ((x,y),z) => x == h && y == arr(dependent_col_index).toDouble} //crashing here
arr(h) = //imputing with the value above
}
}
arr.mkString(",")
}
}
这是code,其中我试图用类标签的平均归咎于缺少的值的片段。
This is a snippet of the code where I am trying to impute the missing values with the mean of class labels.
avgrdd包含的平均水平键值对,其中关键是列索引和类标签值。这avgrdd使用,我看到的是正确的计算结果的计算合
avgrdd contains the average for the key value pairs where key is column index and the class label value. This avgrdd is calculated using the combiners which I see is calculating the results correctly.
dependent_col_index是包含类标签列
dependent_col_index is the column containing the class labels.
用滤波器的线与空指针异常崩溃。
在除去该行原数组是输出(逗号分隔)。
The line with filter is crashing with the null pointer exception.On removing this line the original array is the output (comma separated).
我很困惑,为什么过滤操作是导致飞机坠毁。
I am confused why the filter operation is causing a crash.
请就如何解决此问题的建议。
Please suggest on how to fix this issue.
示例
col1,dependent_col_index
4,1
8,0
,1
21,1
21,0
,1
25,1
,0
34,1
mean for class 1 is 84/4 = 21 and for class 0 is 29/2 = 14.5
Required Output
4,1
8,0
21,1
21,1
21,0
21,1
25,1
14.5,0
34,1
谢谢!
推荐答案
您正试图执行另一个RDD转型的内RDD转变。请记住,你不能使用RDD另一个RDD改造里面,这会导致错误。
You are trying to execute a RDD transformation inside of another RDD transformation. Remember that you cannot use RDD inside of another RDD transformation, this would cause an error.
进行的方法如下:
- 变换源RDD
withoutHeader
来对的RDD&LT;班级,价值与GT;
的corrent型(长在你的情况)。缓存它 - 计算
avgrdd
在withoutHeader
的顶部。这应该是对的RDD&LT;班级,AvgValue中&GT;
- 加入
withoutHeader
RDD和avgrdd
在一起 - 这种方式对每一行,你将有一个结构&LT;类,&LT;值,AvgValue中&GT;&GT;
- 执行
地图
对结果的顶部更换缺少值
与AvgValue中
- Transform the source RDD
withoutHeader
to the RDD of pairs<Class, Value>
of the corrent type (Long in your case). Cache it - Calculate
avgrdd
on top ofwithoutHeader
. This should be an RDD of pairs<Class, AvgValue>
- Join
withoutHeader
RDD andavgrdd
together - this way for each row you would have a structure<Class, <Value, AvgValue>>
- Execute
map
on top of the result to replace missingValue
withAvgValue
另一个选择可能是分裂RDD两部分的第3步(一部分 - RDD缺失值,第二个 - RDD具有非缺失值),加入 avgrdd
只只包含遗漏值的RDD后,使这两个部件之间的结合。如果有缺失值的一小部分会更快
Another option might be to split the RDD in two parts on step 3 (one part - RDD with missing values, second one - RDD with non-missing values), join the avgrdd
only with the RDD containing only missing values and after that make a union between this two parts. It would be faster if you have a small fraction of missing values
这篇关于归咎于该数据集类的标签意味着过滤操作导致崩溃的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!