如何处理SparkR空条目

本文介绍了如何处理SparkR空条目的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个SparkSQL数据帧。

I have a SparkSQL DataFrame.

在此数据中的一些条目是空的，但他们不喜欢的行为NULL或不适用。我怎么能删除？有任何想法吗？

Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas?

在R I可以很容易地将其删除，但它sparkR说，没有与S4系统/方法的问题。

In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods.

感谢。

推荐答案

SparkR列提供的的一长串的isNull 和 isNotNull ：

SparkR Column provides a long list of useful methods including isNull and isNotNull:

> people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA))
> people <- createDataFrame(sqlContext, people_local)
> head(people)

  Id Age
1  1  21
2  2  18
3  3  NA

> filter(people, isNotNull(people$Age)) %>% head()
  Id Age
1  1  21
2  2  18
3  3  30

> filter(people, isNull(people$Age)) %>% head()
  Id Age
1  4  NA

请记住，有 NA 和 NaN的在SparkR。

Please keep in mind that there is no distinction between NA and NaN in SparkR.

如果你整个数据帧上preFER操作有一组的的功能包括 fillna 和 dropna ：

If you prefer operations on a whole data frame there is a set of NA functions including fillna and dropna:

> fillna(people, 99) %>% head()
 Id Age
1  1  21
2  2  18
3  3  30
4  4  99

> dropna(people) %>% head()
 Id Age
1  1  21
2  2  18
3  3  30

两者都可以调整，只考虑列（ COLS ）和 dropna 的某个子集有一些其他有用参数。例如，您可以指定NOT NULL列的最少数量的：

Both can be adjusted to consider only some subset of columns (cols), and dropna has some additional useful parameters. For example you can specify minimal number of not null columns:

> people_with_names_local <- data.frame(
    Id=1:4, Age=c(21, 18, 30, NA), Name=c("Alice", NA, "Bob", NA))
> people_with_names <- createDataFrame(sqlContext, people_with_names_local)
> people_with_names %>% head()
  Id Age  Name
1  1  21 Alice
2  2  18  <NA>
3  3  30   Bob
4  4  NA  <NA>

> dropna(people_with_names, minNonNulls=2) %>% head()
  Id Age  Name
1  1  21 Alice
2  2  18  <NA>
3  3  30   Bob

这篇关于如何处理SparkR空条目的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！