本文介绍了如何处理SparkR空条目的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个SparkSQL数据帧。

I have a SparkSQL DataFrame.

在此数据中的一些条目是空的,但他们不喜欢的行为NULL或不适用。我怎么能删除?有任何想法吗?

Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas?

在R I可以很容易地将其删除,但它sparkR说,没有与S4系统/方法的问题。

In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods.

感谢。

推荐答案

SparkR列提供的的一长串的isNull isNotNull

SparkR Column provides a long list of useful methods including isNull and isNotNull:

> people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA))
> people <- createDataFrame(sqlContext, people_local)
> head(people)

  Id Age
1  1  21
2  2  18
3  3  NA

> filter(people, isNotNull(people$Age)) %>% head()
  Id Age
1  1  21
2  2  18
3  3  30

> filter(people, isNull(people$Age)) %>% head()
  Id Age
1  4  NA

请记住,有 NA NaN的在SparkR。

Please keep in mind that there is no distinction between NA and NaN in SparkR.

如果你整个数据帧上preFER操作有一组的的功能包括 fillna dropna

If you prefer operations on a whole data frame there is a set of NA functions including fillna and dropna:

> fillna(people, 99) %>% head()
 Id Age
1  1  21
2  2  18
3  3  30
4  4  99

> dropna(people) %>% head()
 Id Age
1  1  21
2  2  18
3  3  30

两者都可以调整,只考虑列( COLS )和 dropna 的某个子集有一些其他有用参数。例如,您可以指定NOT NULL列的最少数量的:

Both can be adjusted to consider only some subset of columns (cols), and dropna has some additional useful parameters. For example you can specify minimal number of not null columns:

> people_with_names_local <- data.frame(
    Id=1:4, Age=c(21, 18, 30, NA), Name=c("Alice", NA, "Bob", NA))
> people_with_names <- createDataFrame(sqlContext, people_with_names_local)
> people_with_names %>% head()
  Id Age  Name
1  1  21 Alice
2  2  18  <NA>
3  3  30   Bob
4  4  NA  <NA>

> dropna(people_with_names, minNonNulls=2) %>% head()
  Id Age  Name
1  1  21 Alice
2  2  18  <NA>
3  3  30   Bob

这篇关于如何处理SparkR空条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

09-22 15:39