问题描述
我有一个SparkSQL数据帧。
I have a SparkSQL DataFrame.
在此数据中的一些条目是空的,但他们不喜欢的行为NULL或不适用。我怎么能删除?有任何想法吗?
Some entries in this data are empty but they don't behave like NULL or NA. How could I remove them? Any ideas?
在R I可以很容易地将其删除,但它sparkR说,没有与S4系统/方法的问题。
In R I can easily remove them but in sparkR it say that there is a problem with the S4 system/methods.
感谢。
推荐答案
SparkR列提供的的一长串的isNull
和 isNotNull
:
SparkR Column provides a long list of useful methods including isNull
and isNotNull
:
> people_local <- data.frame(Id=1:4, Age=c(21, 18, 30, NA))
> people <- createDataFrame(sqlContext, people_local)
> head(people)
Id Age
1 1 21
2 2 18
3 3 NA
> filter(people, isNotNull(people$Age)) %>% head()
Id Age
1 1 21
2 2 18
3 3 30
> filter(people, isNull(people$Age)) %>% head()
Id Age
1 4 NA
请记住,有 NA
和 NaN的
在SparkR。
Please keep in mind that there is no distinction between NA
and NaN
in SparkR.
如果你整个数据帧上preFER操作有一组的的功能包括 fillna
和 dropna
:
If you prefer operations on a whole data frame there is a set of NA functions including fillna
and dropna
:
> fillna(people, 99) %>% head()
Id Age
1 1 21
2 2 18
3 3 30
4 4 99
> dropna(people) %>% head()
Id Age
1 1 21
2 2 18
3 3 30
两者都可以调整,只考虑列( COLS
)和 dropna
的某个子集有一些其他有用参数。例如,您可以指定NOT NULL列的最少数量的:
Both can be adjusted to consider only some subset of columns (cols
), and dropna
has some additional useful parameters. For example you can specify minimal number of not null columns:
> people_with_names_local <- data.frame(
Id=1:4, Age=c(21, 18, 30, NA), Name=c("Alice", NA, "Bob", NA))
> people_with_names <- createDataFrame(sqlContext, people_with_names_local)
> people_with_names %>% head()
Id Age Name
1 1 21 Alice
2 2 18 <NA>
3 3 30 Bob
4 4 NA <NA>
> dropna(people_with_names, minNonNulls=2) %>% head()
Id Age Name
1 1 21 Alice
2 2 18 <NA>
3 3 30 Bob
这篇关于如何处理SparkR空条目的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!