r - 使用Sparklyr和Dplyr时获得不同的结果

刚才我开始使用参考文献sparklyr学习sparklyr软件包。
我做了文件中写的。
使用以下代码时

delay <- flights_tbl %>%
  group_by(tailnum) %>%
  summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
  filter(count > 20, dist < 2000, !is.na(delay)) %>%
  collect

Warning messages:
1: Missing values are always removed in SQL.
Use `AVG(x, na.rm = TRUE)` to silence this warning
2: Missing values are always removed in SQL.
Use `AVG(x, na.rm = TRUE)` to silence this warning

> delay
# A tibble: 2,961 x 4
   tailnum count  dist  delay
   <chr>   <dbl> <dbl>  <dbl>
 1 N14228  111    1547  3.71
 2 N24211  130    1330  7.70
 3 N668DN   49.0  1028  2.62
 4 N39463  107    1588  2.16
 5 N516JB  288    1249 12.0
 6 N829AS  230     228 17.2
 7 N3ALAA   63.0  1078  3.59
 8 N793JB  283    1529  4.72
 9 N657JB  285    1286  5.03
10 N53441  102    1661  0.941
# ... with 2,951 more rows

以类似的方式，我想使用nycflights13::flights包对dplyr数据集应用相同的操作

nycflights13::flights %>%
    group_by(tailnum) %>%
    summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
     filter(count > 20, dist < 2000, !is.na(delay))

# A tibble: 1,319 x 4
   tailnum count  dist   delay
   <chr>   <int> <dbl>   <dbl>
 1 N102UW     48   536   2.94
 2 N103US     46   535 - 6.93
 3 N105UW     45   525 - 0.267
 4 N107US     41   529 - 5.73
 5 N108UW     60   534 - 1.25
 6 N109UW     48   536 - 2.52
 7 N110UW     40   535   2.80
 8 N111US     30   536 - 0.467
 9 N11206    111  1414  12.7
10 N112US     38   535 - 0.947
# ... with 1,309 more rows

我的问题是为什么我得到不同的结果？

正如文档dplyr中提到的那样，是完整的后端操作
对于sparklyr。

 > sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: i386-w64-mingw32/i386 (32-bit)
Running under: Windows 7 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] bindrcpp_0.2   dplyr_0.7.4    sparklyr_0.7.0

loaded via a namespace (and not attached):
 [1] DBI_0.7            readr_1.1.1        withr_2.1.1
 [4] nycflights13_0.2.2 rprojroot_1.3-2    lattice_0.20-35
 [7] foreign_0.8-69     pkgconfig_2.0.1    config_0.2
[10] utf8_1.1.3         compiler_3.4.0     stringr_1.3.0
[13] parallel_3.4.0     xtable_1.8-2       Rcpp_0.12.15
[16] cli_1.0.0          shiny_1.0.5        plyr_1.8.4
[19] httr_1.3.1         tools_3.4.0        openssl_1.0
[22] nlme_3.1-131.1     broom_0.4.3        R6_2.2.2
[25] dbplyr_1.2.1       bindr_0.1          purrr_0.2.4
[28] assertthat_0.2.0   curl_3.1           digest_0.6.15
[31] mime_0.5           stringi_1.1.6      rstudioapi_0.7
[34] reshape2_1.4.3     hms_0.4.1          backports_1.1.2
[37] htmltools_0.3.6    grid_3.4.0         glue_1.2.0
[40] httpuv_1.3.5       rlang_0.2.0        psych_1.7.8
[43] magrittr_1.5       rappdirs_0.3.1     lazyeval_0.2.1
[46] yaml_2.1.16        crayon_1.3.4       tidyr_0.8.0
[49] pillar_1.1.0       base64enc_0.1-3    mnormt_1.5-5
[52] jsonlite_1.5       tibble_1.4.2       Lahman_6.0-0

最佳答案

关键区别在于，在非火花发生器中，我们没有在na.rm = TRUE中使用mean，因此，当我们采用NA时，那些在“距离”或“ arr_delay”中具有mean的元素将变为NA。在sparklyr中，NA值已被删除，因此不需要该参数

我们可以检查'distance'和'arr_delay'中的NA元素

nycflights13::flights %>%
       summarise_at(vars(distance, arr_delay), funs(sum(is.na(.))))
# A tibble: 1 x 2
#  distance arr_delay
#     <int>     <int>
#1        0      9430  #### number of NAs

因此，如果我们对此进行了纠正，则输出将相同

res <- nycflights13::flights %>%
    group_by(tailnum) %>%
    summarise(count = n(),
              dist = mean(distance, na.rm = TRUE),
              delay = mean(arr_delay, na.rm = TRUE)) %>%
     filter(count > 20, dist < 2000, !is.na(delay))  %>%
     arrange(tailnum)
res
# A tibble: 2,961 x 4
#   tailnum count  dist   delay
#   <chr>   <int> <dbl>   <dbl>
# 1 N0EGMQ    371   676   9.98
# 2 N10156    153   758  12.7
# 3 N102UW     48   536   2.94
# 4 N103US     46   535 - 6.93
# 5 N104UW     47   535   1.80
# 6 N10575    289   520  20.7
# 7 N105UW     45   525 - 0.267
# 8 N107US     41   529 - 5.73
# 9 N108UW     60   534 - 1.25
#10 N109UW     48   536 - 2.52
# ... with 2,951 more rows

使用sparklyr

library(sparklyr)
library(dplyr)
library(nycflights13)
sc <- spark_connect(master = "local")


flights_tbl <- copy_to(sc, nycflights13::flights, "flights")

delay <- flights_tbl %>%
  group_by(tailnum) %>%
  summarise(count = n(), dist = mean(distance), delay = mean(arr_delay)) %>%
  filter(count > 20, dist < 2000, !is.na(delay)) %>%
  arrange(tailnum) %>%
  collect
delay
# A tibble: 2,961 x 4
#   tailnum count  dist   delay
#   <chr>   <dbl> <dbl>   <dbl>
# 1 N0EGMQ  371     676   9.98
# 2 N10156  153     758  12.7
# 3 N102UW   48.0   536   2.94
# 4 N103US   46.0   535 - 6.93
# 5 N104UW   47.0   535   1.80
# 6 N10575  289     520  20.7
# 7 N105UW   45.0   525 - 0.267
# 8 N107US   41.0   529 - 5.73
# 9 N108UW   60.0   534 - 1.25
#10 N109UW   48.0   536 - 2.52
# ... with 2,951 more rows

关于r - 使用Sparklyr和Dplyr时获得不同的结果，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/48921171/