本文介绍了dplyr:mutate中的整数采样的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述 29岁程序员,3月因学历无情被辞! 我正在尝试生成一个 tbl_df 中的列,该列是0或1的随机整数。这是我使用的代码: 库(dplyr) set.seed(0) #Dummy data.frame以测试 df< - tbl_df(data.frame(x = rep(1:3,each = 4))) #生成随机整数列 df_test = df%>% mutate(pop = sample(0:1,1,replace = TRUE)) 但这似乎并不像我预期的那样工作。我生成的字段似乎全为零。这是因为 mutate 中的语句是并行评估的,因此最终使用相同的种子进行第一次随机绘制? 来源:本地数据框[12 x 2] x pop 1 1 0 2 1 0 3 1 0 4 1 0 5 2 0 6 2 0 7 2 0 8 2 0 9 3 0 10 3 0 11 3 0 12 3 0 I在过去几个小时内,我打破了我的头脑。任何想法我的脚本中的缺陷是什么?解决方案你的代码写的方式,你分配一个值随机绘制的结果)到整个向量(这被称为向量循环)。 在这种情况下,最好的解决方案是StevenBeaupré的答案,创建一个随机的向量您的data.frame的长度: df%>% mutate(pop = sample(0:1 ,n(),replace = TRUE)) 一般来说,如果要在 dplyr 中逐行应用一个函数,就像您以前想到的一样,您可以使用 rowwise(),虽然在这个例子中不是必需的。 下面是一个 rowwise() : df2< - data.frame(a = c(1,3,6),b = c(2, 4,5)) df2%>% mutate(m = max(a,b)) abm 1 1 2 6 2 3 4 6 3 6 5 6 df2%>% ro wwise()%>% mutate(m = max(a,b)) abm 1 1 2 2 2 3 4 4 3 6 5 6 由于 rowwise 每行操作的数据可能比没有任何分组慢。因此,最好使用向量化函数,而不是逐行使用。 基准: 使用 rowwise()的方法速度约为30倍: 库(microbenchmark) df bench< - microbenchmark( vectorized = df2 rowwise = df2 %rowwise()%>%mutate(pop = sample(0:1,1,replace = TRUE)), times = 1000 ) 选项(microbenchmark.unit =relative) print(bench) autoplot(bench) 单位:relative expr min lq mean中位数uq max neval 向量化1.00000 1.00000 1.00000 1.00000 1.00000 1.0000 1000 rowwise 42.53169 42.29486 36.94876 33.70456 34.92621 71.7682 1000 I am trying to generate a column in a tbl_df that is a random integer of 0 or 1. This is the code I am using:library(dplyr)set.seed(0)#Dummy data.frame to testdf <- tbl_df(data.frame(x = rep(1:3, each = 4)))#Generate the random integer columndf_test = df %>% mutate(pop=sample(0:1, 1, replace=TRUE))But this does not seem to work the way I expected. The field I generated seems to be all zeros. Is this because the statement within mutate is evaluated in parallel and hence ends up using the same seed for the first random draw?df_testSource: local data frame [12 x 2] x pop1 1 02 1 03 1 04 1 05 2 06 2 07 2 08 2 09 3 010 3 011 3 012 3 0I am breaking my head over this the past few hours. Any idea what is the flaw in my script? 解决方案 The way your code is written, you are assigning a single value (the result of the random draw) to the entire vector (this is called "vector recycling").The best solution in this case is Steven Beaupré's answer, creating a randomized vector the length of your data.frame:df %>% mutate(pop = sample(0:1, n(), replace = TRUE))Generally, if you want to apply a function row-by-row in dplyr - as you thought would happen here - you can use rowwise(), though in this example it's not required.Here's an example of rowwise():df2 <- data.frame(a = c(1,3,6), b = c(2,4,5))df2 %>% mutate(m = max(a,b)) a b m1 1 2 62 3 4 63 6 5 6df2 %>% rowwise() %>% mutate(m = max(a,b)) a b m1 1 2 22 3 4 43 6 5 6Since rowwise groups the data by each row operations are potentially slower than without any grouping. Therefore, it's mostly better to use vectorized functions whenever possible instead of operating row-by-row.Benchmarking:The approach with rowwise() is about 30x slower:library(microbenchmark)df <- tbl_df(data.frame(x = rep(1:1000, each = 4)))bench <- microbenchmark( vectorized = df2 <- df %>% mutate(pop = sample(0:1, n(), replace = TRUE)), rowwise = df2 <- df %>% rowwise() %>% mutate(pop = sample(0:1, 1, replace = TRUE)), times = 1000 )options(microbenchmark.unit="relative")print(bench)autoplot(bench)Unit: relative expr min lq mean median uq max neval vectorized 1.00000 1.00000 1.00000 1.00000 1.00000 1.0000 1000 rowwise 42.53169 42.29486 36.94876 33.70456 34.92621 71.7682 1000 这篇关于dplyr:mutate中的整数采样的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持! 上岸,阿里云!
09-05 16:26
查看更多