问题描述
我目前面临以下问题。
I currently face the following issue.
我想拿出一个R代码,该代码在我的主目录中创建一个名为 reviews_last30days
的新列。数据框列表
,它应该能够为每个唯一的列表$ ID
计算或累积所有评论。
I want to come up with an R code that creates a new column called, e.g., reviews_last30days
in my main dataframe listings
which should be able to count or accumulate all reviews for each unique listings$ID
.
每个ID的唯一评论在另一个数据框中列出,如下所示:
The unique reviews per ID are listed in another dataframe like this:
REVIEWS
ID review_date
1 2015-12-30
1 2015-12-31
1 2016-10-27
2 2014-05-10
2 2016-10-19
2 2016-10-22
2 2016-10-23
我还需要添加一个日期条件,例如这样,仅考虑从 last_scrape
开始的最后30天。
I also need to add a date condition, e.g. such that only the last 30 days starting from the last_scrape
are considered.
因此,我的结果应该看起来有点像第三列:(更新:请参见编辑以获取对预期结果的更好描述)
LISTINGS
ID last_scrape reviews_last30days
1 2016-11-15 1
2 2016-11-15 3
因此,最后,对于每个 ID reviews_last30days
应该将 review_date
计为自 last_scape
之后的30天的指示时间范围。
So finally, the column reviews_last30days
should count review_date
for each ID
since the indicated time frame of 30 days since the last_scape
.
我已经格式化了这两个格式日期列 as.Date和%Y-%m-%d。
I already formatted both date columns "as.Date" with "%Y-%m-%d".
对不起,如果我的问题可能对你们来说不够清楚,难以解释或形象化,但就代码而言,希望它毕竟不应该那么复杂。
Sorry if my problem might not be formulated clearly enough for you guys, it's quite hard to explain or visualize, but in terms of code it hopefully shouldn't be that complicated after all.
除了上述输入的REVIEWS外,我确实有第二个输入数据框,它是OVERVIEW,目前它看起来像是简化形式:
Besides the input REVIEWS indicated above, I do have a second input dataframe, be it OVERVIEW, that currently looks somewhat like this in a simplified form:
OVERVIEW
ID last_scrape
1 2016-11-15
2 2016-11-15
3 2016-11-15
4 2017-01-15
5 2017-01-15
6 2017-01-15
7 2017-01-15
etc
所以我真正需要的是一个代码,用于对 ID $ c $的所有
中的概述。 review_date
项进行计数来自OVERVIEW的c>与REVIEWS中的 ID
匹配,而来自REVIEWS的 review_date
从<$ c开始最多30天$ c> last_scrape
So what I actually need is a code to count all entries of review_date
for which the ID
from OVERVIEW matches with the ID
in REVIEWS and the review_date
from REVIEWS is max 30 days from the last_scrape
in OVERVIEW.
然后,代码应该理想地将此新计算的值分配为OVERVIEW中的新列,如下所示:
The code should then ideally assign this newly calculated value as new column in OVERVIEW like this:
OVERVIEW
ID last_scrape rev_last30days
1 2016-11-15 1
2 2016-11-15 3
3 2016-11-15 ..
4 2017-01-15 ..
5 2017-01-15 ..
6 2017-01-15 ..
7 2017-01-15 ..
etc
#2编辑-希望是我的最后一个;)
感谢您到目前为止提供的帮助@mfidino!绘制最新代码仍然会导致一个小错误,即以下错误:
#2 EDIT - hopefully my last ;)
Thanks for your help so far @mfidino! Plotting your latest code still results in one minor mistake, namely the following:
TOTALREV$review_date <- ymd(TOTALREV$review_date)
TOTALLISTINGS$last_scraped.calc <- ymd(TOTALLISTINGS$last_scraped.calc)
gen_listings <- function(review = NULL, overview = NULL){
# tibble to return
to_return <- review %>%
inner_join(., overview, by = 'listing_id') %>%
group_by(listing_id) %>%
summarise(last_scraped.calc = unique(last_scraped.calc),
reviews_last30days = sum(review_date >= (last_scraped.calc-30)))
return(to_return)
}
REVIEWCOUNT <- gen_listings(TOTALREV, TOTALLISTINGS)
Error: Column `last_scraped.calc` must be length 1 (a summary value), not 2
您是否知道如何解决此错误?
Do you have any idea how to fix this error?
注意:
我使用了我的名字
NOTE:I used the names as in my original file, code should still be the same.
如果有帮助,向量 last_scraped.calc
的某些属性:
If it helps, some properties of the vector last_scraped.calc
:
$ last_scraped.calc : Date, format: "2018-08-07" "2018-08-07" ...
typeof(TOTALLISTINGS$last_scraped.calc)
[1] "double"
length(TOTALLISTINGS$last_scraped.calc)
[1] 549281
和
unique(TOTALLISTINGS$last_scraped.calc)
[1] "2018-08-07" "2019-01-13" "2018-08-15" "2019-01-16" "2018-08-14"
"2019-01-15" "2019-01-14" "2019-01-22" [9] "2018-08-22" "2018-08-21"
"2019-01-28" "2018-08-20" "2019-01-23" "2019-01-31" "2018-08-09"
"2018-08-10" [17] "2018-08-08" "2018-08-16"
任何进一步的帮助,不胜感激-预先感谢!
Any further help much appreciated - thanks in advance!
推荐答案
您可以使用 dplyr 。我在这里使用 lubridate :: ymd()
而不是 as.Date()
。
You can do this pretty easily with dplyr
. I am using lubridate::ymd()
here instead of as.Date()
as well.
library(lubridate)
library(dplyr)
REVIEWS <- data.frame(ID = c(1,1,1,2,2,2,2),
review_date = c("2015-12-30",
"2015-12-31",
"2016-10-27",
"2014-05-10",
"2016-10-19",
"2016-10-22",
"2016-10-23"))
REVIEWS$review_date <- ymd(REVIEWS$review_date)
LISTINGS <- REVIEWS %>% group_by(ID) %>%
summarise(last_scrape = max(review_date),
reviews_last30days = sum(review_date >= (max(review_date)-30)))
LISTINGS
的输出是您的预期输出:
The output of LISTINGS
is your expected output:
# A tibble: 2 x 3
ID last_scrape reviews_last30days
<dbl> <date> <int>
1 1 2016-10-27 1
2 2 2016-10-23 3
编辑:
相反,如果您有兴趣让 last_scrape
作为输入而不是最新输入每个小组的审查日期,您可以这样修改代码。假设每个ID可以有多个 last_scrape
:
If, instead, you are interested in letting last_scrape
be an input rather than the latest review date per group, you can modify the code as such. Assuming that there can be multiple last_scrape
per ID:
library(lubridate)
library(dplyr)
REVIEWS <- data.frame(ID = c(1,1,1,2,2,2,2),
review_date = c("2015-12-30",
"2015-12-31",
"2016-10-27",
"2014-05-10",
"2016-10-19",
"2016-10-22",
"2016-10-23"))
REVIEWS$review_date <- ymd(REVIEWS$review_date)
OVERVIEW <- data.frame(ID = rep(1:7, 2),
last_scrape = c("2016-11-15",
"2016-11-15",
"2016-11-15",
"2017-01-15",
"2017-01-15",
"2017-01-15",
"2017-01-15",
"2016-11-20",
"2016-11-20",
"2016-11-20",
"2017-01-20",
"2017-01-20",
"2017-01-20",
"2017-01-20"))
OVERVIEW$last_scrape <- ymd(OVERVIEW$last_scrape)
gen_listings <- function(review = NULL, overview = NULL){
# tibble to return
to_return <- review %>%
inner_join(., overview, by ='ID') %>%
group_by(ID, last_scrape) %>%
summarise(
reviews_last30days = sum(review_date >= (last_scrape-30)))
return(to_return)
}
LISTINGS <- gen_listings(REVIEWS, OVERVIEW)
此 LISTINGS 对象是:
ID last_scrape reviews_last30days
<dbl> <date> <int>
1 1 2016-11-15 1
2 1 2016-11-20 1
3 2 2016-11-15 3
4 2 2016-11-20 2
这篇关于是否有R函数以日期范围为条件镜像EXCEL COUNTIFS?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!