问题描述
我生成了一系列每小时的时间戳,
I have generated a series of hourly time stamps with:
intervals <- seq(as.POSIXct("2018-01-20 00:00:00", tz = 'America/Los_Angeles'), as.POSIXct("2018-01-20 03:00:00", tz = 'America/Los_Angeles'), by="hour") > intervals [1] "2018-01-20 00:00:00 PST" "2018-01-20 01:00:00 PST" "2018-01-20 02:00:00 PST" [4] "2018-01-20 03:00:00 PST"
使用杂乱且间隔不均的时间戳,如何将数据集中的时间值与最近的每小时时间戳匹配 id ,并删除两者之间的其他时间戳?例如:
Given a dataset with messy and unevenly spaced timestamps, how would one match time values from that dataset to the closest hourly timestamp by id, and remove other timestamps in between? For example:
> test time id amount 312 2018-01-20 00:02:14 PST 1 54.9508346 8652 2018-01-20 00:54:41 PST 2 30.5557992 13809 2018-01-20 01:19:27 PST 3 90.5459248 586 2018-01-20 00:03:35 PST 1 79.7635973 9077 2018-01-20 00:56:37 PST 2 75.5356406 21546 2018-01-20 02:25:05 PST 3 36.6017705 7275 2018-01-20 00:47:45 PST 1 12.7618139 12768 2018-01-20 01:15:30 PST 2 72.4465838 1172 2018-01-20 00:08:01 PST 3 81.0468155 24106 2018-01-20 03:04:10 PST 1 0.8615881 14464 2018-01-20 01:25:04 PST 2 49.8718743 15344 2018-01-20 01:29:30 PST 3 85.0054113 14255 2018-01-20 01:23:22 PST 1 34.5093891 21565 2018-01-20 02:25:40 PST 2 69.0175725 15602 2018-01-20 01:31:32 PST 3 61.8602426
将产生:
> output interval id amount 1 2018-01-20 01:00:00 1 12.7618139 2 2018-01-20 1 54.9508346 3 2018-01-20 03:00:00 1 0.8615881 4 2018-01-20 01:00:00 2 75.5356400 5 2018-01-20 02:00:00 2 69.0175700 6 2018-01-20 3 81.0468200 7 2018-01-20 01:00:00 3 90.5459200 8 2018-01-20 02:00:00 3 36.6017700
我了解 data.table
setDT(reference)[data, refvalue, roll = "nearest", on = "datetime"]
具有 roll =最近的,但是如何保持在间隔$中找到最近的匹配项c $ c>测试中每个 id 并保留金额属性?
with roll = nearest, but how would one keep find the nearest match in intervals for every id in test and retain the amount attribute ?
任何建议将不胜感激!以下是示例数据:
Any suggestions would be appreciated! Here is the sample data:
dput(test) structure(list(time = c("2018-01-20 00:02:14 PST", "2018-01-20 00:54:41 PST", "2018-01-20 01:19:27 PST", "2018-01-20 00:03:35 PST", "2018-01-20 00:56:37 PST", "2018-01-20 02:25:05 PST", "2018-01-20 00:47:45 PST", "2018-01-20 01:15:30 PST", "2018-01-20 00:08:01 PST", "2018-01-20 03:04:10 PST", "2018-01-20 01:25:04 PST", "2018-01-20 01:29:30 PST", "2018-01-20 01:23:22 PST", "2018-01-20 02:25:40 PST", "2018-01-20 01:31:32 PST"), id = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3), amount = c(54.9508346011862, 30.5557992309332, 90.5459248460829, 79.763597343117, 75.5356406327337, 36.6017704829574, 12.7618139144033, 72.4465838400647, 81.0468154959381, 0.861588073894382, 49.8718742514029, 85.0054113194346, 34.5093891490251, 69.0175724914297, 61.8602426256984)), .Names = c("time", "id", "amount"), row.names = c(312L, 8652L, 13809L, 586L, 9077L, 21546L, 7275L, 12768L, 1172L, 24106L, 14464L, 15344L, 14255L, 21565L, 15602L), class = "data.frame")
推荐答案
另一种选择是在 j 内与 data.table联接:
# convert 'test' to a 'data.table' first with 'setDT' # and convert the 'time'-column tot a datetime format setDT(test)[, time := as.POSIXct(time)][] # preform the join test[, .SD[.(time = intervals), on = .(time), roll = 'nearest'], by = id]
给出:
id time amount 1: 1 2018-01-20 00:00:00 54.9508346 2: 1 2018-01-20 01:00:00 12.7618139 3: 1 2018-01-20 02:00:00 34.5093891 4: 1 2018-01-20 03:00:00 0.8615881 5: 2 2018-01-20 00:00:00 30.5557992 6: 2 2018-01-20 01:00:00 75.5356406 7: 2 2018-01-20 02:00:00 69.0175725 8: 2 2018-01-20 03:00:00 69.0175725 9: 3 2018-01-20 00:00:00 81.0468155 10: 3 2018-01-20 01:00:00 90.5459248 11: 3 2018-01-20 02:00:00 36.6017705 12: 3 2018-01-20 03:00:00 36.6017705
在上述方法中,某些金额值被分配给一次以上的 通过 id 。如果您不想这样做,而只想保留最接近时间的时间,则可以按以下方式改进方法:
In the above approach some amount-values are assigned to more than one time by id. If you don't want that and only want to keep the ones which are the closest to a time you could refine the approach as follows:
test[, r := rowid(id) ][, .SD[.(time = intervals) , on = .(time) , roll = 'nearest' , .(time, amount, r, time_diff = abs(x.time - i.time)) ][, .SD[which.min(time_diff)], by = r] , by = id][, c('r','time_diff') := NULL][]
给出:
id time amount 1: 1 2018-01-20 00:00:00 54.9508346 2: 1 2018-01-20 01:00:00 12.7618139 3: 1 2018-01-20 02:00:00 34.5093891 4: 1 2018-01-20 03:00:00 0.8615881 5: 2 2018-01-20 00:00:00 30.5557992 6: 2 2018-01-20 01:00:00 75.5356406 7: 2 2018-01-20 02:00:00 69.0175725 8: 3 2018-01-20 00:00:00 81.0468155 9: 3 2018-01-20 01:00:00 90.5459248 10: 3 2018-01-20 02:00:00 36.6017705
这篇关于按ID将数据匹配到最接近的时间值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!