本文介绍了按ID将数据匹配到最接近的时间值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我生成了一系列每小时的时间戳,

I have generated a series of hourly time stamps with:

intervals <- seq(as.POSIXct("2018-01-20 00:00:00", tz = 'America/Los_Angeles'), as.POSIXct("2018-01-20 03:00:00", tz = 'America/Los_Angeles'), by="hour")

> intervals
[1] "2018-01-20 00:00:00 PST" "2018-01-20 01:00:00 PST" "2018-01-20 02:00:00 PST"
[4] "2018-01-20 03:00:00 PST" 

使用杂乱且间隔不均的时间戳,如何将数据集中的时间值与最近的每小时时间戳匹配 id ,并删除两者之间的其他时间戳?例如:

Given a dataset with messy and unevenly spaced timestamps, how would one match time values from that dataset to the closest hourly timestamp by id, and remove other timestamps in between? For example:

> test
                         time      id     amount
312   2018-01-20 00:02:14 PST       1 54.9508346
8652  2018-01-20 00:54:41 PST       2 30.5557992
13809 2018-01-20 01:19:27 PST       3 90.5459248
586   2018-01-20 00:03:35 PST       1 79.7635973
9077  2018-01-20 00:56:37 PST       2 75.5356406
21546 2018-01-20 02:25:05 PST       3 36.6017705
7275  2018-01-20 00:47:45 PST       1 12.7618139
12768 2018-01-20 01:15:30 PST       2 72.4465838
1172  2018-01-20 00:08:01 PST       3 81.0468155
24106 2018-01-20 03:04:10 PST       1  0.8615881
14464 2018-01-20 01:25:04 PST       2 49.8718743
15344 2018-01-20 01:29:30 PST       3 85.0054113
14255 2018-01-20 01:23:22 PST       1 34.5093891
21565 2018-01-20 02:25:40 PST       2 69.0175725
15602 2018-01-20 01:31:32 PST       3 61.8602426

将产生:

> output
             interval id     amount
1 2018-01-20 01:00:00  1 12.7618139
2          2018-01-20  1 54.9508346
3 2018-01-20 03:00:00  1  0.8615881
4 2018-01-20 01:00:00  2 75.5356400
5 2018-01-20 02:00:00  2 69.0175700
6          2018-01-20  3 81.0468200
7 2018-01-20 01:00:00  3 90.5459200
8 2018-01-20 02:00:00  3 36.6017700

我了解 data.table

setDT(reference)[data, refvalue, roll = "nearest", on = "datetime"]

具有 roll =最近的,但是如何保持在间隔测试中每个 id 并保留金额属性?

with roll = nearest, but how would one keep find the nearest match in intervals for every id in test and retain the amount attribute ?

任何建议将不胜感激!以下是示例数据:

Any suggestions would be appreciated! Here is the sample data:

 dput(test)
structure(list(time = c("2018-01-20 00:02:14 PST", "2018-01-20 00:54:41 PST", 
"2018-01-20 01:19:27 PST", "2018-01-20 00:03:35 PST", "2018-01-20 00:56:37 PST", 
"2018-01-20 02:25:05 PST", "2018-01-20 00:47:45 PST", "2018-01-20 01:15:30 PST", 
"2018-01-20 00:08:01 PST", "2018-01-20 03:04:10 PST", "2018-01-20 01:25:04 PST", 
"2018-01-20 01:29:30 PST", "2018-01-20 01:23:22 PST", "2018-01-20 02:25:40 PST", 
"2018-01-20 01:31:32 PST"), id = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 
1, 2, 3, 1, 2, 3), amount = c(54.9508346011862, 30.5557992309332, 
90.5459248460829, 79.763597343117, 75.5356406327337, 36.6017704829574, 
12.7618139144033, 72.4465838400647, 81.0468154959381, 0.861588073894382, 
49.8718742514029, 85.0054113194346, 34.5093891490251, 69.0175724914297, 
61.8602426256984)), .Names = c("time", "id", "amount"), row.names = c(312L, 
8652L, 13809L, 586L, 9077L, 21546L, 7275L, 12768L, 1172L, 24106L, 
14464L, 15344L, 14255L, 21565L, 15602L), class = "data.frame")


推荐答案

另一种选择是在 j 内与 data.table联接:

# convert 'test' to a 'data.table' first with 'setDT'
# and convert the 'time'-column tot a datetime format
setDT(test)[, time := as.POSIXct(time)][]

# preform the join
test[, .SD[.(time = intervals), on = .(time), roll = 'nearest'], by = id]

给出:

    id                time     amount
 1:  1 2018-01-20 00:00:00 54.9508346
 2:  1 2018-01-20 01:00:00 12.7618139
 3:  1 2018-01-20 02:00:00 34.5093891
 4:  1 2018-01-20 03:00:00  0.8615881
 5:  2 2018-01-20 00:00:00 30.5557992
 6:  2 2018-01-20 01:00:00 75.5356406
 7:  2 2018-01-20 02:00:00 69.0175725
 8:  2 2018-01-20 03:00:00 69.0175725
 9:  3 2018-01-20 00:00:00 81.0468155
10:  3 2018-01-20 01:00:00 90.5459248
11:  3 2018-01-20 02:00:00 36.6017705
12:  3 2018-01-20 03:00:00 36.6017705


在上述方法中,某些金额值被分配给一次以上的 通过 id 。如果您不想这样做,而只想保留最接近时间的时间,则可以按以下方式改进方法:

In the above approach some amount-values are assigned to more than one time by id. If you don't want that and only want to keep the ones which are the closest to a time you could refine the approach as follows:

test[, r := rowid(id)
     ][, .SD[.(time = intervals)
             , on = .(time)
             , roll = 'nearest'
             , .(time, amount, r, time_diff = abs(x.time - i.time))
             ][, .SD[which.min(time_diff)], by = r]
       , by = id][, c('r','time_diff') := NULL][]

给出:

    id                time     amount
 1:  1 2018-01-20 00:00:00 54.9508346
 2:  1 2018-01-20 01:00:00 12.7618139
 3:  1 2018-01-20 02:00:00 34.5093891
 4:  1 2018-01-20 03:00:00  0.8615881
 5:  2 2018-01-20 00:00:00 30.5557992
 6:  2 2018-01-20 01:00:00 75.5356406
 7:  2 2018-01-20 02:00:00 69.0175725
 8:  3 2018-01-20 00:00:00 81.0468155
 9:  3 2018-01-20 01:00:00 90.5459248
10:  3 2018-01-20 02:00:00 36.6017705


这篇关于按ID将数据匹配到最接近的时间值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-28 05:52