问题描述
我正在处理包含数千行的贸易数据集.每条记录都有一个基于符号和日期的唯一键.给定符号的交易记录是不规则的,因此使用动物园将是自然的选择.我需要使用滞后和合并来创建一个新的数据集.但是,我不知道如何在动物园中设置多列索引以使用滞后功能.以下是示例数据集和预期输出.
I am working with trade dataset with thousands of rows. Every record has a unique key based on a symbol and date. Trade records for a given symbol are irregular, hence using zoo will be natural choice. I need to use lag and merge to create a new dataset. However, I don't know how to setup multi-column index in zoo in order to use lag function. Below is a sample dataset and intended output.
df = data.frame(
dt = as.Date(c("2015-01-01", "2015-01-05", "2015-01-06",
"2015-01-01", "2015-01-02")),
id = c("i1", "i1", "i1", "i2", "i2"),
v1 = c(110, 115, 119, 212, 213),
v2 = c(100, 170, 180, 202, 210),
v3 = c(11, 13, 16, 22, 24)
)
df$id = as.character(df$id)
输出应该是
2015-01-01, i1, 110, 100, 11, 2015-01-05, i1, 115, 170, 13
2015-01-05, i1, 115, 170, 13, 2015-01-06, i1, 119, 180, 16
2015-01-06, i1, 119, 180, 16, NA, NA, NA, NA, NA
2015-01-01, i2, 212, 202, 22, 2015-01-02, i2, 213, 210, 24
2015-01-02, i2, 213, 210, 24, NA, NA, NA, NA, NA
在 SO 中,有许多帖子完成了分组"滞后操作,但仅针对单个列.无论列数如何,我都在寻找合并完整行.
In SO, there are number of posts accomplishing "grouped" lag operations but for a single column only. I am looking for merging complete row, regardless of number of columns.
更新此问题...
以下是解决基于zoo的分组"滞后操作的一种可能方法.
Following is one possible way to solve the "grouped" lag operation based on zoo.
doProcessing = function(df){
icolnames = colnames(df)
tt = zoo(df, df$dt)
tt1 = merge(tt, lag(tt, 1))
colnames(tt1) = c(icolnames, paste0("lag_", icolnames))
data.frame(tt1, stringsAsFactors=F)
}
fin_df = do.call(rbind, with(df, by(df, list(id), doProcessing, simplify=F)))
这个最终输出帧将每个字段作为因子.如何根据输入数据帧获得正确的输出结构?
This final output frame has every field as factor. How do I get the output structure right as per input data frame?
基于@Grothendieck 的 lapply 思想,下面给出了上述问题的可能解决方案.
Based on @Grothendieck's idea of lapply, a possible solution to the above problem is given below.
doProcessing = function(df){
icolnames = colnames(df)
tt = zoo(df, df$dt)
tt1 = merge(tt, lag(tt, 1))
colnames(tt1) = c(icolnames, paste0("lag_", icolnames))
data.frame(tt1, stringsAsFactors=F)
}
fin_df = do.call(rbind, with(df, by(df, list(id), doProcessing, simplify=F)))
仍然需要一些帮助,了解结果数据框如何将每一列作为因子.如何恢复原始结构?
Still need some help, some how resultant data frame has every column as factors. How do I get original structure back?
原始数据框结构
> str(df)
'data.frame': 5 obs. of 5 variables:
$ dt: Date, format: "2015-01-05" "2015-01-01" ...
$ id: chr "i1" "i1" "i1" "i2" ...
$ v1: num 115 110 119 212 213
$ v2: num 170 100 180 202 210
$ v3: num 13 11 16 22 24
结果数据框看起来像
> str(fin_df)
'data.frame': 5 obs. of 10 variables:
$ dt : Factor w/ 4 levels "2015-01-01","2015-01-05",..: 1 2 3 1 4
$ id : Factor w/ 2 levels "i1","i2": 1 1 1 2 2
$ v1 : Factor w/ 5 levels "110","115","119",..: 1 2 3 4 5
$ v2 : Factor w/ 5 levels "100","170","180",..: 1 2 3 4 5
$ v3 : Factor w/ 5 levels "11","13","16",..: 1 2 3 4 5
$ lag_dt: Factor w/ 3 levels "2015-01-05","2015-01-06",..: 1 2 NA 3 NA
$ lag_id: Factor w/ 2 levels "i1","i2": 1 1 NA 2 NA
$ lag_v1: Factor w/ 3 levels "115","119","213": 1 2 NA 3 NA
$ lag_v2: Factor w/ 3 levels "170","180","210": 1 2 NA 3 NA
$ lag_v3: Factor w/ 3 levels "13","16","24": 1 2 NA 3 NA
推荐答案
zoo zoo 对象是时间序列,所以通常这样做的方式是使用宽格式:
zoo zoo objects are time series so normally the way this is done so that the result is a time series is to use wide form:
read.zoo(df, split = 2) # zoo object created by splitting on column 2
给予:
v1.i1 v2.i1 v3.i1 v1.i2 v2.i2 v3.i2
2015-01-01 110 100 11 212 202 22
2015-01-02 NA NA NA 213 210 24
2015-01-05 115 170 13 NA NA NA
2015-01-06 119 180 16 NA NA NA
list 或简单地将数据框拆分为动物园对象列表
list or simply split the data frame into a list of zoo objects
L <- lapply(split(df[-2], df$id), read.zoo)
给予:
> L
$i1
v1 v2 v3
2015-01-01 110 100 11
2015-01-05 115 170 13
2015-01-06 119 180 16
$i2
v1 v2 v3
2015-01-01 212 202 22
2015-01-02 213 210 24
melt 使用 reshape2 包,我们可以创建长表单:
melt Using the reshape2 package we could create long form:
m <- melt(df, id = 1:2)
给予:
> m
dt id variable value
1 2015-01-01 i1 v1 110
2 2015-01-05 i1 v1 115
3 2015-01-06 i1 v1 119
4 2015-01-01 i2 v1 212
5 2015-01-02 i2 v1 213
6 2015-01-01 i1 v2 100
7 2015-01-05 i1 v2 170
8 2015-01-06 i1 v2 180
9 2015-01-01 i2 v2 202
10 2015-01-02 i2 v2 210
11 2015-01-01 i1 v3 11
12 2015-01-05 i1 v3 13
13 2015-01-06 i1 v3 16
14 2015-01-01 i2 v3 22
15 2015-01-02 i2 v3 24
在这种形式下,很容易得到各种切片.例如,
In this form it's easy to get various slices. For example,
> subset(m, dt == "2015-01-01")
dt id variable value
1 2015-01-01 i1 v1 110
4 2015-01-01 i2 v1 212
6 2015-01-01 i1 v2 100
9 2015-01-01 i2 v2 202
11 2015-01-01 i1 v3 11
14 2015-01-01 i2 v3 22
3d 数组 另一种可能是将其表示为一个 3 维数组.m
来自上面的 melt
解决方案.我们可以置换第二个参数的组件以获得变化:
3d array Another possibility is to represent it as a three dimensional array. m
is from melt
solution above. We could permute the components of the second argument to get variations:
a <- acast(m, ... ~ id ~ dt)
giving:
, , 2015-01-01
i1 i2
v1 110 212
v2 100 202
v3 11 22
, , 2015-01-02
i1 i2
v1 NA 213
v2 NA 210
v3 NA 24
, , 2015-01-05
i1 i2
v1 115 NA
v2 170 NA
v3 13 NA
, , 2015-01-06
i1 i2
v1 119 NA
v2 180 NA
v3 16 NA
各种切片很容易得到:a[1,,]
, a[,1,]
, a[,,1]
.
Various slices are easy to get: a[1,,]
, a[,1,]
, a[,,1]
.
更新增加了解决方案并重新整理和改进了一些解决方案.
Updates Have added to solution and rearranged and improved some solutions.
这篇关于动物园时间序列中的多部分索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!