问题描述
为了建立对能源使用的季节性影响,我需要将来自计费数据库的能源使用信息与每月温度一致。
使用具有不同长度和开始和结束日期的帐单的结算数据集,并且我想获得每个月内每个帐户的月平均值。例如,我有一个具有以下特征的结算数据库:
acct amount begin end days
1 2242 11349 2009-10-06 2009-11-04 29
2 2242 12252 2009-11-04 2009-12-04 30
3 2242 21774 2009-12-04 2010-01-08 35
4 2242 18293 2010-01-08 2010-02-05 28
5 2243 27217 2009-10-06 2009-11-04 29
6 2243 117 2009-11-04 2009-12-04 30
7 2243 14543 2009-12-04 2010-01-08 35
喜欢弄清楚如何强制这些有些不规则的时间序列(对于每个帐户),以获得在每个帐单内跨越每个月内的每天的平均金额,例如:
acct amount begin end days avgamtpday
1 2242 11349 2009-10-01 2009-10-31 31 X
2 2242 12252 2009-11-01 2009- 11-30 30 X
3 2242 21774 2009-12-01 2010-12-31 31 X
4 2242 18293 2010-01-01 2010-01-31 31 X
4 2242 18293 2010-02-01 2010-02-28 28 X
5 2243 27217 2009-10-01 2009-10-31 31 X
6 2243 117 2009-11-01 2009-11-30 30 X
7 2243 14543 2009-12-01 2009-12-31 30 X
7 2243 14543 2010-01-01 2010-01-31 31 X
对于任何一个工具都可以做到这一点,我是相当不可知的,因为我只需要这样做一次。
一个额外的皱纹是表是大约150,000行长,这是不是真的非常大,大多数标准,但足够大,使循环解决方案在R困难。我研究了使用z中的动物园,xts和tempdisagg包。我开始写一个真正丑陋的循环,将拆分每个账单,然后在现有账单中为每个月创建一行,然后tapply()通过accts
在MySQL中,我尝试过:
但是由于我不明白,我的服务器不喜欢这个表,并挂在内部连接,即使我在不同的计算阶段。
/ div>
以下是使用 data.table
开始的开始:
billdata< - read.table(text =acct amount begin end days
1 2242 11349 2009-10-06 2009-11-04 29
2 2242 12252 2009-11-04 2009-12 -04 30
3 2242 21774 2009-12-04 2010-01-08 35
4 2242 18293 2010-01-08 2010-02-05 28
5 2243 27217 2009-10- 06 2009-11-04 29
6 2243 117 2009-11-04 2009-12-04 30
7 2243 14543 2009-12-04 2010-01-08 35,sep =, header = TRUE,row.names = 1)
require(data.table)
DT = as.data.table(billdata)
首先,更改列类型 begin
和 end
到日期。与data.frame不同,这不会复制整个数据集。
DT [,begin:= as.Date(begin) ]
DT [,end:= as.Date(end)]
时间跨度,找到每一天的当前账单,并汇总。
alldays = DT [,seq(min max(end),by =day)]
setkey(DT,acct,begin)
DT [CJ(unique(acct),alldays),
mean(amount / days,na.rm = TRUE),
by = list(acct,month = format(begin,%Y-%m)),roll = TRUE]
acct month V1
1:2242 2009-10 391.34483
2:2242 2009-11 406.69448
3:2242 2009-12 601.43226
4:2242 2010-01 646.27465
5:2242 2010-02 653.32143
6:2243 2009-10 938.51724
7:2243 2009-11 97.36172
8:2243 2009-12 375.68065
9: 2243 2010-01 415.51429
10:2243 2010-02 415.51429
找到当前的连接逻辑在SQL中相当繁琐,而且速度较慢。
我说这是一个提示,因为它不是很正确。通知行10重复,因为帐户2243不伸展到2010-02不像帐户2242.要完成它,你可以 rbind
在每个帐户的最后一行,并使用 rolltolast
而不是 roll
。或者可以通过帐户而不是跨所有帐户创建 alldays
。
查看上述速度是否可以接受,我们可以从那里开始。
这可能会在1.8.2中遇到一个已经在1.8.3中修复的bug。我使用的是v1.8.3。
让我知道,我们可以解决,或从R-Forge升级到1.8.3。
Btw,漂亮的示例数据。
这是上面提到的完整答案。这有点棘手,我不得不承认,因为它结合了 data.table
的几个功能。这应该在1.8.2工作,因为它发生,但我只在1.8.3测试。
DT [setkey DT [,seq(begin [1],last(end),by =day),by = acct]),
mean(amount / days,na.rm = TRUE),
= list(acct,month = format(begin,%Y-%m)),roll = TRUE]
acct month V1
1:2242 2009-10 391.34483
2:2242 2009-11 406.69448
3:2242 2009-12 601.43226
4:2242 2010-01 646.27465
5:2242 2010-02 653.32143
6:2243 2009- 10 938.51724
7:2243 2009-11 97.36172
8:2243 2009-12 375.68065
9:2243 2010-01 415.51429
In order to establish seasonal effects on energy use, I need to align the energy use information that I have from a billing database with monthly temperatures.
I'm working with a billing dataset that has bills of varying lengths and start and end dates, and I'd like to obtain the monthly average for each account within each month. For example, I have a billing database that has the following characteristics:
acct amount begin end days
1 2242 11349 2009-10-06 2009-11-04 29
2 2242 12252 2009-11-04 2009-12-04 30
3 2242 21774 2009-12-04 2010-01-08 35
4 2242 18293 2010-01-08 2010-02-05 28
5 2243 27217 2009-10-06 2009-11-04 29
6 2243 117 2009-11-04 2009-12-04 30
7 2243 14543 2009-12-04 2010-01-08 35
I would like to figure out how to coerce these somewhat irregular time series (for each account) to get the average amount per day within each month that is spanned within each bill, such that:
acct amount begin end days avgamtpday
1 2242 11349 2009-10-01 2009-10-31 31 X
2 2242 12252 2009-11-01 2009-11-30 30 X
3 2242 21774 2009-12-01 2010-12-31 31 X
4 2242 18293 2010-01-01 2010-01-31 31 X
4 2242 18293 2010-02-01 2010-02-28 28 X
5 2243 27217 2009-10-01 2009-10-31 31 X
6 2243 117 2009-11-01 2009-11-30 30 X
7 2243 14543 2009-12-01 2009-12-31 30 X
7 2243 14543 2010-01-01 2010-01-31 31 X
I'm fairly agnostic to whichever tool can do this, since I only have to do this once.
An additional wrinkle is the table is about 150,000 rows long, which is not really very big by most standards, but big enough to make a loop solution in R difficult. I've investigated using the zoo, xts, and tempdisagg packages in R. I started writing a really ugly loop that would split each bill, then create one row for each month within an existing bill, and then tapply() to summarize by accts and months, but honestly, couldn't see how to do it efficiently.
In MySQL, I've tried this:
But for reasons I don't understand, my server doesn't like this table, and gets hung up on the inner join, even when I stage the different calculations. I'm investigating if there are any temporary memory limits on it.
Thanks!
Here's a start using data.table
:
billdata <- read.table(text=" acct amount begin end days
1 2242 11349 2009-10-06 2009-11-04 29
2 2242 12252 2009-11-04 2009-12-04 30
3 2242 21774 2009-12-04 2010-01-08 35
4 2242 18293 2010-01-08 2010-02-05 28
5 2243 27217 2009-10-06 2009-11-04 29
6 2243 117 2009-11-04 2009-12-04 30
7 2243 14543 2009-12-04 2010-01-08 35", sep=" ", header=TRUE, row.names=1)
require(data.table)
DT = as.data.table(billdata)
First, change type of columns begin
and end
to dates. Unlike data.frame, this doesn't copy the entire dataset.
DT[,begin:=as.Date(begin)]
DT[,end:=as.Date(end)]
Then find the time span, find the prevailing bill for each day, and aggregate.
alldays = DT[,seq(min(begin),max(end),by="day")]
setkey(DT, acct, begin)
DT[CJ(unique(acct),alldays),
mean(amount/days,na.rm=TRUE),
by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE]
acct month V1
1: 2242 2009-10 391.34483
2: 2242 2009-11 406.69448
3: 2242 2009-12 601.43226
4: 2242 2010-01 646.27465
5: 2242 2010-02 653.32143
6: 2243 2009-10 938.51724
7: 2243 2009-11 97.36172
8: 2243 2009-12 375.68065
9: 2243 2010-01 415.51429
10: 2243 2010-02 415.51429
I think you'll find the prevailing join logic quite cumbersome in SQL, and slower.
I say it's a hint because it's not quite correct. Notice row 10 is repeated because account 2243 doesn't stretch into 2010-02 unlike account 2242. To finish it off you could rbind
in the last row for each account and use rolltolast
instead of roll
. Or perhaps create alldays
by account rather than across all accounts.
See if speed is acceptable on the above, and we can go from there.
It's likely you will hit a bug in 1.8.2 that has been fixed in 1.8.3. I'm using v1.8.3.
Let me know and we can either work around, or upgrade to 1.8.3 from R-Forge.
Btw, nice example data. That made it quicker to answer.
Here's the full answer alluded to above. It's a bit tricky I have to admit, as it combines together several features of data.table
. This should work in 1.8.2 as it happens, but I've only tested in 1.8.3.
DT[ setkey(DT[,seq(begin[1],last(end),by="day"),by=acct]),
mean(amount/days,na.rm=TRUE),
by=list(acct,month=format(begin,"%Y-%m")), roll=TRUE]
acct month V1
1: 2242 2009-10 391.34483
2: 2242 2009-11 406.69448
3: 2242 2009-12 601.43226
4: 2242 2010-01 646.27465
5: 2242 2010-02 653.32143
6: 2243 2009-10 938.51724
7: 2243 2009-11 97.36172
8: 2243 2009-12 375.68065
9: 2243 2010-01 415.51429
这篇关于将不规则时间序列划分为每月平均值 - R的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!