在通过交叉验证后,我问了一个question关于按日期分析数据,但又不想通过按月对数据进行分箱来生成虚假的尖峰和低谷。例如,如果某人在每个月的最后一天付款,但有一次某人延迟几天付款,那么一个月将反射(reflect)零费用,而下一个月将反射(reflect)正常费用的两倍。所有虚假的垃圾。
我的问题的answers之一解释了对累积总和使用线性样条平滑法进行插值的概念,以克服装仓中的打cc。我对此很感兴趣,并想在R中实现它,但无法在线找到任何示例。我不只是想打印剧情。我想获得每个时间点(也许每天)的瞬时斜率,但是该斜率应该来自样条曲线,该样条曲线输入从几天(或者可能是几周或几个月)到几天之前的点时间点之后。换句话说,在一天结束时,我想获取诸如数据框之类的数据,其中一列是每天或每天的病人钱,但是这并不受诸如我是否延迟几天付款这样的变数的影响或者该月是否恰好有5个手术日(而不是通常的4天)。
这是一些简化的模拟和绘图,以显示我要面对的问题。
library(lubridate)
library(ggplot2)
library(reshape2)
dates <- seq(as.Date("2010-02-01"), length=24, by="1 month") - 1
dates[5] <- dates[5]+3 #we are making one payment date that is 3 days late
dates#look how the payment date is the last day of every month except for
#2010-05 where it takes place on 2010-06-03 - naughty boy!
amounts <- rep(50,each=24)# pay $50 every month
register <- data.frame(dates,amounts)#this is the starting register or ledger
ggplot(data=register,aes(dates,amounts))+geom_point()#look carefully and you will see that 2010-05 has no dots in it and 2010-06 has two dots
register.by.month <- ddply(register,.(y=year(dates),month=month(dates)),summarise,month.tot=sum(amounts))#create a summary of totals by month but it lands up omiting a month in which nothing happened. Further badness is that it creates a new dataframe where one is not needed. Instead I created a new variable that allocates each date into a particular "zone" such as month or
register$cutmonth <- as.Date(cut(register$dates, breaks = "month"))#until recently I did not know that the cut function can handle dates
table(register$cutmonth)#see how there are two payments in the month of 2010-06
#now lets look at what we paid each month. What is the total for each month
ggplot(register, aes(cutmonth, amounts))+ stat_summary(fun.y = sum, geom = "bar")#that is the truth but it is a useless truth
#so lets use cummulated expense over time
register$cumamount <- cumsum(register$amounts)
cum <- ggplot(data=register,aes(dates,cumamount))+geom_point()
cum+stat_smooth()
#That was for everything the same every month, now lets introduce a situation where there is a trend that in the second year the amounts start to go up,
increase <- c(rep(1,each=12),seq(from=1.01,to=1.9,length.out=12))
amounts.up <- round(amounts*increase,digits=2)#this is the monthly amount with a growth of amount in each month of the second year
register <- cbind(register,amounts.up)#add the variable to the data frarme
register$cumamount.up <- cumsum(register$amounts.up) #work out th cumulative sum for the new scenario
ggplot(data=register,aes(x=dates))+
geom_point(aes(y=amounts, colour="amounts",shape="amounts"))+
geom_point(aes(y=amounts.up, colour="amounts.up",shape="amounts.up"))# the plot of amount by date
#I am now going to plot the cumulative amount over time but now that I have two scenarios it is easier to deal with the data frame in long format (melted) rather than wide format (casted)
#before I can melt, the reshape2 package unforutnately can't handle date class so will have to turn them int o characters and then back again.
register[,c("dates","cutmonth")] <- lapply(register[,c("dates","cutmonth")],as.character)
register.long <- melt.data.frame(register,measure.vars=c("amounts","amounts.up"))
register.long[,c("dates","cutmonth")] <- lapply(register.long[,c("dates","cutmonth")],as.Date)
ggplot(register.long, aes(cutmonth,value))+ stat_summary(fun.y = sum, geom = "bar")+facet_grid(. ~ variable) #that is the truth but it is a useless truth,
cum <- ggplot(data=register,aes(dates,cumamount))+geom_point()
#that is the truth but it is a useless truth. Furthermore it appears as if 2010-06 is similar to what is going on in 2011-12
#that is patently absurd. All that happened was that the 2010-05 payment was delayed by 3 days.
#so lets use cummulated expense over time
ggplot(data=register.long,aes(dates,c(cumamount,cumamount.up)))+geom_point() + scale_y_continuous(name='cumulative sum of amounts ($)')
因此,对于简单图而言,一年中的每一天每天的变量interpolate.daily约为每天$ 50/30.4 = $ 1.64。对于第二个地块,第二年每月支付的金额开始增加,第二年将显示第一年每天的每日费率,每天$ 1.64,第二年中的日期,人们将看到每日费率逐渐从每天的1.64美元增加到每天大约3.12美元。
非常感谢您一直阅读本书。你一定和我一样着迷!
最佳答案
这是执行此操作的一种基本方法。当然,还有更复杂的选项和参数需要调整,但这应该是一个很好的起点。
dates <- seq(as.Date("2010-02-01"), length=24, by="1 month") - 1
dates[5] <- dates[5]+3
amounts <- rep(50,each=24)
increase <- c(rep(1,each=12),seq(from=1.01,to=1.9,length.out=12))
amounts.up <- round(amounts*increase,digits=2)
df = data.frame(dates=dates, cumamount.up=cumsum(amounts.up))
df.spline = splinefun(df$dates, df$cumamount.up)
newdates = seq(min(df$dates), max(df$dates), by=1)
money.per.day = df.spline(newdates, deriv=1)
如果将其绘制,则可以看到样条线的有趣行为:
plot(newdates, money.per.day, type='l')