问题描述
不确定交叉验证是否可以更好地解决这个问题,但是我认为与纯统计问题一样,它与编程问题一样重要.
Not sure whether this question would be better off at Cross Validated, but I think it is as much of a programming question as a pure statistical one.
我有一个102 x 1147数据框,其中存在年份(在1960年至2016年之间),每条记录都是科学论文.我计算在某些主题下每年发表的论文数量(以特定列中的值为指导),我想根据年份和论文数量的年度计数来计算线性斜率.
I have a 102 x 1147 data frame where there are years (between 1960 and 2016) and each record is a scientific paper. I count the number of papers published each year within certain topics (guided by values in specific columns), and I want to calculate the linear slope from the year and the annual count of the number of papers.
这是我的脚本,首先是线性模型,然后是绘图:
Here's my script, first the linear model, then the plot:
# THEME 1 (POPABU)
sub2=subset(as.data.frame(table(sysrev60[,c("YR","POPABU")])),
POPABU==1,select=c(1,3))
sub2$YR<-as.numeric(paste(sub2$YR))
lm_eqn <- function(df){
m <- lm(Freq ~ YR, sub2);
eq <- substitute(italic(y) == a + b %.% italic(x)*","~~italic(r)^2~"="~r2,
list(a = format(coef(m)[1], digits = 2),
b = format(coef(m)[2], digits = 2),
r2 = format(summary(m)$r.squared, digits = 3)))
as.character(as.expression(eq));
}
ggplot(sub2, aes(x=YR,y=Freq)) +
scale_y_continuous(limit=c(0,20),expand=c(0, 0)) +
scale_x_continuous(breaks=c(1960,1965,1970,1975,1980,1985,1990,1995,2000,
2005,2010,2015),labels=c(1960,1965,1970,1975,1980,1985,
1990,1995,2000,2005,2010,2015)) +
geom_bar(stat='identity') +
geom_text(x = 1960, y = 16, label = lm_eqn(df), size=5,hjust=0, parse = TRUE) +
stat_smooth(method="lm",col="red") +
xlab(" ") + ylab("No of papers") +
annotate("text",x=1960,y=18,label="THEME 1",
family="serif",size=7,hjust=0,color="darkred")
我的问题是,此过程仅计算年份与计数> 0之间的线性关系.在很多年中,论文数等于0,因此我需要回归以涵盖同一时期(1960- 2016年),针对我正在研究的所有25个不同主题,即我需要强制将每年的回归都包括0,论文数为0.
My problem is that this procedure only calculates the linear relation between the year and the counts > 0. There are a number of years where the count of papers equals 0, and I need the regression to cover the same period (1960-2016) for all the 25 different topics I am studying, i.e. I need to force the regression to include a 0 for every year the count of papers is 0.
我已经制作了大数据框的子集,对应于我要研究其发布率的每个主题.这是我的"sub2"数据帧的DPUT
:
I've made subsets of the large data frame corresponding to each topic I want to study the publication rate for. Here's a DPUT
of my 'sub2' data frame:
dput(sub2)
structure(list(YR = c(1960, 1961, 1962, 1963, 1964, 1965, 1966,
1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977,
1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988,
1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999,
2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010,
2011, 2012, 2013, 2014, 2015, 2016), Freq = c(0L, 0L, 0L, 0L,
0L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 2L, 1L, 0L, 1L,
3L, 0L, 1L, 0L, 2L, 0L, 3L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 2L,
0L, 2L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 2L, 0L, 1L,
1L, 1L, 2L, 3L, 5L)), .Names = c("YR", "Freq"), row.names = 58:114, class = "data.frame")
如您所见,我的数据框中似乎有明确的0,但回归似乎并不重要.
As you can see there seem to be explicit 0's in my data frame, but the regression don't seem to care.
我感觉这可以通过对脚本进行一些小的调整来完成.我该怎么办?
I have a feeling that this could be done by a small tweak of my script. How do I do that?
推荐答案
到目前为止您已经做的 考虑了零,我们可以通过手动计算系数进行仔细检查,以防万一您认为lm()
由于某种原因正在做一些奇怪的事情:
What you have so far does take into account the zeros, which we can double check by manually calculating the coefficients in case you think lm()
is doing something weird for some reason:
# Make sure zeros are there:
sub2$Freq
[1] 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 2 1 0 1 3 0 1 0 2 0 3 0 1 0 1 0 0 1 1 2 0 2
[39] 0 0 0 1 0 0 0 0 0 1 0 2 0 1 1 1 2 3 5
# Yep
X <- cbind(rep(1, nrow(sub2)), sub2$YR) # add a column of 1s for intercept
solve(t(X) %*% X) %*% t(X) %*% sub2$Freq # (X'X)^-1 X'Y -- OLS formula
[,1]
[1,] -38.1778584
[2,] 0.0195748
考虑到四舍五入,这与您发布的代码产生的绘图上显示的内容相匹配:
Taking rounding into account, this matches what's displayed on the plot that results from your posted code:
当我们使用所有值(包括零)时,截距约为-38,年系数约为0.02.因此,那里绝对没有错.可能导致您认为它忽略了零的原因是Freq
为零的年份中没有条形,但这仅仅是因为绘图准确地反映了这些值-当条形的高度为零时,您将无法看到栏.
When we use all the values, including the zeros, the intercept is about -38 and the year coefficient is about 0.02. So, there's absolutely nothing wrong there. What may be causing you to think that it's ignoring zeros is that there are no bars for the years where Freq
is zero, but that's just because the plot is accurately reflecting the values -- when the height of the bar is zero, you will not be able to see a bar.
这篇关于强制R在计数与年份的回归中包括0作为值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!