问题描述
我正在尝试编写一个函数,该函数可以让我查找给定年份中某个事件的多个首次出现.事件在不同的时间在不同的公司发生.因此,一个事件可能在1980年第一次发生在c公司,然后在1981年发生在b公司.在这种情况下,我需要查找的只是公司c_1980和矩阵中的关联值.
I am trying to write a function that allows me to find multiple first occurences of an event in a given year. Events happen to different firms at different moments in time. So an event might happen for the first time in 1980 to firm c and afterwards in 1981 to firm b. In that case, all i need to find is firm c_1980 and the associated value in the matrix.
但是,如果某个事件直到它在1986年确定为a以及在1986年确定为e都发生时才发生,那么我需要找到a_1986和e_1986都作为结果,并将其各自的值存储在矩阵中.
If however, an event does NOT happen UNTIL it happens to firm a in 1986 and to firm e in 1986 as well, then I need to find as outcome both a_1986 and e_1986 with their respective values in the matrix.
我的(2500 * 800)矩阵在垂直轴上有2500个不同的事件,在水平轴上有800个不同的year_firm组合.所有值都在0到10之间(在实数矩阵中,在0到2之间的示例中),绝大多数是零.
My (2500 * 800) matrix has 2500 different events on the vertical axis and 800 different year_firm combinations on the horizontal one. All values are between 0 and 10 (in the real matrix, in the example between 0 and 2) with the vast majority being zeros.
示例数据:
av<-matrix(rep(0:2),10,40)
av[1:7,]=0 ; av[9,3:14]=0
av[,c(22,38)]=1
colnames(av)<-paste(c("a","b","c","d","e"),rep(1980:1987, each=5),sep="_")
col.av<-colnames(av)
rownames(av)<-paste("X",1:10,sep="")
row.av<-rownames(av)
我一直在使用的主要公式给出了第一次出现在矩阵中的位置:
The main formula I have been using gives the positions in the matrix of the first occurence:
first<-max.col(av>0,"first")
这样可以很好地找到第一个匹配项.但是,如数据所示,有时是同一年中多次发生(例如,第a行,b,d和e行发生在1980年的第8行中的事件->鉴于这是第8行成为第一个不发生的年份) -zero我需要找到4个不同的值作为输出).
This works fine to find the first occurence. However, as the data show, sometimes that are multiple occurences in the same year (e.g. event in row 8 occurs in 1980 for firms a, b, d, and e -> given that this is the first year in which row 8 becomes non-zero I would need to find 4 different values as output).
到目前为止,我的代码基本上是所有修补程序
My code so far is basically a patch-all solution
avdum1<-matrix(cbind(seq(1:nrow(av)),first),nrow=nrow(av),ncol=2)
avdum2<-matrix(cbind(row.av,first),nrow=nrow(av),ncol=2)
使用avdum1和avdum2以及原始行和列名称,然后我可以设计一个矩阵,该矩阵为我提供原始矩阵中的第一次出现,以及第一次出现的确切值(1或2)以及公司-年份组合
Using avdum1 and avdum2 as well as the original row and column names I can then devise a matrix that gives me the first occurence in the original matrix, together with the exact value of the first occurence (1 or 2) as well as the firm-year combination
firsttime<-matrix(cbind(row.av,col.av[first],av[avdum1]),nrow=nrow(av),ncol=3)
到目前为止,一切都很好.现在,要查找同一年的其他首次出现的情况,我要做的是
So far so good.Now, to find other first occurences in the same year, what I do is
av[avdum1]<-0
这会将原始的第一次出现的位置设为零,然后再次运行整个过程,然后扩展第一次矩阵,将列名称按年划分,并将公司名称a,b,c,d,e进行比较.年,看看第二次第一次发生是否在同一年发生.如果是这样,我必须第三次重新运行整个过程,依此类推(我的真实数据集有40家公司).
This places the original first occurences at zero and then I run through the entire process again, to then expand the firsttime matrix, split up the column names in years and firm names a,b,c,d,e, compare the years and see whether the second first occurence happened in the same year. If it did, I have to rerun the entire process a third time and so on (my real dataset has 40 firms).
这变得非常麻烦,所以我想知道是否有更聪明的方法来做到这一点?基于该事件在矩阵中的相对位置,一旦发现某个积极事件,也许可以进行局部搜索?
This becomes pretty cumbersome so I'm wondering if there is a smarter way to do it? Maybe a localized search once a positive event has been spotted based on the relative position of that event in the matrix?
(如果您复制示例数据,则在生成矩阵时可以忽略初始警告)
(if you copy the example data you can ignore the initial warning when producing the matrix)
对于第1到7行,结果将是b_1984,值为1对于第8行,结果应为a_1980与1,b_1980与2,d_1980与1和e_1980与2第9行,a_1980,第2行对于第10行,b_1980为1,c_1980为2,e_1980为1
for rows 1 to 7 , the result would be b_1984 with value 1for row 8, the result should be a_1980 with 1, b_1980 with 2, d_1980 with 1 and e_1980 with 2for row 9, a_1980 with 2for row 10, b_1980 with 1, c_1980 with 2, and e_1980 with 1
希望这可以澄清一些先前的问题/评论
Hopefully this clarifies some of the previous questions/comments
任何建议都将受到欢迎!
Any suggestions would be very welcome!
推荐答案
尽管我的选择路径与您的有所不同,但我还是给了它一个机会.也许,有一种方法可以按原样操作数据以给出结果(甚至可能很快),但是我更喜欢使用长"格式.长格式也可以通过"data.table"和"dplyr"之类的包快速进行处理.
I gave it a shot, although I followed a bit different path than yours. Perhaps, there could be a way to manipulate your data as is to give the result (and maybe, even, fast) but I preferred to use a "long" format instead. A long format can, also, be fastly manipulated with packages like "data.table" and "dplyr".
首先,我将您的av
转换为以下格式的长格式:
Firstly, I transformed your av
to a long format of the following format:
#turn to long format
long_DF = as.data.frame(as.table(av), responseName = "value")
#tidy up
tmp = do.call(rbind.data.frame, strsplit(as.character(long_DF[[2]]), "_"))
long_DF$firm = tmp[, 1] ; long_DF$year = tmp[, 2]
long_DF$event = long_DF[[1]] ; long_DF = long_DF[-(1:2)]
long_DF[c(1,4,5,8,15,16,20), ]
# value firm year event
#1 0 a 1980 X1
#4 0 a 1980 X4
#5 0 a 1980 X5
#8 1 a 1980 X8
#15 0 b 1980 X5
#16 0 b 1980 X6
#20 1 b 1980 X10
从现在开始,我想将会有许多不同且更有效的方法,但是我只能-提出以下内容:
From here on, I guess there would be many different and more efficient approaches, but I could -only- come up with the following:
#3D array
res = xtabs(value ~ firm + year + event, long_DF)
res[, , 3, drop = F]
#, , event = X3
#
# year
#firm 1980 1981 1982 1983 1984 1985 1986 1987
# a 0 0 0 0 0 0 0 0
# b 0 0 0 0 1 0 0 0
# c 0 0 0 0 0 0 0 1
# d 0 0 0 0 0 0 0 0
# e 0 0 0 0 0 0 0 0
对于每个第3维,您可以搜索1)哪个值([行,列])大于0和2)哪个值在可用的最小列中(即事件发生在较早的年份).此功能的实现可以是以下功能:
For each 3rd dimension you could search for 1)which values ([row, column]) are above 0 and 2) which of them are in the minimum column available (i.e. the event occured in an earlier year). An implementation of this could be the following function:
#function to apply to each 3rd dimension
f1 = function(x) {
wh = which(x > 0, arr.ind = T)
wh2 = which(wh[, "col"] == min(wh[, "col"]))
wh3 = wh[wh2, , drop = F]
cbind.data.frame(firm = rownames(x)[wh3[, 1]],
year = colnames(x)[wh3[, 2]],
val = x[wh3])
}
然后将该函数应用于每个第3维;即针对每个事件:
And apply the function to each 3rd dimension; i.e. to each event:
ret = apply(res, 3, f1)
#ret
ans = cbind.data.frame(event = rep(names(ret), lapply(ret, nrow)),
do.call(rbind.data.frame, apply(res, 3, f1)))
ans
# event firm year val
#X1 X1 b 1984 1
#X2 X2 b 1984 1
#X3 X3 b 1984 1
#X4 X4 b 1984 1
#X5 X5 b 1984 1
#X6 X6 b 1984 1
#X7 X7 b 1984 1
#X8.1 X8 a 1980 1
#X8.2 X8 b 1980 2
#X8.3 X8 d 1980 1
#X8.4 X8 e 1980 2
#X9 X9 a 1980 2
#X10.1 X10 b 1980 1
#X10.2 X10 c 1980 2
#X10.3 X10 e 1980 1
这篇关于R:搜寻和隔离时间序列中的多个首次出现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!