This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)

(7个答案)


3年前关闭。




我担心我的头衔不能完全抓住我的问题。
让我尽我所能解释我的问题。这是我数据的一部分。
pdpass begyr1 gvkey1 endyr1 begyr2 gvkey2 endyr2 begyr3 gvkey3 endyr3 begyr4 gvkey4 endyr4 begyr5 gvkey5 endyr5
-16348   1991  26005   1993   1994   8852   1996   1997   3708   2005     NA     NA     NA     NA     NA     NA
-16179   2006 174876   2006     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
-16161   1993  30964   2005     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
-15631   1996 111491   2006     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
-14668   1988  15854   2003     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA     NA
-13980   1986  13218   1995   1996   9921   2005     NA     NA     NA     NA     NA     NA     NA     NA     NA

由于数据量很大,因此我也将键入dput值。
#dput

structure(list(pdpass = c(-16348L, -16179L, -16161L, -15631L, -14668L, -13980L),
begyr1 = c(1991L, 2006L, 1993L, 1996L, 1988L,  1986L), gvkey1 = c(26005L, 174876L, 30964L, 111491L, 15854L, 13218L), endyr1 = c(1993L, 2006L, 2005L, 2006L, 2003L, 1995L),
begyr2 = c(1994L, NA, NA, NA, NA, 1996L), gvkey2 = c(8852L, NA, NA, NA, NA, 9921L), endyr2 = c(1996L, NA, NA, NA, NA, 2005L),
begyr3 = c(1997L, NA, NA, NA, NA, NA), gvkey3 = c(3708L, NA, NA, NA, NA, NA), endyr3 = c(2005L, NA, NA, NA, NA, NA),
begyr4 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), gvkey4 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), endyr4 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_),
begyr5 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), gvkey5 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), endyr5 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_)), row.names = c(NA, -6L), class = "data.frame", .Names = c("pdpass", "begyr1", "gvkey1", "endyr1", "begyr2", "gvkey2", "endyr2", "begyr3", "gvkey3", "endyr3", "begyr4", "gvkey4", "endyr4", "begyr5", "gvkey5", "endyr5"))

我想将其整理成长格式。
pdpass列是ID变量。我想为每个begyrs值收集所有endyearsgvkeypdpass值。
但是,从原始数据中可以看到,有些pdpass仅具有一个gvkey, endyear, begyears值,而有些则具有多个。 (它们表示为gvkey2, endyear2, begyear2, gvkey3 ...等)。

基本上,我只想在gvkey-endyear-begyear值以每个pdpass相同的数字结尾时才进行一行

轻松地说,这是我想要的输出:
   pdpass gvkeyN  gvkey begyearN begyear endyearN endyear
1  -16348 gvkey1  26005 begyear1    1991 endyear1    1993
2  -16348 gvkey2   8852 begyear2    1994 endyear2    1996
3  -16348 gvkey3   3708 begyear3    1997 endyear3    2005
4  -16179 gvkey1 174876 begyear1    2006 endyear1    2006
5  -16161 gvkey1  30964 begyear1    1993 endyear1    2005
6  -15631 gvkey1 111491 begyear1    1996 endyear1    2006
7  -14668 gvkey1  15854 begyear1    1988 endyear1    2003
8  -13980 gvkey1  13218 begyear1    1986 endyear1    1995
9  -13980 gvkey2  13218 begyear1    1996 endyear2    2005
10 -13956 gvkey1   8674 begyear1    1968 endyear2    1987

# code
desired<-data.frame(pdpass=c(rep(-16348, 3), -16179, -16161, -15631, -14668, rep(-13980, 2), -13956),
gvkeyN=c("gvkey1", "gvkey2", "gvkey3", rep("gvkey1", 5), "gvkey2", "gvkey1"),
gvkey=c(26005, 8852, 3708, 174876, 30964, 111491, 15854, 13218, 13218, 8674),
begyearN=c("begyear1", "begyear2", "begyear3", rep("begyear1", 5), "begyear1", "begyear1"),
begyear=c(1991, 1994, 1997, 2006, 1993, 1996, 1988, 1986, 1996, 1968),
endyearN=c("endyear1", "endyear2", endyear3", rep("endyear1", 5), "endyear2", "endyear2"),
endyear=c(1993, 1996, 2005, 2006, 2005, 2006, 2003, 1995, 2005, 1987))`

到目前为止,这是我尝试过的
require(tidyr)
d1<-gather(NBER, gvkeyN, gvkey, -pdpass, -endyr1, -begyr1, -endyr2, -begyr2, -endyr3, -begyr3, -endyr4, -begyr4, -endyr5, -begyr5, na.rm=TRUE)
d2<-gather(d1, begyrN, begyear, -gvkeyN, -gvkey, -pdpass, -endyr1, -endyr2, -endyr3, -endyr4, -endyr5, na.rm=TRUE)
d3<-gather(d2, endyrN, endyear, -gvkeyN, -gvkey, -pdpass, -begyrN, -begyear, na.rm=TRUE)
d4<-arrange(d3, pdpass)
head(d4, 10)

pdpass gvkeyN gvkey begyrN begyear endyrN endyear
1  -16348 gvkey1 26005 begyr1    1991 endyr1    1993
2  -16348 gvkey2  8852 begyr1    1991 endyr1    1993
3  -16348 gvkey3  3708 begyr1    1991 endyr1    1993
4  -16348 gvkey1 26005 begyr2    1994 endyr1    1993
5  -16348 gvkey2  8852 begyr2    1994 endyr1    1993
6  -16348 gvkey3  3708 begyr2    1994 endyr1    1993
7  -16348 gvkey1 26005 begyr3    1997 endyr1    1993
8  -16348 gvkey2  8852 begyr3    1997 endyr1    1993
9  -16348 gvkey3  3708 begyr3    1997 endyr1    1993
10 -16348 gvkey1 26005 begyr1    1991 endyr2    1996

如您所见,我得到了不必要的行。我需要的是gvkey, endyear, and begyear末尾的数字完全匹配的行。 (例如gvkey 1 begyr 1 endyr 1gvkey 2 begyr 2 endyr 2gvkey3 begyr3 endyr3等。);但是,我得到了1,2和3的组合。因此,上面的2:9行都是不必要的。

也许,我可以在收集后以某种方式手动删除所有不必要的行。但是,我认为这可能是更好地利用gather函数的某种方式。我看过类似的问题,但找不到该问题的解决方案。
任何人都可以为我提供任何这些解决方案吗?

最佳答案

我们可以使用melt中的data.table

library(data.table)
melt(setDT(df1), measure = patterns("^gvkey", "^begyr", "^endyr"),
    na.rm = TRUE, value.name = c("gvkey", "begyear", "endyear"))

09-06 07:46