This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(7个答案)
3年前关闭。
我担心我的头衔不能完全抓住我的问题。
让我尽我所能解释我的问题。这是我数据的一部分。
由于数据量很大,因此我也将键入
我想将其整理成长格式。
但是,从原始数据中可以看到,有些
基本上,我只想在gvkey-endyear-begyear值以每个
轻松地说,这是我想要的输出:
到目前为止,这是我尝试过的
如您所见,我得到了不必要的行。我需要的是
也许,我可以在收集后以某种方式手动删除所有不必要的行。但是,我认为这可能是更好地利用
任何人都可以为我提供任何这些解决方案吗?
(7个答案)
3年前关闭。
我担心我的头衔不能完全抓住我的问题。
让我尽我所能解释我的问题。这是我数据的一部分。
pdpass begyr1 gvkey1 endyr1 begyr2 gvkey2 endyr2 begyr3 gvkey3 endyr3 begyr4 gvkey4 endyr4 begyr5 gvkey5 endyr5
-16348 1991 26005 1993 1994 8852 1996 1997 3708 2005 NA NA NA NA NA NA
-16179 2006 174876 2006 NA NA NA NA NA NA NA NA NA NA NA NA
-16161 1993 30964 2005 NA NA NA NA NA NA NA NA NA NA NA NA
-15631 1996 111491 2006 NA NA NA NA NA NA NA NA NA NA NA NA
-14668 1988 15854 2003 NA NA NA NA NA NA NA NA NA NA NA NA
-13980 1986 13218 1995 1996 9921 2005 NA NA NA NA NA NA NA NA NA
由于数据量很大,因此我也将键入
dput
值。#dput
structure(list(pdpass = c(-16348L, -16179L, -16161L, -15631L, -14668L, -13980L),
begyr1 = c(1991L, 2006L, 1993L, 1996L, 1988L, 1986L), gvkey1 = c(26005L, 174876L, 30964L, 111491L, 15854L, 13218L), endyr1 = c(1993L, 2006L, 2005L, 2006L, 2003L, 1995L),
begyr2 = c(1994L, NA, NA, NA, NA, 1996L), gvkey2 = c(8852L, NA, NA, NA, NA, 9921L), endyr2 = c(1996L, NA, NA, NA, NA, 2005L),
begyr3 = c(1997L, NA, NA, NA, NA, NA), gvkey3 = c(3708L, NA, NA, NA, NA, NA), endyr3 = c(2005L, NA, NA, NA, NA, NA),
begyr4 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), gvkey4 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), endyr4 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_),
begyr5 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), gvkey5 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), endyr5 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_)), row.names = c(NA, -6L), class = "data.frame", .Names = c("pdpass", "begyr1", "gvkey1", "endyr1", "begyr2", "gvkey2", "endyr2", "begyr3", "gvkey3", "endyr3", "begyr4", "gvkey4", "endyr4", "begyr5", "gvkey5", "endyr5"))
我想将其整理成长格式。
pdpass
列是ID变量。我想为每个begyrs
值收集所有endyears
,gvkey
和pdpass
值。但是,从原始数据中可以看到,有些
pdpass
仅具有一个gvkey, endyear, begyears
值,而有些则具有多个。 (它们表示为gvkey2, endyear2, begyear2, gvkey3
...等)。基本上,我只想在gvkey-endyear-begyear值以每个
pdpass
相同的数字结尾时才进行一行轻松地说,这是我想要的输出:
pdpass gvkeyN gvkey begyearN begyear endyearN endyear
1 -16348 gvkey1 26005 begyear1 1991 endyear1 1993
2 -16348 gvkey2 8852 begyear2 1994 endyear2 1996
3 -16348 gvkey3 3708 begyear3 1997 endyear3 2005
4 -16179 gvkey1 174876 begyear1 2006 endyear1 2006
5 -16161 gvkey1 30964 begyear1 1993 endyear1 2005
6 -15631 gvkey1 111491 begyear1 1996 endyear1 2006
7 -14668 gvkey1 15854 begyear1 1988 endyear1 2003
8 -13980 gvkey1 13218 begyear1 1986 endyear1 1995
9 -13980 gvkey2 13218 begyear1 1996 endyear2 2005
10 -13956 gvkey1 8674 begyear1 1968 endyear2 1987
# code
desired<-data.frame(pdpass=c(rep(-16348, 3), -16179, -16161, -15631, -14668, rep(-13980, 2), -13956),
gvkeyN=c("gvkey1", "gvkey2", "gvkey3", rep("gvkey1", 5), "gvkey2", "gvkey1"),
gvkey=c(26005, 8852, 3708, 174876, 30964, 111491, 15854, 13218, 13218, 8674),
begyearN=c("begyear1", "begyear2", "begyear3", rep("begyear1", 5), "begyear1", "begyear1"),
begyear=c(1991, 1994, 1997, 2006, 1993, 1996, 1988, 1986, 1996, 1968),
endyearN=c("endyear1", "endyear2", endyear3", rep("endyear1", 5), "endyear2", "endyear2"),
endyear=c(1993, 1996, 2005, 2006, 2005, 2006, 2003, 1995, 2005, 1987))`
到目前为止,这是我尝试过的
require(tidyr)
d1<-gather(NBER, gvkeyN, gvkey, -pdpass, -endyr1, -begyr1, -endyr2, -begyr2, -endyr3, -begyr3, -endyr4, -begyr4, -endyr5, -begyr5, na.rm=TRUE)
d2<-gather(d1, begyrN, begyear, -gvkeyN, -gvkey, -pdpass, -endyr1, -endyr2, -endyr3, -endyr4, -endyr5, na.rm=TRUE)
d3<-gather(d2, endyrN, endyear, -gvkeyN, -gvkey, -pdpass, -begyrN, -begyear, na.rm=TRUE)
d4<-arrange(d3, pdpass)
head(d4, 10)
pdpass gvkeyN gvkey begyrN begyear endyrN endyear
1 -16348 gvkey1 26005 begyr1 1991 endyr1 1993
2 -16348 gvkey2 8852 begyr1 1991 endyr1 1993
3 -16348 gvkey3 3708 begyr1 1991 endyr1 1993
4 -16348 gvkey1 26005 begyr2 1994 endyr1 1993
5 -16348 gvkey2 8852 begyr2 1994 endyr1 1993
6 -16348 gvkey3 3708 begyr2 1994 endyr1 1993
7 -16348 gvkey1 26005 begyr3 1997 endyr1 1993
8 -16348 gvkey2 8852 begyr3 1997 endyr1 1993
9 -16348 gvkey3 3708 begyr3 1997 endyr1 1993
10 -16348 gvkey1 26005 begyr1 1991 endyr2 1996
如您所见,我得到了不必要的行。我需要的是
gvkey, endyear, and begyear
末尾的数字完全匹配的行。 (例如gvkey 1 begyr 1 endyr 1
; gvkey 2 begyr 2 endyr 2
; gvkey3 begyr3 endyr3
等。);但是,我得到了1,2和3的组合。因此,上面的2:9行都是不必要的。也许,我可以在收集后以某种方式手动删除所有不必要的行。但是,我认为这可能是更好地利用
gather
函数的某种方式。我看过类似的问题,但找不到该问题的解决方案。任何人都可以为我提供任何这些解决方案吗?
最佳答案
我们可以使用melt
中的data.table
library(data.table)
melt(setDT(df1), measure = patterns("^gvkey", "^begyr", "^endyr"),
na.rm = TRUE, value.name = c("gvkey", "begyear", "endyear"))
09-06 07:46