本文介绍了在MySQL或R中重塑表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有下表:

perid   date    rating
10001   2005    RD
10001   2006    GN
10001   2007    GD
10002   2008    GD
10002   2009    YW
10002   2010    GN
10002   2011    GN
10003   2005    GD
10003   2006    GN
10003   2007    YW

如何将表格转换为以下格式:

how can I turn this table to the following format:

perid    2005    2006    2007    2008    2009    2010    2011
10001    RD      GN      GN      N/A     N/A     N/A     N/A
10002    N/A     N/A     N/A     GD      YW      GN      GN
10003    GD      GN      YW      N/A     N/A     N/A     N/A

或者如果我能在R中做到这一点?

Or if I can do this in R?

谢谢,P

推荐答案

在基数R中,要使用的函数为reshape,您将把数据从长"转换为宽".

In base R, the function to use would be reshape, and you would be converting your data from "long" to "wide".

reshape(mydf, direction = "wide", idvar="perid", timevar="date")
#   perid rating.2005 rating.2006 rating.2007 rating.2008 rating.2009 rating.2010 rating.2011
# 1 10001          RD          GN          GD        <NA>        <NA>        <NA>        <NA>
# 4 10002        <NA>        <NA>        <NA>          GD          YW          GN          GN
# 8 10003          GD          GN          YW        <NA>        <NA>        <NA>        <NA>

或者,您可以从"reshape2"包中查看dcast并尝试:

Alternatively, you can look at dcast from the "reshape2" package and try:

library(reshape2)
dcast(mydf, perid ~ date, value.var="rating")
#   perid 2005 2006 2007 2008 2009 2010 2011
# 1 10001   RD   GN   GD <NA> <NA> <NA> <NA>
# 2 10002 <NA> <NA> <NA>   GD   YW   GN   GN
# 3 10003   GD   GN   YW <NA> <NA> <NA> <NA>

为获得更快的速度,请将您的data.frame转换为data.table并改用dcast.data.table.

For better speed, convert your data.frame to a data.table and use dcast.data.table instead.

library(reshape2)
library(data.table)
DT <- data.table(mydf)
dcast.data.table(DT, perid ~ date, value.var = "rating")
#    perid 2005 2006 2007 2008 2009 2010 2011
# 1: 10001   RD   GN   GD   NA   NA   NA   NA
# 2: 10002   NA   NA   NA   GD   YW   GN   GN
# 3: 10003   GD   GN   YW   NA   NA   NA   NA


在您的注释中,听起来好像您在第1列和第2列的组合中有重复的值,这意味着dcast默认情况下将使用length作为其聚合函数.


From your comments, it sounds like you have duplicated values among the combinations of column 1 and 2, which means that by default, dcast will use length as its aggregation function.

要克服这一点,您需要创建一个辅助ID(实际上是时间")列,可以这样做.

To get past this, you need to make a secondary ID (or "time", actually) column, which can be done like this.

首先,提供一些示例数据.请注意第1行和第2行中前两列的重复组合.

First, some sample data. Note the duplicated combination of the first two columns in rows 1 and 2.

mydf <- data.frame(
  period = c(10001, 10001, 10002, 10002, 10003, 10003, 10001, 10001),
  date = c(2005, 2005, 2006, 2007, 2005, 2006, 2006, 2007),
  rating = c("RD", "GN", "GD", "GD", "YW", "GN", "GD", "YN"))
mydf
#   period date rating
# 1  10001 2005     RD
# 2  10001 2005     GN
# 3  10002 2006     GD
# 4  10002 2007     GD
# 5  10003 2005     YW
# 6  10003 2006     GN
# 7  10001 2006     GD
# 8  10001 2007     YN

尝试dcast时,它只是计算"每个组合下的数字.

When you try dcast, it just "counts" the number under each combination.

## Not what you want
dcast(mydf, period ~ date, value.var="rating")
# Aggregation function missing: defaulting to length
#   period 2005 2006 2007
# 1  10001    2    1    1
# 2  10002    0    1    1
# 3  10003    1    1    0

要么确定应该删除哪个重复的行,要么,如果所有数据都属于您的数据集中,则添加一个时间"变量,如下所示:

Either decide which duplicated row should be dropped, or, if all the data belongs in your dataset, add a "time" variable, like this:

mydf$time <- ave(1:nrow(mydf), mydf$period, mydf$date, FUN = seq_along)
mydf
#   period date rating time
# 1  10001 2005     RD    1
# 2  10001 2005     GN    2
# 3  10002 2006     GD    1
# 4  10002 2007     GD    1
# 5  10003 2005     YW    1
# 6  10003 2006     GN    1
# 7  10001 2006     GD    1
# 8  10001 2007     YN    1

现在,dcast应该可以正常工作.这是一个半长版本...

Now, dcast should work fine. Here's a semi-long version...

dcast(mydf, period + time ~ date, value.var="rating")
#   period time 2005 2006 2007
# 1  10001    1   RD   GD   YN
# 2  10001    2   GN <NA> <NA>
# 3  10002    1 <NA>   GD   GD
# 4  10003    1   YW   GN <NA>

...和半宽版.

dcast(mydf, period ~ date + time, value.var="rating")
#   period 2005_1 2005_2 2006_1 2007_1
# 1  10001     RD     GN     GD     YN
# 2  10002   <NA>   <NA>     GD     GD
# 3  10003     YW   <NA>     GN   <NA>

这篇关于在MySQL或R中重塑表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

10-11 03:13