本文介绍了将行转为列,然后拆分它们的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含按行排序的数据的数据集,如下所示:

*VarName1* - *VarValue1**VarName2* - *VarValue2**等等.*

我希望 VarName 成为单独的列.我通过使用以下代码实现了这一点:

DFP1 

现在,这是一个非常大的数据集.它包含多年(数百万行)的数据.上面的代码创建了一个具有 > 1E6 列的数据框.我需要按每个条目拆分这些列.我看到在第一条数据中,每第 86 列都会出现一个新条目.所以,我试过这个:

tmp 

这适用于几行.问题是变量的数量多年来一直在增加,我不能简单地假设每个条目的变量数量是相同的.这会导致变量值与其名称不匹配.我正在寻找一种根据变量名称匹配变量和值的方法.

我是高级数据分析的新手,所以如果您需要更多信息,请告诉我.

我创建了一些关于 DFP 外观的示例数据,希望能让您更好地理解我的问题:

DFP 

这里需要注意的是第三次循环中新变量宽度的出现.这对于我的数据集来说是典型的,引入了新变量.这里的关键指标是batch,每次出现变量batch时都要拆分.

样本数据的dput输出:

structure(list(nmlVar = structure(c(1L, 3L, 2L, 4L, 1L, 3L, 2L,4L, 1L, 3L, 2L, 5L, 4L), .Label = c("Batch", "Length", "Mass","Product", "Width"), class = "factor"), noFloat = c(254578, 20,24, 24547, 254579, 23, 24, 24547, 254580, 20, 24, 19, 24547)), .Names = c("nmlVar","noFloat"), row.names = c(NA, -13L), class = "data.frame")
解决方案

这就是你想要的吗?:

库(dplyr)图书馆(整理)DFP %>%变异(样本 = cumsum(nmlVar == 'Batch'))%>%传播(nmlVar,noFloat)

给出:

 样品 批次长度 批量产品宽度1 1 254578 24 20 24547 不适用2 2 254579 24 23 24547 不适用3 3 254580 24 20 24547 19

I have a data set containing data sorted in rows like this:

*VarName1* - *VarValue1*
*VarName2* - *VarValue2*
*Etc.*

I want it to be that the VarNames become individual columns. I have achieved this by using the following code:

DFP1 <- as.data.frame(t(DFP)) #DFP contains the data

Now, this is a very big data set. It contains multiple years (millions of rows) of data. Above code creates a dataframe which has > 1E6 columns. I need to split these columns by each entry. I saw that in the first piece of data, a new entry recurs at every 86th column. So, I tried this:

tmp <- data.frame(
       X = DFP$noFloat,
       ind = rep(1:86, nrow(DFP)/86)
)

y <- rbind(DFP$nmlVar[1:86], unstack(tmp, X~ind))

This works for a few rows. The problem is that the number of variables increased over the years and that I cannot simply assume that the number of variables per entry are the same. This results in variable values mismatching it's names. I am looking for a way to match variables and values based on their variable names.

I am new to advanced data-analysis, so please let me know if you need anything more.

EDIT: I created some sample data of how DFP looks like, to hopefully make you better understand my question:

DFP <- data.frame(
    nmlVar = c("Batch", "Mass", "Length", "Product","Batch", "Mass",
    "Length", "Product", "Batch", "Mass", "Length", "Width", "Product"),

    noFloat = c(254578, 20, 24, 24547, 254579, 23, 24, 24547, 254580, 20,
    24, 19, 24547)
)

Important to note here is the apperance of new variable width in the third recurrence. This is typical for my dataset, introduction of new variables. The key indicator here is batch and it should be split at each time the variable batch appears.

dput output of sample data:

structure(list(nmlVar = structure(c(1L, 3L, 2L, 4L, 1L, 3L, 2L,
4L, 1L, 3L, 2L, 5L, 4L), .Label = c("Batch", "Length", "Mass",
"Product", "Width"), class = "factor"), noFloat = c(254578, 20,
24, 24547, 254579, 23, 24, 24547, 254580, 20, 24, 19, 24547)), .Names = c("nmlVar",
"noFloat"), row.names = c(NA, -13L), class = "data.frame")
解决方案

Is this what you are after?:

library(dplyr)
library(tidyr)
DFP %>%
  mutate(sample = cumsum(nmlVar == 'Batch')) %>%
  spread(nmlVar, noFloat)

Gives:

这篇关于将行转为列,然后拆分它们的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!

08-01 15:25