如何从混合和凌乱的CSV文件构建数据矩阵？

本文介绍了如何从混合和凌乱的CSV文件构建数据矩阵？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个巨大的 .csv 文件：

Transcript Id   Gene Id(name)   Mirna Name  miTG score
ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p   1
UTR3    21:30717114-30717142    0.05994568
UTR3    21:30717414-30717442    0.13591267
ENST00000345080 ENSG00000187772 (LIN28B)    hsa-let-7a-5p   1
UTR3    6:105526681-105526709   0.133514751

，我想从中构建一个这样的矩阵：

and I want to build a matrix like this from it :

Transcript Id    Gene Id(name)   Mirna Name        miTG score    UTR3        MRE_score
ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p       1  21:30717414-30717442 0.13591267

我想在我的新矩阵中添加三个新列： UTR3 ， MRE_score CDS 。

I want to add three new columns into my new matrix called UTR3, MRE_score and CDS.

对于每个 Gene ID ENST00000286800 ），原始矩阵中有几个 UTR3 （这里两个 UTR3 为 ENST00000286800 ，一个 UTR3 为 ENST00000345080 ），我们选择第三列中得分最高的 UTR3 。在新矩阵中，对于每个 Gene ID 的 UTR3 的值将是

For every Gene ID (for example ENST00000286800), there are several UTR3 in the original matrix (here two UTR3's for ENST00000286800, and one UTR3 for ENST00000345080) we choose the UTR3 with the highest score in the third column. In the new matrix, the value of UTR3 for every Gene ID will be the value of UTR3 in the second column of the original matrix.

任何机构可以帮我重整这些数据并构建我的新矩阵吗？

Can any body help me to reshape this data and build my new matrix?

`推荐答案`

您可以尝试使用正则表达式构造CSV：

You could try to structure the CSV using regular expressions:

textfile <- "ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p   1
UTR3    21:30717114-30717142    0.05994568
UTR3    21:30717414-30717442    0.13591267
ENST00000345080 ENSG00000187772 (LIN28B)    hsa-let-7a-5p   1
UTR3    6:105526681-105526709   0.133514751"
txt <- readLines(textConnection(textfile))

sepr <- grepl("^ENST.*", txt)
r <- rle(sepr)
r <- r$lengths[!r$values]

regex <- "(\\S+)\\s+(\\S+)\\s(\\([^)]+\\)\\s+\\S+)\\s+(\\d+)"
m <- regexec(regex, txt[sepr])
m1 <- as.data.frame(t(sapply(regmatches(txt[sepr], m), "[", 2:5)))
m1 <- m1[rep(1:nrow(m1), r),]

regex <- "(\\S+)\\s+(\\S+)\\s+(\\S+)"
m <- regexec(regex, txt[!sepr])
m2 <- as.data.frame(t(sapply(regmatches(txt[!sepr], m), "[", 2:4)))

df <- cbind(m1, m2[,-1])
names(df) <- c("Transcript Id",    "Gene Id(name)",   "Mirna Name",        "miTG score",    "UTR3",        "MRE_score"   )
rownames(df) <- NULL
df
# Transcript Id   Gene Id(name)                Mirna Name miTG score                  UTR3   MRE_score
# 1 ENST00000286800 ENSG00000156273     (BACH1) hsa-let-7a-5p          1  21:30717114-30717142  0.05994568
# 2 ENST00000286800 ENSG00000156273     (BACH1) hsa-let-7a-5p          1  21:30717414-30717442  0.13591267
# 3 ENST00000345080 ENSG00000187772 (LIN28B)    hsa-let-7a-5p          1 6:105526681-105526709 0.133514751

                        这篇关于如何从混合和凌乱的CSV文件构建数据矩阵？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！