问题描述
我有一个巨大的 .csv
文件:
Transcript Id Gene Id(name) Mirna Name miTG score
ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1
UTR3 21:30717114-30717142 0.05994568
UTR3 21:30717414-30717442 0.13591267
ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p 1
UTR3 6:105526681-105526709 0.133514751
,我想从中构建一个这样的矩阵:
and I want to build a matrix like this from it :
Transcript Id Gene Id(name) Mirna Name miTG score UTR3 MRE_score
ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1 21:30717414-30717442 0.13591267
我想在我的新矩阵中添加三个新列: UTR3
, MRE_score
CDS
。
I want to add three new columns into my new matrix called UTR3
, MRE_score
and CDS
.
对于每个 Gene ID
ENST00000286800
),原始矩阵中有几个 UTR3
(这里两个 UTR3
为 ENST00000286800
,一个 UTR3
为 ENST00000345080
),我们选择第三列中得分最高的 UTR3
。在新矩阵中,对于每个 Gene ID
的 UTR3
的值将是
For every Gene ID
(for example ENST00000286800
), there are several UTR3
in the original matrix (here two UTR3
's for ENST00000286800
, and one UTR3
for ENST00000345080
) we choose the UTR3
with the highest score in the third column. In the new matrix, the value of UTR3
for every Gene ID
will be the value of UTR3
in the second column of the original matrix.
任何机构可以帮我重整这些数据并构建我的新矩阵吗?
Can any body help me to reshape this data and build my new matrix?
推荐答案
您可以尝试使用正则表达式构造CSV:
You could try to structure the CSV using regular expressions:
textfile <- "ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1
UTR3 21:30717114-30717142 0.05994568
UTR3 21:30717414-30717442 0.13591267
ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p 1
UTR3 6:105526681-105526709 0.133514751"
txt <- readLines(textConnection(textfile))
sepr <- grepl("^ENST.*", txt)
r <- rle(sepr)
r <- r$lengths[!r$values]
regex <- "(\\S+)\\s+(\\S+)\\s(\\([^)]+\\)\\s+\\S+)\\s+(\\d+)"
m <- regexec(regex, txt[sepr])
m1 <- as.data.frame(t(sapply(regmatches(txt[sepr], m), "[", 2:5)))
m1 <- m1[rep(1:nrow(m1), r),]
regex <- "(\\S+)\\s+(\\S+)\\s+(\\S+)"
m <- regexec(regex, txt[!sepr])
m2 <- as.data.frame(t(sapply(regmatches(txt[!sepr], m), "[", 2:4)))
df <- cbind(m1, m2[,-1])
names(df) <- c("Transcript Id", "Gene Id(name)", "Mirna Name", "miTG score", "UTR3", "MRE_score" )
rownames(df) <- NULL
df
# Transcript Id Gene Id(name) Mirna Name miTG score UTR3 MRE_score
# 1 ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1 21:30717114-30717142 0.05994568
# 2 ENST00000286800 ENSG00000156273 (BACH1) hsa-let-7a-5p 1 21:30717414-30717442 0.13591267
# 3 ENST00000345080 ENSG00000187772 (LIN28B) hsa-let-7a-5p 1 6:105526681-105526709 0.133514751
这篇关于如何从混合和凌乱的CSV文件构建数据矩阵?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!