regex - 如何在R中编辑姓氏？

我有一个名为的变量，名称为，我想将其设置为矩阵的列名，但是在执行此操作之前，我需要在名为的变量中编辑名称

>name
[722] "TCGA-OL-A66N-01A-12R-A31S-13.isoform.quantification.txt"
[723] "TCGA-OL-A66O-01A-11R-A31S-13.isoform.quantification.txt"
[724] "TCGA-OL-A66P-01A-11R-A31S-13.isoform.quantification.txt"

我只想保留第四个之前的字母-

预期产量:

  >name
    [722] "TCGA-OL-A66N-01A"
    [723] "TCGA-OL-A66O-01A"
    [724] "TCGA-OL-A66P-01A"

有人可以帮我在R中实现吗？

最佳答案

如果大小有所变化/不能保证nchar可用，则可以使用str_split_fixed()中的stringr。
stringr解决方案:

library(stringr)

name <- c(
    "TCGA-OL-A66N-01A-12R-A31S-13.isoform.quantification.txt",
    "TCGA-OL-A66O-01A-11R-A31S-13.isoform.quantification.txt",
    "TCGA-OL-A66P-01A-11R-A31S-13.isoform.quantification.txt")

apply(str_split_fixed(name,"-",5)[,1:4],1,paste0,collapse="-")

将为您提供:

## "TCGA-OL-A66N-01A" "TCGA-OL-A66O-01A" "TCGA-OL-A66P-01A"

解释:

str_split_fixed(name,"-",5)

根据name的前5次出现，将5的每个矢量元素拆分为-的片段

[,1:4]

保留每个name元素的前4个部分(结果矩阵的列)

apply(...,1,paste0,collapse="-")

使用"-"将它们粘贴在一起以恢复名称(按行)

但是如果我有很多名字怎么办？

在这里，我将stringr + apply()方法与@BondedDust grep方法和基本strsplit方法进行比较。

首先，让我们将其增加到一万个名称:

name <- rep(name,3.334e3)

然后是一个微基准测试:

microbenchmark(
  stringr_apply = apply(str_split_fixed(name,"-",5)[,1:4],1,paste0,collapse="-"),
  grep_ninja = sub("^([^-]*[-][^-]*[-][^-]*[-][^-]*)([-].*$)", "\\1", name),
  strsplit = sapply( lapply( strsplit(name, "\\-"), "[", 1:4), paste, collapse="-"),
  times=25)

并获得:

#  Unit: milliseconds
#  expr             min       lq    median        uq       max    neval
# stringr_apply 845.44542 874.5674 899.27849 941.22628 976.88903    25
# grep_ninja     25.51796  25.7066  25.85404  25.95922  27.89165    25
# strsplit      115.10626 123.2645 126.45171 130.10334 147.39517    25

似乎base模式匹配/替换将更好地扩展...在这里大约是一秒钟，比最慢的方法快30倍。

关于regex - 如何在R中编辑姓氏？，我们在Stack Overflow上找到一个类似的问题：https://stackoverflow.com/questions/23918200/