问题描述
我有一个股票代码的字符向量,其中代码名称以以下形式连接到该代码所在的国家/地区:country_name/ticker_name.我试图拆分每个字符串并从 '/' 后面删除所有内容,返回仅包含股票代码名称的字符向量.这是一个示例向量:
I have a character vector of stock tickers where the ticker name is concatenated to the country in which that ticker is based in the following form: country_name/ticker_name. I am trying to split each string and delete everything from the '/' back, returning a character vector of only the ticker names. Here is an example vector:
sample_string <- c('US/SPY', 'US/AOL', 'US/MTC', 'US/PHA', 'US/PZI',
'US/AOL', 'US/BRCM')
我最初的想法是使用 stringr 库.我对那个包没有任何经验,但这是我正在尝试的:
My initial thought would be to use the stringr library. I don't have really any experience with that package, but here is what I was trying:
library(stringr)
split_string <- str_split(sample_string, '/')
但我不确定如何仅将每个列表的第二个元素作为单个向量返回.
But I was unsure how to return only the second element of each list as a single vector.
如何在大型字符向量(约 1.05 亿个条目)上执行此操作?
How would I do this over a large character vector (~105 million entries)?
推荐答案
这里的一些基准测试包括@David Arenburg 建议的所有方法,以及使用 stringr
str_extract 的另一种方法> 包.
Some benchmark here including all the methods suggested by @David Arenburg, and another method using str_extract
from stringr
package.
sample_string <- rep(sample_string, 1000000)
library(data.table); library(stringr)
s1 <- function() sub(".*/(.*)", "\\1", sample_string)
s2 <- function() sub(".*/", "", sample_string)
s3 <- function() str_extract(sample_string, "(?<=/)(.*)")
s4 <- function() tstrsplit(sample_string, "/", fixed = TRUE)[[2]]
length(sample_string)
# [1] 7000000
identical(s1(), s2())
# [1] TRUE
identical(s1(), s3())
# [1] TRUE
identical(s1(), s4())
# [1] TRUE
microbenchmark::microbenchmark(s1(), s2(), s3(), s4(), times = 5)
# Unit: seconds
# expr min lq mean median uq max neval
# s1() 3.916555 3.917370 4.046708 3.923246 3.925184 4.551184 5
# s2() 3.584694 3.593755 3.726922 3.610284 3.646449 4.199426 5
# s3() 3.051398 3.062237 3.354410 3.138080 3.722347 3.797985 5
# s4() 1.908283 1.964223 2.349522 2.117521 2.760612 2.996971 5
tstrsplit
方法是最快的.
更新:
添加@Frank的另一种方法,这个比较不是严格准确的,这取决于实际数据,如果上面产生的sample_string
有很多重复的情况,优势很明显:
Add another method from @Frank, this comparison is not strictly accurate which depends on the actual data, if there is a lot of duplicated cases as the sample_string
is produced above, the advantage is quite obvious:
s5 <- function() setDT(list(sample_string))[, v := tstrsplit(V1, "/", fixed = TRUE)[[2]], by=V1]$v
identical(s1(), s5())
# [1] TRUE
microbenchmark::microbenchmark(s1(), s2(), s3(), s4(), s5(), times = 5)
# Unit: milliseconds
# expr min lq mean median uq max neval
# s1() 3905.97703 3913.264 3922.8540 3913.4035 3932.2680 3949.3575 5
# s2() 3568.63504 3576.755 3713.7230 3660.5570 3740.8252 4021.8426 5
# s3() 3029.66877 3032.898 3061.0584 3052.6937 3086.9714 3103.0604 5
# s4() 1322.42430 1679.475 1985.5440 1801.9054 1857.8056 3266.1101 5
# s5() 82.71379 101.899 177.8306 121.6682 209.0579 373.8141 5
这篇关于删除正则表达式前的字符 (R)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!