问题描述
我有从.csv文件导入的数据。第一列包含在括号内包含文本的字符串。数据如下:
I have data imported from a .csv file. The first column contains character strings that contain text within parentheses. The data look like:
symbol
___________________________________________
1 | Apollo Senior Floating Rate Fund Inc. (AFT)
2 | Apollo Tactical Income Fund Inc. (AIF)
3 | Altra Industrial Motion Corp. (AIMC)
4 | Allegion plc (ALLE)
5 | Amphenol Corporation (APH)
6 | Ares Management Corporation (ARES)
7 | ARMOUR Residential REIT, Inc. (ARR)
8 | Banc of California, Inc. (BANC)
9 | BlackRock Resources (BCX)
10| Belden Inc (BDC)
...
我需要将该数据列转换为列表,例如:
I need to convert that column of data into a list such as:
symbol2
___________________________________________
1 | AFT
2 | AIF
3 | AIMC
4 | ALLE
5 | APH
6 | ARES
7 | ARR
8 | BANC
9 | BCX
10| BDC
...
我的最终目标是获得单个字符串,其中文本圆括号之间用;分隔像这样:
My ultimate goal is to get a single character string where the text bound by parentheses are separated by a ";" like this:
"AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC;..."
我可以使用
paste(symbol2, collapes = ";")
但我不知道如何隔离所需的文本。
but I can't figure out how to isolate the desired text.
我已经尝试了此处列出的所有内容(),方法是将:替换为(,并且可以没有任何工作。我尝试过:
I've tried everything listed here (extract a substring in R according to a pattern) by replacing the ":" with "(" and could not get anything to work. I tried:
gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", symbol, perl=T)
如此处建议的
(中),但输出为
as recommended here (Extract text in parentheses in R), but the output is
"c(4, 5, 2, 1, 3, 6, 7, 8, 17, 9,...)"
有帮助吗?
推荐答案
1)read.table 将 read.table
与指示的 sep
和 comment
值获得一个两列数据框,其中第一列是名称,第二列是符号。最后,选择第二列并将其折叠为单个字符串。没有使用包或正则表达式。
1) read.table Use read.table
with the indicated sep
and comment
values to get a 2 column data frame in which the first column is the names and the second column is the symbols. Finally take that second column and collapse it into a single string. No packages or regular expressions are used.
DF2 <- read.table(text = unlist(DF), sep = "(", comment = ")")
paste(DF2[[2]], collapse = ";")
## [1] "AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC"
2)dplyr 与$ tidyr 分开
分开名称和符号列,同时删除名称列。 取消列表
并将其折叠为单个字符串。 tidyr 0.8.2或更高版本。
2) dplyr We can use separate
from tidyr to separate the name and symbol columns dropping the name column at the same time. unlist
that and collapse it into a single string. tidyr 0.8.2 or later must be used.
library(dplyr)
library(tidyr)
DF %>%
separate(symbol, c(NA, "symbol2"), "[()]", extra = "drop") %>%
unlist %>%
paste(collapse = ";")
## [1] "AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC"
3)gsub 我们可以匹配所有(包括以下)(即 。* \\(
以及从}开始的所有内容,即 \\)。*
并替换为空字符串。然后像以前一样崩溃。
3) gsub We can match everything up to and including (, i.e. ".*\\("
and also everything from ) onwards, i.e. "\\).*"
and replace those with the empty string. Then collapse as before.
paste(gsub(".*\\(|\\).*", "", DF$symbol), collapse = ";")
## [1] "AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC"
4)修剪这是另一个基本解决方案。它需要R 3.6.0或更高版本(当前为r-devel)。我们将空格定义为除括号之外的任何空格,并使用 trimws
对其进行修饰。然后,将空格定义为括号,然后将其修剪掉。
4) trimws This is another base solution. It requires R 3.6.0 or later (currently r-devel). We define whitespace as anything other than parentheses and use trimws
to trim it away. Then we define whitespace as parentheses and then trim that away. That leaves us with the symbols which we can now collapse.
paste(trimws(trimws(DF$symbol, white = "[^()]"), white = "[()]"), collapse = ";")
## [1] "AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC"
注意
输入可重复使用的形式是:
Note
The input in reproducible form is:
Lines <- "
symbol
1 | Apollo Senior Floating Rate Fund Inc. (AFT)
2 | Apollo Tactical Income Fund Inc. (AIF)
3 | Altra Industrial Motion Corp. (AIMC)
4 | Allegion plc (ALLE)
5 | Amphenol Corporation (APH)
6 | Ares Management Corporation (ARES)
7 | ARMOUR Residential REIT, Inc. (ARR)
8 | Banc of California, Inc. (BANC)
9 | BlackRock Resources (BCX)
10| Belden Inc (BDC)"
DF <- read.table(text = Lines, sep = "|", strip.white = TRUE, as.is = TRUE)
这篇关于提取括号内的字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!