


I have a text file of a presidential debate. Eventually, I want to parse the text into a dataframe where each row is a statement, with one column with the speaker's name and another column with the statement. For example:

"Bob Smith: Hi Steve. How are you doing? Steve Brown: Hi Bob. I'm doing well!"


   name          text
1   Bob Smith    Hi Steve. How are you doing?
2 Steve Brown    Hi Bob. I'm doing well!


Question: How do I split the statements from the names? I tried splitting on the colon:

data <- strsplit(data, split=":")


"Bob Smith" "Hi Steve. How are you doing? Steve Brown" "Hi Bob. I'm doing well!"


"Bob Smith" "Hi Steve. How are you doing?" "Steve Brown" "Hi Bob. I'm doing well!"


我怀疑这是否能解决您的所有解析需求,但是使用 strsplit 解决您最紧迫的问题的方法是使用环视.不过,您需要使用 perl 正则表达式.

I doubt this will fix all of your parsing needs, but an approach using strsplit to solve your most immediate question is using lookaround. You'll need to use perl regex though.

在这里,您指示 strsplit 在 : 或前面有标点符号且空格和 : 之间只有字母数字字符或空格的空格上进行拆分.\\pP 匹配标点字符,\\w 匹配单词字符.

Here you instruct strsplit to split on either : or a space where there is a punctuation character immediately before and nothing but alphanumeric characters or spaces between the space and :. \\pP matches punctuation characters and \\w matches word characters.

data <- "Bob Smith: Hi Steve. How are you doing? Steve Brown: Hi Bob. I'm doing well!"
strsplit(data,split="(: |(?<=\\pP) (?=[\\w ]+:))",perl=TRUE)
[1] "Bob Smith"                    "Hi Steve. How are you doing?" "Steve Brown"
[4] "Hi Bob. I'm doing well!"


