问题描述
我有一个成绩单,为了对每个说话者进行分析,我只需要将他们的单词添加到字符串中即可.我遇到的问题是每一行都不以演讲者姓名开头.这是我的文本文件的一个片段
I have a transcript and in order to perform an analysis of each speaker I need to only add their words to a string. The problem I'm having is that each line does not start with the speakers name.Here's a snippet of my text file
BOB: blah blah blah blah
blah hello goodbye etc.
JERRY:.............................................
...............
BOB:blah blah blah
blah blah blah
blah.
我只想收集来自所选说话者(在本例中为bob)所说的单词,并将其添加到字符串中,并排除来自jerry和其他说话者的单词.有什么想法吗?
I want to collect only the words from the chosen speaker(in this case bob) said and add them to a string and exclude words from jerry and other speakers. Any ideas for this?
在段落之间以及任何新的发言者开始之前都有换行符.
edit:There are line breaks between paragraphs and before any new speaker starts.
推荐答案
使用正则表达式是最好的方法.由于您将多次使用它,因此可以在使用它与每一行匹配之前对其进行编译,从而节省了一些处理时间.
Using a regex is the best way to go. As you'll be using it multiple times, you can save on a bit of processing by compiling it before using it to match each line.
import re
speaker_words = {}
speaker_pattern = re.compile(r'^(\w+?):(.*)$')
with open("transcript.txt", "r") as f:
lines = f.readlines()
current_speaker = None
for line in lines:
line = line.strip()
match = speaker_pattern.match(line)
if match is not None:
current_speaker = match.group(1)
line = match.group(2).strip()
if current_speaker not in speaker_words.keys():
speaker_words[current_speaker] = []
if current_speaker:
# you may want to do some sort of punctuation filtering too
words = [word.strip() for word in line.split(' ') if len(word.strip()) > 0]
speaker_words[current_speaker].extend(words)
print speaker_words
这将输出以下内容:
{
"BOB": ['blah', 'blah', 'blah', 'blah', 'blah', 'hello', 'goodbye', 'etc.', 'blah', 'blah', 'blah', 'blah', 'blah', 'blah', 'blah.'],
"JERRY": ['.............................................', '...............']
}
这篇关于仅阅读特定说话者的单词并将这些单词添加到列表中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!