我已经阅读了使用的语料库
file_directory = 'path'
my_corpus = PlaintextCorpusReader(file_directory,'.*',encoding='latin1')
我进行预处理
totalwords = my_corpus.words()
docs = [my_corpus.words(f) for f in fids]
docs2 = [[w.lower()for w in doc]for doc in docs]
docs3 = [[w for w in doc if re.search('^[a-z]+$',w)]for doc in docs2]
from nltk.corpus import stopwords
stop_list = stopwords.words('english')
docs4 = [[w for w in doc if w not in stop_list]for doc in docs3]
wordscount = [w for doc in docs4 for w in doc]
fd_dist_total = nltk.FreqDist(wordscount)
print(fd_dist_total.most_common(common_words))
收到的输出为
words = [('ubs', 131), ('pacific', 130), ('us', 121), ('credit', 113), ('aum', 108), ('suisse', 102), ('asia', 98), ('arm', 95)]
我想知道是否可以用“ credit-suisse”替换“ suisse”的102个值。同样,将“ asia”替换为“ asia-pacific”
预期产出-
words = [('credit-suisse', 102), ('credit', 11) , ('pacific', 32), ('asia-pacific', 98)]
我尝试使用
wordscount1 = [w.replace('asia','asia-pacific').replace('suisse', 'credit-suisse') for w in wordscount]
但是我遇到明显的错误。
请指导我。
最佳答案
由于我们不知道如何确保例如count('suisse') >= count('credit')
,因此这是未指定的。特别是,您要:
保持信用(第一学期)为“ credit-suisse”代替“ suisse” credit minus suisse
但是,与此同时,您希望将“ asia”替换为“ asia-pacific”,并保留太平洋(第二项)pacific minus asia
(与第一种情况相反)
您绝对必须澄清该要求。也许您的替换条款以某种方式排序?无论如何,作为起点:
words = [('ubs', 131), ('pacific', 130), ('us', 121),
('credit', 113), ('aum', 108), ('suisse', 102),
('asia', 98), ('arm', 95)]
d = dict(words)
for terms in (('credit', 'suisse'), ('asia', 'pacific')):
v1 = d.get(terms[1])
if v1:
d['-'.join(terms)] = v1
v0 = d.get(terms[0],0)
d[terms[0]] = v0-v1 # how to handle zero or negative values here ?
# it is unclear if it should be v1-v0 or v0-v1
# or even abs(v0-v1)
from pprint import pprint
pprint(d)
pprint(d.items())
生产:
sh$ python3 p.py
{'arm': 95,
'asia': -32, # <- notice that value
'asia-pacific': 130,
'aum': 108,
'credit': 11, # <- and this one
'credit-suisse': 102,
'pacific': 130,
'suisse': 102,
'ubs': 131,
'us': 121}
dict_items([('us', 121), ('suisse', 102), ('aum', 108), ('arm', 95),
('asia-pacific', 130), ('ubs', 131), ('asia', -32),
('credit', 11), ('credit-suisse', 102), ('pacific', 130)])
关于python - 替换列表中的多个单词,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/29071320/