检查字符串中的(仅整个)单词

本文介绍了检查字符串中的(仅整个)单词的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在Checkio上进行培训.该任务称为流行词.任务是从给定字符串的列表中搜索单词.

Training on Checkio. The task is called Popular words. The task is to search for words from a list (of strings) in a given string.

例如:

textt="When I was One I had just begun When I was Two I was nearly new"

wwords=['i', 'was', 'three', 'near']

我的代码如下:

def popular_words(text: str, words: list) -> dict:
    # your code here

    occurence={}
    text=text.lower()


    for i in words:
        occurence[i]=(text.count(i))

    # incorrectly takes "nearly" as "near"


    print(occurence)
    return(occurence)

popular_words(textt,wwords)

几乎可以正常工作，返回

which works almost fine, returning

{'i': 4, 'was': 3, 'three': 0, 'near': 1}

因此将附近"算作附近"的一部分.这显然是作者的意图.但是，除了

thus counting "near" as a part of the "nearly". It was obviously the authors intention. I, however, cannot find a way to get aroud this other than

"search for words that are not first (index 0) or last (last index) and for these that begin/end with whitespace"

我可以寻求帮助吗?请以这个相当幼稚的代码为基础.

May I ask for a help, please? Building upon this rather childish code, please.

推荐答案

最好是分割您的句子，然后计算单词，而不是子字符串:

you'd be better off splitting your sentence, then count the words, not the substrings:

textt="When I was One I had just begun When I was Two I was nearly new"
wwords=['i', 'was', 'three', 'near']
text_words = textt.lower().split()
result = {w:text_words.count(w) for w in wwords}

print(result)

打印:

{'three': 0, 'i': 4, 'near': 0, 'was': 3}

如果文本现在具有标点符号，则最好使用正则表达式根据非字母数字来分割字符串:

if the text has punctuation now, you're better off with regular expressions to split the string according to non-alphanum:

import re

textt="When I was One, I had just begun.I was Two when I was nearly new"

wwords=['i', 'was', 'three', 'near']
text_words = re.split("\W+",textt.lower())
result = {w:text_words.count(w) for w in wwords}

结果:

{'was': 3, 'near': 0, 'three': 0, 'i': 4}

(另一种选择是对单词字符使用 findall : text_words = re.findall(r"\ w +"，textt.lower()))

(another alternative is to use findall on word characters: text_words = re.findall(r"\w+",textt.lower()))

现在，如果您的重要"单词列表很大，也许最好统计所有个单词，然后使用经典的 collections.Counter 进行过滤:

Now if your list of "important" words is big, maybe it's better to count all the words, and filter afterwards, using the classical collections.Counter:

text_words = collections.Counter(re.split("\W+",textt.lower()))
result = {w:text_words.get(w) for w in wwords}

这篇关于检查字符串中的(仅整个)单词的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！