我有一个pandas数据框,如下所示,列名为“texts”

texts
throne one
bar one
foo two
bar three
foo two
bar two
foo one
foo three
one three

我想计算每行有三个单词‘1’、‘2’、‘3’,如果是一个完整的单词,则返回这些单词的匹配数。输出如下所示。
    texts   counts
    throne one  1
    bar one     1
    foo two     1
    bar three   1
    foo two     1
    bar two     1
    foo one     1
    foo three   1
    one three   2

您可以看到,第一行的count是1,因为没有将“porth”视为正在搜索的值之一“one”不是一个完整的单词,而是“porth”。
有什么帮助吗?

最佳答案

通过将pd.Series.str.countwords结合使用'|'与regex

words = 'one two three'.split()

df.assign(counts=df.texts.str.count('|'.join(words)))

        texts  counts
0  throne one       2
1     bar one       1
2     foo two       1
3   bar three       1
4     foo two       1
5     bar two       1
6     foo one       1
7   foo three       1
8   one three       2

为了确定'throne',我们可以在regex中添加一些单词边界
words = 'one two three'.split()

df.assign(counts=df.texts.str.count('|'.join(map(r'\b{}\b'.format, words))))

        texts  counts
0  throne one       1
1     bar one       1
2     foo two       1
3   bar three       1
4     foo two       1
5     bar two       1
6     foo one       1
7   foo three       1
8   one three       2

对于flair,在Python 3.6中使用f字符串的原始形式
words = 'one two three'.split()

df.assign(counts=df.texts.str.count('|'.join(fr'\b{w}\b' for w in words)))

        texts  counts
0  throne one       1
1     bar one       1
2     foo two       1
3   bar three       1
4     foo two       1
5     bar two       1
6     foo one       1
7   foo three       1
8   one three       2

关于python - 返回pandas列中存在的多个单词的计数,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/49676597/

10-10 19:42