使用 Python 和 Pandas 进行文本挖掘

本文介绍了使用 Python 和 Pandas 进行文本挖掘的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

这可能是重复的，但我没有找到它...

this maybe is a duplicate, but I had no luck finding it...

我正在使用 Pandas 在 Python 中进行一些文本挖掘.我在 DataFrame 中有单词，旁边有 Porter 和其他一些统计数据.这意味着可以在此 DataFrame 中找到具有完全相同 Porter 词干的相似词.我想将这些相似的词汇总到一个新列中，然后删除与 Porter 词干相关的重复词.

I am working on some text mining in Python with Pandas. I have words in a DataFrame and the Porter stemming next to it with some other statistics. This means similar words having exact same Porter stem can be found in this DataFrame. I would like to aggregate these similar words in a new column then drop the duplicates regarding Porter stem.

import pandas as pd
pda = pd.DataFrame.from_dict({'Word': ['bank', 'hold', 'banking', 'holding', 'bank'], 'Porter': ['bank', 'hold', 'bank', 'hold', 'bank'], 'SomeData': ['12', '13', '12', '13', '12']})

pdm = pd.DataFrame(pda.groupby(['Porter'])['Word'].apply(list))

我最想拥有的:

# Word      Porter               Merged    SomeData
# bank        bank      [bank, banking]          12
# hold        hold      [hold, holding]          13
# banking     bank      [bank, banking]          12
# holding     hold      [hold, holding]          13
# bank        bank      [bank, banking]          12

删除重复项后:

# Word      Porter               Merged    SomeData
# bank        bank      [bank, banking]          12
# hold        hold      [hold, holding]          13

我尝试使用，但我没有更接近我的目标.

I tried to use, but I came no closer to my goals.

pda.join(pdm, on="Porter", how="left")``

提前感谢您的帮助.

上面修改的代码

推荐答案

你可以应用一个集合而不是一个列表，所以你会自动删除所有重复项:

You can apply a set to this instead of a list, so you are removing all the duplicates automaticly:

import pandas as pd
pda = pd.DataFrame.from_dict({'Word': ['bank', 'hold', 'banking', 'holding', 'bank'],
                              'Porter': ['bank', 'hold', 'bank', 'hold', 'bank'],
                              'SomeData': ['12', '13', '12', '13', '12']})

pdm = pd.DataFrame(pda.groupby(['Porter'])['Word'].apply(set))

这篇关于使用 Python 和 Pandas 进行文本挖掘的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持！