问题描述
这可能是重复的,但我没有找到它...
this maybe is a duplicate, but I had no luck finding it...
我正在使用 Pandas 在 Python 中进行一些文本挖掘.我在 DataFrame 中有单词,旁边有 Porter 和其他一些统计数据.这意味着可以在此 DataFrame 中找到具有完全相同 Porter 词干的相似词.我想将这些相似的词汇总到一个新列中,然后删除与 Porter 词干相关的重复词.
I am working on some text mining in Python with Pandas. I have words in a DataFrame and the Porter stemming next to it with some other statistics. This means similar words having exact same Porter stem can be found in this DataFrame. I would like to aggregate these similar words in a new column then drop the duplicates regarding Porter stem.
import pandas as pd
pda = pd.DataFrame.from_dict({'Word': ['bank', 'hold', 'banking', 'holding', 'bank'], 'Porter': ['bank', 'hold', 'bank', 'hold', 'bank'], 'SomeData': ['12', '13', '12', '13', '12']})
pdm = pd.DataFrame(pda.groupby(['Porter'])['Word'].apply(list))
我最想拥有的:
# Word Porter Merged SomeData
# bank bank [bank, banking] 12
# hold hold [hold, holding] 13
# banking bank [bank, banking] 12
# holding hold [hold, holding] 13
# bank bank [bank, banking] 12
删除重复项后:
# Word Porter Merged SomeData
# bank bank [bank, banking] 12
# hold hold [hold, holding] 13
我尝试使用,但我没有更接近我的目标.
I tried to use, but I came no closer to my goals.
pda.join(pdm, on="Porter", how="left")``
提前感谢您的帮助.
上面修改的代码
推荐答案
你可以应用一个集合而不是一个列表,所以你会自动删除所有重复项:
You can apply a set to this instead of a list, so you are removing all the duplicates automaticly:
import pandas as pd
pda = pd.DataFrame.from_dict({'Word': ['bank', 'hold', 'banking', 'holding', 'bank'],
'Porter': ['bank', 'hold', 'bank', 'hold', 'bank'],
'SomeData': ['12', '13', '12', '13', '12']})
pdm = pd.DataFrame(pda.groupby(['Porter'])['Word'].apply(set))
这篇关于使用 Python 和 Pandas 进行文本挖掘的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持!