我正在尝试在此DataFrame的content
列中计算标点符号。我已经尝试过this,但是它不起作用。我的DataFrame看起来像这样:
我希望结果是这样的:
而不是用情绪来计算每篇文章的标点符号。
In:
text_words = df.content.str.split()
punctuation_count = {}
punctuation_count[','] = 0
punctuation_count[';'] = 0
punctuation_count["'"] = 0
punctuation_count['-'] = 0
def search_for_single_quotes(word):
single_quote = "'"
search_char_index = word.find(single_quote)
search_char_count = word.count(single_quote)
if search_char_index == -1 and search_char_count != 1:
return
index_before = search_char_index - 1
index_after = search_char_index + 1
if index_before >= 0 and word[index_before].isalpha() and index_after == len(word) - 1 and word[index_after].isalpha():
punctuation_count[single_quote] += 1
for word in text_words:
for search_char in [',', ';']:
search_char_count = word.count(search_char)
punctuation_count[search_char] += search_char_count
search_for_single_quotes(word)
search_for_hyphens(word)
Out:
AttributeError: 'list' object has no attribute 'find'
最佳答案
给出以下输入:
df = pd.DataFrame(['I love, pizza, hamberget and chips!!.', 'I like drink beer,, cofee and water!.'], columns=['content'])
content
0 I love, pizza, hamberget and chips!!.
1 I like drink beer,, cofee and water!.
试试这个代码:
count = lambda l1,l2: sum([1 for x in l1 if x in l2])
df['count_punct'] = df.content.apply(lambda s: count(s, string.punctuation))
并给出:
content count_punct
0 I love, pizza, hamberget and chips!!. 5
1 I like drink beer,, cofee and water!. 4
如果要累积列表中每一行的标点符号:
accumulate = lambda l1,l2: [x for x in l1 if x in l2]
df['acc_punct_list'] = df.content.apply(lambda s: accumulate(s, string.punctuation))
并给出:
content count_punct acc_punct_list
0 I love, pizza, hamberget and chips!!. 5 [,, ,, !, !, .]
1 I like drink beer,, cofee and water!. 4 [,, ,, !, .]
如果要在字典中累积每行的标点符号,并将每个元素转置为数据框列:
df['acc_punct_dict'] = df.content.apply(lambda s: {k:v for k, v in Counter(s).items() if k in string.punctuation})
content acc_punct_dict
0 I love, pizza, hamberget and chips!!. {',': 2, '!': 2, '.': 1}
1 I like drink beer,, cofee and water!. {',': 2, '!': 1, '.': 1}
现在在df的列中扩展字典:
df_punct = df.acc_punct_dict.apply(pd.Series)
, ! .
0 2 2 1
1 2 1 1
如果要将新数据框与起始数据框组合在一起,只需执行以下操作:
df_res = pd.concat([df, df_punct], axis=1)
并给出:
content acc_punct_dict , ! .
0 I love, pizza, hamberget and chips!!. {',': 2, '!': 2, '.': 1} 2 2 1
1 I like drink beer,, cofee and water!. {',': 2, '!': 1, '.': 1} 2 1 1
注意:如果您不关心字典中的列,则可以通过
df_res.drop('acc_punct_dict', axis=1)
将其删除关于python - 在DataFrame列中计算标点符号,我们在Stack Overflow上找到一个类似的问题:https://stackoverflow.com/questions/58252056/